Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

i.e. Enforcing MDR Uniqueness

Description

When posting a Master Data Record, MDM should prevent duplicate records from getting into the system.  Therefore, part of the Domain definition should include some criteria to be checked when a an MDR is created or modified to guarantee it is unique before allowing it to be POSTed/PUT.  

MDM will support 2 modes of MDR duplicate prevention, used exclusively of one another.  The first one (v1) will be easier to implement, easier for the end user to understand, and more efficient to execute.  Hence, we don't intend for v2 to replace v1, but be an "advanced" feature to be used optionally instead of v1.  It  V1 will allow the Domain creator to specify a list of columns properties whose values for a given MDR cannot match, in their entirety, the values of another MDR.

Example: A Student domain might want to be is created with the columns many properties including the subset of properties FirstName, LastName, CCCID as the "uniquenessRules" columnsproperties. So, if there exists a Student Joe Blow Jane Doe with ABC1234 as their CCCID and another an MDM adaptor using or MDM API consumer attempts to create a student with the name Joe Blow Jane Doe and CCCID ABC1234, it should fail with some sort of Duplicate Record error messagethe HTTP response code 409 CONFLICT.  It's possible, in this scenario, that CCCID may be the only column necessary for duplicate prevention .  But however, that would be domain specific and use case specific.

The v2 version of Duplicate Prevention will be rules-based.  We are still evaluating libraries which may be able to keep us from having to roll our own.  The idea will be that you can specify more sophisticated boolean logic to determine whether one MDR represents the same entity that another does. For example:

 student.cccID == $1 OR (student.last == $4 AND (student.phone[any] == $2 OR student.email[any] == $3))

More sophisticated rules can be added that consider case, string distance, etc.

Implementation - v1

In the md_domains table, we will add a varchar column that stores the uniquenessRule.  This will be a comma-separated list of property names in the domain.  Property names should not allow commas in them, so there is no need to escape the property names (this will not be the case, however, with the values stored in each MDR, where if commas in the data are not escaped, it will seem like there are too many columns stored).  

...

idnameversion...uniqueness_rule
1Student1...lower:firstname,lastname,ccc_id
...




Validation of uniquenessRule

Upon initial save or saving a new version of the Domain, we must validate that all properties specified in the uniquenessRule exist in the DomainSchemaDomain's Schema.  So, we will iterate over the comma-separated values, and verify that the properties exist.  If not, we must fail the save and return an appropriate error messagea 400 Bad Request.  Using the above example, we would walk the related DomainSchemaEntryEntities, insuring there are ones name properties named firstname, lastname, and ccc_id.  Notice I have "lower:" preceding the comma-separated list.  I think even in the simple version, we might be able to support a modifier that could help us to be more efficient as well as benefit the user.  This "lower:" would essentially mean that we'll be storing the values as all-lowercase, which would allow us to do a case-insensitive search.  

We may need to disallow certain property_types, such as arrays, to simplify things.  The value of an array of strings could be written out as comma-separated values.  If we escape those commas, it would still be a straightforward equals check.  So, we may be able to support certain arrays, or we may want to disallow it altogether for simplicity (both simplicity of implementation and simplicity of presentation). 

Application of uniquenessRule

In order to apply the uniquenessRule to the saving or updating of an MDR, we must store the values of the properties in each DomainEntryEntity.  To support this, we will add a column to md_domains_entries called uniqueness_values to store the values as comma-separated values.

...

Notice that the values are saved in all lowercase.  I would expect us to have 2 indexes on this column, one that is lower() and one that is not, so that if they are using case insensitive uniqueness, we can use the lower index which should be faster.  So, the query we would use for duplicate detection in this case would look something like:

SELECT CASE WHEN EXISTS (SELECT id FROM md_domains_entries 
WHERE lower(uniqueness_values)=$1
AND md_domains=$2
AND version=$3
)
THEN CAST (1 AS BIT)
ELSE CAST (0 AS BIT)
END;

This will return 1 if a duplicate exists, at which point we can fail the operation, otherwise allow the save/update.  

...