MDR Duplicate Prevention

i.e. Enforcing MDR Uniqueness

Description

When posting a Master Data Record, MDM should prevent duplicate records from getting into the system.  Therefore, part of the Domain definition should include some criteria to be checked when an MDR is created or modified to guarantee it is unique before allowing it to be POSTed/PUT.  

MDM will support 2 modes of MDR duplicate prevention, used exclusively of one another.  The first one (v1) will be easier to implement, easier for the end user to understand, and more efficient to execute.  Hence, we don't intend for v2 to replace v1, but be an "advanced" feature to be used optionally instead of v1.  V1 will allow the Domain creator to specify a list of properties whose values for a given MDR cannot match, in their entirety, the values of another MDR.

Example: A Student domain is created with many properties including the subset of properties FirstName, LastName, CCCID as the "uniquenessRules" properties. So, if there exists a Student Jane Doe with ABC1234 as their CCCID and an MDM adaptor or MDM API consumer attempts to create a student with the name Jane Doe and CCCID ABC1234, it should fail with the HTTP response code 409 CONFLICT.  It's possible, in this scenario, that CCCID may be the only column necessary for duplicate prevention however, that would be domain and use case specific.

The v2 version of Duplicate Prevention will be rules-based.  We are still evaluating libraries which may be able to keep us from having to roll our own.  The idea will be that you can specify more sophisticated boolean logic to determine whether one MDR represents the same entity that another does. For example:

 student.cccID == $1 OR (student.last == $4 AND (student.phone[any] == $2 OR student.email[any] == $3))

More sophisticated rules can be added that consider case, string distance, etc.

Implementation - v1

In the md_domains table, we will add a varchar column that stores the uniquenessRule.  This will be a comma-separated list of property names in the domain.  Property names should not allow commas in them, so there is no need to escape the property names (this will not be the case, however, with the values stored in each MDR, where if commas in the data are not escaped, it will seem like there are too many columns stored).  

Table 1 - Example Domain (data from md_domains table)

idnameversion...uniqueness_rule
1Student1...lower:firstname,lastname,ccc_id
...




Validation of uniquenessRule

Upon initial save or saving a new version of the Domain, we must validate that all properties specified in the uniquenessRule exist in the Domain's Schema.  So, we will iterate over the comma-separated values, and verify that the properties exist.  If not, we must fail the save and return a 400 Bad Request.  Using the above example, we would walk the related DomainSchemaEntryEntities, insuring there are properties named firstname, lastname, and ccc_id.  Notice I have "lower:" preceding the comma-separated list.  I think even in the simple version, we might be able to support a modifier that could help us to be more efficient as well as benefit the user.  This "lower:" would essentially mean that we'll be storing the values as all-lowercase, which would allow us to do a case-insensitive search.  

We may need to disallow certain property_types, such as arrays, to simplify things.  The value of an array of strings could be written out as comma-separated values.  If we escape those commas, it would still be a straightforward equals check.  So, we may be able to support certain arrays, or we may want to disallow it altogether for simplicity (both simplicity of implementation and simplicity of presentation). 

Application of uniquenessRule

In order to apply the uniquenessRule to the saving or updating of an MDR, we must store the values of the properties in each DomainEntryEntity.  To support this, we will add a column to md_domains_entries called uniqueness_values to store the values as comma-separated values.

Table 2 - Example Domain Entries (data from md_domains_entries)

idmd_domainsversion...uniqueness_values
111..joe,blow,abc1111
211..angie,davis,abc1234
...



Notice that the values are saved in all lowercase.  I would expect us to have 2 indexes on this column, one that is lower() and one that is not, so that if they are using case insensitive uniqueness, we can use the lower index which should be faster.  So, the query we would use for duplicate detection in this case would look something like:

SELECT CASE WHEN EXISTS (SELECT id FROM md_domains_entries 
WHERE lower(uniqueness_values)=$1
AND md_domains=$2
AND version=$3
)
THEN CAST (1 AS BIT)
ELSE CAST (0 AS BIT)
END;

This will return 1 if a duplicate exists, at which point we can fail the operation, otherwise allow the save/update.  

Implementation - v2

Proposed: Use Drools.  The idea would be to hopefully store the rule in the same column of md_domains, and infer based on formatting whether the domain uses v1 or v2 duplicate detection.  Obviously, since a rules engine would provide a lot more power, it would come at a cost.  Instead of simply querying the md_domains_entries table, we would probably have to store data for any columns referenced in the rule in md_domains_data_entries (even if MDM is a federated install).  And then we would have to load all the referenced data for all the entities and iterate over each one, executing the rule logic.  

Alternate Implementation Details

It was proposed to store the uniqueness_rules as "AND" separated rather than comma-separated, as it may match the Drools Rule Language (DRL) more closely.  But I think that DRL would represent and with &&.  So, we could &&, which I would prefer over the text AND mainly because it seems easier to handle.  There will eventually be a case where someone defines a domain with a property like "BAND_NAME" and it will throw off our parse, and we'll have to somehow escape "AND" found in property names, and it just ends up being more difficult with little benefit.