Combat of Duplicate Data, Who is the Survivor?
One foremost objective of implementing Master Data hub is to identify and resolve duplicate customer records. This is a crucial step towards achieving single version of truth about customer information thus help lower operational costs and maximize analytical capabilities.
Duplicate processing starts as soon as customer records are consolidated in MDM from the different sources data, or at worst, duplicate records found in same system in absence of business rules to prevent it.
The process has the following two steps and can take significant effort depending on how distorted the data is.
Step 1: Identifying duplicates (Data Matching)
Data matching is a process of identifying duplicate records. There are 2 matching techniques which today’s MDM applications use to detect duplicates.
- Deterministic – Exact compare of data elements to assign scores.
- Probabilistic – Matching based on phonetic (sounds like), likelihood of occurrence, leveraging statistical theory, and pinpoint variation and nuances. This technique assigns a percentage indicating the probability of match.
Although there are certain advantages to both the matching techniques, probabilistic matching scores an upper hand due to high accuracy in matching records.
In a recent article, Scott Schumacher compares Probabilistic and Deterministic Data Matching which explains key differences between above two matching techniques. Albeit matching is a very interesting (and one of my favorite) topic, lets save this for another column and concentrate on step 2, the main objective of this post.
Step 2: Merging Duplicate Records (Surviving golden record)
Once the duplicate records are identified, they need to be merged to create what we call as “single version of truth” or “golden record”. This record by definition will be the best possible data representing current, complete and accurate view of customer information.
The main challenge we often face during merging process is determining what data elements and their values from duplicated customer data needs to be considered to create the golden record. The foundation stone is building “Intelligent” data survivorship rules during automated merging when system has to decide what data to pick from duplicated records. The controls built into MDM should be comprehensive so complete and accurate master information is realized.
Below are some of the rules which can be followed in an automated customer de duplication process.
- Select the data element coming from a most trusted source. In other words, the rules should know which source data element has higher priority. We usually figure out which source gets more priority by employing a step called data profiling on the source data. This will help us get metrics on data quality. More on profiling can be found here.
- Pick the data which is most recently changed. You would want to select a most recently updated address from source A over older address from a duplicate record coming from source B.
- Choose the data elements which are populated with more details. For example,
- For names, choosing Caroline over Carry, Robert over Bob, Michael Harris over M. Harris etc.
- For address, choose1607 Chestnut Drive. over 1607 Chestnut Dr
- Choose postal codes which are complete. For example 43219-1933 over 03219 (zip+4 is better than incomplete zip)
- Ignore null and empty values of suspect records.
- Ignore ambiguous values. Ex: 123-45-6789 for social security number. (Well, these values should not be here if only you were punctual and did good data governance and data quality control beforehand).
Above list provides some of the commonly used rules during data stewardship. However there may be some variations depending on quality of data and customer situation. Sometimes you may need to build these specialized rules based on how new customer records are created in the system over updating existing records. However, the above guidelines are good starting point.
Do share your thoughts.
COMMENTS
Leave A Comment
RECENT POSTS
Composable Applications Explained: What They Are and Why They Matter
Composable applications are customized solutions created using modular services as the building blocks. Like how...
Is ChatGPT a Preview to the Future of Astounding AI Innovations?
By now, you’ve probably heard about ChatGPT. If you haven’t kept up all the latest...
How MDM Can Help Find Jobs, Provide Better Care, and Deliver Unique Shopping Experiences
Industrial data is doubling roughly every two years. In 2021, industries created, captured, copied, and...
Nice introduction to the domain, Prashanta.
One missing element is the binning that usually takes place prior to every matching process.
I have covered the overall process for building MDM on the following blog post:
http://www.semarchy.com/blog/mdm-deep-dive-the-convergence-hub-pattern/
Prashanta,
I enjoyed the post. It might be worth mentioning that duplicate consolidation is particular to the repository MDM model. Registry models do not consolidate duplicates but rather just identifiy them.
I appreciate the first line of the post as I am currently at a client that wants a registry but purchased a repository. It makes for a rough implementation.
On survivorship, have you encountered rules based on more complex logic? If so, I’d love to hear about it!
William,
Glad you liked this post.
You are correct about registry model where we only identify the duplicates by applying suitable matching logic and let stewards know about the duplicate records. Most of the time, I do see hub style being the first choice except in health care scenario where registry model is considered the most. (Due to dispersed nature of patient data across network of hospitals)
I have seen variations on survivorship rules in every implementation. Although the list I put up in this blog serves most of the purpose, there are additional rules built in to make sure we do not end up merging records which should be kept separate. For example, we unmarked a duplicate record which differed by a Person/Organization indicator. (The case when a customer has a person account and he also operates as a point of contact for the company he works).
Similarly, there are instances when I had to change the rule to disable merging when the line of business of duplicated record was different. In this case customer wanted to remove duplicates only in same line of business. There are many such scenarios I can think about.
Btw, I get surprised when customers choose a wrong style of implementation. That might cost dearly.
[…] Constructing rules to identify suspects so as to create single version of truth (As discussed in my earlier post) […]
Hi Prashant,
Thank you.
I have read your article. I am very junior working on Oracle Data Quality tool.
Please,first consider the scenario of my project:
DQ tool : Oracle EDQ
Master Data HUB : siebel UCM
My question is on real-time basis, after potential duplicate records are found by intermediate matching operation. Then merging will take place between potential duplicate records and existing master record which is sent by Siebel UCM to EDQ. Then survivorship rules will come into effect that which data should be treated as golden data. then It will be transferred to Siebel-UCM as master data.
That means surviorship rules will do operation in EDQ. Am I right ???????????????????????
2nd question,
Why don’t we remove duplicate records rather merging records??
What does it mean by merging of records? selecting better record other than existing one.
suppose, existing record in DB :
first name: g
last name: samanta
incoming driving record form GUI:
firstname : goutam
lastname: samanta
If survivorship rules come into effect, then 2nd records are to be selected as master record.
then will be the status of 1st record. It will be deleted from the system or It will be reported as ambiguous record .
please please clarify me.
Regards,
Goutam samanta
Hi,
I am joined recently the MDM practice of a global consultancy major from working in the capital markets for half a decade. Can you guys guide me how to go about learning the key traits as far as functional role is concerned? I would mostly be working on MarkIT EDM (Cadis).
FYI – Above is interesting but realize its scratching the surface and I feel both over-complicating and oversimplifying the situations but then its hard to really explain all this in a few paragraphs. You really need a couple of hour presentation on the subject at minimum. I’m using Melissadata DQ tool (WAY simpler and more sophisticated and fraction of the cost than Oracle EDQ) and doing very advanced survivorship at a column by column level and had POC working in a few days. But refining it, making it performant so can do hourly updates/rematches has taken a couple of months.Whatever tool you pick, make sure it can support whatever survivorship requirements you have AND support your performance requirements! And test test test before you buy! Some products in this area are not what they claim (shocking I know). With some columns I have a combination of multiple criteria for picking including customer status coming from source systems, last updated (specific to set of data such as primary addresses or email or phone), most recent customer transaction, prioritization of customers and sources, …just to name a few.
Hello Prashanth,
Thanks for your post.Our requirment is in OEDQ webservices
DQ Tool : OEDQ
Hub : Oracle CDH
When we do deduplication using webservices it is returning so many duplicate records for the same exact match rules
and which is a valid scenario.But we need to limit the count of duplicate records to 10 and output in webservices
how can we achieve this ?
Thanks