1. Introduction
Identifying duplicate customers is no unique challenge to any company. As a matter of fact, it is one of the most common data quality problems faced by many organizations.
Many data quality vendors offer viable solutions to help identify duplicate customers. However, you can't expect an out-of-the-box solution. That's because there is a business issue intrinsic to the technical problem, which requires a comprehensive evaluation from multiple angles.
In this topic, the People-Process-Technology triangle is as relevant as ever. Having the right people in the right business areas is critical when making a final decision on duplicates, proper course of action, and impact analysis. Technology is key in expediting the data analysis and improving accuracy, scalability, and repeatability. Lastly, defining the most effective process to bring both technology and people together to achieve the desired results efficiently is equally important.
2. Background
At Sun Microsystems, we have been through a massive Master Data Management (MDM) project, consolidating customer data from over 800 disparate legacy applications into a Customer Data Hub (CDH). The ultimate goal is to have a single source of truth to enable a 360 degree view of the customer.
Needless to say, this major effort comes with a price. Technical difficulties are certainly a major challenge. Bringing so many disparate data systems into a common structure is no easy task, but an even more daring challenge is aligning multiple business units into an integrated set of data, processes, standards, policies, and rules. Such subject deserves its own paper. For the purpose of this case study, we will focus specifically on one of the major issues we had by combining multiple legacy systems into one: duplicated data.
Duplicated data was not simply the result of bringing multiple data sources together. Not surprisingly, duplicated data already existed within each of them. The problem simply became magnified and over-exposed because the newly centralized data was available to a much larger audience.
3. About the Tool
Throughout this article, I'll make constant references to DataFlux, which has multiple products. For this particular effort, we have used dfPower Studio. Here is a brief description about dfPower Studio as provided by DataFlux:
"dfPower Studio is the award-winning DataFlux data quality and data integration platform. dfPower Studio offers a unique set of workflow tools built on an industry-leading technology platform that encompasses every facet of the data management process. Through its intuitive interface, dfPower Studio gives business users powerful data improvement capabilities and complete control over data quality and data governance initiatives.
dfPower Studio allows data stewards, business analysts and data owners to build complex data improvement workflows quickly and logically through an innovative job flow builder. These same jobs can then be implemented, in batch or real time, through the DataFlux Integration Server, easily transforming data quality projects into ongoing data governance policies."
dfPower Studio has allowed us to automate several tasks within our complex process. It has given us the ability to quickly and accurately execute what would be, otherwise, very time consuming and labor intensive steps. dfPower Studio is a critical piece in making this process scalable and repeatable. From now on, I'll use DataFlux and dfPower Studio interchangeably.
4. The Methodology
Figure 1 depicts the overall process flow. The 4 main steps in this process are:
- Identify potential duplicates
- Collect detail data for scoring
- Review results and get approvals
- Consider disposition and execute actions

Figure 1 - Methodology
The next sections describe in detail each of the steps above.
4.1 Identify Potential Duplicates
Identifying duplicates can be a very challenging and time consuming effort if you have lots of data. In our case, we have hundreds of thousands of records, and certainly looking through them manually is impossible. We use DataFlux to help with the search in a semi automated fashion.
We started with a list of customers prioritized by the business. Our first pass was using a list of our top couple hundred customers, which represent a sizable part of our revenues. Using DataFlux and their fuzzy matching techniques, we try to find records in our system related to those customers.
Needless to say, the fuzzy matching is not perfect, so we need to manually detect false positives and false negatives. False positives are basically identified matches that are not truly matches, and false negatives are missed matches. We have a group of data analysts that look through the fuzzy matching results and flag the false positives. That is a relatively simple and quick task, since it is mostly quite obvious the records that do not belong. The false negatives can be a little bit more difficult since we need to deal not only with a tool imperfection, but also with business updates. Merges and acquisitions are very frequent. Our challenge is bigger than simply trying to find matching names. We need to find ways to understand the ever changing company structure of our clients.
To overcome this obstacle, we have data analysts using external data references, such as Dunn & Bradstreet, to help model the most recent company structure and its subsidiaries. A list of search strings is created and used by DataFlux to complement the acquisition of initially missed parties.
At the end of this step, we have a fairly complete set of data records within our system identified as potential matches to our original input list.
4.2 Collect Detail Data for Scoring
The overall goal in this step is to score the matching records using a business defined set of rules. The score is based on a multitude of attributes related to that particular customer record, such as number of service requests, service contracts, install base records, sites, tax id, credit limits, and a lot more. This scoring is useful in deciding the "best" records we have that truly represent the relationship with our customers. The advantage of this method is that it is automated with DataFlux.
We start with the list generated from the previous step. Using DataFlux, we query our data source to collect all relevant attributes that are needed in our scoring system. Using the retrieved data, we run our automated scoring algorithm. Records are clustered together by the customer country of incorporation since our ultimate goal is to have a representative set of records based on their legal definitions.
Our scoring varies from 0 to 100. The highest the score, the more complete the data record is according to the business definitions. The record or records with the highest scores are more likely to survive. The records with the lowest scores will either be merged to a surviving record or inactivated if they provide no business value.
At the end of this step, we have a list of potential duplicates along with a score and accompanying detail data for each of them.
4.3 Review Results and Get Approvals
In this step, Data Analysts look through the scoring results provided by DataFlux along with the supporting detail data. The scoring helps analysts make a determination about which record(s) should survive, and which of them should be inactivated or merged.
Some cases are pretty obvious. For example, when a single record has a very high score compared to the others, there is not much of a question about which record should survive. In more complex cases, where the scores are more even, the business analysts may be required to do some additional research.
It is important to notice that we are not necessarily looking for a single record to survive for all cases. First of all, as mentioned earlier, we cluster records per country of incorporation. So, when looking at a particular set of customer records, we are trying to identify the best legal company data we want to maintain. Furthermore, in many cases, multiple legal entities will still exist within a single country of incorporation.
That is why we need a thorough Business Review to make the determination. In this review, several business areas are brought together to approve the recommendations provided by the analysts. Business areas included are Finance, Support, Sales, Marketing, and more. These reviews are led by the Customer Data Governance group, which is acting as a mediator in the process.
This can be a very time consuming effort, and it is, in my opinion, the price to pay for having a Master Data system. Gaining consensus from everyone can be difficult, but in the end, it pays off to have a single source of truth instead of multiple fragmented siloed data systems.
Once agreement is reached, we are ready to move on to the next step.
4.4 Consider Disposition and Execute Actions
Now that the survivors are marked and approved, a DataFlux job creates a final report to be distributed to the field for correction.
In our case, even though we may have identified which records are the survivors and which ones are not, we cannot necessarily merge them immediately. We still have dependencies with some not yet EOL'd legacy applications that prevent us from performing certain tasks. We also have dependencies related to interface systems and spoke applications that use our data. Therefore, some constraints have to be considered.
The DataFlux job evaluates the input data, considers the constraints, and then make a determination on what actions can be taken. A report for execution is generated, which is assigned to the proper teams for execution.
5. Lessons Learned
Data de-duplication is not a perfect science. The reason for duplicates can be many, and can be very difficult to understand. Mistakes will be made, but we must be proactive.
Costs associated to bad data quality are high and, a lot of times, very difficult to measure. Getting executive buy-in is critical. They must be willing to sponsor these type of activities for the long-term achievement of the organization's operational and strategic business objectives.
Bringing multiple business areas together is also very hard and essential. In our case, some groups had to experience some painfully inefficient processes before they could realize the importance of an effort like this. Having a Customer Data Governance team to voice these issues and organize the proper structure for carrying on the process is a must.
The only way to make this process scalable and repeatable is to use a tool like DataFlux. Without DataFlux, it would have been impossible to quickly produce the required results. With a few data analysts and DataFlux, we can process dozens of company data sets in a single day. Otherwise, we would spend days analyzing a single company.
About the Author
Dalton Cervo is the Customer Data Quality Lead at Sun Microsystems. Dalton is part of Customer Data Steward and is a member of the Customer Data Governance team responsible for defining policies and procedures governing the oversight of master customer data.
Dalton has been with Sun for over 8 years. At Sun, Dalton started as an IT Software Engineer developing web applications. Later, he moved into IT Project Management and eventually into the business area as part of the Market and Business Intelligence team.
In the last 3 years, Dalton has led several customer data management efforts, including: data collection, cleansing, standardization, enrichment, and hierarchy classification; master data management conversion, consolidation, standardization, and cleansing; and data quality standards, policies, metrics and monitors.
Dalton is a member of the Customer Advisory Board for DataFlux, and was a speaker at the IDEAS conference in 2008 on the topic of customer data metrics.
 
 
 

 
No comments:
Post a Comment