Dalton's Blog: June 2009

Most of us in Master Data Management (MDM) are very familiar with the Dimensions of Data Quality, which are very well explained by David Loshin in his Master Data Management book, or by Thomas Ravn and Martin Høedholt in their very informative article in the Information Management Magazine (see References at the end of this posting for more details).

From the sources above and others, the data quality dimensions normally include, but are not limited to, the following: Accuracy, Completeness, Consistency, Currency, Referential Integrity, Timeliness, Uniqueness, and Validity. Please refer to the references for definitions, or add a comment if you would like me to expand some more on the subject.

I am a big proponent of Data Quality Dimensions, and have used them extensively to help organize and classify the multitude of metrics I have implemented so far. But I also like to organize my data quality initiatives using a complementary view. I like to call it the Quadrant for Data Quality Initiatives.

Data Quality initiatives normally fall into 2 categories: pro-active or reactive. In general terms, pro-active initiatives are measures you establish to avoid problems from happening, while reactive initiatives are measures you adopt after the problem has already occurred and needs correction.

Either of those initiatives can lead to 2 results. One of the results is of Full Compliance, meaning the entire problematic data set is corrected, and risk of bad quality data left is near zero. The second result is of Partial Compliance, where there is no guarantee the problem is fully fixed.

A classic example of a Pro-active/Full Compliance activity is when we establish referential integrity rules to avoid incorrect data from being added to the system. In my opinion, that is the ideal scenario. If we could only establish those types of rules for every data element, our life would be much easier. Unfortunately, that's not the case. We can't possibly prevent all data errors from happening.

That's when the Quadrant comes handy. You can define your data quality initiatives in the terms describe above, and place them as appropriate in the quadrant. The next diagram shows a possible classification:

Figure 1 - Quadrant with Data Quality Initiatives

The usefulness of the Quadrant comes from the fact that you can easily identify which initiatives require closer attention. If a certain initiative leads to Partial Compliance, that means you need to establish control mechanisms to track the progress and mitigate the impact and risk of the bad data quality still in place. On the other hand, if you know for sure an initiative will achieve Full Compliance, the need for control is lower.

You can also look at the Quadrant from an opposite direction. You can put as much emphasis on a specific initiative as needed to achieve as much compliance as required. In my diagram, you'll notice I have data cleansing as a reactive initiative leading to 2 results: full and partial compliances. I call one of them Data Cleansing and the other High Impact Cleansing Efforts. The difference comes from the desired outcome. If a “perfect” outcome is required, more attention and resources need to be put during the cleansing activity itself. If not, less attention and resources are required, and the monitoring comes from the overall on-going data quality controlling activities.

The next diagram adds the area where more control is needed. By control, I mean data monitors and data quality scorecards. Those controls are established within the parameters of the Data Quality Dimensions, and that is how you can bring both techniques together.

Figure 2 - Quadrant with Data Quality Initiatives and Where to Focus Control

As a final note, some of the initiatives I have above could be different or placed elsewhere in the quadrant based on your environment conditions and requirements.

References:

Ravn, Thomas; Høedholt, Martin: How to Measure and Monitor the Quality of Master Data – Information Management Magazine, May 2009

1. Introduction

Identifying duplicate customers is no unique challenge to any company. As a matter of fact, it is one of the most common data quality problems faced by many organizations.

Many data quality vendors offer viable solutions to help identify duplicate customers. However, you can't expect an out-of-the-box solution. That's because there is a business issue intrinsic to the technical problem, which requires a comprehensive evaluation from multiple angles.

In this topic, the People-Process-Technology triangle is as relevant as ever. Having the right people in the right business areas is critical when making a final decision on duplicates, proper course of action, and impact analysis. Technology is key in expediting the data analysis and improving accuracy, scalability, and repeatability. Lastly, defining the most effective process to bring both technology and people together to achieve the desired results efficiently is equally important.

2. Background

At Sun Microsystems, we have been through a massive Master Data Management (MDM) project, consolidating customer data from over 800 disparate legacy applications into a Customer Data Hub (CDH). The ultimate goal is to have a single source of truth to enable a 360 degree view of the customer.

Needless to say, this major effort comes with a price. Technical difficulties are certainly a major challenge. Bringing so many disparate data systems into a common structure is no easy task, but an even more daring challenge is aligning multiple business units into an integrated set of data, processes, standards, policies, and rules. Such subject deserves its own paper. For the purpose of this case study, we will focus specifically on one of the major issues we had by combining multiple legacy systems into one: duplicated data.

Duplicated data was not simply the result of bringing multiple data sources together. Not surprisingly, duplicated data already existed within each of them. The problem simply became magnified and over-exposed because the newly centralized data was available to a much larger audience.

3. About the Tool

Throughout this article, I'll make constant references to DataFlux, which has multiple products. For this particular effort, we have used dfPower Studio. Here is a brief description about dfPower Studio as provided by DataFlux:

"dfPower Studio is the award-winning DataFlux data quality and data integration platform. dfPower Studio offers a unique set of workflow tools built on an industry-leading technology platform that encompasses every facet of the data management process. Through its intuitive interface, dfPower Studio gives business users powerful data improvement capabilities and complete control over data quality and data governance initiatives.

dfPower Studio allows data stewards, business analysts and data owners to build complex data improvement workflows quickly and logically through an innovative job flow builder. These same jobs can then be implemented, in batch or real time, through the DataFlux Integration Server, easily transforming data quality projects into ongoing data governance policies."

dfPower Studio has allowed us to automate several tasks within our complex process. It has given us the ability to quickly and accurately execute what would be, otherwise, very time consuming and labor intensive steps. dfPower Studio is a critical piece in making this process scalable and repeatable. From now on, I'll use DataFlux and dfPower Studio interchangeably.

4. The Methodology

Figure 1 depicts the overall process flow. The 4 main steps in this process are:

- Identify potential duplicates
- Collect detail data for scoring
- Review results and get approvals
- Consider disposition and execute actions

Figure 1 - Methodology

The next sections describe in detail each of the steps above.

4.1 Identify Potential Duplicates

Identifying duplicates can be a very challenging and time consuming effort if you have lots of data. In our case, we have hundreds of thousands of records, and certainly looking through them manually is impossible. We use DataFlux to help with the search in a semi automated fashion.

We started with a list of customers prioritized by the business. Our first pass was using a list of our top couple hundred customers, which represent a sizable part of our revenues. Using DataFlux and their fuzzy matching techniques, we try to find records in our system related to those customers.

Needless to say, the fuzzy matching is not perfect, so we need to manually detect false positives and false negatives. False positives are basically identified matches that are not truly matches, and false negatives are missed matches. We have a group of data analysts that look through the fuzzy matching results and flag the false positives. That is a relatively simple and quick task, since it is mostly quite obvious the records that do not belong. The false negatives can be a little bit more difficult since we need to deal not only with a tool imperfection, but also with business updates. Merges and acquisitions are very frequent. Our challenge is bigger than simply trying to find matching names. We need to find ways to understand the ever changing company structure of our clients.

To overcome this obstacle, we have data analysts using external data references, such as Dunn & Bradstreet, to help model the most recent company structure and its subsidiaries. A list of search strings is created and used by DataFlux to complement the acquisition of initially missed parties.

At the end of this step, we have a fairly complete set of data records within our system identified as potential matches to our original input list.

4.2 Collect Detail Data for Scoring

The overall goal in this step is to score the matching records using a business defined set of rules. The score is based on a multitude of attributes related to that particular customer record, such as number of service requests, service contracts, install base records, sites, tax id, credit limits, and a lot more. This scoring is useful in deciding the "best" records we have that truly represent the relationship with our customers. The advantage of this method is that it is automated with DataFlux.

We start with the list generated from the previous step. Using DataFlux, we query our data source to collect all relevant attributes that are needed in our scoring system. Using the retrieved data, we run our automated scoring algorithm. Records are clustered together by the customer country of incorporation since our ultimate goal is to have a representative set of records based on their legal definitions.

Our scoring varies from 0 to 100. The highest the score, the more complete the data record is according to the business definitions. The record or records with the highest scores are more likely to survive. The records with the lowest scores will either be merged to a surviving record or inactivated if they provide no business value.

At the end of this step, we have a list of potential duplicates along with a score and accompanying detail data for each of them.

4.3 Review Results and Get Approvals

In this step, Data Analysts look through the scoring results provided by DataFlux along with the supporting detail data. The scoring helps analysts make a determination about which record(s) should survive, and which of them should be inactivated or merged.

Some cases are pretty obvious. For example, when a single record has a very high score compared to the others, there is not much of a question about which record should survive. In more complex cases, where the scores are more even, the business analysts may be required to do some additional research.

It is important to notice that we are not necessarily looking for a single record to survive for all cases. First of all, as mentioned earlier, we cluster records per country of incorporation. So, when looking at a particular set of customer records, we are trying to identify the best legal company data we want to maintain. Furthermore, in many cases, multiple legal entities will still exist within a single country of incorporation.

That is why we need a thorough Business Review to make the determination. In this review, several business areas are brought together to approve the recommendations provided by the analysts. Business areas included are Finance, Support, Sales, Marketing, and more. These reviews are led by the Customer Data Governance group, which is acting as a mediator in the process.

This can be a very time consuming effort, and it is, in my opinion, the price to pay for having a Master Data system. Gaining consensus from everyone can be difficult, but in the end, it pays off to have a single source of truth instead of multiple fragmented siloed data systems.

Once agreement is reached, we are ready to move on to the next step.

4.4 Consider Disposition and Execute Actions

Now that the survivors are marked and approved, a DataFlux job creates a final report to be distributed to the field for correction.

In our case, even though we may have identified which records are the survivors and which ones are not, we cannot necessarily merge them immediately. We still have dependencies with some not yet EOL'd legacy applications that prevent us from performing certain tasks. We also have dependencies related to interface systems and spoke applications that use our data. Therefore, some constraints have to be considered.

The DataFlux job evaluates the input data, considers the constraints, and then make a determination on what actions can be taken. A report for execution is generated, which is assigned to the proper teams for execution.

5. Lessons Learned

Data de-duplication is not a perfect science. The reason for duplicates can be many, and can be very difficult to understand. Mistakes will be made, but we must be proactive.

Costs associated to bad data quality are high and, a lot of times, very difficult to measure. Getting executive buy-in is critical. They must be willing to sponsor these type of activities for the long-term achievement of the organization's operational and strategic business objectives.

Bringing multiple business areas together is also very hard and essential. In our case, some groups had to experience some painfully inefficient processes before they could realize the importance of an effort like this. Having a Customer Data Governance team to voice these issues and organize the proper structure for carrying on the process is a must.

The only way to make this process scalable and repeatable is to use a tool like DataFlux. Without DataFlux, it would have been impossible to quickly produce the required results. With a few data analysts and DataFlux, we can process dozens of company data sets in a single day. Otherwise, we would spend days analyzing a single company.

About the Author

Dalton Cervo is the Customer Data Quality Lead at Sun Microsystems. Dalton is part of Customer Data Steward and is a member of the Customer Data Governance team responsible for defining policies and procedures governing the oversight of master customer data.

Dalton has been with Sun for over 8 years. At Sun, Dalton started as an IT Software Engineer developing web applications. Later, he moved into IT Project Management and eventually into the business area as part of the Market and Business Intelligence team.

In the last 3 years, Dalton has led several customer data management efforts, including: data collection, cleansing, standardization, enrichment, and hierarchy classification; master data management conversion, consolidation, standardization, and cleansing; and data quality standards, policies, metrics and monitors.

Dalton is a member of the Customer Advisory Board for DataFlux, and was a speaker at the IDEAS conference in 2008 on the topic of customer data metrics.

About Me

Find Me

Links

Blog Archive

Followers

Blog Stats

Friday, June 26, 2009

I'm on Expert Panel - Data Quality PRO

Tuesday, June 23, 2009

A Quadrant for Data Quality Initiatives

Friday, June 19, 2009

MDM Interview with Data Quality PRO

Thursday, June 11, 2009

Identifying Duplicate Customer Records - Case Study