Thursday, November 19, 2009


Apologies for the inconvenience, but my site has moved. All my postings in this blog are in my new site, which I hope will give you a much better experience.

Please visit:


Sunday, November 1, 2009

A well-oiled machine is critical to Data Quality

The beauty of human beings is that people will look for creative ways to solve their problems. That means, when users have technical problems or run into business limitations during data entry, they will find ways to do it, even if it means breaking business rules or overriding well defined processes. From a Data Quality perspective, that is not a good thing, but who is to blame the users? After all, they may be facing a particular customer need that doesn't fit an existing business process, or a system bug that is delaying a high profit transaction.

Let's assume you organization does have all elements in place, such as Data Governance, Data Stewardship, Data Quality, IT support, etc. Users are less likely to engage the proper teams if their confidence in the support process is low. They may think: “oh boy, by the time I get this problem resolved through the proper mechanisms, it will be too long and I'll have a customer satisfaction issue beyond repair.” Therefore, for the “benefit” of the organization, they act with imagination and solve the immediate problem with non-approved solutions. Making matters worse, detecting these out-of-spec practices and associated data issues are sometimes difficult to monitor and correct.

With that said, your goal as an organization should be not only to have the proper elements of a well governed organization, but have them working effectively as well. That comes with maturity, and a constant focus on process improvement. Simply improving your data entry process alone is not enough. You have to improve the support process around it. Just about everything is constantly changing: business needs, business landscape, technology, people, etc. Your only hope is to have an efficiently adaptive model that in spite of all these changes, can continue to deliver results quickly. Let's focus our creativity on this problem, and be really dull when it comes down to being creative breaking business rules!

Tuesday, October 27, 2009

Monday, October 19, 2009

Wild wild MDM...

I am a visual learner, and that ends up transpiring into how I demonstrate my experiences. If you have been reading my postings, I'm sure you noticed I have lots of diagrams.

This visual need has led me to think about a good comic strip for MDM. More recently, inspired by Jim Harris and Phil Simon debate about which board game is the better metaphor for an Information Technology (IT) project, I have “resurrected” my long time desire.

I did come up with an original idea, I think. But obviously, only time will tell if the theme I came up with is any good.

Growing up, I used to read western comics. I don't read them anymore, but I am a long time fan of Tex Willer, which was originally created in Italy. In any event, I can see a good analogy between MDM and the old west. It was a time of political compromise, technological innovation, treaties, and establishment of law and order. Sounds familiar?

With that, I came up with “Wild Wild MDM...”

The characters:

1. Native Americans: the IT department. Granted, the native americans weren't technologically advanced, but I see them similar in their wild spirit, like to “hunt their own food,” value and take care of their resources (land, water, etc vs. hardware, software, etc).

2. Cavalry: the Business. They share a somewhat pompous attitude, highly structured and formal. Don't be offended, please. I am a business person, btw.

3. Sheriff: Data Governance. Obvious analogy, I believe.

4. Ranger: data quality people. The duties of the Rangers consisted of conducting criminal and special investigations, apprehending wanted felons, and suppressing major disturbances. I can see a plausible analogy here. Guns could be data quality tools...

5. Outlaws: bad quality data. Data will assume various forms, which I think is very reasonable since data is indeed very elusive. Bad quality data will be like outlaws at times. Data could also be smoke as shown on one of the cartoons below.

More characters will be added as needed. I'm still evolving them.

With that said, I have to excuse a few things in advance:

1. My drawings are not good. I can't draw real cartoons, so I'll have to use some pre-defined objects, and get them to express my ideas. Please, use your imagination.

2. Incorrect historical facts. This is a very “lose” analogy. I will use fictitious characters/situations combined with real ones, or combine characters/situations from different times. I will not be doing vast research on American history. I'll stick to the fundamental stereotype defined, and sometimes show character relationships that didn't quite exist.

3. Strips will not necessarily be funny. As a matter of fact, most of them probably won't be. They likely will reflect a common situation just to express a message.

4. I may run out of ideas very quickly, which could be a sign my analogy wasn't that good, or that my drawing capabilities (or lack thereof) are preventing me from representing my thoughts.

I have two strips to start. Here they are:

Tuesday, October 13, 2009

Implementing an Analytical MDM


Last week I was at the DataFlux IDEAS conference. Without a doubt, the biggest attraction from a product perspective was Project Unity, which is a joint effort between DataFlux and its parent company SAS. Unity is a next-generation enterprise data management platform encompassing data quality, data integration and master data management (MDM). As a DataFlux user for years, I can certainly say that Unity will vastly improve existing functionality, add outstanding new features, and ameliorate already ease of use interface.

I also had a chance to take a look into their existing qMDM for Customer Data product. qMDM helps you integrate multiple customer data sources into a single structure. Within the resulting centralized repository, pre-defined cleansing/standardization/enrichment rules can be customized as needed to provide the proper level of consolidation. Finally, cluster of customer records can be mapped properly into hierarchies via a web-driven interface for a unified view of customer information.

Too bad qMDM wasn't available a few years ago, when I was working on an Analytical MDM implementation for Customer Data. It would have certainly saved me a lot of time and grief. I did use DataFlux, but had to write logic that is now available out-of-the-box with qMDM.

This posting is not about describing qMDM. The short presentation I saw doesn't qualify me for that. However, I will describe in high level terms what we implemented a few years ago. It should be generic enough that you can adapt to your particular situation, whether you develop everything in-house or use a vendor tool. Remember that even if you use a tool, from DataFlux or not, quite a bit of customization may be required depending on your business needs.


Our objective with this project was to improve Business Intelligence and Analytics as it relates to customer. Needless to say, customer information was the primary driver. As such, we needed a unified view of the customer information throughout the entire organization to achieve the so elusive 360-degree view of the customer.

Customer data exists in multiple system across the enterprise. So does the transaction data associated with them. We had to bring both customer and transaction information into a single repository that we could use. By cleansing, standardizing, enriching and consolidating customer data, we could better make sense of the associated transaction information and use the results for predictive analytics and reporting.

For example, understanding that a particular customer on a given industry sector had purchased a particular hardware configuration with combined storage, software, and services could lead to cross-sell opportunities for other customers in the same industry sector. There wasn't a single system holding all the transaction information associated to a customer, meaning we had to bring data together from multiple systems to make that determination. Furthermore, even when data was all stored on a particular repository, there was a fair amount of customer data duplication, making it difficult to understand all the associations.

In essence, fragmentation and duplication was preventing us from realizing the opportunity we had to really use our data as an asset.


There are 4 main logical components in this project:
1. Data Extract Process
2. Customer Data Cleansing, Standardization, Enrichment and Consolidation
3. Customer Hierarchy Mapping
4. Analytics and Reporting

Activities performed by each component is well defined and can be quite complex. Multiple tools are used, and most of the process is automated. However, some steps require human intervention, and reports and analytics are constantly evolving.

Here is a list of tools/technologies used: Oracle RDMS, Oracle PL/SQL, DataFlux, D&B, Java, CORDA, Oracle ODM, SAS, BRIO. In my diagrams, I have added a logo indicating which technology is used for each activity.

Data Extract Process

Before we can analyze our data, we need to bring it together. This is the first step on a Customer Data Integration (CDI) project, where multiple sources containing disparate information are queried and the relevant information loaded into a single repository.

I like separating the data into two parts: the Customer Master Data or Customer Identity, and the Transaction Data. Master Data is the persistent, non-transactional data defining a business entity for which there should be one consistent and understood view across the organization. That is one of the things we are achieving with this project. By having the identity separated from the transaction, we can use tools and techniques to consolidate it without losing the keys that map the data back to the transaction piece.

In any event, in this first component, we use Oracle stored procedures to extract/transform/load the disparate data from multiple sources into a common structure. For the Customer Identity Data tables, any system specific idiosyncrasies are eliminated during this step. Therefore, following steps are applied independent of the source.

Customer Data Cleansing, Standardization, Enrichment and Consolidation

The ultimate goal in this step is to eliminate duplicates. But the data is dirty and even though they conform to a common structure at this point, they don't necessarily conform to the same standard. Address lines could be mixed, customer names non-complaint, missing information, etc.

We use D&B to enrich data with DUNS number, and standard company and address information. We also use DataFlux to standardize company information, cleanse data fields, right-field address information, etc.

Once data is cleansed and standardized, we use DataFlux and fuzzy matching to consolidate similar records into clusters. This is fully automated, and as such, we want to minimize risk. Our clustering is conservative, purposely leading to false negatives, but avoiding false positives.

Customer Hierarchy Mapping

After data is consolidated, we have a much lower number of records to map into hierarchies. To be precise, the number of records is the same, but since they are in clusters, you only need to map the cluster, and every record under it is automatically mapped.

Customer hierarchy allows for rollups, facilitating the interpretation and reporting of transactions associated to a group of customer records.

Some records are automatically mapped by DataFlux by means of getting grouped into an already mapped cluster. A group of hierarchy managers use a java web-driven application to manually map records into hierarchies.

Analytics and Reporting

This is where everything comes together. We now have hierarchical and consolidated Customer Master Data. We also have transaction information. We can use Analytics and Data Mining tools to combine the two for Business Intelligence. Dashboards using CORDA and ad-hoc reports with BRIO complement the suite.


Pretty easy, isn't it? This is not for the faint-of-heart, but once in operation can reap great benefits. Our “clients” (sales and marketing teams) are extremely satisfied with the results this implementation has provided.

Remember also that the data at the source is dynamic. Our process supports new records as well as changes to existing records. It gets quite complex because you can have an already mapped record modified at the source. There are different implications if the record is the survivor in its cluster or not, if it is mapped to a hierarchy or not, etc. But that deserves a total new posting.

Tuesday, October 6, 2009

Presentation at DataFlux IDEAS conference

If you were at my presentation today at DataFlux IDEAS, and is wondering where to find more info, please check my blog entry: Identifying Duplicate Customer Records - Case Study.

I hope you enjoy the other postings too.

Monday, September 14, 2009

The Data Profile Spectrum

No matter how long you have been working with data quality, or simply data for that matter, you certainly know how important data profiling is. You don't know what you don't know, and data profile helps you bridge the gap.

Data profiling is a critical activity whether you are migrating a data system, integrating a new source into your data warehouse, synchronizing multiple systems, implementing an MDM repository, or just trying to measure and improve the quality of your data.

However, data profile is quite often an unqualified activity. Sometimes that is OK, but sometimes it is not. By “unqualified” I mean not much information or requirements are given about what the data profile is all about. Sometimes that is OK because you have either no knowledge at all or very minimum knowledge about the data you're profiling. But very often, you do know quite a bit already, and maybe you're simply trying to fit your data to a specific set of rules.

Bear with me, but I feel like I need to add one more definition before I continue expressing my point. I keep using the term “knowledge about your data.” What do I mean? There are multiple levels of knowledge in this case. For each data element or combination of data elements, there are so many associated properties. There is data type, data content, data pattern, data association, data conformance, business rules, etc. It could also be about what the data should be and not only about what it is. As you can see, how much you might know could vary a lot.

When you combine the objective of your profile along with how much you know about the data already, you end up with a lot of different combinations. That is why I like to use the term Data Profile Spectrum. And remember, different attributes could be at different parts of the spectrum. No wonder data profiling can be a lot more complex than people give it credit for.

Next picture depicts the Data Profile Spectrum.

Let's first talk about Data Profile Artifacts. By that I mean what is usually provided by a data quality tool, or maybe something you put together yourself. Basically it is what you have to analyze your data, from data completeness to pattern analysis, data distribution, and a lot more. I won't get into a lot of detail about the artifacts. Please refer to Jim Harris' article Adventures in Data Profiling for more on that and some other cool stuff. The only thing I'll point out is notice I used tetrominoes to represent the artifacts. That is just to call attention to the fact that data profile artifacts are pieces that can be applied and/or combined in a variety of ways to accomplish what you need. For example, you may use the data distribution artifact during discovery just to understand what random values you have at what percentage. However, you may use the same artifact on a Country Code field to identify the percentage of valid values. It is the same artifact applied slightly different dependent on where you are in the spectrum.

The Prior Knowledge scale represents how much you already know about what the data is or what the data should be. It is important to grasp where you are in that scale so you know how to apply the right artifacts properly. I mean, why would you need to verify uniqueness when a primary key constraint already exist in the database for that particular field? That is just an example, but hopefully you get the idea.

Another twist is being able to identify where you should be in that scale for a given profiling activity. I can see some eyes rolling, but I'll explain. Here is a real example I faced. We were about to start a data conversion activity. I was asked to “go profile the data to be converted.” My reply was that we needed more information than that. I mean, if we were to convert one system into another, we should have quite a bit of knowledge about the new system, which would drive what and how we profile the old system. This is definitely not a low-end of the scale profile activity in my spectrum.

Interesting enough, my reply wasn't quite well received. I hadn't written this blog entry yet, so this concept wasn't quite formalized in my mind. I was reminded that data profiling should be the first thing to occur, so we could “discover” things about our data. My point was our goal was not to find out information about our data. Our goal was to fit our data into the new system. Doing “primitive” data profiling would be a useless activity. We had to profile our data bounded by the new system. Well, I eventually convinced them, but I wish I had the Data Profile Spectrum handy back then.

In summary, I had a request to do a “No Knowledge” profile, when I should be asked to do something at a higher end of the Data Profile Spectrum. At the time of the request, we didn't know much. One could have thought the request was pertinent when using the Data Profile Spectrum. However, you not only need to consider where you are in the spectrum, but also where you should be. If they don't match, something is missing.

I have several other real examples of data profiling requests, but it is getting pretty late, and I want to post this entry before I go to bed. If you care to read more about them, please let me know.

Wednesday, August 19, 2009

Migrating Data into an MDM Repository - Case Study


Most likely, a Master Data Management (MDM) project will require a data migration (a.k.a. data conversion) step. Simply speaking, data migration is copying data from one repository into another. Technically speaking, it is an Extract-Transform-Load (ETL) exercise.

On an Operational or Enterprise MDM project, chances are you'll end-of-life (EOL) multiple legacy systems after their data is migrated into a single source. If you're implementing a Hub and Spoke architecture, you may be migrating multiple source systems, but not necessarily EOL'ing all of them.

Notice that if you're using Data Federation to implement your MDM solution, there is no data migration. Data Federation acts as a virtual central repository, and as such, does not require a physical copy of your source data. Data Federation "translates" the source information in real time according to required business rules and definitions. It is, so to speak, a real-time Extract-Transform process.

At Sun Microsystems, our MDM implementation utilizes a Hub and Spoke architecture, and we did go through a complex data migration step. This case study describes the challenges and method we used to convert the data while improving the quality at the same time.


At Sun, we have been implementing a phased Operational MDM project. The ultimate goal is to convert multiple data systems into a single Oracle 11i Hub and Spoke architecture. Most of the systems will be EOL'ed after conversion, while others will continue to exist, but dependent on the Hub for Master Data. Multiple data entities are in scope, but for the purposes of this case study, I'll focus on the customer entity.

In this project, the MDM repository for customer data is the Customer Data Hub (CDH), which is part of the Oracle Trading Community Architecture (TCA).

As expected in any Operational MDM implementation, there is a tremendous business impact. It is a very intrusive project, where multiple Lines of Business (LOB) are affected, including changes to a large multitude of existing processes and procedures. A task this big can't be carried on without a massive Change Management Plan emphasizing: readiness assessment; sponsorship; communication; education and training; coaching; measurement systems, rewards and reinforcement.

Challenges to Data Migration

All the changes occasioned by MDM also affect the design of the data migration process. As stated previously, data migration is mostly an ETL exercise. Mapping attributes from the legacy systems into the new one is very challenging for two primary reasons. The first one is structural: tables and attributes of the new system will certainly differ from the old one(s). The second is about semantics: meanings and dependencies might vary widely according to the new business purposes of entities and attributes. The second is usually far more difficult to resolve than the first, and usually requires a lot more data profiling.

One of the lessons learned for us is that we should have done a lot more data profiling than we did. We did a lot, but certainly not enough. While it might not always have changed the data migration step itself, it probably would have avoided some pitfalls and made us better prepare and better train the affected LOB's.

Another interesting aspect of data migration, or more likely from a big project like this, is managing expectations. It is quite amusing to witness multiple LOB's in the organization thinking that after conversion, just because we're moving data to a new system, the data will look near perfect. They forget that years of their own incorrect or non-standardized practice created a large amount of bad quality data. The typical garbage-in/garbage-out (GIGO) attacks again!

Are they off base to expect data should be corrected during conversion? In an ideal situation, they are probably right. But there are several reasons why it is not always possible or desirable to clean the data as you convert.

In our case, we had a highly complex conversion task, combined with a very aggressive schedule. We made a clear decision that we would perform data quality improvements first and foremost to minimize data fallout. Data fallout is simply data that fails to convert because the new system usually imposes a higher degree of completeness, consistency and referential integrity when compared to the legacy data.

In addition to the above fundamental data quality improvement, we would also apply additional ones dependent on justified business needs. Due to tight schedule, we had to be very strict. After all, if LOB's could operate before with the existing quality, or lack thereof, why couldn't they continue the same way in the new system? Obviously I'm being a bit sarcastic, but that is a very valid question, and I used it at times. I was confronted occasionally by users complaining about data issues in the new system, only to show them the issue existed in legacy, and created by their own selves!

Don't get me wrong, as a Data Quality Lead myself, I thrive to improve the data. I pushed very hard at times to do as much as possible to improve the data during conversion so the bad data wouldn't "pollute" the new system. But you have to prioritize.

Above, I mentioned it might not always be possible or desirable to clean the data during conversion. Fitting the possible category are schedule, technical expertise, or feasibility. Less obvious is the desirable category. One might wonder: isn't it always desirable to cleanse the data as early as possible? It depends. The new system might provide a better infrastructure and better tools to achieve the task. Here is an example: the Oracle CDH provides a party merge function. A party could be a company, person, prospect, etc. When merging two parties, the Oracle built-in function will also move all associated entities and transactions from the merged party into the surviving party while maintaining data integrity. Replicating this logic during conversion could be time consuming and redundant effort.

Data Migration Approach

Next picture depicts our overall data migration approach (click on the picture to enlarge).

A staging area is used to store intermediate results before data is transferred to the Customer Data Hub (CDH). We use two primary schemas in staging. One called Staging 1, and the other Staging 2.

The table structure of Staging 1 is very generic, and can easily accommodate the data from the source systems. For that reason, a very small transformation is required to bring data into Staging 1. This is done for a reason. Very few users have access to the Source Systems. It is much less intrusive to bring the data over without much change into the staging first. That way, we don't impact performance of the still operational systems, or risk corrupting any data. Staging 1 becomes the source for data profiling, isolated from the "live" systems.

The table structure of Staging 2 is very similar to the final one. That means the bulk of the transformation happens between Staging 1 and 2 while the transformation between Staging 2 and the CDH is very minimum again.

Next diagram depicts what happens between Staging 1 and 2 (click on the picture to enlarge).

As I mentioned, that's where most of the transformation takes place. That's where we apply Data Quality improvements, such as data cleansing, standardization, consolidation, and enrichment.

DataFlux dfPower Studio is a critical component in the conversion. DataFlux is used to profile the data in Staging 1 and help determine where the attributes fall into the following categories:
A1 - attributes to be migrated
A2 - attributes to be cleansed
A3 - attributes to be consolidated
A4 - attributes to be cleansed and consolidated

Upon such determination, DataFlux is employed to perform data cleansing where needed. Dun & Bradstreet (D&B) is also utilized to enrich the data, and facilitate data de-duplication later.

Once data cleansing is complete, intermediate data is classified as:
B1 - attributes to be migrated
B2 - attributes to be consolidated

Another round of DataFlux is in place to de-duplicate the data. Initially, the DataFlux process is very interactive, with manual intervention, profiling, and validation until a solid set of rules is in place.

The final result is the following set of data, cleansed and consolidated:
C1 - attributes to be migrated


Data migration can be quite complex depending on number of systems to convert, disparity among them, and volume of data.

Decide on a conversion strategy and on a target data quality improvement. From a data quality perspective, prioritize your list and have a clear set of measurable and achievable goals. Be specific and reasonable. A lot of times you have to decide not only what to cleanse but how much to cleanse. For certain data elements, you may have to achieve 99.9% quality, but for others, maybe 80% will suffice. Data quality should always be driven by the business. If it doesn't benefit the business, it has no purpose.

Make sure you manage expectations appropriately and communicate, communicate, and communicate!! Believe me, no matter how much you do, somebody later will say they were not aware.

Last, but not least, profile the data! Understanding your data is critical and will avoid many problems later.

Tuesday, July 14, 2009

Which MDM approach to use: Analytical, Operational, or Enterprise?

In my last posting, Setting the stage for MDM: some definitions, I described the 3 approaches to MDM that can solve for operational, analytical or both operational and analytical aspects of Master Data. They are, respectively, Operational MDM, Analytical MDM, and Enterprise MDM.

But which one is the right one for your organization?

Before I answer that question, let's talk about the major drivers to an MDM initiative. When building a business case for MDM, several experts agree to three major benefits: risk mitigation, cost reduction, and revenue growth (please see references at the end of the posting).

Risk mitigation involves reducing the possibility of non regulatory compliance (financial reports, SOX, etc.); non legal compliance (contracts, compensation, privacy, etc.); lawsuits and litigation; audit findings and loss of certifications; audit and legal costs; and a company's reputation.

Cost reduction involves lowering costs due to inconsistent, inaccurate, and non-timely delivery of data. These costs are: returned mailing and products; shipping fines; invoicing delays; wasted direct marketing; IT (maintenance of redundant systems, data reconciliation, consulting fees, software maintenance fees, etc.); low productivity related to inefficient processes, redundancy, and rework.

Revenue growth are mostly related to strategic objectives, such as marketing campaigns; channel management; value justification; brand identity; demand creation; and merges & acquisitions.

I recommend you first decide around which one(s) of those areas you're going to write your business case. Once you make that decision, you can use the following table as a guideline to what MDM approach you should implement.

But please, read the following “warnings” regarding the ensuing table:
- Enterprise MDM is a combination of Analytical and Operational MDM's. With that said, Enterprise MDM could be used to solve for all cases. My objective is to recommend the minimum necessary to achieve what you need.
- Use this as a general guideline only. I'm sure you can tell there are some overlaps among the three business cases. One could correctly argue that by mitigating certain risks, you could lower the costs of doing business raising questions if it should be put in the cost reduction category. Also, better data could reduce cost and improve marketing and consequently grow revenue. Anyway, please use your discretion.

Please comment if you have used this type of rationale when deciding what MDM approach to use, or if you have a different suggestion.


Fisher, Tony: The Data Asset - How Smart Companies Govern Their Data For Business Success - Wiley & SAS Business Series

Butler, Maureen: Building the Business Case (parts 1, 2, and 3) - Hub Solutions Design

Tuesday, July 7, 2009

Setting the stage for MDM: some definitions

Master Data Management or MDM is everywhere these days. Executives have heard how MDM is going to save their organizations by revolutionizing how companies deal with their data, and making them more agile, competitive, and successful.

I'm not arguing with that at all. I do believe MDM is capable of achieving all what has been said if done correctly. But it seems like there is quite a bit of disparity to what people call MDM. I've seen organizations simply doing a Data Integration project and calling it MDM. Granted, Data Integration is often enough an important step to getting to MDM, but it is not an MDM per se.

With that in mind, I think it is time to set the stage and have some definitions. I say we need to understand the MD (Master Data) part first, before we can define the second M (Management).

Master data is information that is key to operational and analytics/reporting aspects of business. This key business information may include data about the following entities: customers, products, suppliers, partners, employes, materials, etc. Master data is often non-transactional in nature, but it supports transactional processes and operations, as well as business intelligence via analytics and reporting. Master data is normally used by multiple functional groups and stored in disparate systems across the organization. Since it is commonly stored in siloed systems, the possibility for inaccurate and/or duplicate master data exists. Simply put, master data is that persistent, non-transactional data defining a business entity for which there should be an agreed upon view across the organization.

Notice the two distinct aspects of Master Data: operational and analytics. From that definition, one may say that an MDM project not addressing both aspects is not truly MDM. I'm not that extremist. However, I like to use the following terms to distinguish what is being addressed: Operational MDM, Analytical MDM, and Enterprise MDM. These are not new terms – that's right, I won't take credit for them. I have seen white papers using those terms (sorry for not providing appropriate credits, but I really don't have links to those documents anymore). I'm just surprised that those terms are not used more often to help distinguish what MDM approach in being implemented.

Which one should you implement? You guessed it: it depends on what you're trying to accomplish. Historically, Analytical MDM, implemented in Data Warehouses, has been the most common MDM approach adopted by many organizations, mostly due to its low impact to the company's operational systems. This is still where most of master data is managed today. That is not saying it is the right one. As a matter of fact, most would argue that's not the appropriate solution to manage master data. But that is a topic for another posting.

Next diagram depicts the level of intrusiveness of each approach. This picture is not suggesting phases to follow when implementing MDM (another posting).

To complicate matters even more, there are potentially multiple architecture definitions for each of the three approaches. You guessed it again: more postings to come! I'm getting tired, but I hope you're not!!

Friday, June 26, 2009

I'm on Expert Panel - Data Quality PRO

I am now a member of the Expert Panel on Data Quality PRO (

Thanks to founder Dylan Jones for the opportunity. I look forward to contribute to the community by sharing my experiences in Master Data Management (MDM) and Customer Data Integration (CDI).

You can find my profile on Data Quality PRO at:

Tuesday, June 23, 2009

A Quadrant for Data Quality Initiatives

Most of us in Master Data Management (MDM) are very familiar with the Dimensions of Data Quality, which are very well explained by David Loshin in his Master Data Management book, or by Thomas Ravn and Martin Høedholt in their very informative article in the Information Management Magazine (see References at the end of this posting for more details).

From the sources above and others, the data quality dimensions normally include, but are not limited to, the following: Accuracy, Completeness, Consistency, Currency, Referential Integrity, Timeliness, Uniqueness, and Validity. Please refer to the references for definitions, or add a comment if you would like me to expand some more on the subject.

I am a big proponent of Data Quality Dimensions, and have used them extensively to help organize and classify the multitude of metrics I have implemented so far. But I also like to organize my data quality initiatives using a complementary view. I like to call it the Quadrant for Data Quality Initiatives.

Data Quality initiatives normally fall into 2 categories: pro-active or reactive. In general terms, pro-active initiatives are measures you establish to avoid problems from happening, while reactive initiatives are measures you adopt after the problem has already occurred and needs correction.

Either of those initiatives can lead to 2 results. One of the results is of Full Compliance, meaning the entire problematic data set is corrected, and risk of bad quality data left is near zero. The second result is of Partial Compliance, where there is no guarantee the problem is fully fixed.

A classic example of a Pro-active/Full Compliance activity is when we establish referential integrity rules to avoid incorrect data from being added to the system. In my opinion, that is the ideal scenario. If we could only establish those types of rules for every data element, our life would be much easier. Unfortunately, that's not the case. We can't possibly prevent all data errors from happening.

That's when the Quadrant comes handy. You can define your data quality initiatives in the terms describe above, and place them as appropriate in the quadrant. The next diagram shows a possible classification:

Figure 1 - Quadrant with Data Quality Initiatives

The usefulness of the Quadrant comes from the fact that you can easily identify which initiatives require closer attention. If a certain initiative leads to Partial Compliance, that means you need to establish control mechanisms to track the progress and mitigate the impact and risk of the bad data quality still in place. On the other hand, if you know for sure an initiative will achieve Full Compliance, the need for control is lower.

You can also look at the Quadrant from an opposite direction. You can put as much emphasis on a specific initiative as needed to achieve as much compliance as required. In my diagram, you'll notice I have data cleansing as a reactive initiative leading to 2 results: full and partial compliances. I call one of them Data Cleansing and the other High Impact Cleansing Efforts. The difference comes from the desired outcome. If a “perfect” outcome is required, more attention and resources need to be put during the cleansing activity itself. If not, less attention and resources are required, and the monitoring comes from the overall on-going data quality controlling activities.

The next diagram adds the area where more control is needed. By control, I mean data monitors and data quality scorecards. Those controls are established within the parameters of the Data Quality Dimensions, and that is how you can bring both techniques together.

Figure 2 - Quadrant with Data Quality Initiatives and Where to Focus Control

As a final note, some of the initiatives I have above could be different or placed elsewhere in the quadrant based on your environment conditions and requirements.


Loshin, David: Master Data Management – Copyright © 2009 by Elsevier Inc.

Ravn, Thomas; Høedholt, Martin: How to Measure and Monitor the Quality of Master Data – Information Management Magazine, May 2009

Friday, June 19, 2009

MDM Interview with Data Quality PRO

I was recently interviewed by Data Quality PRO regarding our MDM project at Sun, and our Customer Data De-duplication effort.

Please check the article at:

Thursday, June 11, 2009

Identifying Duplicate Customer Records - Case Study

1. Introduction

Identifying duplicate customers is no unique challenge to any company. As a matter of fact, it is one of the most common data quality problems faced by many organizations.

Many data quality vendors offer viable solutions to help identify duplicate customers. However, you can't expect an out-of-the-box solution. That's because there is a business issue intrinsic to the technical problem, which requires a comprehensive evaluation from multiple angles.

In this topic, the People-Process-Technology triangle is as relevant as ever. Having the right people in the right business areas is critical when making a final decision on duplicates, proper course of action, and impact analysis. Technology is key in expediting the data analysis and improving accuracy, scalability, and repeatability. Lastly, defining the most effective process to bring both technology and people together to achieve the desired results efficiently is equally important.

2. Background

At Sun Microsystems, we have been through a massive Master Data Management (MDM) project, consolidating customer data from over 800 disparate legacy applications into a Customer Data Hub (CDH). The ultimate goal is to have a single source of truth to enable a 360 degree view of the customer.

Needless to say, this major effort comes with a price. Technical difficulties are certainly a major challenge. Bringing so many disparate data systems into a common structure is no easy task, but an even more daring challenge is aligning multiple business units into an integrated set of data, processes, standards, policies, and rules. Such subject deserves its own paper. For the purpose of this case study, we will focus specifically on one of the major issues we had by combining multiple legacy systems into one: duplicated data.

Duplicated data was not simply the result of bringing multiple data sources together. Not surprisingly, duplicated data already existed within each of them. The problem simply became magnified and over-exposed because the newly centralized data was available to a much larger audience.

3. About the Tool

Throughout this article, I'll make constant references to DataFlux, which has multiple products. For this particular effort, we have used dfPower Studio. Here is a brief description about dfPower Studio as provided by DataFlux:

"dfPower Studio is the award-winning DataFlux data quality and data integration platform. dfPower Studio offers a unique set of workflow tools built on an industry-leading technology platform that encompasses every facet of the data management process. Through its intuitive interface, dfPower Studio gives business users powerful data improvement capabilities and complete control over data quality and data governance initiatives.

dfPower Studio allows data stewards, business analysts and data owners to build complex data improvement workflows quickly and logically through an innovative job flow builder. These same jobs can then be implemented, in batch or real time, through the DataFlux Integration Server, easily transforming data quality projects into ongoing data governance policies."

dfPower Studio has allowed us to automate several tasks within our complex process. It has given us the ability to quickly and accurately execute what would be, otherwise, very time consuming and labor intensive steps. dfPower Studio is a critical piece in making this process scalable and repeatable. From now on, I'll use DataFlux and dfPower Studio interchangeably.

4. The Methodology

Figure 1 depicts the overall process flow. The 4 main steps in this process are:

- Identify potential duplicates
- Collect detail data for scoring
- Review results and get approvals
- Consider disposition and execute actions

Figure 1 - Methodology

The next sections describe in detail each of the steps above.

4.1 Identify Potential Duplicates

Identifying duplicates can be a very challenging and time consuming effort if you have lots of data. In our case, we have hundreds of thousands of records, and certainly looking through them manually is impossible. We use DataFlux to help with the search in a semi automated fashion.

We started with a list of customers prioritized by the business. Our first pass was using a list of our top couple hundred customers, which represent a sizable part of our revenues. Using DataFlux and their fuzzy matching techniques, we try to find records in our system related to those customers.

Needless to say, the fuzzy matching is not perfect, so we need to manually detect false positives and false negatives. False positives are basically identified matches that are not truly matches, and false negatives are missed matches. We have a group of data analysts that look through the fuzzy matching results and flag the false positives. That is a relatively simple and quick task, since it is mostly quite obvious the records that do not belong. The false negatives can be a little bit more difficult since we need to deal not only with a tool imperfection, but also with business updates. Merges and acquisitions are very frequent. Our challenge is bigger than simply trying to find matching names. We need to find ways to understand the ever changing company structure of our clients.

To overcome this obstacle, we have data analysts using external data references, such as Dunn & Bradstreet, to help model the most recent company structure and its subsidiaries. A list of search strings is created and used by DataFlux to complement the acquisition of initially missed parties.

At the end of this step, we have a fairly complete set of data records within our system identified as potential matches to our original input list.

4.2 Collect Detail Data for Scoring

The overall goal in this step is to score the matching records using a business defined set of rules. The score is based on a multitude of attributes related to that particular customer record, such as number of service requests, service contracts, install base records, sites, tax id, credit limits, and a lot more. This scoring is useful in deciding the "best" records we have that truly represent the relationship with our customers. The advantage of this method is that it is automated with DataFlux.

We start with the list generated from the previous step. Using DataFlux, we query our data source to collect all relevant attributes that are needed in our scoring system. Using the retrieved data, we run our automated scoring algorithm. Records are clustered together by the customer country of incorporation since our ultimate goal is to have a representative set of records based on their legal definitions.

Our scoring varies from 0 to 100. The highest the score, the more complete the data record is according to the business definitions. The record or records with the highest scores are more likely to survive. The records with the lowest scores will either be merged to a surviving record or inactivated if they provide no business value.

At the end of this step, we have a list of potential duplicates along with a score and accompanying detail data for each of them.

4.3 Review Results and Get Approvals

In this step, Data Analysts look through the scoring results provided by DataFlux along with the supporting detail data. The scoring helps analysts make a determination about which record(s) should survive, and which of them should be inactivated or merged.

Some cases are pretty obvious. For example, when a single record has a very high score compared to the others, there is not much of a question about which record should survive. In more complex cases, where the scores are more even, the business analysts may be required to do some additional research.

It is important to notice that we are not necessarily looking for a single record to survive for all cases. First of all, as mentioned earlier, we cluster records per country of incorporation. So, when looking at a particular set of customer records, we are trying to identify the best legal company data we want to maintain. Furthermore, in many cases, multiple legal entities will still exist within a single country of incorporation.

That is why we need a thorough Business Review to make the determination. In this review, several business areas are brought together to approve the recommendations provided by the analysts. Business areas included are Finance, Support, Sales, Marketing, and more. These reviews are led by the Customer Data Governance group, which is acting as a mediator in the process.

This can be a very time consuming effort, and it is, in my opinion, the price to pay for having a Master Data system. Gaining consensus from everyone can be difficult, but in the end, it pays off to have a single source of truth instead of multiple fragmented siloed data systems.

Once agreement is reached, we are ready to move on to the next step.

4.4 Consider Disposition and Execute Actions

Now that the survivors are marked and approved, a DataFlux job creates a final report to be distributed to the field for correction.

In our case, even though we may have identified which records are the survivors and which ones are not, we cannot necessarily merge them immediately. We still have dependencies with some not yet EOL'd legacy applications that prevent us from performing certain tasks. We also have dependencies related to interface systems and spoke applications that use our data. Therefore, some constraints have to be considered.

The DataFlux job evaluates the input data, considers the constraints, and then make a determination on what actions can be taken. A report for execution is generated, which is assigned to the proper teams for execution.

5. Lessons Learned

Data de-duplication is not a perfect science. The reason for duplicates can be many, and can be very difficult to understand. Mistakes will be made, but we must be proactive.

Costs associated to bad data quality are high and, a lot of times, very difficult to measure. Getting executive buy-in is critical. They must be willing to sponsor these type of activities for the long-term achievement of the organization's operational and strategic business objectives.

Bringing multiple business areas together is also very hard and essential. In our case, some groups had to experience some painfully inefficient processes before they could realize the importance of an effort like this. Having a Customer Data Governance team to voice these issues and organize the proper structure for carrying on the process is a must.

The only way to make this process scalable and repeatable is to use a tool like DataFlux. Without DataFlux, it would have been impossible to quickly produce the required results. With a few data analysts and DataFlux, we can process dozens of company data sets in a single day. Otherwise, we would spend days analyzing a single company.

About the Author

Dalton Cervo is the Customer Data Quality Lead at Sun Microsystems. Dalton is part of Customer Data Steward and is a member of the Customer Data Governance team responsible for defining policies and procedures governing the oversight of master customer data.

Dalton has been with Sun for over 8 years. At Sun, Dalton started as an IT Software Engineer developing web applications. Later, he moved into IT Project Management and eventually into the business area as part of the Market and Business Intelligence team.

In the last 3 years, Dalton has led several customer data management efforts, including: data collection, cleansing, standardization, enrichment, and hierarchy classification; master data management conversion, consolidation, standardization, and cleansing; and data quality standards, policies, metrics and monitors.

Dalton is a member of the Customer Advisory Board for DataFlux, and was a speaker at the IDEAS conference in 2008 on the topic of customer data metrics.