Tuesday, October 27, 2009

Monday, October 19, 2009

Wild wild MDM...

I am a visual learner, and that ends up transpiring into how I demonstrate my experiences. If you have been reading my postings, I'm sure you noticed I have lots of diagrams.

This visual need has led me to think about a good comic strip for MDM. More recently, inspired by Jim Harris and Phil Simon debate about which board game is the better metaphor for an Information Technology (IT) project, I have “resurrected” my long time desire.

I did come up with an original idea, I think. But obviously, only time will tell if the theme I came up with is any good.

Growing up, I used to read western comics. I don't read them anymore, but I am a long time fan of Tex Willer, which was originally created in Italy. In any event, I can see a good analogy between MDM and the old west. It was a time of political compromise, technological innovation, treaties, and establishment of law and order. Sounds familiar?

With that, I came up with “Wild Wild MDM...”

The characters:

1. Native Americans: the IT department. Granted, the native americans weren't technologically advanced, but I see them similar in their wild spirit, like to “hunt their own food,” value and take care of their resources (land, water, etc vs. hardware, software, etc).

2. Cavalry: the Business. They share a somewhat pompous attitude, highly structured and formal. Don't be offended, please. I am a business person, btw.

3. Sheriff: Data Governance. Obvious analogy, I believe.

4. Ranger: data quality people. The duties of the Rangers consisted of conducting criminal and special investigations, apprehending wanted felons, and suppressing major disturbances. I can see a plausible analogy here. Guns could be data quality tools...

5. Outlaws: bad quality data. Data will assume various forms, which I think is very reasonable since data is indeed very elusive. Bad quality data will be like outlaws at times. Data could also be smoke as shown on one of the cartoons below.

More characters will be added as needed. I'm still evolving them.

With that said, I have to excuse a few things in advance:

1. My drawings are not good. I can't draw real cartoons, so I'll have to use some pre-defined objects, and get them to express my ideas. Please, use your imagination.

2. Incorrect historical facts. This is a very “lose” analogy. I will use fictitious characters/situations combined with real ones, or combine characters/situations from different times. I will not be doing vast research on American history. I'll stick to the fundamental stereotype defined, and sometimes show character relationships that didn't quite exist.

3. Strips will not necessarily be funny. As a matter of fact, most of them probably won't be. They likely will reflect a common situation just to express a message.

4. I may run out of ideas very quickly, which could be a sign my analogy wasn't that good, or that my drawing capabilities (or lack thereof) are preventing me from representing my thoughts.

I have two strips to start. Here they are:

Tuesday, October 13, 2009

Implementing an Analytical MDM


Last week I was at the DataFlux IDEAS conference. Without a doubt, the biggest attraction from a product perspective was Project Unity, which is a joint effort between DataFlux and its parent company SAS. Unity is a next-generation enterprise data management platform encompassing data quality, data integration and master data management (MDM). As a DataFlux user for years, I can certainly say that Unity will vastly improve existing functionality, add outstanding new features, and ameliorate already ease of use interface.

I also had a chance to take a look into their existing qMDM for Customer Data product. qMDM helps you integrate multiple customer data sources into a single structure. Within the resulting centralized repository, pre-defined cleansing/standardization/enrichment rules can be customized as needed to provide the proper level of consolidation. Finally, cluster of customer records can be mapped properly into hierarchies via a web-driven interface for a unified view of customer information.

Too bad qMDM wasn't available a few years ago, when I was working on an Analytical MDM implementation for Customer Data. It would have certainly saved me a lot of time and grief. I did use DataFlux, but had to write logic that is now available out-of-the-box with qMDM.

This posting is not about describing qMDM. The short presentation I saw doesn't qualify me for that. However, I will describe in high level terms what we implemented a few years ago. It should be generic enough that you can adapt to your particular situation, whether you develop everything in-house or use a vendor tool. Remember that even if you use a tool, from DataFlux or not, quite a bit of customization may be required depending on your business needs.


Our objective with this project was to improve Business Intelligence and Analytics as it relates to customer. Needless to say, customer information was the primary driver. As such, we needed a unified view of the customer information throughout the entire organization to achieve the so elusive 360-degree view of the customer.

Customer data exists in multiple system across the enterprise. So does the transaction data associated with them. We had to bring both customer and transaction information into a single repository that we could use. By cleansing, standardizing, enriching and consolidating customer data, we could better make sense of the associated transaction information and use the results for predictive analytics and reporting.

For example, understanding that a particular customer on a given industry sector had purchased a particular hardware configuration with combined storage, software, and services could lead to cross-sell opportunities for other customers in the same industry sector. There wasn't a single system holding all the transaction information associated to a customer, meaning we had to bring data together from multiple systems to make that determination. Furthermore, even when data was all stored on a particular repository, there was a fair amount of customer data duplication, making it difficult to understand all the associations.

In essence, fragmentation and duplication was preventing us from realizing the opportunity we had to really use our data as an asset.


There are 4 main logical components in this project:
1. Data Extract Process
2. Customer Data Cleansing, Standardization, Enrichment and Consolidation
3. Customer Hierarchy Mapping
4. Analytics and Reporting

Activities performed by each component is well defined and can be quite complex. Multiple tools are used, and most of the process is automated. However, some steps require human intervention, and reports and analytics are constantly evolving.

Here is a list of tools/technologies used: Oracle RDMS, Oracle PL/SQL, DataFlux, D&B, Java, CORDA, Oracle ODM, SAS, BRIO. In my diagrams, I have added a logo indicating which technology is used for each activity.

Data Extract Process

Before we can analyze our data, we need to bring it together. This is the first step on a Customer Data Integration (CDI) project, where multiple sources containing disparate information are queried and the relevant information loaded into a single repository.

I like separating the data into two parts: the Customer Master Data or Customer Identity, and the Transaction Data. Master Data is the persistent, non-transactional data defining a business entity for which there should be one consistent and understood view across the organization. That is one of the things we are achieving with this project. By having the identity separated from the transaction, we can use tools and techniques to consolidate it without losing the keys that map the data back to the transaction piece.

In any event, in this first component, we use Oracle stored procedures to extract/transform/load the disparate data from multiple sources into a common structure. For the Customer Identity Data tables, any system specific idiosyncrasies are eliminated during this step. Therefore, following steps are applied independent of the source.

Customer Data Cleansing, Standardization, Enrichment and Consolidation

The ultimate goal in this step is to eliminate duplicates. But the data is dirty and even though they conform to a common structure at this point, they don't necessarily conform to the same standard. Address lines could be mixed, customer names non-complaint, missing information, etc.

We use D&B to enrich data with DUNS number, and standard company and address information. We also use DataFlux to standardize company information, cleanse data fields, right-field address information, etc.

Once data is cleansed and standardized, we use DataFlux and fuzzy matching to consolidate similar records into clusters. This is fully automated, and as such, we want to minimize risk. Our clustering is conservative, purposely leading to false negatives, but avoiding false positives.

Customer Hierarchy Mapping

After data is consolidated, we have a much lower number of records to map into hierarchies. To be precise, the number of records is the same, but since they are in clusters, you only need to map the cluster, and every record under it is automatically mapped.

Customer hierarchy allows for rollups, facilitating the interpretation and reporting of transactions associated to a group of customer records.

Some records are automatically mapped by DataFlux by means of getting grouped into an already mapped cluster. A group of hierarchy managers use a java web-driven application to manually map records into hierarchies.

Analytics and Reporting

This is where everything comes together. We now have hierarchical and consolidated Customer Master Data. We also have transaction information. We can use Analytics and Data Mining tools to combine the two for Business Intelligence. Dashboards using CORDA and ad-hoc reports with BRIO complement the suite.


Pretty easy, isn't it? This is not for the faint-of-heart, but once in operation can reap great benefits. Our “clients” (sales and marketing teams) are extremely satisfied with the results this implementation has provided.

Remember also that the data at the source is dynamic. Our process supports new records as well as changes to existing records. It gets quite complex because you can have an already mapped record modified at the source. There are different implications if the record is the survivor in its cluster or not, if it is mapped to a hierarchy or not, etc. But that deserves a total new posting.

Tuesday, October 6, 2009

Presentation at DataFlux IDEAS conference

If you were at my presentation today at DataFlux IDEAS, and is wondering where to find more info, please check my blog entry: Identifying Duplicate Customer Records - Case Study.

I hope you enjoy the other postings too.