Wednesday, August 19, 2009

Migrating Data into an MDM Repository - Case Study


Most likely, a Master Data Management (MDM) project will require a data migration (a.k.a. data conversion) step. Simply speaking, data migration is copying data from one repository into another. Technically speaking, it is an Extract-Transform-Load (ETL) exercise.

On an Operational or Enterprise MDM project, chances are you'll end-of-life (EOL) multiple legacy systems after their data is migrated into a single source. If you're implementing a Hub and Spoke architecture, you may be migrating multiple source systems, but not necessarily EOL'ing all of them.

Notice that if you're using Data Federation to implement your MDM solution, there is no data migration. Data Federation acts as a virtual central repository, and as such, does not require a physical copy of your source data. Data Federation "translates" the source information in real time according to required business rules and definitions. It is, so to speak, a real-time Extract-Transform process.

At Sun Microsystems, our MDM implementation utilizes a Hub and Spoke architecture, and we did go through a complex data migration step. This case study describes the challenges and method we used to convert the data while improving the quality at the same time.


At Sun, we have been implementing a phased Operational MDM project. The ultimate goal is to convert multiple data systems into a single Oracle 11i Hub and Spoke architecture. Most of the systems will be EOL'ed after conversion, while others will continue to exist, but dependent on the Hub for Master Data. Multiple data entities are in scope, but for the purposes of this case study, I'll focus on the customer entity.

In this project, the MDM repository for customer data is the Customer Data Hub (CDH), which is part of the Oracle Trading Community Architecture (TCA).

As expected in any Operational MDM implementation, there is a tremendous business impact. It is a very intrusive project, where multiple Lines of Business (LOB) are affected, including changes to a large multitude of existing processes and procedures. A task this big can't be carried on without a massive Change Management Plan emphasizing: readiness assessment; sponsorship; communication; education and training; coaching; measurement systems, rewards and reinforcement.

Challenges to Data Migration

All the changes occasioned by MDM also affect the design of the data migration process. As stated previously, data migration is mostly an ETL exercise. Mapping attributes from the legacy systems into the new one is very challenging for two primary reasons. The first one is structural: tables and attributes of the new system will certainly differ from the old one(s). The second is about semantics: meanings and dependencies might vary widely according to the new business purposes of entities and attributes. The second is usually far more difficult to resolve than the first, and usually requires a lot more data profiling.

One of the lessons learned for us is that we should have done a lot more data profiling than we did. We did a lot, but certainly not enough. While it might not always have changed the data migration step itself, it probably would have avoided some pitfalls and made us better prepare and better train the affected LOB's.

Another interesting aspect of data migration, or more likely from a big project like this, is managing expectations. It is quite amusing to witness multiple LOB's in the organization thinking that after conversion, just because we're moving data to a new system, the data will look near perfect. They forget that years of their own incorrect or non-standardized practice created a large amount of bad quality data. The typical garbage-in/garbage-out (GIGO) attacks again!

Are they off base to expect data should be corrected during conversion? In an ideal situation, they are probably right. But there are several reasons why it is not always possible or desirable to clean the data as you convert.

In our case, we had a highly complex conversion task, combined with a very aggressive schedule. We made a clear decision that we would perform data quality improvements first and foremost to minimize data fallout. Data fallout is simply data that fails to convert because the new system usually imposes a higher degree of completeness, consistency and referential integrity when compared to the legacy data.

In addition to the above fundamental data quality improvement, we would also apply additional ones dependent on justified business needs. Due to tight schedule, we had to be very strict. After all, if LOB's could operate before with the existing quality, or lack thereof, why couldn't they continue the same way in the new system? Obviously I'm being a bit sarcastic, but that is a very valid question, and I used it at times. I was confronted occasionally by users complaining about data issues in the new system, only to show them the issue existed in legacy, and created by their own selves!

Don't get me wrong, as a Data Quality Lead myself, I thrive to improve the data. I pushed very hard at times to do as much as possible to improve the data during conversion so the bad data wouldn't "pollute" the new system. But you have to prioritize.

Above, I mentioned it might not always be possible or desirable to clean the data during conversion. Fitting the possible category are schedule, technical expertise, or feasibility. Less obvious is the desirable category. One might wonder: isn't it always desirable to cleanse the data as early as possible? It depends. The new system might provide a better infrastructure and better tools to achieve the task. Here is an example: the Oracle CDH provides a party merge function. A party could be a company, person, prospect, etc. When merging two parties, the Oracle built-in function will also move all associated entities and transactions from the merged party into the surviving party while maintaining data integrity. Replicating this logic during conversion could be time consuming and redundant effort.

Data Migration Approach

Next picture depicts our overall data migration approach (click on the picture to enlarge).

A staging area is used to store intermediate results before data is transferred to the Customer Data Hub (CDH). We use two primary schemas in staging. One called Staging 1, and the other Staging 2.

The table structure of Staging 1 is very generic, and can easily accommodate the data from the source systems. For that reason, a very small transformation is required to bring data into Staging 1. This is done for a reason. Very few users have access to the Source Systems. It is much less intrusive to bring the data over without much change into the staging first. That way, we don't impact performance of the still operational systems, or risk corrupting any data. Staging 1 becomes the source for data profiling, isolated from the "live" systems.

The table structure of Staging 2 is very similar to the final one. That means the bulk of the transformation happens between Staging 1 and 2 while the transformation between Staging 2 and the CDH is very minimum again.

Next diagram depicts what happens between Staging 1 and 2 (click on the picture to enlarge).

As I mentioned, that's where most of the transformation takes place. That's where we apply Data Quality improvements, such as data cleansing, standardization, consolidation, and enrichment.

DataFlux dfPower Studio is a critical component in the conversion. DataFlux is used to profile the data in Staging 1 and help determine where the attributes fall into the following categories:
A1 - attributes to be migrated
A2 - attributes to be cleansed
A3 - attributes to be consolidated
A4 - attributes to be cleansed and consolidated

Upon such determination, DataFlux is employed to perform data cleansing where needed. Dun & Bradstreet (D&B) is also utilized to enrich the data, and facilitate data de-duplication later.

Once data cleansing is complete, intermediate data is classified as:
B1 - attributes to be migrated
B2 - attributes to be consolidated

Another round of DataFlux is in place to de-duplicate the data. Initially, the DataFlux process is very interactive, with manual intervention, profiling, and validation until a solid set of rules is in place.

The final result is the following set of data, cleansed and consolidated:
C1 - attributes to be migrated


Data migration can be quite complex depending on number of systems to convert, disparity among them, and volume of data.

Decide on a conversion strategy and on a target data quality improvement. From a data quality perspective, prioritize your list and have a clear set of measurable and achievable goals. Be specific and reasonable. A lot of times you have to decide not only what to cleanse but how much to cleanse. For certain data elements, you may have to achieve 99.9% quality, but for others, maybe 80% will suffice. Data quality should always be driven by the business. If it doesn't benefit the business, it has no purpose.

Make sure you manage expectations appropriately and communicate, communicate, and communicate!! Believe me, no matter how much you do, somebody later will say they were not aware.

Last, but not least, profile the data! Understanding your data is critical and will avoid many problems later.