Wednesday, August 19, 2009

Migrating Data into an MDM Repository - Case Study

Introduction

Most likely, a Master Data Management (MDM) project will require a data migration (a.k.a. data conversion) step. Simply speaking, data migration is copying data from one repository into another. Technically speaking, it is an Extract-Transform-Load (ETL) exercise.

On an Operational or Enterprise MDM project, chances are you'll end-of-life (EOL) multiple legacy systems after their data is migrated into a single source. If you're implementing a Hub and Spoke architecture, you may be migrating multiple source systems, but not necessarily EOL'ing all of them.

Notice that if you're using Data Federation to implement your MDM solution, there is no data migration. Data Federation acts as a virtual central repository, and as such, does not require a physical copy of your source data. Data Federation "translates" the source information in real time according to required business rules and definitions. It is, so to speak, a real-time Extract-Transform process.

At Sun Microsystems, our MDM implementation utilizes a Hub and Spoke architecture, and we did go through a complex data migration step. This case study describes the challenges and method we used to convert the data while improving the quality at the same time.


Background

At Sun, we have been implementing a phased Operational MDM project. The ultimate goal is to convert multiple data systems into a single Oracle 11i Hub and Spoke architecture. Most of the systems will be EOL'ed after conversion, while others will continue to exist, but dependent on the Hub for Master Data. Multiple data entities are in scope, but for the purposes of this case study, I'll focus on the customer entity.

In this project, the MDM repository for customer data is the Customer Data Hub (CDH), which is part of the Oracle Trading Community Architecture (TCA).

As expected in any Operational MDM implementation, there is a tremendous business impact. It is a very intrusive project, where multiple Lines of Business (LOB) are affected, including changes to a large multitude of existing processes and procedures. A task this big can't be carried on without a massive Change Management Plan emphasizing: readiness assessment; sponsorship; communication; education and training; coaching; measurement systems, rewards and reinforcement.


Challenges to Data Migration

All the changes occasioned by MDM also affect the design of the data migration process. As stated previously, data migration is mostly an ETL exercise. Mapping attributes from the legacy systems into the new one is very challenging for two primary reasons. The first one is structural: tables and attributes of the new system will certainly differ from the old one(s). The second is about semantics: meanings and dependencies might vary widely according to the new business purposes of entities and attributes. The second is usually far more difficult to resolve than the first, and usually requires a lot more data profiling.

One of the lessons learned for us is that we should have done a lot more data profiling than we did. We did a lot, but certainly not enough. While it might not always have changed the data migration step itself, it probably would have avoided some pitfalls and made us better prepare and better train the affected LOB's.

Another interesting aspect of data migration, or more likely from a big project like this, is managing expectations. It is quite amusing to witness multiple LOB's in the organization thinking that after conversion, just because we're moving data to a new system, the data will look near perfect. They forget that years of their own incorrect or non-standardized practice created a large amount of bad quality data. The typical garbage-in/garbage-out (GIGO) attacks again!

Are they off base to expect data should be corrected during conversion? In an ideal situation, they are probably right. But there are several reasons why it is not always possible or desirable to clean the data as you convert.

In our case, we had a highly complex conversion task, combined with a very aggressive schedule. We made a clear decision that we would perform data quality improvements first and foremost to minimize data fallout. Data fallout is simply data that fails to convert because the new system usually imposes a higher degree of completeness, consistency and referential integrity when compared to the legacy data.

In addition to the above fundamental data quality improvement, we would also apply additional ones dependent on justified business needs. Due to tight schedule, we had to be very strict. After all, if LOB's could operate before with the existing quality, or lack thereof, why couldn't they continue the same way in the new system? Obviously I'm being a bit sarcastic, but that is a very valid question, and I used it at times. I was confronted occasionally by users complaining about data issues in the new system, only to show them the issue existed in legacy, and created by their own selves!

Don't get me wrong, as a Data Quality Lead myself, I thrive to improve the data. I pushed very hard at times to do as much as possible to improve the data during conversion so the bad data wouldn't "pollute" the new system. But you have to prioritize.

Above, I mentioned it might not always be possible or desirable to clean the data during conversion. Fitting the possible category are schedule, technical expertise, or feasibility. Less obvious is the desirable category. One might wonder: isn't it always desirable to cleanse the data as early as possible? It depends. The new system might provide a better infrastructure and better tools to achieve the task. Here is an example: the Oracle CDH provides a party merge function. A party could be a company, person, prospect, etc. When merging two parties, the Oracle built-in function will also move all associated entities and transactions from the merged party into the surviving party while maintaining data integrity. Replicating this logic during conversion could be time consuming and redundant effort.


Data Migration Approach


Next picture depicts our overall data migration approach (click on the picture to enlarge).



A staging area is used to store intermediate results before data is transferred to the Customer Data Hub (CDH). We use two primary schemas in staging. One called Staging 1, and the other Staging 2.

The table structure of Staging 1 is very generic, and can easily accommodate the data from the source systems. For that reason, a very small transformation is required to bring data into Staging 1. This is done for a reason. Very few users have access to the Source Systems. It is much less intrusive to bring the data over without much change into the staging first. That way, we don't impact performance of the still operational systems, or risk corrupting any data. Staging 1 becomes the source for data profiling, isolated from the "live" systems.

The table structure of Staging 2 is very similar to the final one. That means the bulk of the transformation happens between Staging 1 and 2 while the transformation between Staging 2 and the CDH is very minimum again.

Next diagram depicts what happens between Staging 1 and 2 (click on the picture to enlarge).



As I mentioned, that's where most of the transformation takes place. That's where we apply Data Quality improvements, such as data cleansing, standardization, consolidation, and enrichment.

DataFlux dfPower Studio is a critical component in the conversion. DataFlux is used to profile the data in Staging 1 and help determine where the attributes fall into the following categories:
A1 - attributes to be migrated
A2 - attributes to be cleansed
A3 - attributes to be consolidated
A4 - attributes to be cleansed and consolidated

Upon such determination, DataFlux is employed to perform data cleansing where needed. Dun & Bradstreet (D&B) is also utilized to enrich the data, and facilitate data de-duplication later.

Once data cleansing is complete, intermediate data is classified as:
B1 - attributes to be migrated
B2 - attributes to be consolidated

Another round of DataFlux is in place to de-duplicate the data. Initially, the DataFlux process is very interactive, with manual intervention, profiling, and validation until a solid set of rules is in place.

The final result is the following set of data, cleansed and consolidated:
C1 - attributes to be migrated


Conclusion

Data migration can be quite complex depending on number of systems to convert, disparity among them, and volume of data.

Decide on a conversion strategy and on a target data quality improvement. From a data quality perspective, prioritize your list and have a clear set of measurable and achievable goals. Be specific and reasonable. A lot of times you have to decide not only what to cleanse but how much to cleanse. For certain data elements, you may have to achieve 99.9% quality, but for others, maybe 80% will suffice. Data quality should always be driven by the business. If it doesn't benefit the business, it has no purpose.

Make sure you manage expectations appropriately and communicate, communicate, and communicate!! Believe me, no matter how much you do, somebody later will say they were not aware.

Last, but not least, profile the data! Understanding your data is critical and will avoid many problems later.

9 comments:

  1. Hi Dalton:

    Thanks for the great article. Do you mind answering couple of questions that are not addressed in the case study.

    1) How long did it take to complete the project?

    2) What surprises (if any) have you encountered during the implementation.

    3) Has the initiative for MDM came from Business or IT? I think in most cases, it comes from IT. However, it's hard to justify the costs for such an effort in this environment. Did you encounter any such issues in terms of funding?

    4) How long will it take to realize the ROI as specified in the business case?

    Thanks,

    Krishna.

    ReplyDelete
  2. Hi Krishna,

    Thanks for your questions.

    1) The project is still on-going. It is roughly a 3 year project, with 5 major deployment phases.

    2) I'm not sure we would call them surprises, but we were able to confirm several assumptions we had regarding data fragmentation and different business practices across multiple geographies.

    3) It was mandated by the CEO, but we were limited to a relative "standard" implementation. Funding for customization was very controlled. You might want to check the interview I gave to DataQualityPRO about the effort:

    http://www.dataqualitypro.com/data-quality-home/identifying-duplicate-customer-records-an-mdm-case-study-wit.html


    4) I was not part of the ROI analysis and business case justification.

    Regards,
    Dalton.

    ReplyDelete
  3. You bring up a good point about adequate profiling of data prior to migration. By comparing data to existing or new standards of data quality you not only cleanse in the process but it allows you to understand the fall out and set realistic expectations with the business. I leveraged a DQ scorecard against data to migrate establishing a report that identified records that would pass and fail along with those that may need minor cleansing to get through and presented that back to the business. Depending on the issues we would perform cleansing efforts both prior to ETL and during the ETL process. It is also important to note that in later migrations a data quality solution was leveraged in the migration process. You can consider this ETL or not depending. I consider these tools ETL on steroids myself.

    Once records were migrated, they were incorporated into the overall DQ process where our regular monitoring, business rule updates, and cleansing efforts began to be applied. So in the end, DQ was before, during and after migration.

    I think a thing that is important to remember is that it may make sense to relax DQ rules in migration projects as 100% compliance may not be realistic to achieve prior. Depending on the issue and the impact to the business, it can either be handled after migration or ignored depending on overall thresholds of DQ expectations.

    Good post.

    ReplyDelete
  4. Thanks for your comments, brainvibe (Michele, right?).

    Couldn't agree with you more. DQ shouldn't be an afterthought. It should be a priority applied with the proper dose depending on the circumstances.

    Really appreciate you sharing your experiences and providing feedback! Like your blog, btw.

    ReplyDelete
  5. Hi Dalton,

    As I've mentioned before on LinkedIn, great blog! I'm interested to hear about what is next. With a project as large as this, what are you views on what Data Stewardship is going to be like, how involved / responsible is the business going to be in a DQ role?

    ReplyDelete
  6. Alexander,

    Thanks a lot for your comments. Please make sure you check this posting and others at my new web site at http://www.dcervo.com

    Thanks,
    Dalton.

    ReplyDelete
  7. Hi Dalton
    Great Blog
    I'm undergoing a very similar project creating CDH to provide customer data to two very different E-Business instances. Am profiling source data as I write!! We are struggling mapping source to target - particularly the relationships and the documentation isn't great. Can you point us in the right direction?

    ReplyDelete
  8. Hi Michelle, please notice I have a new URL: http://www.dcervo.com

    Back to your question. Yes, mapping can be a tremendous challenge, specially if no existing documentation.

    It is a somewhat brute force approach, which takes a group of people to clearly understand the existing data model and business processes behind the data.

    What you can do to help is to minimize what needs to be mapped. A lot of times not everything in the old system is needed in the new system, so you should go through an exercise to descope what is no longer needed. You can use the results of your data profiling for that. For example, if a column is scarcely populated, it is a pretty good indication that the data is not used and there is no need to convert it.

    It is not uncommon to be able to descope 20%-30% of your data elements. If you can achieve that, the universe that needs to be mapped is much smaller.

    Also, if you have multiple touch points for mapping, e.g. from legacy to staging, staging to destination, please consider using a metadata tool. Data lineage can become very difficult to manage with spreadsheets.

    Hope that helps.

    Thanks,
    Dalton.

    ReplyDelete