Monday, September 14, 2009

The Data Profile Spectrum


No matter how long you have been working with data quality, or simply data for that matter, you certainly know how important data profiling is. You don't know what you don't know, and data profile helps you bridge the gap.

Data profiling is a critical activity whether you are migrating a data system, integrating a new source into your data warehouse, synchronizing multiple systems, implementing an MDM repository, or just trying to measure and improve the quality of your data.

However, data profile is quite often an unqualified activity. Sometimes that is OK, but sometimes it is not. By “unqualified” I mean not much information or requirements are given about what the data profile is all about. Sometimes that is OK because you have either no knowledge at all or very minimum knowledge about the data you're profiling. But very often, you do know quite a bit already, and maybe you're simply trying to fit your data to a specific set of rules.

Bear with me, but I feel like I need to add one more definition before I continue expressing my point. I keep using the term “knowledge about your data.” What do I mean? There are multiple levels of knowledge in this case. For each data element or combination of data elements, there are so many associated properties. There is data type, data content, data pattern, data association, data conformance, business rules, etc. It could also be about what the data should be and not only about what it is. As you can see, how much you might know could vary a lot.

When you combine the objective of your profile along with how much you know about the data already, you end up with a lot of different combinations. That is why I like to use the term Data Profile Spectrum. And remember, different attributes could be at different parts of the spectrum. No wonder data profiling can be a lot more complex than people give it credit for.

Next picture depicts the Data Profile Spectrum.




Let's first talk about Data Profile Artifacts. By that I mean what is usually provided by a data quality tool, or maybe something you put together yourself. Basically it is what you have to analyze your data, from data completeness to pattern analysis, data distribution, and a lot more. I won't get into a lot of detail about the artifacts. Please refer to Jim Harris' article Adventures in Data Profiling for more on that and some other cool stuff. The only thing I'll point out is notice I used tetrominoes to represent the artifacts. That is just to call attention to the fact that data profile artifacts are pieces that can be applied and/or combined in a variety of ways to accomplish what you need. For example, you may use the data distribution artifact during discovery just to understand what random values you have at what percentage. However, you may use the same artifact on a Country Code field to identify the percentage of valid values. It is the same artifact applied slightly different dependent on where you are in the spectrum.

The Prior Knowledge scale represents how much you already know about what the data is or what the data should be. It is important to grasp where you are in that scale so you know how to apply the right artifacts properly. I mean, why would you need to verify uniqueness when a primary key constraint already exist in the database for that particular field? That is just an example, but hopefully you get the idea.

Another twist is being able to identify where you should be in that scale for a given profiling activity. I can see some eyes rolling, but I'll explain. Here is a real example I faced. We were about to start a data conversion activity. I was asked to “go profile the data to be converted.” My reply was that we needed more information than that. I mean, if we were to convert one system into another, we should have quite a bit of knowledge about the new system, which would drive what and how we profile the old system. This is definitely not a low-end of the scale profile activity in my spectrum.

Interesting enough, my reply wasn't quite well received. I hadn't written this blog entry yet, so this concept wasn't quite formalized in my mind. I was reminded that data profiling should be the first thing to occur, so we could “discover” things about our data. My point was our goal was not to find out information about our data. Our goal was to fit our data into the new system. Doing “primitive” data profiling would be a useless activity. We had to profile our data bounded by the new system. Well, I eventually convinced them, but I wish I had the Data Profile Spectrum handy back then.

In summary, I had a request to do a “No Knowledge” profile, when I should be asked to do something at a higher end of the Data Profile Spectrum. At the time of the request, we didn't know much. One could have thought the request was pertinent when using the Data Profile Spectrum. However, you not only need to consider where you are in the spectrum, but also where you should be. If they don't match, something is missing.

I have several other real examples of data profiling requests, but it is getting pretty late, and I want to post this entry before I go to bed. If you care to read more about them, please let me know.


4 comments:

  1. I think there are different data profiling levels in a data quality or migration project. I refer to the discussion in Henrik's blog:
    lilendahl.wordpress.com (when to cleanse in Data Migration project) - and there are many solutions.

    I believe to have a rough estimation about the project we need at least a pre-profiling in the beginning, where to target system maybe not known.

    br
    Tibor Bossányi

    http://migration.hu

    ReplyDelete
  2. Another good blog Dalton. I fully agree and support the approach of knowing where you are in the ‘spectrum’ and where you need to be before you start to profile. Once you are in the depths of the profiling process you can often loose focus on the end goal. Our MDM programme is an example of this, putting to much focus on the profiling work-stream and the DQ processes resulted in a fantastic solution, but is was far bigger than it needed to be, we didn’t build the profiling and DQ processes to fit the end goal!

    ReplyDelete
  3. Hi Dalton,

    I like your approach, and your 'spectrum'. I particularly like your real life example, and would like to read more about your real life data profiling experience. Theory is fine - it is the real world application that matters.

    You were put under pressure to perform 'No knowledge' data profiling "because data profiling should be the first thing to occur, so we could “discover” things about our data".

    I have more often faced situations where no-one accepted the need for data profiling in the first place!!

    Hence, I am keen to build and contribute to the "Business Reasons" for data profiling.

    Looking forward to learning more - Ken

    ReplyDelete
  4. Thank you all for your comments!

    Tibor, I agree very much with your comments. There are indeed different profiling levels. The intent of the spectrum model is to generalize and abstract the actual profiling activity (data quality, data migration, etc) into a knowledge based approach. It is just a different way of looking at, really. The pre-profiling when the target system is not known could certainly be a valid exercise, and this activity can be positioned in the spectrum accordingly to set the right expectations.

    Charles, thanks for the feedback. I can relate to your experience. We also missed the target a bit in some of our data profiling projects. I hope we have learned from some of that now, and can do a better job in the future.

    Ken, thanks for your comments! I'll add a new posting with more info on some of the experiences I've had so far.

    ReplyDelete