Monday, September 14, 2009

The Data Profile Spectrum

No matter how long you have been working with data quality, or simply data for that matter, you certainly know how important data profiling is. You don't know what you don't know, and data profile helps you bridge the gap.

Data profiling is a critical activity whether you are migrating a data system, integrating a new source into your data warehouse, synchronizing multiple systems, implementing an MDM repository, or just trying to measure and improve the quality of your data.

However, data profile is quite often an unqualified activity. Sometimes that is OK, but sometimes it is not. By “unqualified” I mean not much information or requirements are given about what the data profile is all about. Sometimes that is OK because you have either no knowledge at all or very minimum knowledge about the data you're profiling. But very often, you do know quite a bit already, and maybe you're simply trying to fit your data to a specific set of rules.

Bear with me, but I feel like I need to add one more definition before I continue expressing my point. I keep using the term “knowledge about your data.” What do I mean? There are multiple levels of knowledge in this case. For each data element or combination of data elements, there are so many associated properties. There is data type, data content, data pattern, data association, data conformance, business rules, etc. It could also be about what the data should be and not only about what it is. As you can see, how much you might know could vary a lot.

When you combine the objective of your profile along with how much you know about the data already, you end up with a lot of different combinations. That is why I like to use the term Data Profile Spectrum. And remember, different attributes could be at different parts of the spectrum. No wonder data profiling can be a lot more complex than people give it credit for.

Next picture depicts the Data Profile Spectrum.

Let's first talk about Data Profile Artifacts. By that I mean what is usually provided by a data quality tool, or maybe something you put together yourself. Basically it is what you have to analyze your data, from data completeness to pattern analysis, data distribution, and a lot more. I won't get into a lot of detail about the artifacts. Please refer to Jim Harris' article Adventures in Data Profiling for more on that and some other cool stuff. The only thing I'll point out is notice I used tetrominoes to represent the artifacts. That is just to call attention to the fact that data profile artifacts are pieces that can be applied and/or combined in a variety of ways to accomplish what you need. For example, you may use the data distribution artifact during discovery just to understand what random values you have at what percentage. However, you may use the same artifact on a Country Code field to identify the percentage of valid values. It is the same artifact applied slightly different dependent on where you are in the spectrum.

The Prior Knowledge scale represents how much you already know about what the data is or what the data should be. It is important to grasp where you are in that scale so you know how to apply the right artifacts properly. I mean, why would you need to verify uniqueness when a primary key constraint already exist in the database for that particular field? That is just an example, but hopefully you get the idea.

Another twist is being able to identify where you should be in that scale for a given profiling activity. I can see some eyes rolling, but I'll explain. Here is a real example I faced. We were about to start a data conversion activity. I was asked to “go profile the data to be converted.” My reply was that we needed more information than that. I mean, if we were to convert one system into another, we should have quite a bit of knowledge about the new system, which would drive what and how we profile the old system. This is definitely not a low-end of the scale profile activity in my spectrum.

Interesting enough, my reply wasn't quite well received. I hadn't written this blog entry yet, so this concept wasn't quite formalized in my mind. I was reminded that data profiling should be the first thing to occur, so we could “discover” things about our data. My point was our goal was not to find out information about our data. Our goal was to fit our data into the new system. Doing “primitive” data profiling would be a useless activity. We had to profile our data bounded by the new system. Well, I eventually convinced them, but I wish I had the Data Profile Spectrum handy back then.

In summary, I had a request to do a “No Knowledge” profile, when I should be asked to do something at a higher end of the Data Profile Spectrum. At the time of the request, we didn't know much. One could have thought the request was pertinent when using the Data Profile Spectrum. However, you not only need to consider where you are in the spectrum, but also where you should be. If they don't match, something is missing.

I have several other real examples of data profiling requests, but it is getting pretty late, and I want to post this entry before I go to bed. If you care to read more about them, please let me know.