10 min read
I have been having a lot of conversations lately with organizations that are implementing “Big Data” and have seen many organizations consider using Atlas or Navigator as their “data catalog” of choice for data governance. Because this topic has been coming up so much, I think it is time to discuss why those technologies are not sufficient as a data catalog or for solving real enterprise data governance problems.
First off, these technologies are built to service technical users. They don’t organize and simplify the underlying metadata in a way that is searchable and useable by business users. Servicing these less tech-oriented users requires an additional layer of “smarts” to hide the underlying complexity. These technologies also do not solve the fundamental problems that data catalogs MUST solve to be considered a data catalog (rather than a metadata repository) or to be actually be useful for data governance. To provide this layer of value, they have to deal with such challenges as:
Data catalogs provide additional intelligence that allow business users (or applications) to search for data they need to use for analysis or some other purpose (like controlling access for governance). They should rely on everyday business terms to locate the needed info independent of metadata or schema information (such as actual column names). The user of the data catalog should then be able to figure out what data sets they want to use based on both objective (profiling) and subjective (ratings or reviews) information provided in the data catalog. Access to that information should of course be controlled or governed based on some sets of roles or permissions.
This means a data catalog can’t just be a repository of raw column names with their associated schemas. It has to have an additional layer of “smarts” to make it useful for business users by providing the following fundamental capabilities in order to really deliver value to business users:
(Note: I am only mentioning the top three capabilities here. For a more detailed breakdown, check out this blog post on the top 10 features to look for in a data catalog.)
Let’s dig into these in more detail:
Automatically populating the data catalog with integrated technical AND business metadata: First off, for a data catalog to be useful, it has to connect a business term, like “First Name”, with all the physical attributes that might contain a first name across all of the data sets that contain that attribute. This needs to be done independent of the naming of those columns (or data elements), because almost no organization is able to maintain perfect naming discipline for their datasets across any reasonable period of time. This is important, because later, when a business user starts searching for information, they don’t need to know all of the synonyms that might have previously been used to name that term within a dataset.
A more realistic scenario is to imagine you have 5 different columns of data in 5 different data sets in your enterprise data lake, and their column names are as follows: First Name, FN, G_name, GN, Col_5. If users just search on “First Name,” they would only get back one of the 5 columns that contained first names. This is obviously not a very useful result in the context of a data catalog.
So how do you solve this problem? A good data catalog starts with introspection of the data itself, using machine learning to automatically examine the data values in each column, creating a “fingerprint” of each of those columns, and then matching the fingerprint to a known business term. In this way, even columns with cryptic column names like “Col_5” can be automatically tagged with a business term of “First Name”. (There is much more to the process of “data fingerprinting,” so If you want more details on how this works, read more about “Data Fingerprinting” here.)
The most important point is that this process must be automated. Automation is critical for any practical deployment of a data catalog for data governance, because most data is not well documented with good column names. There is simply much too much data in most organizations to go back and manually tag the data with good consistent names. And second, once your data is tagged, an automated approach allows you to easily integrate this tagging process with underlying security systems for those items that have been tagged with terms that are considered “sensitive”.
But if it isn’t clear already, for any of this to work in a real environment, the population of business level metadata in the data catalog has to first be automated to be useful at a meaningful scale to business users.
Crowdsourced ratings and reviews: Automated tagging is great, and it can tell you a lot of about the datasets you are interested in working with. Beyond just the tags, the automated process will give statistical demographic information like min/max values, most frequent values, selectivity, etc. However, that isn’t always enough to determine whether a data set is of value for the context within which you want to use it. That is why every competitive data catalog also has a democratized system of ratings and reviews that allows users to comment on a data set that notes things like “Data is good for automotive claims but incomplete for home owners policies”. And while this is something that any smart data analyst could figure out after playing with the data set for a while, having ratings and reviews makes it easier for a user of a data catalog to determine whether a data set is fit to fulfill their needs. Think of it like Yelp for your organization's data. It’s just extremely useful to see who has used a data set before and what they thought of it--before spending too much time diving into it. Once again, Navigator and Atlas lack this concept.
Provide business user friendly (yet secure) search results: Data governance isn’t a technical process. It is a business process that is about managing data as a business asset. Just like we have business self-service now for business intelligence and business self-service for data integration via data wrangling, business users also are expecting business self-service for data governance. This means that the data catalog, which will be the first place business users look to find the data they need, must provide a secure and user friendly search experience.
In order for this to work, searching for data must be done using easy to understand business terms. This of course connects us back to the first point I made above about automatically populating the data catalog with easy to understand business terms that are tagged to the actual (user unfriendly) column names used within most dataset. In addition, search results must provide context, which connects back to point 2 about ratings and reviews, which provide human context for data from actual users.
Lastly, you also need a nice business-friendly UI with customizable facets that make it easy to filter search results similar to how you do a shopping search on Amazon.
When all of these components come together, and the catalog is also integrated with the security layer so data that is tagged as sensitive can be protected, the data catalog provides a centerpiece for driving big data governance for your data lake. That said, neither Navigator nor Atlas provide any of the services mentioned above. They document raw metadata captured within their scope for big data, but they aren’t actually data catalogs that allow a more accurate business searching of data. The result: if you bet on either of those technologies, you will still need capabilities that are provided by enterprise class data catalog solutions.
Learn more about Waterline Data and how we work with MapR to provide robust data governance capabilities for your hadoop data lake that you won’t get from those other technologies (and for your non-Big-Data data sources as well)? Ask your MapR sales rep or contact Waterline Data, and we will be glad to tell you more.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.