8 min read
Polyglot Data Management
At the Big Data Everywhere conference held in Atlanta, Senior Software Engineer Mike Davis and Senior Solution Architect Matt Anderson from Liaison Technologies gave an in-depth talk titled “Polyglot Data Management,” where they discussed how to build a polyglot data management platform that gives users the flexibility to choose the right tool for the job, instead of being forced into a solution that might not be optimal. They discussed the makeup of an enterprise data management platform and how it can be leveraged to meet a wide variety of business use cases in a scalable, supportable, and configurable way.
Matt began the talk by describing the three components that make up a data management system: structure, governance and performance. “Person data” was presented as a good example when thinking about these different components, as it includes demographic information, sensitive information such as social security numbers and credit card information, as well as public information such as Facebook posts, tweets, and YouTube videos. The data management system components include:
1. Structure: Data is not schema, but it can have a variety of shapes. String cubes, graphs, relational tables, and trees are all examples of different data shapes. However, it’s important to think of shape and structure as separate from the data itself, because a single bit of data can have multiple different shapes. In the case of person data, for example, your social data may best be represented by a stream, a graph could be used for looking at links between friends, relatives and co-workers, and a relational model would work best for demographic information.
_ Different types of data shapes_
2. Governance: In addition to the data itself, you have metadata about the data. Data management requires governance, so you have to think about security/compliance issues as well as quality issues. Security/compliance areas include encryption, access controls, lineage, and auditing (who saw what data, and when). Quality issues include validation, business rules, cleansing and access. If you are thinking about creating a data management platform, you need to ensure that your data is clean and valid, and that you have the right data at the right time. Matt mentioned that MapR security features have made a lot of headway in this area.
Validation, cleansing, and identity resolution can be applied to demographic data; how do you know if Joe Smith is the same person as Joseph Smith? Being able to run those types of rules and have a system in place that can take that information, cleanse it, and put it into a clean record is vitally important.
3. Performance: Your data management solution needs to be scalable, fast, fault tolerant and robust.
When looking at the entire data management spectrum, there really is no “silver bullet.” Not every type of data needs all the different properties. There is a wide range of data management solutions, ranging from the “safe” traditional approach of an RDMBS to a more flexible approach found in some of the big data technologies. In terms of polyglot data management, it’s important to pick the right tool that you need for your use case.
Mike Davis then spoke about polyglot data management, which is essentially the ability to choose the correct type of data management solution for the job instead of using the same, possibly ill-fitting solution for everything. Ideally, you want a data management system where you can view your data in any way that’s necessary.
He then discussed the three basic tenants of what you should look for in a polyglot data management platform: Primitives, Specification, and Orchestration.
Primitives include Persist, Define, Query, Event, Flow, Explore, and Secure.
The Persist component consists of the underlying storage technology and a data access layer for interacting with the lower level APIs. It’s responsible for supporting (at a low level) CRUD operations, indexes, partitioning, cache strategies, lineage, etc.
The graph below shows a typical example of data flow. Data has motion, which is represented in the graph below as a configurable sequence of activities triggered by events.
Processing can be stream or batch
The solution specification ties it all together. The specification:
The orchestration component processes a solution specification and coordinates all of the other components to execute the specification.
Want to learn more?
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.