5 min read
Editor's Note: In this week's Whiteboard Walkthrough, Dale Kim, product marketing manager, explains how document data bases fit in your enterprise's use cases.
Here's the unedited transcription:
One use case I want to talk about is known as the enterprise data hub which some people use synonymously with data lake. I like to think that enterprise data hubs do a bit more than where there's necessarily a two-way communication with other systems within your enterprise architecture. To look at an enterprise data hub you start off with a number of sources from across your enterprise. That'll include things like order information, customer information, maybe even social media, product information. Collect all that information into your enterprise data hub and they will necessarily have a combination of structured data and unstructured data.
For structured data your document database handles that extremely well. Of course document databases are often used for unstructured or non-relational data but they're also good for structured formats. Then to compliment that is your Hadoop which will handle your unstructured data and then the combination of the two will allow you to run a number of analytics, ETL processing and so on. You can share that information with your existing enterprise data warehouse or your data marts. The document databases provide value here because with all the different data formats from across your enterprise, your application developers will have an easier time in mapping those different formats into a central repository.
To talk about another high level use case, it's actually more of an environment, is this notion of time series data where you're necessarily collecting data that are taking measurements at various intervals of time. Now you can expand that definition of time series data to include things like click streams where they're necessarily based on specific times but they are necessarily orders. That's a more informal view of what time series data is, but I think you get the picture. In a time series data environment you will necessarily have a lot of different sources. If you're look at an oil field or you're looking at wearables, technologies, you might be thinking of hundreds, thousands, or even million of connected devices, all delivering their own unique stream of data.
You'll typically have a collector that aggregates all of that information. This information might come, often comes actually, as JSON. A lot of systems still provide binary formats that are being sent but then you can easily convert that into adjacent format which allows your application developers to aggregate all these different data formats into one single repository. There's a lot to be said about the collector side but I'll hold of on that for now and instead jump to the data store mechanism which again would be a document database or necessarily a NoSQL document database. With all the different formats that you have in JSON you can store it here as the repository but that's only part of the equation when it comes to time series data.
If you're collecting all of this information from around the world and this huge volume of data you necessarily want to do some type of analysis on it to get some value. That's where hadoop comes in. You use hadoop as the analytics platform, working with NoSQL, so that you easily store information in your NoSQL database and then you have hadoop to do the hardcore number crunching in your expandable cluster. That's pretty much everything I know about document databases in your enterprise. If you enjoyed this whiteboard walkthrough please be sure to add your comments below. Of course if you have ideas for other topics that you'd like to hear about in our whiteboard walkthrough series please comment below as well. Thank you very much!
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.