The recent release of Apache Hive 0.13 prompted me to think about how far we are concerning serving both operational and analytical systems in the context of the Hadoop ecosystem.
So, let's step back a bit. The MapReduce paradigm, still considered to be one of the tenants of Hadoop enjoys many production deployments and success stories. However, due to its batch nature people have been working on interactive large-scale query systems (again, in the context of Hadoop; obviously such systems are around way longer) in the past two to three years.
Further, as a matter of fact enterprises depend on both operational (think: customer purchase orders, etc.) and analytical (for example, BI tools) systems. Now, the question arises: can we provide one platform that is capable to cater for both? A bit like Oracle databases, just at scale and with a sensible price label.
In the figure above I've tried to provide a bit of an orientation where we are at the time of writing. Let's now have a closer look at what is already available and what are the remaining work items.
Acknowledging that most use cases benefit from a polyglot persistence mindset—using the right datastore for a certain task, depending on the nature of the data, the workload and the SLAs—I'd argue that in practice it is SQL and NoSQL rather than SQL or NoSQL. In this sense, the available SQL-on-Hadoop offerings along with the capabilities of the Apache Spark stack enable us to address the different types of workloads we encounter in the enterprise:
In the NoSQL category, we typically find Apache HBase and M7 utilised, providing low-latency access to structured and semi-structured data at scale; billions of records, tens of thousands concurrent users with latency SLAs in the hundreds of milliseconds.
It's fair to state that we've come a long way, being able to cover most of the use cases of operational and analytical systems with our MapR Platform. However, there is one work item left that needs to be addressed in order to be able to offer a 100% coverage and that are transactions. So, in this context we consider a commit as the unit of work incl. rollback. There are several strands of work:
Last but not least, a purely functional view (being able to deal with transactions) is, in an enterprise setup, not sufficient. Any solution that offers this feature must as well be scalable, reliability, and secure, guaranteeing business continuity and disaster recovery—ideally out of the box. You expect this from your enterprise database, so why would you expect less from a Hadoop solution?
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.