15 min read
Apache Drill 1.11 on MapR 6.0 was released with new enterprise capabilities for a faster, secure, and robust interactive BI analytics experience. Drill now leverages the new secondary index technology on semi-structured operational data stored in MapR Database, the database for the world’s most data-intensive applications, to speed up analytics, deliver insights, and drive better decisions. New security features ensure protection of sensitive data as they are accessed, processed, and delivered to end users. Several enhancements were added to improve better handling of analytic workloads running on the system, including spooling data intensive queries to disk and their management through queues.
The current trend in big data analytics marketing dictates that every product release, such as this one, warrants that the product manager (yours truly!) make a big splash in a blog post, promoting the greatest features that the team has shipped. Performance benchmark numbers must be published in carefully orchestrated setups to make the query engine appear like the North Star. Self-service? Fast ETL? Sub-second response times faster than the speed of thought? No problems. They got it all.
Big data marketing myths are far from enterprise reality. I am not going to do any of that. Instead, I want to talk about the Apache Drill release on MapR in the context of the problems we see our enterprise customers facing on their big data journey to bringing enterprise-wide analytics access. Democratization of access and self-service analytics continue to be the cornerstones of the current BI wave. From that perspective, there are three big challenges we see:
None of these problems are easy to solve, especially when you are designing an industry-grade solution at massive scale.
One of our core design beliefs is that a query engine can never be designed in isolation but must be tightly integrated with the underlying data layer. It should come as no surprise that we continue to invest time and effort to strengthen our core data platform that we have built from the ground up. Keeping reliability and performance at scale as our end goal, we have built an industry-grade distributed file and object store (MapR XD), a NoSQL database for data intensive operational applications (MapR Database), and real-time streams for IoT scale (MapR Event Store). To simplify its operation, management, and deployment, the above technologies are unified within the software layer, yielding convergence (i.e., data can be streamed directly to the database, which would be stored as files, all within a single machine). Such design precludes the need for maintaining these as separate systems and managing their complexity.
For several years now, MapR has contributed to Apache Drill, an open source MPP query engine project. What is truly unique about Apache Drill is its ability to discover schema of the data on the fly. This capability becomes important as the amount of unstructured data stored continues to increase exponentially in the enterprise. In my opinion, most of the MPP query engines currently have a short-term focus on solving the structured data problem, reminiscent of the relational world. On D-Day, they will have to contend with querying all this unstructured data in-place without complete knowledge of their underlying schema. And in real-time. From the MapR perspective, we believe that Apache Drill prepares the world better for this impending data explosion.
To address the above challenges we see in the enterprise, the Apache Drill 1.11 integration on MapR has the following capabilities:
Conceptual diagram, showing how Apache Drill integration on MapR enables operational and historical data analytics
The performance benefits can be substantial at scale at a fraction of the cost of the traditional warehousing system! Analysts now do not have to wait for the operational data collected from transaction systems to be ETL-ed into historical data stored in columnar Parquet format. Besides, columnar formats are ill-suited for queries that are highly selective. For example, we ran a simple test rig that confirmed these results as shown below.
Results from a simple test verify that secondary indexes speed up Drill queries on MapR Database JSON, especially when the selectivity is high, corresponding to a low selectivity (%) value.
We also carried out a concurrency test with different number of concurrent users (each user is a user stream) sending simple queries into the same system on a TPC-DS dataset. The queries were fired in batches with the next query being executed when a query slot became available. Based on type of query, filter selectivity and capacity available we observe that highly selective queries increase system throughput.
Simple concurrency test results showing query throughput and response times as type of query, selectivity and number of users are varied.
As stated earlier, our vision at MapR is to leverage Apache Drill as a unified SQL access layer across files, tables and streams. We have made significant strides with enabling this capability on files and with this release, expanding it to tables. Our near-term focus is to introduce streaming analytics on the platform. For the curious, we are beginning to experiment with streaming SQL analytics in the open source community by building an experimental Kafka plugin. Even though KSQL from Confluent has a developer preview out, we are not convinced that their semantics and architecture considerations have been well thought out, which requires deep discussion at our end.
Enterprise-grade query engine capabilities: Security and resource management are the primary concerns_. _
Authentication options on various communication paths within the Drill architecture
Apache Drill supports spooling to disk for two memory intensive operators that inevitably appear in most BI/SQL queries: aggregation and sorting. Thus, queries containing such operators will slow down and not fail. We have tested the functionality on the MapR XD and found that queries that would normally fail under limited memory conditions complete. As expected, these queries do undergo a deterioration in performance. In the next several releases, there are plans in the open source to add this functionality to other operators, such as join, so that queries as a whole can spool to disk if the need arises.
We are also supporting the ZooKeeper queue feature to better manage the concurrency of the system with a setting that decides the number of queries that can be run concurrently in the Drill cluster at any point in time, while all other queries wait. This Drill cluster concurrency should not be confused with user concurrency, which is typically defined as the number of concurrent queries being sent by all users through some front-end client (i.e., Tableau, Web UI, or REST API). This concurrency number includes queries that are being processed and are waiting to be processed by the Drill cluster. The waiting time can cause a slower response time to all users, which itself is dependent on the cluster response time.
The feature was first released in open source in November of last year, but we have tested and added new features to this capability. The key idea behind using queues is that you can tune the system to allow light workload queries (typically interactive) to run more frequently than heavy workload queries (typically reporting or batch data intensive query) that can burden the system, thereby leading to a lower system response time or even failures.
For every 5 queries in the small queue that are getting executed, only 1 query in the large queue is allowed to execute. All other queries must wait in FIFO order and are routed to the appropriate queue when a slot opens.
If you are smart enough to manage the spill to disk options, then the heavy workload queries will slow down but not fail. This slowdown, due to queuing and spilling, discourages those users who may have overloaded the system. Incidentally, the control on the concurrency also sets an upper bound on the number of CPU cores that could be used overall, as each query uses the same
planner.width.max_per_query parameter. If these resources are sized appropriately, the response time of the system to interactive workloads for a given user concurrency can be improved as the waiting time for such queries can be reduced.
Sample parameters of Drill queues, showing that 5 light workload queries ("Small") run for every heavy workload ("Large") query.
As shown above, Cost Threshold is a measure of the workload size computed by Drill at query time. Since the parameter is an estimate, it needs to be tuned. To enable tuning during POCs and testing, we provide this estimate in the query profile itself, so that you can trace how queries with well-defined workloads were queued based on that parameter. For the curious, query profiles are JSON files that can be queried by Drill itself. From our point of view, this manual queuing feature represents an important step towards our future goal of making them automatic and dynamic by putting in more intelligent control through feedback mechanisms.
To enable threshold parameter tuning, query profiles contain a new column for total cost and its associated queue.
If you are an aspiring data scientist, learning SQL could be one thing worth doing now. We recently announced the launch of a new product called the MapR Data Science Refinery, a scalable data science offering that comes pre-packaged with a notebook, Apache Zeppelin. You can use the notebook as a SQL interface to retrieve data and visualize it.
Drill queries alongside a simple pie chart representation.
No release is ever complete without you giving it a try and seeing for yourself if it delivers the value you are looking for. With that, I invite you to give the new Apache Drill 1.11 on MapR 6.0 a try, and let me know your feedback.
The MapR Community pages for Apache Drill also got a face-lift. Check it out here.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.