4 min read
Congratulations to the Apache Drill community on reaching a big milestone. Apache Drill 0.4.0—a developer preview—has just been released. This is the first in a series of monthly builds the project team will deliver as it drives towards Beta and GA milestones.
Let’s take a brief look at why Apache Drill matters and its key features.
Modern applications such as social, mobile and sensor data from the Internet of Things are generating an order of magnitude greater volume of data today. It is not just the scale, but the variety and variability of these datasets that is also a growing challenge. The datasets associated with these applications are often self-describing, include complex/multi-structured content and evolve rapidly.
Take the example of the JSON data format, which is quickly becoming the lingua franca of data on the Internet for APIs, data exchange, data storage, and data processing. HBase, another example, is a highly scalable NoSQL database capable of storing 1000s of columns in a single row, allowing every row to have its own schema. There are many other formats and systems that are growing rapidly in the Hadoop ecosystem such as Parquet, AVRO, and Protobuf, where data is neither flat nor has fixed schema that is static like relational databases.
So how does one perform interactive SQL queries on such datasets?
Apache Drill takes a different approach to SQL-on-Hadoop than Hive and other related technologies in solving this problem. The goal for Drill is to enable you to perform self-service data exploration by bringing the SQL ecosystem and performance of the relational systems to Hadoop-scale data without compromising on the flexibility of Hadoop/NoSQL.
Below are core elements of Drill that enable this goal.
Agility: Apache Drill allows direct queries on self-describing and semi-structured data in files (such as JSON, Parquet) and HBase tables without needing to specify metadata definitions in a centralized store such as Hive metastore. This means that you can explore live data on your own as it arrives on Hadoop before spending weeks or months on data preparation, modeling and subsequent schema management. You can choose to optionally create models in Drill once you understand the value of the data, and want to operationalize for repetitive questions and reporting needs.
Flexibility: Drill provides a JSON-like internal data model to represent and process data. The flexibility of this data model allows Drill to query, without flattening, both simple and complex/nested data types as well as constantly changing application-driven schemas commonly seen with Hadoop/NoSQL applications. Drill also provides intuitive extensions to SQL to work with complex/nested data types.
Familiarity: With Drill, you can minimize switching costs and learning curves with the familiar ANSI SQL syntax and BI/analytics tools through JDBC/ODBC drivers. You can also plug-and-play with Hive environments to enable ad-hoc low latency queries on existing Hive tables and reuse Hive's metadata, hundreds of file formats and UDFs out-of-the-box.
The developer preview release lets you experiment with all of the core elements of Drill. For a detailed list of features, refer to the Apache Drill 0.4 announcement and the release notes. Note that MapR does not formally support developer preview packages.
Congratulations again to the Apache Drill team! We’re looking forward to the 0.5 release.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.