Freedom of Data Exploration with Apache Drill

Contributed by

5 min read

It's the 21st century, and who doesn’t want to be independent? Everyone wants to enjoy some freedom in whatever work they do. As a business user, you can experience this kind of freedom with Apache Drill, which gives you the flexibility to explore data in new and powerful ways.

Apache Drill gives you the ability to analyze your business trends and get insights from your data in real time, whether it's structured or unstructured. You don't have to wait weeks and months for the data to be transformed by ETL processes. Let's take a look at the unique features of Apache Drill:

Apache Drill Flexibility

What is Drill?

  • Drill is a low latency distributed SQL query engine for large-scale datasets for big data exploration.
  • The datasets need not be structured. They can semi-structured, nested, or schema-less datasets.
  • It does not require any schema for query execution and discovers the schema at run time. Hence, a DBA is not required to maintain the model or schema definitions.

​Coming from a traditional background of databases, I often experienced numerous limitations. For instance:

  • The schema and data types of the fields had be defined before combining or joining two tables.
  • If one of them is a file, the data types had to be matched with the column. Then, either load the file to a database, or use the ETL process to compare the file to a table.

The limitations of structured data have been overcome by many of the NoSQL databases, which make it easy to store any kind of data without worrying about schema definitions. Apache Drill has a unique capability of joining a raw file to a relational table or a NoSQL database.

In addition to guarantying the fastest processing time on a big data scale, Drill also gives us the freedom to combine two different data sources in a single query without needing to define any central metadata definitions. For example, we can query join a Hive table with an HBase table and a file in a single query; this powerful feature helps Drill stand out from other query engines.

How does Drill help end users?

By leveraging Drill's ability to work with any of the BI tools such Tableau, MicroStrategy, QlikView, and Excel, you can derive trends by changing the criteria without depending upon the ETL process to give you the formatted and updated data into the tables. You can get the latest data in any format, and can drill down into the data exploration process. In addition, you don't need to invest time in learning a new SQL language; all the queries are run using the industry-standard query language, ANSI SQL.

How does Drill handle all of these unique features at once?

  • Drill has a distributed execution engine. You can submit the query to any node of a cluster, and Drill can handle it.
  • Drill provides the ability to discover schemas on-the-fly and provides ANSI SQL extensions to natively query and manipulate complex data.
  • It uses the feature of columnar data storage, which allows the CPU to operate on vectors.
  • Drill uses a pipeline model and query execution, which occurs in memory as much as possible in order to avoid disk interaction.
  • It compiles the queries at runtime, which enables faster execution.

Apache Drill provides a new way of thinking about business intelligence and analytics. It also enables you to explore data without depending upon ETL and other data formatting tools, and provides results in-situ. At the same time, it is easy to use for anyone with a basic understanding of SQL, and its integration with BI tools makes it even more powerful.

For more information on the Apache Drill architecture, please refer to the Apache Drill documentation: http://drill.apache.org/docs/


This blog post was published July 06, 2016.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now