Drive Innovation by Breaking Down Data Silos with Apache Drill

Contributed by Dr. Kirk Borne

Data across the enterprise are typically stored in silos belonging to different business divisions and even to different projects within the same division. These silos may be further segmented by services/products and functions. Silos (which stifle data-sharing and innovation) are often identified as a primary impediment (both practically and culturally) to business progress and thus they may be the cause of numerous difficulties. For example, streamlining important business processes are rendered more challenging, ranging from compliance to data discovery. But, breaking down the silos may not be so easy to accomplish. In fact, it is often infeasible due to ownership issues, governance practices, and regulatory concerns.

Big Data silos create additional complications including data duplication (and associated increased costs), complicated data replication solutions, high data latency, and data quality concerns, not to mention being an enabler of the real problematic situation where your data repositories could hold different versions of the truth. The silos also put a limit on business intelligence (discovery and actionable insights). As big data best practices rise above the hype and noise, we now know that actionable value is more easily extracted when multiple data sets can be integrated and viewed holistically.

Data analysts naturally want to break down silos to combine data from multiple data sources. Unfortunately, this can create its own bottleneck: a complex integration labyrinth—which is costly to maintain, rarely performs well, and can’t be guaranteed to provide consistent results.

In response, many companies have deployed Apache Hadoop to address the problem of segregated data. Hadoop enables multiple types of data to be directly processed in place, and it fully supports data integration from multiple sources across different data storage technologies.

Drilling Through the Silos

Organizations that use Hadoop are finding additional benefits with Apache Drill, which is the open source version of Google’s Dremel system. Google built Dremel for interactive analysis of large data sets, which is a reasonable introductory definition of Drill’s full capabilities: (1) Drill provides analysts with self-service data discovery abilities; (2) Drill enables data analysts to explore data without waiting for IT to define new schemas or to create new ETL processes; and (3) Drill’s engine discovers dynamically the source schemas and adjusts query plans as analysts work with the data.

Working with self-describing data, while being able to process complex data types on the fly, consequently enables data analysts to test new hypotheses, to generate ad hoc queries, and to discover new actionable insights from the correlations and anomalies across multiple data sources such as social media platforms, clickstreams, customer service records, transaction data, competitive analyses, business reports, historical trends, and much more.

For example, a mix of historical, near-time, and real-time information provides the insight needed to understand why sales have dropped or stalled, predict customer needs, pre-target and retarget consumers, engage with active and past customers more effectively, and act proactively to address evolving issues that could impact profitability. This empowers analysts to engage in more innovative business intelligence processes: moving beyond descriptive analytics to predictive, prescriptive, and cognitive analytics.

Get To Know The Drill

Apache Drill is not a database; it is an SQL query layer that works with numerous underlying data sources. Drill is a standalone query engine that supports multiple data sources and formats. Drill can combine data from multiple sources on-the-go in a single query, incorporating new data sources immediately, providing refreshed views of existing data sources easily, and then presenting those data in a unified manner to an end-user visualization or analysis tool. This is truly “breaking down the data silos” with ease, flexibility, and style.

For instance, one can write a JOIN query against a complex nested JSON file plus a simple csv file sitting on a distributed file-system such as HDFS with a database table stored in a NoSQL database such as Apache HBase. This functionality closes the gap that has existed between systems built for efficient use of Big Data (such as the newer Hadoop-based systems and non-relational databases) and systems that deliver traditional SQL compatibility.

Drill also provides analysts with familiar ANSI SQL capabilities and standard ODBC connectivity that enables them to leverage their existing SQL skills and business intelligence (BI) tools to directly query self-describing data and process complex data types.

Drill is perfectly suited for demanding low-latency performance tasks such as data exploration, discovery, ad hoc BI queries, and “day zero” analytics. It supports interactive queries rather than batch-oriented requests, and can quickly serve up a fast snapshot of a statistics set to initiate an extended, explorative analysis of a Data Lake.

Drill can easily scale from a single laptop to a large cluster of servers. It enables data to be queried in its native formats (in those data silos), including nested data, schema-less data, and dynamic data. Since there is no need to explicitly define and maintain schemas, Drill fully enables self-service data integration and exploration across distributed sources. Live data can be queried upon arrival, and analysts can change data sources on the fly without waiting for DBA services to structure the data into a static relational form.

Mission Critical Readiness

The open source community built Drill by refining the features of Google’s Dremel and adding enhanced capabilities. These include: the extensibility of Drill’s architecture, increased agility, support for full SQL, optional schema handling, and the ability to handle nested data (such as JSON, Protobuf, and Parquet). The community continues to advance Apache Drill’s key technologies and performance.

The Apache Software Foundation announced in December 2014 that it has promoted Drill to a top-level project at Apache, where it joins other illustrious projects such as Apache Hadoop and httpd (the world's most popular Web server).

See Drill in Action

For a demo showing how an online retailer might use Drill to explore why sales have dropped, and to gather insights that can provide insights on how best to address the situation, view this video:

Finally, if you’re ready to test-drive Drill, you can do so using the MapR Sandbox for Hadoop with Drill, which runs on PC, Mac, and Linux platforms.

If you have any questions or comments feel free to ask in the comments section below.

This blog post was published April 14, 2015.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.