SQL on Hadoop Details

Within the big data landscape there are multiple approaches to accessing, analyzing, and manipulating data in Hadoop. Each depends on key considerations such as latency, ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs.

Below is a discussion segmented by broad latency characteristics of each approach.

SQL on Hadoop Segments Diagram

Batch SQL

Technologies such as Hive are designed for batch queries on Hadoop by providing a declarative abstraction layer (HiveQL), which uses the MapReduce processing framework in the background. Hive is used primarily for queries on very large data sets and large ETL jobs. The queries can take anywhere between a few minutes to several hours depending on the complexity of the job. The Apache Tez project aims to provide targeted performance improvements for Hive to deliver interactive query capabilities. MapR ships and supports Apache Hive today and provides an early Developer Preview of Apache Tez.

Interactive SQL

Technologies such as Impala and Apache Drill provide interactive query capabilities to enable traditional business intelligence and analytics on Hadoop-scale datasets. The response times vary between milliseconds to minutes depending on the query complexity. Users expect SQL-on-Hadoop technologies to support common BI tools such as Tableau and MicroStrategy (to name a couple) for reporting and ad-hoc queries. MapR supports customers using both Apache Drill and Impala on the MapR Converged Data Platform.

In-Memory SQL

In-memory computing has enabled new ecosystem projects such as Apache Spark to further accelerate query processing. Spark SQL uses in-memory computing while retaining full Hive compatibility to provide 100x faster queries than Hive. MapR customers are using Spark with the MapR Converged Data Platform today.

Operational SQL

Unlike batch and interactive queries that are used by business teams for decision making and operate as read-only operations on large datasets (OLAP), point queries are typically done by OLTP and web applications, operating over smaller datasets and typically include insert, update, and deletes. The expected latency is usually very low (e.g., milliseconds) due to the high volume of requests from these applications. MapR ships and supports operational SQL capabilities by providing Apache HBase support on the MapR Converged Enterprise Edition.

Interactive SQL-on-Hadoop Technology Landscape SQL technologies complement traditional data warehouse and analytical environments for:

  • Interactive and ad-hoc queries on large-scale data
  • Data exploration to discover new insights worth modeling into a data warehouse schema
  • Interactive queries on more or new types of data
  • Queries for online data archives in Hadoop vs. backing up to tape

Technologies and approaches for interactive SQL vary and include (but are not limited to)

  1. Querying the data using connectors from Hadoop to analytic platforms (upfront or at query run time with external tables)
  2. Running traditional SQL engines side by side on every node of the Hadoop cluster
  3. Providing purpose-built SQL engines directly on top of Hadoop (native SQL options)
  4. Efforts to improve MapReduce performance to make it suitable for interactive queries

Options (1) and (2) excel at SQL support, performance optimizations, and overall enterprise readiness. However, native SQL-on-Hadoop options (3) are evolving as cost-effective alternatives because they have stronger compatibility with the Hadoop ecosystem (e.g., use Hadoop native file formats and common metadata store through Hive)

SQL technologies available on MapR

The table below describes at a high level some of the key considerations for picking the right SQL-on-Hadoop technology. Please contact us for specific questions about your use case.

Drill Hive Impala Spark SQL
Key Use Cases Self-service Data Exploration Interactive BI / Ad-hoc queries Batch/ ETL/ Long-running jobs Interactive BI / Ad-hoc queries SQL as part of Spark pipelines / Advanced analytic workflows
Data Sources Files Support Parquet, JSON, Text, all Hive file formats Yes (all Hive file formats) Yes (Parquet, Sequence, RC, Text, AVRO ...) Parquet, JSON, Text, all Hive file formats
HBase/MapR Database Yes Yes Yes Yes
Beyond Hadoop Yes No No Yes
Data Types Relational Yes Yes Yes Yes
Complex/Nested Yes Limited No Limited
Metadata Schema-less/Dynamic schema Yes No No Limited
Hive Meta store Yes Yes Yes Yes
SQL / BI tools SQL support ANSI SQL HiveQL HiveQL ANSI SQL (limited) & HiveQL
Beyond Memory Yes Yes Yes Yes
Optimizer Limited Limited Limited Limited
Platform Latency Low Medium Low Low (in-memory) / Medium
Concurrency High Medium High Medium
Decentralized Granular Security Yes No No No