Apache Hive

Open Source Data Warehouse on Top of Hadoop


Apache Hive On-Demand Training

Apache Hive logo


Apache Hive is a data warehouse system built on top of Apache Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in various databases and file systems that integrate with Hadoop, including the MapR Data Platform with MapR XD and MapR Database. Hive offers a simple way to apply structure to large amounts of unstructured data and then perform batch SQL-like queries on that data. Hive easily integrates with traditional data center technologies using the familiar JDBC/ODBC interface.

Free Hadoop Training:
Query and Store Data with Apache Hive

  • Review of SQL-on-Hadoop tools
  • Learn how to create, load, query, and manipulate tables in Hive
  • Learn how to use Hive to query structured data without writing MapReduce code
  • Learn how to create and load tables in Hive
  • Learn how to query data using the Hive Query Language

Take a moment to explore the free course.

The Hive metastore provides a simple mechanism to project structure onto large amounts of unstructured data by applying a table schema on top of the data. This table abstraction of the underlying data structures and file locations presents users with a relational view of data in the file systems and NoSQL databases. Structure is applied to data at time of read, so users don’t need to worry about formatting the data when it is stored in their cluster. Data can be read from a variety of formats, from unstructured flat files with comma- or space-separated text, to semi-structured JSON files, to structured HBase tables.

Hive features an SQL-like programming interface called HiveQL to query data stored in various databases and file systems. HiveQL automatically translates SQL-like queries into batch MapReduce jobs.

Several efforts have emerged for faster execution of HiveQL or SQL on top of Hadoop:

  • Apache Spark is a powerful unified analytics engine for large-scale distributed data processing and machine learning. The Hive metastore can be used with Spark SQL and/or HiveQL can run on the Spark execution engine, optimizing workflows and offering in-memory processing to improve performance significantly.
  • Apache Drill is an open source distributed SQL query engine offering fast in memory processing with ANSI SQL versus HiveQL. Drill provides the ability to leverage the metadata in the Hive metastore for querying. This is in addition to querying nested data with dynamic schemas.
  • Tez has emerged as a complementary high-performance execution engine with the introduction of YARN as an independent resource manager. Hive can run on Tez, allowing queries to run significantly faster.
  • Impala leverages Hive’s query language (HiveQL) and metastore to bring interactive SQL to Hadoop.


Hive cons icon

Challenges with Previous Technologies

  • Before Hive, there was MapReduce, a scalable, resilient distributed processing framework that enabled Google to index the exploding volume of content on the web across large clusters of commodity servers.
  • MapReduce provides batch analysis of massive volumes of semi-structured and unstructured data in Hadoop, but the MapReduce Java API is not easy to program with, especially for non-programmers.
Hive pros icon

Advantages of Hive

  • Hive was initially developed at Facebook to summarize, query, and analyze large amounts of data stored on a distributed file system. Hive makes it easy for non-programmers to read, write, and manage large datasets residing in distributed Hadoop storage using HiveQL SQL-like queries. Hive has gained a lot of popularity due to its ease of use and compatibility with existing business applications through ODBC.


Metastore icon

Shared Metastore and Data

Hive can function as a relational datastore for Spark in predictive analytics, machine learning, and other programming models via API access or SparkSQL. Hive features a metastore that maintains metadata about Hive tables (location and schema) and partitions that it makes programmatically available to developers via a metastore service API.

TCO icon

Data Warehouse Optimization at Lower TCO

The Hadoop/Spark platform is increasingly used to augment, optimize, or even replace a traditional data warehouse. Hive is used to provide SQL access to new generation big data warehouses.

Data lakes are a foundational use case of the MapR Data Platform. Enterprise data centers looking to rationalize and liberate potentially valuable data of all types use the MapR Data Platform to provide the storage and management capabilities to put all enterprise datasets into play.

Many BI/DW teams are discovering that many BI applications and queries can run on the MapR Data Platform at a fraction of the price of traditional data warehouse platforms.

Data Integration icon

Data Integration

Data produced by different systems across a business is rarely clean or consistent enough to be simply and easily combined for reporting or analysis. Extract, transform, and load (ETL) processes are often used to pull data from different systems, clean and standardize it, and then load it into a separate system for analysis. Hive, Drill, Spark, and Hadoop are increasingly being used to reduce the cost and time required for this ETL process.

Mining icon

Rich Processing Ecosystem for Data Mining

Hive-on-MapR users benefit from the integration of the key core open source projects (Drill, Spark, HBase, Zeppelin, etc.) as well as optimized native services (MapR XD, MapR Database, and MapR Event Store for Apache Kafka).




  • Data analysts use Hive to query, summarize, explore and analyze that data, then turn it into actionable business insight.


  • Hive provides BI/DW teams a natural progression from traditional data warehousing environments to the world of big data.


  • Meet line of business data needs at a lower cost. Grant fast, secure, multi-tenant access to all data for the full spectrum of analytics needs.
  • Accelerate the business. Support in-place analytics, stateful containerized applications, and much more.
  • Deploy anywhere – in the public cloud, on-premises, at the edge, or all of the above at once.


  • The MapR Data Platform is built for production. Consistent snapshots, replicas, and mirroring deliver enterprise-grade high availability and disaster recovery.
  • MapR XD is multi-tenant by design. Assign policies (quotas, permissions, placement) to logical units of management called volumes.
  • Balance cost and performance with MapR XD. Leverage policy-based data tiering, erasure coding, data placement, and more.
  • Establish big data capabilities with Hadoop, HBase, Hive, and MapR Database.


A confluence of several different technology shifts has dramatically changed big data and machine learning applications. The combination of distributed computing, streaming analytics, and machine learning is accelerating the development of next-generation intelligent applications, which take advantage of modern computational paradigms powered by modern computational infrastructure. The MapR Data Platform integrates global event streaming, real-time database capabilities, and scalable enterprise storage with Hadoop, Hive, Spark, Drill, and machine learning libraries to power this new generation of data processing pipelines and intelligent applications. Diverse and open APIs allow all types of analytics workflows to run on the data in place:

  • The MapR XD Distributed File and Object Store is designed to store data at exabyte scale, support trillions of files, and combine analytics and operations into a single platform. MapR XD supports industry standard protocols and APIs, including POSIX, NFS, S3, and HDFS. Unlike Apache HDFS, which is write once, append-only, the MapR Data Platform delivers a true read-write, POSIX-compliant file system. Support for the HDFS API enables Spark and Hadoop ecosystem tools, for both batch and streaming, to interact with MapR XD. Support for POSIX enables Spark and all non-Hadoop libraries to read and write to the distributed data store as if the data were mounted locally, which greatly expands the possible use cases for next-generation applications. Support for an S3-compatible API means MapR XD can also serve as the foundation for Spark applications that leverage object storage.
  • The MapR Event Store for Apache Kafka is the first big-data-scale streaming system built into a unified data platform and the only big data streaming system to support global event replication reliably at IoT scale. Support for the Kafka API enables Spark streaming applications to interact with data in real time in a unified data platform, which minimizes maintenance and data copying.
  • MapR Database is a high-performance NoSQL database built into the MapR Data Platform. MapR Database is multi-model: wide-column, key-value with the HBase API or JSON (document) with the OJAI API. Spark connectors are integrated for both HBase and OJAI APIs, enabling real-time and batch pipelines with MapR Database:
    • The MapR Database Connector for Apache Spark enables you to use MapR Database as a sink for Spark Structured Streaming or Spark Streaming.
    • The Spark MapR Database Connector enables users to perform complex SQL queries and updates on top of MapR Database, while applying critical techniques such as projection and filter pushdown, custom partitioning, and data locality.

MapR put key technologies essential to achieving high scale and high reliability in a fully distributed architecture that spans on-premises, cloud, and multi-cloud deployments, including edge-first IoT, while dramatically lowering both hardware and operational costs of your most important applications and data.