Open Source Engines

MapR packages a broad set of Apache open source ecosystem projects that enable big data applications. The goal is to provide you with an open platform that lets you choose the right tool for the job. MapR tests and integrates open source ecosystem projects such as Hive™, Pig™, Apache™ HBase™ and Mahout, among others. The MapR Converged Data Platform and the open source projects are tied together through an advanced management console to monitor and manage the system.

The MapR Ecosystem Pack (MEP) delivers quick customer access to the latest innovations from the open source community, while ensuring interoperability of all ecosystem projects in a given MEP release. MapR pioneered the decoupling of platform versions from project versions, and MEP is the next evolution of that process. This decoupling gives customers flexibility on when to upgrade their environment, and MEP will ensure customers have a fully compatible deployment.

Compare Editions

MapR diagram with open-source highlighted

Core Hadoop

hadoop elephant

Apache Hadoop

Apache Hadoop was born out of a need to process an avalanche of big data. The web was generating more and more information on a daily basis, and it was becoming very difficult to index over one billion pages of content. Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a large variety of tasks that all share the common theme of lots of variety, volume, and velocity of data—both structured and unstructured.

Learn More


YARN (Yet Another Resource Negotiator) is a core component of Hadoop that manages access to all resources in a cluster. Before YARN, jobs were forced to go through the MapReduce framework, which is designed for long-running batch operations. Now, YARN brokers access to cluster compute resources on behalf of multiple applications, using selectable criteria such as fairness or capacity, allowing for a more general-purpose experience.

Learn More


MapReduce logo


Apache MapReduce is a powerful framework for processing large, distributed sets of structured or unstructured data on a Hadoop cluster. The key feature of MapReduce is its ability to perform processing across an entire cluster of nodes, with each node processing its local data. This feature makes MapReduce orders of magnitude faster than legacy methods of processing big data, which often consisted of a single node accessing and processing data located in remote SAN or NAS devices.

Learn More

Apache Hive logo

Apache Hive

Apache Hive is an open source Hadoop application for data warehousing. It offers a simple way to apply structure to large amounts of unstructured data, and then perform batch SQL-like queries on that data.

Learn More

Apache Pig logo

Apache Pig

Many users who are new to Hadoop find that the MapReduce framework has a steep learning curve. Apache Pig helps these users by offering a simpler alternative for transforming and analyzing large data sets. Users write scripts in a high-level language called Pig Latin, which Pig translates into MapReduce jobs that run on a Hadoop cluster.

Learn More

Apache Spark logo

Apache Spark

Apache Spark is a general-purpose graph execution engine for Hadoop that allows users to analyze large data sets with very high performance. One common use case for Spark is executing MapReduce-style graphs, achieving high performance batch processing in Hadoop.

Learn More

Interactive SQL

Apache Drill logo

Apache Drill

Apache Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data sources and data formats, including nested, self-describing data.

Learn More


Apache HBase logo

Apache HBase

Apache HBase is a database that runs on a Hadoop cluster. Clients can access HBase data through either a native Java API, or through a Thrift or REST gateway, making it accessible by any language.

Learn More


GraphX logo


GraphX is a graph library that runs on top of Apache Spark. Developers can use the languages and tools they are familiar with using for Spark to implement new types of algorithms that require the modeling of relationships between objects.

Learn More

Machine Learning

Apache Mahout logo

Apache Mahout

Apache Mahout is a powerful, scalable, machine-learning library that runs on top of Hadoop MapReduce. Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed. Machine learning is the basis for many technologies that are part of our everyday lives.

Learn More

MLlib logo


MLlib is a machine learning library that runs on top of Apache Spark. Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed. Machine learning is the basis for many technologies that are part of our everyday lives.

Learn More


Spark Streaming logo

Spark Streaming

When Hadoop first emerged, it provided a platform to store petabytes of data, and perform batch queries on that data to gather insights. This model works well for many use cases, like analyzing vast amounts of customer data for interesting patterns. However, not all data can wait for a batch query to be performed.

Learn More

Data Tools

HttpFS logo


HttpFS is one of several tools available to interact with the MapR distributed file system. Some differentiating features of HttpFS include programmatic access, version independence, and remote access.

Learn More

Apache Sqoop logo

Apache Sqoop

Hadoop users often want to perform analysis of data across multiple sources and formats, and a common source is a relational database or data warehouse. Sqoop allows users to efficiently move structured data from these sources into Hadoop for analysis and correlation with other data types, such as semi-structured and unstructured data stored in the distributed file system. Once analysis has been completed, Sqoop can be used to push any resulting structured data back into a database or data warehouse so it is available for operational use.

Learn More

Apache Flume logo

Apache Flume

Apache Flume is a distributed and reliable system for efficiently collecting, aggregating, and moving large amounts of log or event data from many sources to a centralized data store like the MapR Data Platform.

Learn More


Apache Oozie logo

Apache Oozie

Apache Oozie is a valuable tool for Hadoop users to automate commonly performed tasks in order to save time and prevent user error. With Oozie, users can describe workflows to be performed on a Hadoop cluster, schedule those workflows to execute under a specified condition, and even combine multiple workflows and schedules together into a package to manage their full lifecycle.

Learn More

ZooKeeper logo


In any distributed cluster, it is important that all nodes be able to share configuration and state data in a reliable way. Hadoop relies on ZooKeeper to keep each of its distributed processes, including MapReduce and HBase, consistent across the cluster. ZooKeeper nodes store a shared hierarchical name space of data registers in RAM, allowing clients to access it with high throughput and low latency. Hadoop clusters should be provisioned with an odd number of ZooKeeper nodes, typically either 3 or 5, to provide high availability and maintain a quorum.

Learn More

Apache Myriad logo

Apache Myriad

Apache Myriad is an open source Hadoop project that lets YARN applications run side by side with Apache Mesos frameworks. It does this by registering YARN as a Mesos framework, requesting Mesos resources on which to launch YARN applications. This allows YARN applications to run on top of a Mesos cluster without any modification.

Learn More

GUI, Configuration, Monitoring

Hue logo


Hue (Hadoop User Experience) offers a web GUI to Hadoop users to simplify the process of creating, maintaining, and running many types of Hadoop jobs. Hue is made up of several applications that interact with Hadoop components, and has an open SDK to allow new applications to be created.

Learn More