MapR packages a broad set of Apache open source ecosystem projects enabling batch, interactive, or real-time applications. It is a complete distribution which is pre-tested, pre-integrated, hardened and includes Hive, Pig, Apache HBase™, Oozie™, Sqoop™, Flume, Mahout and a huge amount of innovative engineering that significantly moves Hadoop forward. The data platform and the projects are all tied together through an advanced management console to monitor and manage the entire system.
Please click on each project below to learn more about them and links to resources managed by the open source community.
Apache Hadoop™ was born out of a need to process an avalanche of big data. The web was generating more and more information on a daily basis, and it was becoming very difficult to index over one billion pages of content. Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a large variety of tasks that all share the common theme of lots of variety, volume and velocity of data – both structured and unstructured.
YARN (Yet Another Resource Negotiator) is a core component of Hadoop that manages access to all resources in a cluster. Before YARN, jobs were forced to go through the MapReduce framework, which is designed for long-running batch operations. Now, YARN brokers access to cluster compute resources on behalf of multiple applications, using selectable criteria such as fairness or capacity, allowing for a more general-purpose experience.
Apache MapReduce is a powerful framework for processing large, distributed sets of structured or unstructured data on a Hadoop cluster. The key feature of MapReduce is its ability to perform processing across an entire cluster of nodes, with each node processing its local data. This feature makes MapReduce orders of magnitude faster than legacy methods of processing big data, which often consisted of a single node accessing and processing data located in remote SAN or NAS devices.
Apache Hive is an open source Hadoop application for data warehousing. It offers a simple way to apply structure to large amounts of unstructured data, and then perform batch SQL-like queries on that data.
Tez is a generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use cases.
Many users who are new to Hadoop find that the MapReduce framework has a steep learning curve. Apache Pig helps these users by offering a simpler alternative for transforming and analyzing large data sets. Users write scripts in a high level language called Pig Latin, which Pig translates into MapReduce jobs that run on a Hadoop cluster.
Cascading is a data processing API and processing query planner used for defining, sharing, and executing data processing workflows. On a distributed computing cluster using the Apache Hadoop platform, Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling.
Apache Spark is a general-purpose graph execution engine for Hadoop that allows users to analyze large data sets with very high performance. One common use case for Spark is executing MapReduce-style graphs, achieving high performance batch processing in Hadoop.
Apache Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data sources and data formats, including nested, self-describing data.
Apache HBase is a database that runs on a Hadoop cluster. Clients can access HBase data through either a native Java API, or through a Thrift or REST gateway, making it accessible by any language.
Integrating Search provides a single platform to perform predictive analytics, full search and discovery, as well as advanced database operations. The MapR Distribution including Hadoop now includes LucidWorks Search.
GraphX is a graph library that runs on top of Apache Spark. Developers can use the languages and tools they are familiar with using for Spark to implement new types of algorithms that require the modeling of relationships between objects.
Apache Mahout is a powerful, scalable machine-learning library that runs on top of Hadoop MapReduce. Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed. Machine learning is the basis for many technologies that are part of our everyday lives.
MLlib is a machine learning library that runs on top of Apache Spark. Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed. Machine learning is the basis for many technologies that are part of our everyday lives.
When Hadoop first emerged, it provided a platform to store petabytes of data, and perform batch queries on that data to gather insights. This model works well for many use cases, like analyzing vast amounts of customer data for interesting patterns. However, not all data can wait for a batch query to be performed.
HttpFS is one of several tools available to interact with the MapR distributed file system. Some differentiating features of HttpFS include programmatic access, version independence, and remote access.
Hadoop users often want to perform analysis of data across multiple sources and formats, and a common source is a relational database or data warehouse. Sqoop allows users to efficiently move structured data from these sources into Hadoop for analysis and correlation with other data types, such as semi-structured and unstructured data stored in the distributed file system. Once analysis has been completed, Sqoop can be used to push any resulting structured data back into a database or data warehouse so it is available for operational use.
Apache Flume is a distributed and reliable system for efficiently collecting, aggregating, and moving large amounts of log or event data from many sources to a centralized data store like MapR Data Platform.
Apache Oozie is a valuable tool for Hadoop users to automate commonly performed tasks in order to save time and prevent user error. With Oozie, users can describe workflows to be performed on a Hadoop cluster, schedule those workflows to execute under a specified condition, and even combine multiple workflows and schedules together into a package to manage their full lifecycle.
In any distributed cluster, it is important that all nodes be able to share configuration and state data in a reliable way. Hadoop relies on ZooKeeper to keep each of its distributed processes, including MapReduce and HBase, consistent across the cluster. ZooKeeper nodes store a shared hierarchical name space of data registers in RAM, allowing clients to access it with high throughput and low latency. Hadoop clusters should be provisioned with an odd number of ZooKeeper nodes, typically either 3 or 5, to provide high availability and maintain a quorum.
Hue (Hadoop User Experience) offers a web GUI to Hadoop users to simplify the process of creating, maintaining, and running many types of Hadoop jobs. Hue is made up of several applications that interact with Hadoop components, and has an open SDK to allow new applications to be created.