Apache Hadoop

Open Source Framework for the Distributed Storage and Processing of Very Large Datasets

FREE TRAINING

Apache Hadoop Essentials
On-Demand Training

Apache Hadoop logo

WHAT IS APACHE HADOOP?

Apache Hadoop is an open-source framework designed for distributed storage and processing of very large data sets across clusters of computers. Apache Hadoop consists of components including:

  • Hadoop Distributed File System (HDFS), the bottom layer component for storage. HDFS breaks up files into chunks and distributes them across the nodes of the cluster.
  • Yarn for job scheduling and cluster resource management.
  • MapReduce for parallel processing.
  • Common libraries needed by the other Hadoop subsystems.

Hadoop is often used in conjunction with Apache Spark and NoSQL databases to provide the data storage and management for Spark-powered data pipelines. A modern implementation of Hadoop now features an ecosystem of related projects that provide a rich set of big data services:

Apache Spark
Spark is a general purpose, distributed processing engine that performs high performance, in-memory processing of large data sets.

Apache Hive
Hive provides built-in data warehousing capabilities to the Hadoop system using a SQL-like access methods for querying data and analytics.

Apache HBase
HBase is a scalable, distributed NoSQL wide column database built on top of HDFS.

Apache Zeppelin
Zeppelin is a web-based, multi-purpose notebook that enables interactive data processing including ingestion, exploration, visualization, and collaboration features for Hadoop and Spark.

The MapR approach to enterprise Hadoop

The MapR Distribution including Hadoop is the only distribution built from the ground up for your business-critical production applications. Built-in enterprise-grade features such as high availability, disaster recovery, security, and consistent snapshots let you deploy production-ready systems.

Explore 20+ videos to learn more about Hadoop.

WHY APACHE HADOOP?

Hadoop cons icon

CHALLENGES WITH PREVIOUS TECHNOLOGIES

  • The old approach of painstakingly cleansing information from transactional systems and neatly placing it into data warehouses doesn’t work in an era where data is arriving in huge volumes from many diverse sources in formats that are constantly changing.
  • Partly spurred by legal and regulatory mandates, storage requirements are exploding because retention periods for previously ephemeral data can now extend to years or even decades.
  • The rigid structure of yesterday’s data warehouses made data a precious asset that could cost upwards of $10,000 per terabyte. With Hadoop on commodity hardware, those costs fall by more than 90%. This removes many of the cost and technical barriers of enabling data agility.
Hadoop pros icon

ADVANTAGES OF HADOOP

  • Apache Hadoop was born out of a need to process an avalanche of big data. The web was generating more and more information on a daily basis, and it was becoming very difficult to index over one billion pages of content. In order to cope, Google invented a new style of distributed data storage and processing on clusters of commodity computers known as GFS and MapReduce. A year after Google published a white paper describing their framework, Doug Cutting and Mike Cafarella, inspired by the white paper, created Hadoop (HDFS and MapReduce) to apply these concepts to an open-source software framework to support distribution for the Nutch search engine project. Given the original case, HDFS was designed with a simple write-once storage infrastructure.
  • Apache Hadoop controls costs by storing data more affordably per terabyte than other platforms. Instead of thousands to tens of thousands of dollars per terabyte, Hadoop delivers compute and storage for hundreds of dollars per terabyte.
  • Fault tolerance is one of the most important advantages of using Hadoop. Even if individual nodes experience high rates of failure when running jobs on a large cluster, data is replicated across a cluster so that it can be recovered easily in the face of disk, node, or rack failures.
  • Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a huge variety of tasks that all share the common theme of high variety, volume, and velocity of data – both structured and unstructured.

WHY APACHE HADOOP WITH MAPR?

Hadoop challenges icon

CHALLENGES WITH HDFS AND HADOOP

  • HDFS, the core bottom layer in Hadoop, stores a large amount of data on local disks across a distributed cluster of computers.
  • HDFS is written in Java on top of the Linux filesystem and is a write once storage layer. Updates to closed files are conducted via an append process. The batch updates of HDFS are a major limitation. There is no support for continuous updates to a file. Moreover, HDFS relies on the underlying Linux file system to store the HDFS content.
  • The success of HDFS and distributed Hadoop processing jobs hangs on the performance of the NameNode—the part of the master node that identifies the location of each file block in relation to the master file being computed. There are some scalability issues associated with the NameNode. It can accommodate from 100 to 200 million files, depending on the memory capacity of the node. If the NameNode fails, it may lose track of the blocks and must re-establish communication with each individual block, a repair process than can take eight hours or more in a typical cluster.
  • These design decisions have caused difficulties for enterprises that base their operations on Hadoop.
  • Apache Hadoop diagram
Hadoop advantages icon

ADVANTAGES OF MAPR XD AND HADOOP

  • Hadoop is architected with a component model down to the file system level. MapR replaces one or more components, packages the rest of the open source components, and maintains compatibility with Hadoop.
  • While maintaining the core distribution, MapR has conducted proprietary development in some critical areas where the open-source community has not been able to solve Hadoop’s design flaws.
  • Improving HDFS for high performance and high availability
    • MapR replaced HDFS so it would not be reliant on Java, Java Garbage collection, or the underlying Linux file system.
    • MapR’s distributed file system (MapR XD) implements a random read-write file system natively in C++ and accesses disks directly. HDFS is an append-only file system that can only be written once.
    • MapR de-centralized and distributed the NameNode, increasing its capacity from 100 million files to 1 trillion. Because each node in the cluster contains a copy of the NameNode, all nodes can participate in failure recovery, instead of requiring each node to call back to a central instance.
  • MapR provides open APIs between Hadoop clusters and other common environments in the enterprise, including POSIX NFS, S3, HDFS, HBase, SQL, and Kafka. With MapR, Hadoop gains a full read/write storage system that supports multiple and full random readers and writers.
  • MapR Data Platform architecture diagram
  • *Explore the interactive diagram

HDFS vs. MapR-FS 3 Numbers for a Superior Architecture

Learn about the architectural differences between HDFS and MapR FS that boil down to three numbers.

  1. Block size - 8KB
  2. Chunks - 256 MB, adjustable
  3. Containers - 10-30GB, self-adjusting

These allow your data to scale into the exabyte range.

KEY BENEFITS OF MAPR AND HADOOP

SPEED, SCALE, AND RELIABILITY icon

SPEED, SCALE, AND RELIABILITY

The MapR Data Platform provides a low-friction method of ingesting, storing, and organizing petabytes of data that may reside on various operating systems, file systems, and cloud storage services. MapR also automatically replicates metadata along with application data, making high availability part of the core architecture.

MACHINE LEARNING icon

MACHINE LEARNING

As data volumes grow, machine learning approaches become more feasible and increasingly accurate. Software can be trained to identify and act upon triggers within well-understood data sets before applying the same solutions to new and unknown data. Spark’s ability to store data in memory and rapidly run repeated queries makes it a good choice for training machine learning algorithms. Running broadly similar queries again and again at scale significantly reduces the time required to go through a set of possible solutions in order to find the most efficient algorithms.

DATA WAREHOUSE OPTIMIZATION AT LOWER TCO icon

DATA WAREHOUSE OPTIMIZATION AT LOWER TCO

Data lakes are a foundational use case of the MapR Data Platform. Enterprise data centers looking to rationalize and liberate potentially valuable data of all types use the MapR Data Platform to provide the storage and management capabilities to put all enterprise datasets into play.

Many BI/DW teams are discovering that many BI applications and queries can run on the MapR Data Platform at a fraction of the price of traditional data warehouse platforms.

EVENT STREAM PROCESSING icon

EVENT STREAM PROCESSING

From log files to sensor data, application developers are increasingly having to cope with streams of data. This data arrives in a steady stream, often from multiple sources simultaneously. While it is possible to store these data streams on disk and analyze them retrospectively, it is sometimes necessary to process and act upon the data as it arrives. Streams of data related to financial transactions, for example, can be processed in real time to identify – and refuse – potentially fraudulent transactions.

INTERACTIVE ANALYTICS icon

INTERACTIVE ANALYTICS

Rather than running pre-defined queries to create static dashboards of sales, production line productivity or stock prices, business analysts and data scientists want to explore their data by asking a question, viewing the result, and then either altering the initial question slightly or drilling deeper into results. This interactive query process requires systems, such as Apache Spark or Apache Drill, that are able to respond and adapt quickly.

DATA INTEGRATION icon

DATA INTEGRATION

Data produced by different systems across a business is rarely clean or consistent enough to be simply and easily combined for reporting or analysis. Extract, transform, and load (ETL) processes are often used to pull data from different systems, clean and standardize it, and then load it into a separate system for analysis. Spark and Hadoop are increasingly being used to reduce the cost and time required for this ETL process.

WHY HADOOP WITH MAPR MATTERS TO YOU

CIO / ENTERPRISE ARCHITECT icon

CIO / ENTERPRISE ARCHITECT

  • Meet line of business data needs at lower cost. Grant fast, secure, multi-tenant access to all data for the full spectrum of analytics needs.
  • Accelerate the business. Support in-place ML/AI and analytics, stateful containerized applications, and much more.
  • Deploy anywhere – in the public cloud, on-premises, at the edge, or all of the above at once.
IT / STORAGE ADMINISTRATOR icon

IT / STORAGE ADMINISTRATOR

  • The MapR Data Platform is built for production. Consistent snapshots, replicas, and mirroring deliver enterprise-grade high availability and disaster recovery.
  • MapR XD is multi-tenant by design. Assign policies (quotas, permissions, placement) to logical units of management called volumes.
  • Balance cost and performance with MapR XD. Leverage policy-based data tiering, erasure coding, data placement, and more.
  • Establish big data capabilities with Hadoop, HBase, Hive, and MapR NoSQL databases.
DEVELOPERS AND DATA ENGINEERS icon

DEVELOPERS AND DATA ENGINEERS

  • Persist data for containerized applications. MapR Data Fabric for Kubernetes allows for MapR volumes to be mounted for access by containers.
  • Scale data as containers grow. With a “grow as you go” feature, MapR handles growth in data without having to move it to a separate, dedicated environment.
  • Use tools like Hive and Pig to ingest, transform, and cleanse large data sets.
  • Easier, faster data pipelines with Spark:
    • Build complex ETL pipelines that can speed up data ingestion and deliver superior performance.
    • Combine event streams with machine learning to handle the logistics of machine learning.
DATA SCIENTISTS icon

DATA SCIENTISTS

  • Faster time to insight. With support for POSIX, MapR XD works with newer Python-based ML and AI tools like Tensorflow and PyTorch. No need to move the data to a separate cluster.
  • Better support for machine learning logistics. Containerize AI and ML models and train them against all data – not just a subset – leading to more accurate results.

What Is MapR XD? (formerly MapR-FS)

Learn some of the key themes, including "real-time" and "standard interfaces," that are important in a big data environment for driving business value.

BEYOND HADOOP

A confluence of several different technology shifts have dramatically changed big data and machine learning applications. The combination of distributed computing, streaming analytics, and machine learning is accelerating the development of next-generation intelligent applications, which take advantage of modern computational paradigms powered by modern computational infrastructure. The MapR Data Platform integrates global event streaming, real-time database capabilities, and scalable enterprise storage with Hadoop, Spark, Apache Drill, and other ML libraries to power this new generation of data processing pipelines and intelligent applications. Diverse and open APIs allow all types of analytics workflows to run on the data in place:

  • The MapR XD Distributed File and Object Store is designed to store data at exabyte scale, support trillions of files, and combine analytics and operations into a single platform. MapR XD supports industry standard protocols and APIs, including POSIX, NFS, S3, and HDFS. Unlike Apache HDFS, which is write once, append-only, the MapR Data Platform delivers a true read-write, POSIX-compliant file system. Support for the HDFS API enables Spark and Hadoop ecosystem tools, for both batch and streaming, to interact with MapR XD. Support for POSIX enables Spark and all non-Hadoop libraries to read and write to the distributed data store as if the data was mounted locally, which greatly expands the possible use cases for next-generation applications. Support for an S3-compatible API means MapR XD can also serve as the foundation for Spark applications that leverage object storage.
  • The MapR Event Store for Apache Kafka is the first big-data-scale streaming system built into a unified data platform and the only big data streaming system to support global event replication reliably at IoT scale. Support for the Kafka API enables Spark streaming applications to interact with data in real time in a unified data platform, which minimizes maintenance and data copying.
  • MapR Database is a high-performance NoSQL database built into the MapR Data Platform. MapR Database is multi-model: wide-column, key-value with the HBase API, or JSON (document) with the OJAI API. Spark connectors are integrated for both HBase and OJAI APIs, enabling real-time and batch pipelines with MapR Database:
    • The MapR Database Connector for Apache Spark enables you to use MapR Database as a sink for Spark Structured Streaming or Spark Streaming.
    • The Spark MapR Database Connector enables users to perform complex SQL queries and updates on top of MapR Database, while applying critical techniques such as projection and filter pushdown, custom partitioning, and data locality.

MapR put key technologies essential to achieving high scale and high reliability in a fully distributed architecture that spans on-premises, cloud, and multi-cloud deployments, including edge-first IoT, while dramatically lowering both hardware and operational costs of your most important applications and data.

Everything on One Cluster

Acessing Data In-Place

CUSTOMERS USING THE MAPR DATA PLATFORM

What's New?