Enterprise Data Hub: Optimizing Your Data Architecture with Hadoop

Enterprise Data Hub: Optimizing Your Data Architecture with Hadoop

Solution Overview

Many organizations today face the challenges of big data, and need a scalable and cost-effective way to manage their data growth. They realize their existing technologies such as relational database management systems (RDBMS), data warehouses (DW), and storage area networks (SAN) incur too much cost for the scaling and complex data processing requirements of big data on their own. These existing technology assets must be complemented with newer technologies that were designed to handle big data challenges. That is why many organizations boost their enterprise data architecture with the ecosystem of technologies around Apache™ Hadoop®.

Central Access to Data

An enterprise data hub (EDH) is a large central repository of multi-structured data including structured, semi-structured, and “unstructured” data. Deploying an EDH with MapR leverages the high-performance, massively scalable, and reliable MapR Distribution for Hadoop to give organizations a powerful, enterprise-grade, distributed computing platform. Some of the important features of the MapR EDH include:

Data storage in native formats.
The MapR EDH supports all data types without requiring predefined schemas so that all data sources from across the enterprise can be included for a 360-degree view of your business. Data with a variety of data structures including documents, log files, audio, social media, and transactional data can be easily loaded and accessed. In addition, structured data types can be loaded with the included NoSQL database capabilities for real-time, operational access to data.

High performance.
The MapR EDH was designed for high performance, with respect to both high throughput and low latency. This provides the responsiveness that end users require when accessing data. In addition, a fraction of servers are required for running the MapR EDH versus other Hadoop distributions, leading to architectural simplicity and lower capital and operational expenses.

Easy data access.
Copying data to and from the MapR EDH is as simple as copying data to a standard file system. Direct Access NFS™ lets users access data without special tools, so they can read and write files with their existing applications. The integrated security in MapR ensures that users can access only the data they are authorized to access.

The MapR EDH supports multiple user groups, any and all enterprise data sets, and multiple applications in the same cluster. Features such as volumes and integrated security ensure users groups are separated and get access only to their data. Job placement control and resource management ensure that multiple jobs and applications can run simultaneously in the same cluster without conflict.

Business Continuity
For environments with stringent service level agreements (SLA), MapR has a proven track record of reliable production deployments. The MapR EDH provides integrated high availability (HA), data protection, and disaster recovery (DR) capabilities to protect against both hardware failure as well as site-wide failure.

MapR Enterprise Data Hub Use Cases

The MapR EDH is the primary component for several business initiatives

Data Warehouse Optimization.
Organizations seek larger data sets from new and existing data sources in their data warehouses (DW) to get more value. They derive more accurate critical insights by analyzing a complete picture of enterprise-wide data. While critically important to the business, DWs have a hard time keeping up with the growing data volumes. DWs cannot cost-effectively scale to the levels that organizations require, in addition to requiring upfront data transformation and data modeling before analysts can query the data.

As a result, organizations make trade-offs like analyzing fewer data points via summaries or aggregates. In many cases, detailed data older than a few months are discarded. This limited view of data inhibits the ability to gain important insights or provide a complete audit trail for regulatory compliance or data governance best practices.

The MapR EDH is the core of a data warehouse optimization strategy that lets organizations add more data and more capabilities to their data warehouse environments. They can keep more data to get deeper insights. They can gain value from the many new, disparate data formats that are available – cloud or mobile app data, social media, machine data, and more. They can also keep up with higher speeds of incoming data, enabling real-time analytics for faster responsiveness to new insights.

Data Storage and Offload.
In many situations, organizations seek a cost-effective alternative to a SAN or other large-scale storage platform. They expect the typical capabilities of a storage platform including HA/DR, consistent snapshots, volumes, and high throughput. The MapR EDH addresses these requirements to help organizations get more value from their expenditures. Long-term data storage for compliance or archiving purposes is an ideal data storage use case for the MapR EDH. Instead of using high-cost storage or resorting to cold “tape” backups, organizations can leverage MapR for warm archiving of less frequently used data. Organizations seeking to offload data or processing cycles from expensive mainframes can also leverage MapR as a repository and compute engine for all data types. And in situations where high speed throughput is essential such as a store for a high performance compute grid, the MapR EDH is an ideal complement for persistent data storage.

Data Integration: Extract/Transform/Load.
Many organizations use a platform based on Hadoop to handle their unique extract/transform/load (ETL) jobs. With tools like Apache Pig as well as the Hadoop MapReduce paradigm, organizations can use the MapR EDH to complement their existing ETL tools for handling complex and extremely resource-intensive jobs that can be parallelized across many servers. The MapR EDH is also ideal for custom transformations such as running facial-recognition algorithms on video files.

Search and Analytics
When storing massive volumes of data in a single repository, organizations can get a complete picture of all enterprise data. MapR partners provide search, analytics, and visualization technologies that give organizations the tools to find and analyze data, and make better business decisions.

Key Features

  • Support for both structured, semi-structured, and unstructured data – all data in the EDH
  • Multi-tenancy – supports multiple business groups and applications in one cluster without conflicts
  • High performance – fast, responsive access to data and higher throughput
  • Direct Access NFS – easy access to data
  • Integrated security – built-in data access controls
  • Volume support – disparate user groups and data by logical volumes
  • Job placement control and resource management – jobs run simultaneously in the same cluster
  • HA/DR – business continuity
  • Data protection - consistent snapshots with point-intime audits and recovery

Key Benefits

  • Simplified architecture with easy access to all enterprise data in a single repository
  • Fast, responsive access to data to enable real-time operations
  • Low cost storage along with the benefits of high-end storage platforms
  • High uptime for the reliability to meet stringent SLAs and avoid costly downtime

Key Use Cases

  • Reduce the load from your data warehouse, while gaining more value from detailed and varied data in Hadoop that is accessible from the data warehouse
  • Enable low cost, live archival and compliance data stores
  • Run compute-intensive and unique transformations efficiently
  • Provide a complete view of enterprise data for search and analytics