A Transformative Partnership to Accelerate Real-Time Analytics Using a Data Lake

6 min read

As enterprises get larger in size over time, the amount of data they tend to accumulate undoubtedly increases. Enterprises continuously acquire data on their customers, suppliers, partners, and competitors in their market. This results in the complexity of storing data, managing it, and deriving key business insights from these datasets in real time. Trends show how the 3 Vs of big data – i.e., Volume, Variety, and Velocity – have scaled over time.

Moreover, businesses today need to adapt to changes in market conditions and increasing competition and stay innovative in their product offerings. This often comes down to how they leverage their most important asset – data. In order to be an agile business, it is imperative to be able to extract insights in real time to address these challenges. Enterprises can no longer wait for weeks or even days to generate analytics and uncover these new business insights. The digital transformation is enabling businesses to leverage data from traditional data warehouses as well as new data streams coming in from the edge devices and gateways. This trend is now common across industries, including financial services, healthcare, manufacturing, retail, telecom, oil & gas, and media & entertainment, to name a few. The driving objective is to exploit the new ingest streams and to make real-time decisions on this data, achieving new operational efficiencies, identifying new revenue streams, and ultimately improving customer satisfaction. Some of the key challenges customers face in attempting to make this transition would include:

  • Lack of infrastructure solutions built for the cloud, on-premises, and the edge
  • Slow-performing Hadoop solutions, leading to infrastructure cluster sprawl
  • Lack of low latency IT solutions for edge computing and advanced edge analytics
  • Inability to build and deploy AI across all 3 – cloud, on-premises, and the edge
  • Lack of an automated, policy-based mechanism to move hot, warm, and cold data to storage tiers based on cost, performance, and capacity trade-offs.

With industries often facing new varieties of data as well as ingestion occurring at never-before-seen rates, the underlying compute and storage platform need rethinking. The traditional data lakes, which were historically viewed as a unified platform to store structured enterprise data, now need to be transformed and adjusted to incorporate these new varieties and velocities of data. Imagine a car manufacturer ingesting data for its autonomous car initiative, generating 4 TBs of data per day with just one hour of driving. This would be a combination of images, event logs, clickstream data, and traffic conditions. Part of this data will need to be processed within the car itself (edge) to ensure driving responsiveness from changing road conditions, while the rest will be sent to the cloud for improved decision-making. Ingestion and analysis of such data demands a robust data pipeline with purpose-built infrastructure. Once such a data pipeline is in place, experience tells us that it is much easier to not only create visualization and dashboards but also train machine learning models, rapidly generate insights for the dashboards, and incorporate the predictions back into the operational systems. The majority of legacy infrastructure, if used as is, can prove inadequate in addressing this need.

For enterprises, this often results in creating multiple silos of data assets for the purpose of experimenting with different use cases and then subsequently deploying them in production. Add to that, different businesses want to run different real-time applications but access the same data. Imagine a custom application that writes customers’ financial transaction data in a table using OJAI API, while an ad hoc analytics query using the Drill SQL interface reads and analyzes customer spend in real time, simultaneously.

A Data Lake for Real-Time Analytics

HPE and MapR have partnered to bring our innovation in hardware infrastructure and a software data platform respectively to directly address these industry challenges. The table below provides details of the flexibility this architecture gives to the enterprises.

Shown below is a high-level architecture diagram of the joint solution. It shows the MapR Data Platform – consisting of MapR XD (a distributed file and object store), MapR Event Store, and MapR Database – stacked on top of the HPE Apollo Gen 10 compute and storage infrastructure enabled by the 2000 / 4200 / 6500 servers. In addition, this architecture also allows the edge computing – using a combination of MapR Edge and HPE Edgeline – to seamlessly integrate with the data lake at the core.

As shown above, this architecture also allows customer applications to be portable through the entire development and deployment process by containerizing them. MapR’s Volume driver allows the applications to remain stateful by persisting storage throughout this process – a breakthrough in itself. This is important for 2 reasons: the nature of data and the repositories from which it is extracted can constantly change for real-time applications, and such applications usually need to be ported over from one environment to another as they are taken from development to production.

For more information on the MapR - HPE joint architecture, please email us at info@mapr.com or reach out to your sales representative.

This blog post was published January 22, 2019.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now