The Modern Data Fabric - What It Means to Your Business

The Modern Data Fabric - What It Means to Your Business

Summary

Catalyst

Data-driven organizations have a natural competitive edge when it comes to sensing their markets, responding to customers, anticipating cyberthreats, and optimizing their processes. But the explosion of real-time connected device data, coupled with the new practicality of applying artificial intelligence, has raised the stakes for data-driven enterprises. No longer can they rely on segregated processes for analyzing real-time and historic data – smart decisions today are often driven by predictive or prescriptive analytics that require the combination of real-time data with sophisticated offline analysis and modeling. The business challenges or use cases that demand such solutions are not new. Examples include customer engagement for preventing churn or spurring cross-selling or upselling; deciphering market activity to stem or prevent fraud; and making maintenance decisions to prevent unplanned downtime. Each instance can benefit from the ability to integrate real-time decision-making with insights or AI models that can optimize those decisions. Until recently, having the capability to blend real time with batch and incorporate machine learning was differentiating; increasingly, these capabilities are becoming the norm.

Ovum view

Ovum's big data research has always been premised on the vision that big data must become a firstclass citizen in the enterprise. When enterprises began analyzing and operationalizing big data, the conventional wisdom was to implement a Lambda architecture that had distinct batch and real-time tiers. Yet, the Lambda architecture required considerable duplication of infrastructure and processes. A growing number of enterprises are turning to a new take on an architectural pattern that originated out of the need for enterprise integration: the modern data fabric. The fabric reaches to the places where data resides and processing is conducted: from the edge to on-premises data centers and the cloud. The fabric reduces, if not eliminates, the need to replicate or move data to the point of consumption, and it removes artificial barriers between real-time, interactive, and batch processing. The fabric treats data and processing as a utility that must be able to accommodate scale, bring together all types of data and content, and accommodate multiple modes of access. MapR's data platform, which is based on an architecture that brings real time, interactive, and batch processing together, is well-suited as a launchpad for delivering a modern data fabric.

Key messages

  • Business problems cannot be arbitrarily classified as batch or real time
  • By treating data as a utility, the modern data fabric provides a solution that bridges the artificial silos between real, interactive, and batch processing
  • MapR has extended its data platform into a modern data fabric through the ability to secure, manage, and process data in real time and/or batch in the same infrastructure, under a common security and management umbrella

Business problems are not batch or real time

Technology has imposed artificial limitations

Data is proliferating in all shapes, sizes, and velocities. Today's generation of business applications requires the ability for organizations to make smart decisions on heterogeneous data from a variety of sources, including IoT devices, smartphones, social networks, messaging, rich media, and traditional transaction systems. And they must make decisions while working with data that may be generated by connected devices out on the "edge" of the network, back-office transactions, weblogs, messaging, etc.

Traditionally, enterprises kept data in silos because of the difficulties of integrating and harmonizing diverse data sets that are modeled or optimized for specific applications. With schema on read, the generation of big data Hadoop and NoSQL platforms addressed the data harmonization bottleneck, but on their own did not solve the problem of data silos – especially the silos separating real-time from batch processes.

Initially, the "Lambda architecture," which separated real-time computing from batch processing, was employed as a workaround because of the shortcomings of existing big data platforms. The traditional model of separating the tiers was adequate for a world where data was dedicated for single purposes, such as parsing a stream, running analytics from a data warehouse, delivering query and reporting from a data mart, or conducting exploratory analytics in Hadoop. The result was a complex, multitiered architecture encompassing the streaming engine, an operational database, a data warehouse and/or data mart, and a big data platform such as Hadoop. Besides duplicating infrastructure, such architectures involved significant duplication and movement of data.

Ovum has noted that data platforms have become more heterogeneous by embracing overlapping functions. For instance, many traditional data warehouses have added in-memory column stores to deliver faster, more scalable analytics; many NoSQL platforms have added SQL-like query processing; and Hadoop has added a mix of real-time processing and ingestion, interactive SQL query, higher-performance Spark processing to expand on its original batch processing heritage, and a resource scheduler (YARN) that allows multiple processes to coexist side by side. But even in the best cases, running batch and interactive processes on such platforms requires extremely careful attention to resource management and load balancing, with real-time processes (such as using Kafka to manage real-time streams) typically requiring a separate cluster.

The boundaries between batch and real time are no longer sustainable

Anyone who shops online understands the expectations that customers have for instant response. For the retailer, that requires segmenting customers and modeling behavior to keep them engaged in the moment, whether it involves cross-selling or simply preventing churn. In turn, effective cybersecurity increasingly requires not only the ability to combine historical intelligence with real-time sensing, but the ability to harness machine learning in the moment to decipher whether patterns of attack are mutating. The common thread is that the line between real-time and batch processing is increasingly an artificial one.

Many businesses are looking to real-time processing as an approach not only to transform their business but to revolutionize their industry.

Real-time financial transactions are core to an emerging small-business lender's approach to radically changing the underwriting model for lending capital to small and midsize businesses. Because many small businesses lack the depth of balance sheet data to generate meaningful credit scores, this organization is implementing metrics that monitor a variety of event-based feeds such as transactions with conventional banks and/or PayPal, sales via eBay and Etsy, and even transactions conducted by Airbnb hosts. The lender also taps into APIs with QuickBooks and other small-business accounting packages. This wealth of event and transactional data then feeds machine learning models that segment, profile, and score prospective business customers for underwriting and fraud prevention, all within minutes of the business's application.

To support this business model, the organization required a platform that took a different approach to event processing; instead of treating events as transient records, they had to be managed as an immutable ledger that feeds the underwriting models that drive its business. It needed to converge the streaming data and transform it into more appropriate formats for each use case, including scoring models, fraud protection, portfolio management, and risk mitigation. It embraced a platform that managed data as a utility.

Introducing the modern data fabric

The fabric is the data utility

An enterprise fabric treats data as a utility. It is an architectural pattern that answers the need to bridge the data and application silos that are the reality in most large enterprises. The fabric is not a centralized hub but treats data and processing as virtual resources across a common architecture. Recently, this pattern has received new attention thanks to innovations in connectivity, scale-out architecture, and the power and cost-effectiveness of commodity infrastructure.

Under a common management and security umbrella, enterprise data fabrics do not arbitrarily separate data, processes, or applications. They apply consistent security to all data and processes. And with a flexible architecture that does not rely on a central hub, fabrics empower enterprises to deploy data and processes where they reside, based on financial, data governance, and data gravity criteria.

The requirements for an enterprise data fabric

As a utility, the modern data fabric is not limited to specific data types. It responds to the needs of modern analytic and operational use cases and applications that digest data in all of its forms, including files, tables, rich media (e.g., images, audio and/or video), streams, images, logs, messaging, and containers. The fabric must scale, because getting a complete picture may involve petabytes of data contained in billions of files. It must extend to wherever the data is created or collected. That extends from the "edge," where connected, smart devices lie, to existing on-premises data centers and the cloud. That requires support for virtualized management, where the fabric manages, secures, catalogs, performs access control, and brings processes to the data, regardless of whether the fabric is deployed in the cloud, on premises, or at the edge. With this distributed processing comes a global namespace that provides a single picture of all of the data, regardless of where it physically resides.

Because of the diverse nature and scale of data addressed by a data fabric, there is the requirement to support multitemperature data management, also known as data tiering or information lifecycle management. This is a capability for supporting policies for storing "hot" (frequently accessed) data on faster, more accessible media, with cooler data tiered to more economical storage. And because, as noted before, today's business challenges require the ability to interweave all modes of data processing, from real-time streaming to operational interactions and offline modeling, the fabric must accommodate analytics, operational, and real-time streaming under the same umbrella. That entails treating the real-time stream as a system of record through native publish/subscribe support. For today's highly distributed applications that encompass IoT, that means the fabric must begin out at the edge.

Enterprises are embracing the cloud because of its economics and flexibility, thanks to how it decouples the application from the underlying infrastructure. That allows customers to manage resources and SLAs through selective mixing and matching of compute infrastructure and the economics of elasticity. The fabric must deliver the same advantages, providing the ability to take advantage of heterogeneous infrastructure for allocating the right resources to get the right SLAs, at the right cost, for the business problem. Underscoring that flexibility is the need to take advantage of modern, cloud-friendly architectural innovations with containers, microservices, and orchestration.

For all enterprises, the last word in data management is security. The modern data fabric must operate under common security to consistently manage access control, data protection, and compliance with emerging privacy mandates such as GDPR.

How MapR supports the data fabric

The MapR Data Fabric converges the traditional silos of real-time, interactive, and batch on a single software-based platform that spans on premises, public and private cloud, and the edge. MapR Data Platform provides seamless access and processing of data via a common API, under a single security model. MapR provides core common data services to deliver a consistent, secure, scalable platform, regardless of the underlying physical infrastructure. This allows a data fabric with MapR to encompass multiple data types and extend across multiple locations.

The MapR Data Fabric is enabling a leading global credit risk information provider to eliminate data silos while providing flexibility to support its evolving analytic and operational needs. It provides a common data tier that replaced multiple standalone data warehouses and network-attached storage file systems. The MapR Data Fabric allows the agency to provide common access to existing enterprise applications and tools such as Ab Initio, R, Spark, and other analytic data processing tools, while supporting new approaches for querying variably structured data using Apache Drill. Additionally, with integrated streaming using MapR Event Store, the global credit risk information provider can perform operational analytics. The MapR fabric could potentially be used for transforming what the organization uses for a system of record with MapR Event Store providing a real-time event stream and MapR Database functioning as the immutable event store. All of this is implemented on a single platform, under a common security and management umbrella.

Key building blocks of the MapR Data Fabric encompass:

  • MapR XD, providing the scale-out file system supporting global namespace
  • MapR Database, as the operational database
  • Apache Drill, for interactive query
  • MapR Event Store, for self-contained publish/subscribe messaging that supports real-time streaming
  • MapR Edge, which provides processing for devices deployed out on the edge

In turn, the MapR Persistent Application Client Container provides the persistence tier connecting enterprise applications to the MapR cluster just as if they were writing to a local file system – but with ubiquitous and persistent access. Additionally, with natively integrated Kubernetes volume driver, stateful applications can be deployed in containers for production use cases, machine learning pipelines, and multitenant use cases. This allows organizations to deploy existing applications in containers with no code changes onto any node in their infrastructure.

Takeaways

The separations between batch, interactive, and real-time processing are no longer sustainable as business challenges require a mix of perspectives. Increasingly, real-time streaming is essential for working in conjunction with advanced segmentation, classification, or predictive models (often using AI). That demands a unified data platform that delivers consistent access to all forms of data and allows flexibility in how and when to process it. MapR Data Fabric provides the platform that delivers the seamless access to data in all its forms, wherever it resides, while breaking down the silos between batch, interactive, and real-time processing.


Download PDF