Architect's Guide

to Implementing a Digital Transformation

by George Demarest

Phase 4 Optimization

Operate Shared Services and Deploy Converged Application Architectures

In this final phase, there is critical mass in the enterprise and all key LOBs are seeing the benefits of big data and advanced analytics. The data-driven processes that make up the digital business are now the new normal. You should leverage this capability to improve business and operational processes, which will reshape the business and give you a competitive edge. The goal is to be able to be predictive about most aspects of the business, and be able to respond and change operations in real time across more than one line of business.

However, companies that are operating at this level are disrupting their industries and demonstrating how responsive and adaptable a modern enterprise can be. Certain industries, such as ad tech, multinational finance, logistics, and social networks simply cannot exist without a complete commitment to digital transformation and operational agility. Like any important discipline, it is important to observe the most advanced players to help you up your own game, and to successfully progress through the four phases of big data adoption.

Disclaimer

The number of organizations at this advanced level is relatively small. Of all MapR customers, there are probably less than a dozen, including American Express, UnitedHealthcare Group, Ericsson, comScore, a Fortune 50 retailer, a Fortune 100 telecommunications company, and a few others. In the broader business landscape, one can cite many examples of companies that operate at this level: Google, LinkedIn, Facebook, and other born-of-the-web companies. But even for these companies, the new normal is still new. Therefore, it is difficult to be too prescriptive about what the new normal will look like.

Additionally, the most advanced practitioners understandably see their digital transformation as important intellectual property and a competitive advantage.

Motivating Factors

Meeting Enterprise Service Levels and Business Targets

As the digital business becomes the new normal, expectations for service levels rise in both an IT and a business context. IT systems are now expected to deliver essentially 100% availability and be able to routinely survive significant server, storage, network, and data center failures. The clustered architecture of the MapR Converged Data Platform provides multiple levels of redundancy and HA capabilities, and is a single security and governance entity that can greatly streamline business processes and regulatory compliance.

Achieving Operational Agility from Real-time Analytics

A data-driven business is looking to make real-time business decisions based on just in time information. But companies are looking for more than good decisions — they are looking for information and the ability to act on it. For manufacturers, it may mean creating an emergency maintenance window before a disastrous malfunction occurs. Digital processes in healthcare can anticipate problems with a patient based on real-time data from medical devices. Most industries have good cause to be able to react quickly to changing conditions on the ground.

Establish Advanced Governance and Development Methodology

One of the critical artifacts of today’s IT operations is the data silos that have been created by "hand-rolling" different patchworks of systems/OSS projects. These silos slow down the data-to-action cycle, because data has to move between these different systems in the data pipeline. This is one of the basic motivators to begin a digital transformation in the first place. Different administration controls, security frameworks and data center issues with floor space, power, cooling, etc., are nightmare for IT operations to manage. Because of this, the big data program that powers your digital transformation provides a real opportunity to lay the groundwork for a new IT operations and software development methodology.

Key Activities and Use Cases

Established executive and organizational accountability

At this point, there is likely a designated Chief Data Officer (CDO) role, along with hired data scientists and data engineers that are looking for problems no one has solved yet. There is a deep investment and commitment to data science across all lines of business. Any actionable insights that are created have the potential to be automated. All key decision makers should have real-time visibility into the performance of their operations.

Formalize strategic converged architectures

Data-centric organizations such as Google and LinkedIn have pioneered converged architectures to deliver data and computation across thousands of servers, multiple data centers, and different geographies. All consumers now expect 100% uptime, instant answers, and personalized service. Why should enterprise data systems be any different?

Jim Scott, Director of Enterprise Strategy and Architecture at MapR, outlined just such a converged architecture with his development of the Zeta Architecture. The Zeta Architecture is a high-level enterprise architecture not unlike the Lambda architecture, which enables simplified business processes and defines a scalable way to increase the speed of integrating data into the business.

Zeta Architecture diagram

There are seven pluggable components of the Zeta Architecture which work together, reducing system-level complexity while radically increasing resource utilization and efficiency.

Distributed file system - all applications read and write to a common, scalable solution, which dramatically simplifies the system architecture.

Real-time data storage - supports the need for high-speed business applications through the use of real-time databases.

Pluggable compute model / execution engine - delivers different processing engines and models in order to meet the needs of diverse business applications and users in an organization.

Deployment / Container management system - provides a standardized approach for deploying software. All resource consumers are isolated and deployed in a standard way.

Solution architecture - focuses on solving specific business problems, and combines one or more applications built to deliver the complete solution. These solution architectures encompass a higher-level interaction among common algorithms or libraries, software components, and business workflows.

Enterprise applications - brings simplicity and reusability by delivering the components necessary to realize all of the business goals defined for an application.

Dynamic and Global resource management - allows dynamic allocation of resources so that you can accommodate whatever task is the most important for that day.

A Note on Phase IV Use Cases

While big data use cases will become increasingly sophisticated over time, companies in the optimization phase or “peak digital transformation” will not necessarily be using state of the art technologies for all use cases. If the Expansion phase featured the development of groupings of related use cases into suites, the most advanced practitioners work in terms of practices.

A marketing suite of use cases simply becomes the new normal of marketing. Security use cases are assembled into a Security Information and Event Management platform. Analytics begin to pervade all lines of business and deliver real-time intelligence across a broad range of end users. While the nature of individual use cases may not be appreciably different from Phase III, the ease and speed that new use cases are developed and deployed has accelerated. More importantly, the nature of enterprise solutions begins to change.

Developing new converged application models

While it is likely that big data use cases will continue to evolve and be refined, it is also likely that application architectures and development methodologies will become a crucial part of that evolution.

Converged applications

Converged applications are software applications that can simultaneously process both operational and analytical data, allowing real-time, interactive access to both current and historical data. This class of applications deliver real-time analytics, high frequency decisioning, and other solution architectures that require immediate operations on large volumes of data.

Converged applications provide real-time access to large volumes of data in an efficient architecture to cost-effectively drive combined operational and analytical workloads on big data. They are often deployed in a modular architecture, especially as microservices that work together as a cohesive unit, not as monolithic processes in distinct data silos that require continual data movement. This architecture leads to greater responsiveness, better decisions, less complexity, lower cost, and lower risk.

Converged application architecture

Converged applications exist in the overlap between  historical data (data used for analytical applications) immediate/live (data used for operational applications). The MapR Converged Data Platform is the only platform that enables complete access to real-time and historical data in one platform.

The key benefits of creating this new class of applications is the greater business value of immediate responses to events as they happen combined with the context provided by access to historical data. As the lines blur between operational and analytical systems, data movement is lessened, management overhead is reduced and therefore human error and security gaps are minimized. By adopting this updated application model, you can future-proof your deployment because scaling up is a matter of simply adding more servers to the cluster.

A recent production example of a converged application architecture is the introduction of two new financial analytics solutions from TransUnion through their Prama platform. Prama Insights bases its analysis on TransUnion’s anonymized consumer credit database and a seven-year historical view of data. Data sources include records compiled from over 85,000 data feeds, covering about 300 million consumers.

This self-service solution enables TransUnion customers to explore data and act on insights. With this new platform, TransUnion is allowing customers direct access to their content, but with the power of an advanced analytical platform and team of experts behind it.

TransUnion's Prama platform architecture

Looking at the underlying Prama architecture, it shows a mixture of batch delivery from their voluminous data feeds through an ETL process into a data lake. Prama then provides their customers with portal access into a personalized data hub and analytic sandbox.

Read more about the technology behind TransUnion Prama

The MapR Converged Data Platform in Digital Transformation

In 2014, Gartner introduced the concept of Hybrid Transactional/Analytical Processing or HTAP. They characterized this concept as a new set of systems and applications that could handle both traditional transaction processing workloads as well as OLAP-style analytical functions. The aforementioned Converged Application architecture and the MapR Converged Data Platform are designed to achieve the goals of HTAP and beyond.

The MapR Converged Data Platform (MapR-FS, MapR-DB, and MapR Streams) serve as the data layer for enterprise-grade platforms that allow open source processing engines and applications to run together on the only fully converged data platform. Supported APIs include: HDFS, POSIX, HBase, JSON, Kafka, and most other open source API standards.

Converged data management

The Platform Services of the MapR Converged Data Platform — MapR-FS, MapR-DB, and MapR Streams — provide core data management capabilities such as a global namespace, high availability, data protection, self-healing, unified security, real-time access, multi-tenancy, and management and monitoring.

Components supported by the MapR Converged Data Platform.

A key design criteria of the MapR Platform is the strict use of existing enterprise standards and APIs. For easy data ingestion, MapR uses the NFS protocol and a POSIX standard file system, the same used on the vast majority of server systems in today’s data centers. When practical, MapR has also relied on emerging open source standards and APIs such as Hbase, Apache Kafka APIs, and native support for JSON documents in MapR-DB.

It is the combination of open source projects merged with a hardened UNIX-style file system that encapsulates the fundamental strength of the MapR Platform.

Streaming architectures and microservice development

In their book Streaming Architecture: New Designs Using Apache Kafka and MapR Streams, authors Ted Dunning and Ellen Friedman discuss the creation of new application architectures based upon microservices development and stream processing in order to deliver low latency analytic applications on a far larger scale. (More details about microservices can be found in chapter 3 of “Streaming Architecture.”)

In a related Datanami article entitled Streaming Architecture–Why Flow Instead of State?, Dunning argues:

Instead of a program with a finite input, we now have programs with infinite streams as inputs.... By adopting a streaming data architecture, we get a better fit between applications and real life. The advantages of this type of design are substantial: systems become simpler, more flexible and more robust. Multiple consumers can use the same streaming data for a variety of different purposes without interfering with each other. This independent multi-tenancy approach makes systems more productive and opens the way to data exploration that need not jeopardize existing processes. And where real time insights are needed, low latency analytics make it possible to react to life (and business) as it happens.

The following financial services converged application example illustrates the data pipeline and microservices of a financial services application that uses event streaming from MapR Streams, with microservices consuming and processing data from those streams, while Spark is used to query data in real time from an event stream or microservice output.

image

Containerization at the data platform layer

The advent of cloud computing has pushed IT architects to challenge how to push the limits of virtualization — and now containerization — to achieve operational agility, portability of applications and data, and rapid application/microservice provisioning. The big data world is following suit through technologies like Docker, Kubernetes and Apache Mesos.

Apache Myriad is a relatively new open source Hadoop project that lets YARN applications run side by side with Apache Mesos frameworks. It does this by registering YARN as a Mesos framework, and requesting Mesos resources on which to launch YARN applications. This allows YARN applications to run on top of a Mesos cluster without any modification.

Big data virtualization overview

Myriad is useful for organizations that use Hadoop with Docker and/or Apache Mesos and want to create a converged application environment between their enterprise applications and analytics. It lets you run Hadoop YARN applications on top of Apache Mesos clusters. This lets you share all resources, including data, across different workloads to improve time-to-value.

The combination of YARN, Docker, and Mesos makes up key components of the Zeta Architecture. The diagram below depicts a high-level deployment architecture that uses Mesos and Myriad in an automotive Internet of Things (IoT) context. While the Apache Myriad project currently resides with the Apache Incubator program, hopes are high that it will become an indispensable technology for cloud-based big data initiatives and beyond.

Create an IoT architecture for the automotive industry using cameras, sensors, radar, and other data sources. MapR Streams and MapR-FS (both part of the MapR Converged Data Platform), enable data exploration, real-time analytics and dashboards, and advanced applications.

Digital infrastructure monitoring: The Spyglass Initiative

The Spyglass Initiative is a multi-release MapR effort with the vision of increasing user and administrator productivity. It takes a comprehensive, open, and extensible approach to simplifying big data deployments.

Spyglass phase 1 – summer 2016: In the first phase of the Spyglass Initiative, MapR focuses on operational visibility to help customers with their ongoing big data successes. Successful big data deployments continue to get bigger and more complex. With new data sources, new use cases, new workloads, and new user groups, managing that growth requires a complete understanding of what is currently happening in the system.

MapR Monitoring helps you to manage successful big data deployments by giving you a converged, customizable, and extensible platform for cluster-wide visibility.

MapR Spyglass Initiative overview.

Conclusion

While it can be useful and instructive to look at maturity models as an indicator of your progress and a high level roadmap for future activities, there is a danger of getting hung up on which “bucket” you are currently in. The maturity model suggested in this document and the accompanying tables is somewhat subjective. Different organizations develop at different speeds along different axes, so it is important not to be overly concerned about where you fit on the maturity curve.

Some customers make amazing strides by successfully deploying a single use case. Others are focused on going deep in a particular line of business or application area such as supply chain, customer 360, or risk management. Others create a pipeline of dozens of use cases and quickmarch towards their digital transformation goals.

Whichever route you take, MapR will be there to support and inspire you to achieve great things with the MapR Converged Data Platform.

Summary

Phase IV: Optimization
Description Integrate and expand data-driven apps and analytics
to all lines of business and more business functions
Motivation Meet business SLAs
100% availability
Redundancy
Security
Make real-time business decisions
based on just-in-time information
Eliminate data/technology silos
Optimize operations
Administrative controls
Security frameworks
Capacity and resource management
Executive Sponsor Many of:
CEO
CFO
COO
CIO
CTO
CDO
CRO
CISO
VP of DW/Analytics
VP of Development/Enterprise Applications
Multiple LOB Executives, GMs
Staffing Chief Data Officer (CDO)
Chief Analytics Officer
Cluster administrators (2-5)
Developers (5-100)
BI analysts (5-50)
Dev ops (1-3)
Data engineers (10 - 30)
Data scientists (5 - 100)
Participating Organizations/Groups Central IT (infr)
Application Development
BI/Analytics
Lines of Business (2-5)
Marketing
Sales
Security
Operations
Finance
Hardware Investment Large scale cluster, distributed across data centers
Public Cloud
Hybrid Cloud
Number of Nodes 50 - 1,000s
Key Capabilities High-throughput, frictionless ingest
Streaming ingest
Enterprise-grade persistence/storage
High Availability
Replication
Failover
Mirroring
Multi-tenancy
Security, Privacy and Governance
Authorization
Access Control
Auditing
Data protection
Job/Data Placement
Resource monitoring and management
Key Technologies NFS
MapR-FS
MapR-DB
MapR Streams
Spark
Spark Streaming
Streaming Analytics
Apache Drill
Key Skills Required IT Operations:
Data Management
Data Lifecycle
Advanced data integration, cleansing
Master data management
Converged data ingest
Hadoop administration
Offload/extend life of legacy
Storage offload (temperature-tiering)
Ad hoc data engineering
In-memory processing

Software Development:
Batch: Hadoop ecosystem
Interactive: Spark, Query: SQL on Hadoop
Search: Solr, ElasticSearch
Microservices development
Streaming architectures

DWI/BI/Analytics Team:
Predictive/preventative analytics
NoSQL and real-time analytics
Log analytics
Schema-less ad hoc query (Drill, et al)
Expand statistical skills (R)
# of Production Use Cases 20 - 100's
Common Use Cases

*from previous phase
IT focused:
User home directories
Machine log analyics
File management
Cold data offload/archive

Data lake/hub:
Analytics platform
Data platform
Vertical data lake (expert system)
Analytics as a service

Application replatforming:
Date warehouse retirement
RDBMS application re-engineering
Reengineer legacy apps

Marketing/Sales:
Recommendation engine
Customer/Patient/Citizen 360
Next likely purchase
Customer churn
Ad/Content/Customer targeting
Social/sentiment analysis

Security:
Security log analytics
Fraud detection/prevention
Advanced threat/intrusion detection
SIEM system

Operations:
IoT/Industrial Internet
Supply Chain optimization/analytics
Logistics
Predictive/preventative maintenance
Vertical specific
Finance:
Trading systems
Risk management
Data vault
Market/Trading analytics
Telecoms:
System-wide cost takeout
Network monitoring, optimization
Subscriber analytics
Content/ad targeting
Revenue management
Healthcare and life sciences:
Clinical decision support
Fraud, waste and abuse
Re-admission avoidance
Smart devices and real-time patient monitoring
Genomics
Retail and CPG:
Path to purchase
Customer experience
Ad targeting
In-store operations
Market basket analysis
Pricing optimization
Number of Data Sources 100s-1000s
Data Sources IT systems
IoT and sensor networks
Date warehouses
File wervers, SAN/NAS
Application/System logs
Clickstream
ERP/CRM, SCM, HCM systems
External data sets
Public data sources
Data brokers