BI and Analytics on a Data Lake:

The Definitive Guide

New Platforms Bring New Approaches

Setting the scene

Venerable businesses with global brands didn’t get that way by standing still and ignoring the mandate for continuous change.

Thus when Wells Fargo Capital Markets mulled the changing nature of data and how the changes could impact core businesses, IT leaders decided it needed a big data platform to support analytics in a way traditional systems could not. Founded in 1852, parent company Wells Fargo is a global giant with 265,000 employees and $1.8 trillion in assets.

As the company’s data mix was changing to include vast volumes of unstructured data, IT focused attention on platform solutions supporting NoSQL and Hadoop.

Really big, big data

What’s the meaning of big data and analytics at Wells Fargo Capital Markets? Consider this. The group regularly processes market tick data, which includes all the price points for thousands of equities and the movements of those stocks. Up to three million ticks per second.

Being in a highly regulated industry, Wells Fargo listed security considerations as a top criterion for platform selection. Also on the company’s wish list were superior scalability to handle spikes in big data volumes, ultra-high performance for customer-facing apps, and multi-tenancy to efficiently share IT resources. And the platform had to support the most robust analytics tools available.

As told by Paul Cao, Director of Data Services at Wells Fargo Capital Markets,

“The MapR solution provides powerful features to logically partition a physical cluster to provide separate administrative control, data placement and network access.”

Brave new world of big data analytics

Wells Fargo is one of many forward-leaning organizations that have recognized the need for new platforms for BI and analytics in a big data world.

New platforms.

For many CIOs and IT leaders, these two simple words can cause panic, especially for those leaders seasoned enough to recall the many disastrous ERP ventures of a generation ago. So, now they are asking what a new platform or platforms for big data mean.

  • Are we talking about a clear break with the past?
  • Do we now put aside the enormous investments in so-called “traditional” data processing?
  • Will CIOs be asked to shutter warehouses, abandon data marts, establish data lakes, and oversee a complete retooling and training on everything “new” for the IT staff?
  • Are we talking forklift-style upgrades?

The short answers to these four questions are “No,” “No,” “No,” and “No.”

How we got here

It will be helpful to take a quick look back at where we’ve come from in order to frame the discussion in a more IT-evolution style, because that is what the new platform discussion is really all about.

Before there was big data, there was just data, which was processed by sophisticated databases, and excellent tools that were developed in the 1970s. The most popular were (and still are) relational database management systems (RDBMS), which are transactionally based. The structured query language (SQL) is the decoding ring for managing data and simplifying processing within RDBMS.

Other iterations of DBMS include columnar, key/value, and graph. For the most part, they worked with structured if not highly structured or normalized data, often residing in a warehouse or special purpose data mart.

Another form—object databases—was IT’s first foray into working with less structured if not unstructured data, like videos and images. They are placed in specialized data repositories and usually require specialized skill sets and specialized infrastructure to make them work. In other words, they are expensive to run.

RDBMS benefits package

Billions and billions of dollars globally have been invested in the infrastructure to run these databases, and the people to operate and refine them for various vertical market applications. For real-time transaction processing, they remain the undisputed king of the hill.

Other RDBMS benefits include:

  • Recoverability from failure is very good, right up to the most recent state in most instances
  • A RDBMS can be distributed easily in more than one physical location
  • RDBMS virtually guarantee a high degree of data consistency
  • SQL is easy to learn
  • There is an enormous installed base of IT talent familiar with RDBMS
  • Users can carry out reasonably complex data queries

What’s the downside? The truth is, as long as the data being managed is structured and relational in nature, there are few downsides. Scalability is a problem, as most of these systems are proprietary, and core storage is very expensive, especially as the database grows. But these venerable databases and their entourage of tools and applications are highly visible in every Fortune 1000 company for a good reason: they deliver value.

The fox in the hen house

But then came big data, a lot of it coming from the unstructured hinterlands. It encompassed data from clickstreams, website logs, photos, videos, audio clips, XML docs, email, Tweets, etc.

Initially to IT, most of this data resembled the background noise emanating from deep in the universe–just a lot of noise. But remember this: A man named Arno Penzias deciphered that deep space background noise in 1964, eventually interpreting it as proof of the since-validated Big Bang theory of the universe. He won a Nobel Prize.

And so it is with big data. As it turns out, locked in all those disparate big data sources are invaluable insights into customer behavior, market trends, services demand, and many other nuggets. It is the Big Bang of information technology.

With big data far and away the biggest component of the overall growth in data volumes, and with the relative inability of traditional analytics platforms and solutions to efficiently handle unstructured data, the analytics landscape is undergoing profound changes.

IT evolution, not revolution

But here is the important thing to bear in mind. Big data analytics is not going to replace traditional structured data analytics, certainly not in the foreseeable future.

Quite to the contrary. As stated in The Executive’s Guide to Big Data & Apache Hadoop, “Things get really intriguing when you blend big data with traditional sources of information to come up with innovative solutions that produce significant business value.”

So you might see a manufacturer tying its inventory system (in an RDBMS) with images and video instructions from a document store-based product catalog. This would help customers help themselves to immediately select and order the right part.

Or a hotel chain could join web-based property search results with its own historical occupancy metrics in an RDBMS to optimize nightly pricing and boost revenues via better yield management.

Coexistence, not replacement. That is the correct way to view the relationship between Hadoop-based big data analytics and the RDBMS and MPP world. Thus organizations are wise to focus on Hadoop distributions that optimize the flow of data between Hadoop-based data lakes and traditional systems. In other words, keep the old, and innovate with the new.

Which platform to use?

There are three basic data architectures in common use: data warehouses, massively parallel processing systems[broken link] (MPP), and Hadoop. Each accommodates SQL in different ways.

Data warehouses

Data warehouses are essentially large database management systems that are optimized for read-only queries across structured data. They are relational databases and, as such, are very SQL-friendly. They provide fast performance and relatively easy administration, in large part because their symmetrical multiprocessing (SMP) architecture shares resources like memory and the operating system, and routes all operations through a single processing node.

The biggest negatives are cost and flexibility. Most data warehouses are built upon proprietary hardware and are many orders of magnitude more expensive than other approaches. In one financial comparison conducted by Wikibon, the break-even period for traditional data warehouse was found to be more than six times as long as that of a data lake implementation.

Traditional data warehouses can also only operate on data they know about. They have fixed schemas and aren’t very flexible at handling unstructured data. They are good for transactional analytics, in which decisions must be made quickly based upon a defined set of data elements, but are less effective in applications in which relationships aren’t well-defined, such as recommendation engines.

Massively parallel processing systems (MPPs)

MPP data warehouses are an evolution of traditional warehouses that make use of multiple processors lashed together via a common interconnect. Whereas SMP architectures share everything between processors, MPP architectures share nothing. Each server has its own operating system, processors, memory and storage. The activities of multiple processors are coordinated by a master processor that distributes data across the nodes and coordinates actions and results.

MPP data warehouses are highly scalable, because the addition of a processor results in a nearly linear increase in performance, typically at a lower cost than would be required for a single-node data warehouse. MPP architectures are also well suited to working on multiple databases simultaneously. This makes them somewhat more flexible than traditional data warehouses. However, like data warehouses, they can only work on structured data organized in a schema.

However, MPP architectures have some of the same limitations as SMP data warehouses. Because they require sophisticated engineering, most are proprietary to individual vendors, which makes them costly and relatively inflexible. They are also subject to the same ETL requirements as traditional data warehouses.

MPPs & SQL

From a SQL perspective, MPP data warehouses have one major architectural difference: in order to realize maximum performance gains, rows are spread sequentially across processors. This means that queries must take into account the existence of multiple tables. Fortunately, most MPP vendors hide this detail in their SQL instances.

Hadoop

Hadoop is similar in architecture to MPP data warehouses, but with some significant differences. Instead of rigidly defined by a parallel architecture, processors are loosely coupled across a Hadoop cluster and each can work on different data sources. The data manipulation engine, data catalog, and storage engine can work independently of each other with Hadoop serving as a collection point.

Also critical is that Hadoop can easily accommodate both structured and unstructured data. This makes it an ideal environment for iterative inquiry. Instead of having to define analytics outputs according to narrow constructs defined by the schema, business users can experiment to find what queries matter to them most. Relevant data can then be extracted and loaded into a data warehouse for fast queries.

Data lakes vs. data warehouses

David Menninger of Ventana Research says that data lakes can provide unique opportunities to take advantage of big data and create new revenue opportunities. His article on “data lakes being a safe way to swim in big data” provides a good view on how adoption of data lakes is on the rise.

As noted above, there are many parallel efforts to bring the power of SQL to Hadoop, but these projects all face the same structural barriers, namely, that Hadoop is schema-less and the data is unstructured.

Applying a “structured” query language to unstructured data is a bit of an unnatural act, but these projects are maturing rapidly. Below is an architecture diagram that shows how some of these different approaches fit together in a modern data architecture.

Data architecture:

Slide1.jpg

Choices

In summary, there are valuable use cases for each platform; in fact, data warehouses, MPP data warehouses, and Hadoop are complementary in many ways. Many organizations use Hadoop for data discovery across large pools of unstructured information called data lakes, and then load the most useful data into a relational warehouse for rapid and repetitive queries.

Criteria for selecting the right big data analytics platform

First let’s state there is no one right way to select a big data analytics platform, but there are many wrong ways to do so. The best advice is to consider a checklist like the following. As you will see, selecting the right platform is more of a self-examination of business needs and IT features that will fulfill them.

  • Know the business requirements. This may seem like simple table stakes more than sound advice, but it is the single most mission-critical criteria when selecting a big data analytics platform. Doing this right means having keen insight into your organization’s requirements in 3-5 years, not just in the next 12 months.What new revenue streams might the business explore? How might compliance and regulatory requirements impact data use going forward? What previously untapped data sources hold the most potential value to your business if it could be accessed and analyzed? How are data volumes from those sources expected to grow? These are questions that bestride the IT-business edge, answers to which are vital to your platform selection.
  • Security as always is job #1. Real-time cyber attack detection and then mitigation have never been of more critical importance. Your big data platform must be highly capable of analyzing data from an ever-growing myriad of sources and devices, then using the most advanced security analytics to prevent attacks from ever happening. Be certain in your platform selection that vital functions such as app log monitoring, fraud detection, event management, intrusion detection, and other security chores are handled better on the platform you choose than on any other, and in real time.
  • Look for open and interoperability. Always look for solutions that support open source and for components that support Hadoop’s APIs. And look for solutions that interoperate with existing apps so that all apps will work with data you store in Hadoop.
  • Big data and scalability are a power couple. If the platform cannot scale to meet the ultra aggressive requirements of big data volumes, then “that dog won’t hunt,” as the saying goes. comScore watched as their initial Hadoop cluster grew to process more than 1.7 trillion events per month globally. You surely do not want to be facing such a processing load without first checking to be sure the platform can handle it without performance degradation.
  • Warehouse integration is often needed. Typical organizations store lots of data in warehouses and data marts. To cut costs, they will want to move some of that archived data to less expensive platforms such as Hadoop-based data lakes. And then at some point, the data may well need to migrate back the other way. Be sure to closely examine the ease with which this data integration can take place on big data analytics platforms you are considering. Remember, those distributed and relational database management systems are here to stay for many years to come. The name of the game when it comes to platforms for big data analytics is coexistence and cooperation, not replacement of one for the other.
  • The future is now, and it’s all about self-service. A basic question to ask is just how data will be used and accessed. In the world of big data analytics, IT needs to cast itself in the role of a bystander. That means the platform you choose must support a rich array of self-service tools that are truly user friendly—for any user, not just by quants and data scientists.
  • Is multi-tenancy important to you? Multi-tenancy allows you to share IT resources while letting different business units, as well as data from partners and customers to co-exist on the same cluster. But the highest levels of customizable security must accompany this feature. If the platforms you are considering don’t measure up, consider others instead.

Defining terms

MapReduce – This is a programming model—some say “paradigm”—and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Thus, it is tailor-made for the big data era where deriving speedy insights into massive, often disparate data volumes is essential.

Hadoop – A specific approach for implementing the MapReduce architecture, including a foundational platform and a related ecosystem. This Java-based programming framework supports the processing of large data sets in a distributed computing environment. So again, we have a perfect fit for the big data era of enormous often-disparate data volumes. Wikibon estimates Hadoop will comprise a $22 billion US market by 2018.

Hadoop ecosystem – The Hadoop ecosystem includes additional tools to address specific needs. These include Hive, Pig, Zookeeper, and Oozie, to name a few.

Open source – This is software, including Apache Hadoop and Linux, whose source code is available for modification or enhancement by anyone. Source code is typically the domain of heads-down coders, and seldom is seen or touched by users. Source code is changed or manipulated to enable discrete applications to run on it.

SQL – The broadly accepted language of choice for data query, manipulation, and extraction from an RDBMS.