Exploring the Relationship Between Hadoop and a Data Warehouse (Part 2)

Contributed by

12 min read

Editor's note: Read Part 1 of this blog series.

This blog post is the second in a series based on the ebook, “BI & Analytics on a Data Lake” by Sameer Nori and Jim Scott. In their book, they discuss how BI and analytics are being paired with big data, and how the impact of this union promises to be a disruptive game-changer.

In the second chapter, they discuss the need for new platforms for BI and analytics in a big data world, and describe the three basic data architectures in common use: data warehouses, massively parallel processing systems (MPP) and Hadoop.

Here is an excerpt:

How we got here

It will be helpful to take a quick look back at where we’ve come from in order to frame the discussion in a more IT evolution style, because that is what the new platform discussion is really all about.

Before there was big data there was just data, processed by sophisticated databases and excellent tools, developed over several decades going back to the 1970s. The most popular were (and still are) relational database management systems (RDBMS), which are transactionally-based. The structured query language (SQL) is the decoding ring for managing data and simplifying processing within RDBMS.

Other iterations of DBMS include columnar, key/value, and graph. For the most part, they worked with structured if not highly structured or normalized data, often residing in a warehouse or special purpose data mart.

Another form—object databases—was IT’s first foray into working with less structured if not unstructured data, like videos and images. They are placed in specialized data repositories and usually require specialized skill sets and specialized infrastructure to make them work. In other words, they are expensive to run.

RDBMS benefits package

Billions and billions of dollars globally have been invested in the infrastructure to run these databases and the people to operate and refine them for various vertical market applications. For transaction processing, they remain the undisputed king of the hill.

Other RDBMS benefits include:

  • Recoverability from failure is very good, right up to the most recent state in most instances
  • A RDBMS can be distributed easily in more than one physical location
  • RDBMS virtually guarantee a high degree of data consistency
  • SQL is easy to learn
  • There is an enormous installed base of IT talent familiar with RDBMS
  • Users can carry out reasonably complex data queries

What’s the downside? The truth is, as long as the data being managed is structured and relational in nature, there are few downsides. Scalability is a problem, as most of these systems are proprietary.And core storage is very expensive, especially as the database grows. But these venerable databases and their entourage of tools and applications are highly visible in every Fortune 1000 company for a good reason: they deliver value.

The fox in the hen house

But then came big data, a lot of it coming from the unstructured hinterlands. It encompassed data from clickstreams, website logs, photos, videos, audio clips, XML docs, email, Tweets, etc.

Initially to IT, most of this data resembled the background noise emanating from deep in the universe–just a lot of noise. But remember this: A man named Arno Penzias deciphered that deep space background noise in 1964, eventually interpreting it as proof of the since-validated Big Bang theory of the universe. He won a Nobel Prize.

And so it is with big data. As it turns out, locked in all those disparate big data sources are invaluable insights into customer behavior, market trends, services demand, and many other nuggets. It is the Big Bang of information technology.

With big data far and away the biggest component of the overall growth in data volumes, and with the relative inability of traditional analytics platforms and solutions to efficiently handle unstructured data, the analytics landscape is undergoing profound changes.

IT evolution, not revolution

But here is the important thing to bear in mind. Big data analytics is not going to replace traditional structured data analytics, certainly not in the foreseeable future.

Quite to the contrary. As stated in The Executive’s Guide to Big Data & Apache Hadoop, “Things get really intriguing when you blend big data with traditional sources of information to come up with innovative solutions that produce significant business value.”

So you might see a manufacturer tying its inventory system (in an RDBMS) with images and video instructions from a document store-based product catalog. This would help customers help themselves to immediately select and order the right part.

Or a hotel chain could join web-based property search results with its own historical occupancy metrics in an RDBMS to optimize nightly pricing and boost revenues via better yield management.

Coexistence, not replacement. That is the correct way to view the relationship between Hadoop-based big data analytics and the RDBMS and MPP world. Thus organizations are wise to focus on Hadoop distributions that optimize the flow of data between Hadoop-based data lakes and traditional systems. In other words, keep the old, and innovate with the new.

Which platform to use?

There are three basic data architectures in common use: data warehouses, massively parallel processing systems (MPP) and Hadoop. Each accommodates SQL in different ways.

Data warehouses are essentially large database management systems that are optimized for read-only queries across structured data. They are relational databases and, as such, are very SQL-friendly. They provide fast performance and relatively easy administration, in large part because their symmetrical multiprocessing (SMP) architecture shares resources like memory and the operating system, and routes all operations through a single processing node.

The biggest negatives are cost and flexibility. Most data warehouses are built upon proprietary hardware and are many orders of magnitude more expensive than other approaches. In one financial comparison conducted by Wikibon, the break-even period for traditional data warehouse was found to be more than six times as long as that of a data lake implementation.

Traditional data warehouses can also only operate on data they know about. They have fixed schemas and aren’t very flexible at handling unstructured data. They are good for transactional analytics, in which decisions must be made quickly based upon a defined set of data elements, but are less effective in applications in which relationships aren’t well-defined, such as recommendation engines.

MPP data warehouses are an evolution of traditional warehouses that make use of multiple processors lashed together via a common interconnect. Whereas SMP architectures share everything between processors, MPP architectures share nothing. Each server has its own operating system, processors, memory and storage. The activities of multiple processors are coordinated by a master processor that distributes data across the nodes and coordinates actions and results.

MPP data warehouses are highly scalable, because the addition of a processor results in a nearly linear increase in performance, typically at a lower cost than would be required for a single-node data warehouse. MPP architectures are also well suited to working on multiple databases simultaneously. This makes them somewhat more flexible than traditional data warehouses. However, like data warehouses, they can mostly only work on structured data organized in a schema.

However, MPP architectures have some of the same limitations as SMP data warehouses. Because they require sophisticated engineering, most are proprietary to individual vendors, which makes them costly and relatively inflexible. They are also subject to the same ETL requirements as traditional data warehouses.

From a SQL perspective, MPP data warehouses have one major architectural difference: in order to realize maximum performance gains, rows are spread sequentially across processors. This means that queries must take into account the existence of multiple tables. Fortunately, most MPP vendors hide this detail in their SQL instances.

Hadoop is similar in architecture to MPP data warehouses, but with some significant differences. Instead of rigidly defined by a parallel architecture, processors are loosely coupled across a Hadoop cluster and each can work on different data sources. The data manipulation engine, data catalog, and storage engine can work independently of each other with Hadoop serving as a collection point. Also critical is that Hadoop can easily accommodate both structured and unstructured data. This makes it an ideal environment for iterative inquiry. Instead of having to define analytics outputs according to narrow constructs defined by the schema, business users can experiment to find what queries matter to them most. Relevant data can then be extracted and loaded into a data warehouse for fast queries.

Let’s take a look at the main differences between a data lake and a data warehouse (summarized from KDNuggets):

  • Data: While data is structured in a data warehouse, data lakes support all data types: structured, semi-structured, or unstructured.
  • Processing: Data is schema-on-write in a data warehouse, while it’s schema-on-read in a data lake.
  • Storage: It can be expensive to store large volumes of data in a data warehouse, while data lakes are designed for low-cost storage.
  • Agility: In a data warehouse, data is in a fixed configuration and is much less agile, while data in a data lake is easy to configure as needed.
  • Users: The data lake approach supports all users (data scientists, business professionals), while a data warehouse is used primarily by business professionals.

The foremost use case for Hadoop continues to be the “data lake” because it stores a lot of unstructured data for refinement and extraction into relational “data marts” or data warehouses. In fact, Gartner said that they have seen a big uptick in inquiries from customers about data lakes as evidenced below:

Just looked at the numbers. @Gartner_inc inquiries on data lakes increased 72% from 2014 to 2015.

— Nick Heudecker (@nheudecker) April 12, 2016

There are many parallel efforts to bring the power of SQL to Hadoop, but these projects all face the same structural barriers, namely, that Hadoop is schema-less and the data is unstructured. Applying a “structured” query language to unstructured data is a bit of an unnatural act, but these projects are maturing rapidly. Below is an architecture diagram that shows how some of these different approaches fit together in a modern data architecture.

Choices

In summary, there are valuable use cases for each platform; in fact, data warehouses, MPP data warehouses, and Hadoop are complementary in many ways. Many organizations use Hadoop for data discovery across large pools of unstructured information called data lakes, and then load the most useful data into a relational warehouse for rapid and repetitive queries.

In the next blog post on the subject, we’ll talk about navigating the SQL-on-Hadoop landscape, which is the subject of chapter 3 of “BI & Analytics on a Data Lake.”

Compliments of MapR.


This blog post was published November 10, 2016.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now