6 min read
In this week's Whiteboard Walkthrough video, Sameer Nori, Senior Product Marketing Manager at MapR Technologies, compares a traditional data warehouse or MPP database versus a modern data lake. Sameer explains the advantages in data agility and data exploration you get with a data lake, and how you can complement an existing data warehouse deployment.
For more information:
The full video transcription follows:
Hi, my name is Sameer Nori. In the next few minutes I'm going to talk about data warehouse architectures and MPP databases and contrast those with data lakes which really Hadoop being the underlying architecture and paradigm for data lakes. We'll wrap up by talking a little bit about how you can actually have these coexisting with the Neurig enterprise data architecture.
Let's start by talking about data warehouses. Data warehouses have been around for a long time, you know three, four decades almost, and it really served businesses well in terms of really the classical use cases of operational reporting and financial reporting, and really being able to provide those reports in dashboards to business user on a consistent and predictable basis. Data warehouses are largely based on SMP architectures and really sort of a shared everything concept, if you will, and they've, as I said, worked really well. For those of you who have built data warehouses for a living, and I've done this in my past, I spent a lot of time with customer sites, building star and snowflake schemas like the one you see here, where you've got customers, orders, products, suppliers and you join these all together in a fact table to then do reporting on things like what's the sum of order, what's the count of products, how many products sold in a particular geography, etc.
I think the challenge that's in existence with data warehouses, is that they've been pretty costly to implement, as well as from a storage perspective, you want to add more data to these, the cost can start getting pretty prohibitive.
MPP databases came along about a decade or so back, and really helped take data warehouses to the next level in terms of the same underlying concept of scheme and write from a data perspective, but more of a shared nothing architecture, a little bit more distributed in nature. Then the underlying premise, though, for both data warehouses and MPP databases has been that you want to have your data normalized, you want to model it, you want to have schemes like this, and then you can do the reporting and dashboarding that you require and make it available to your business.
However, when things like as data volumes have grown and data types have emerges, so things like click stream datas, sensor data, device data, increasing these with the internet of things, the data volumes have just grown and data warehouses and MPP databases have started to hit some capacity issues, both from a cost perspective, as well as in terms of the agility of how quickly you can adapt these to accommodate these new data types.
In talking with some of our customers, it's not uncommon that if they want to change these underlying schemas, and this is a very simplistic representation, often times these are hundreds of tables and hundreds of attributes, it could be a three to six-month project, anywhere from half a million to a million dollars. This is where really the promise of data lakes and why data lakes have started becoming so popular and increasingly prevalent with an enterprise data architecture is the fact that you can accommodate a whole lot more data types, so structured, semi-structured, unstructured, utilizing a schema in read fashion, where you don't really need to model everything in a well-defined scheme before, and you can do more types of exploration of data, uncover hypothesis much faster, which at the end of the day matters in terms of agility. Businesses today thrive on agility, and this is increasing the sort of coming to bear in the architecture as we see our customers deploying.
At the end of the day, there's also things like if you want to take SQL and embed it, let's say within some machine learning algorithms, right? You have tools like spark sequel and the whole Apache spark stack that's starting to become more prevalent for building next generation data warehouses and reduce time for ETL processing, for instance.
We really see it as a coexistence story between keep your existing data warehouses and MPP databases for the use cases they're best suited for, but then take on data lakes and Hadoop architectures for new use cases that really can drive faster agility within your enterprise.
Hopefully this has been educational and helpful to you. We'll look forward to talking with you again. Thanks.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.