Get Real with Hadoop: Lower Your TCO

Contributed by

13 min read

In this blog series, we’re showcasing the top 10 reasons customers are turning to MapR in order to run their data-driven businesses. Here’s reason #6: MapR provides the lowest total cost of ownership of any Hadoop distribution.

In the past 13 years of my career in the BI, data warehousing, and Hadoop market, I've seen a lot of innovation and change. Hadoop has by far been the biggest sea change I’ve seen with customers who are rethinking their enterprise architecture to look for ways to reduce cost and increase flexibility of data management/storage, processing, and analytics.

While much has changed, the law of physics still applies to any distributed system. As do best practices in enterprise architecture, deployment, and IT operations. Hadoop is NOT a silver bullet that suddenly makes data management faster, easier, and cheaper. This post focuses on the hidden costs of Hadoop you need to consider once you get past the nice, shiny advertising sign of the fluffy yellow elephant and the idea that “Hadoop” == “free”! As they say, Hadoop is free like a puppy, as long as you have someone to feed and care for it.

The most common misconception I’ve seen about Hadoop is that because the software is open source, organizations think that all you need are some data scientists and/or DevOps folks to get the most from it. Hardware price/performance ratios continue to drop, and any performance issue can be solved by throwing more hardware at the problem, right? Not so fast.

The reality is that most enterprises want commercial support and as their deployments grow, the architectural differences between Hadoop distributions begin to show dramatic cost differences across both capital as well as operational expenses. These differences can save companies 20-50% in total cost of ownership, just taking into consideration hard savings.

A Practical TCO Comparison
For example, here is a cost comparison for a 500TB cluster between two commercial Hadoop distributions based on a customer-validated total cost of ownership (TCO) model we’ve developed. The TCO for MapR is $3.2 million over 3 years compared to another distribution at $4.7 million for the same period. MapR provides a 32% savings.

This model takes into consideration the following variables and assumptions:

  • Hard Savings

  • Software - With Hadoop, people get hung up on whether something is “100% open source” or “proprietary.” The reality is that every Hadoop vendor I’m aware of has a “free” version which can be used in production that doesn’t require you to pay support. But let’s assume you don’t feel like trouble-shooting issues yourself (or hiring specialized skills) and that you’re willing to pay for commercial support. In that case, let’s just chunk all software costs into an operational expense under “support and maintenance fees” and calculate it on a per-node basis, which is most common and consistent across Hadoop distros. In this example, I assumed “list” price of $4k/node for both MapR and the other distribution.

  • Hardware - Every Hadoop distribution can use “commodity” servers with direct-attached storage because of the resiliency of Hadoop which is built into the software. This cost is driven by the number of nodes, and specifications such as amount of memory, type of processor, ports, interconnect, and the number/speed/density of disks. For purposes of this comparison, we assumed 24TB nodes using 12 x 2TB drives at a cost of $9000/node. 2 ports/node with 10GB interconnect is very common.

  • Environment - power, space & cooling. Many large-scale customers I speak with say that data center capacity planning and floor space is much more of a cost and operational headache than the type or cost of a hardware node. One source references $150/sq-ft/yr for power and space. This can add up quickly based on cluster size and the number of racks required.

  • Soft Savings

    • Administration FTEs - There is an argument to be made as to which Hadoop distribution is easier to manage and the number of resources required. MapR customers state they get about 25% better ops staff productivity using MapR than what was required when using other distributions because of the better uptime, self-healing availability, the no-NameNode architecture, and higher file limit (more on this below). FTE savings can now focus on strategic problems such as next-generation tools and technologies that are emerging rapidly in Hadoop instead of fire-fighting operations issues like name node failures and HBase region server failures. In this model, I left this out completely to be conservative. The slightly lower FTE costs is based on the number of nodes required between distributions for the same workload.

    • Business continuity - One factor you also should consider is the cost of downtime to your business. Many customers have switched to MapR after experiencing downtime, corrupted, or lost data with HDFS-based competitors who can not provide reliable disaster recovery strategies such as mirroring and consistent snapshots. (You’d be surprised what our competition tries to claim to act as if they’ve caught up in areas like NFS and snapshots. See it in action for yourself.)

How MapR Helps You Lower TCO
So, what drives the major cost difference? Aren’t most Hadoop distributions roughly the same, only with some minor differences in query engines, management consoles, or other value-added software? The short answer is “no.” With the MapR Data Platform, which underlies the Apache Hadoop projects within the MapR Distribution—you get the benefits of all the open source innovation with an enterprise-grade platform which is as reliable and performant as best-in-class DBMS or NAS systems. The MapR architecture dramatically lowers TCO in four key areas:

  1. MapR No-NameNode Architecture Reduces Hardware Required (and Dramatically Improves Reliability and Ease of Administration)
    MapR provides a fully-distributed data platform which distributes and triplicates the NameNode metadata across every worker node in the cluster. This no-NameNode architecture is ultra-reliable, and also has a major cost benefit. With no NameNode, there are no practical limits to the number of files that can be used on MapR. This results in much less hardware in the cluster compared to HDFS where you require multiple NameNodes to deal with the file limit at scale, and multiple active standby servers to implement NameNode HA.

    It is well documented that you can expect a NameNode based on HDFS to handle about 100 million files. When an organization has a lot of files (particularly lots of small files), they will do file management acrobatics through a LOT of extra coding to combine lots of small files and get more utilization of block size before they need to add an additional NameNode (which then of course requires implementing the journaling NameNode if you want the system to be HA). This has hard dollar costs as well as the soft costs associated with Hadoop developers writing and maintaining code and running jobs to deal with this file limitation.

  2. Automatic File Compression
    MapR applies compression automatically to files in the cluster. Compression is 2-3x depending on file types and compression settings, which reduces storage requirements, as well as using less bandwidth on the network, resulting in improved performance. In HDFS-based distributions, compression is a manual process within the application itself and an additional administrative overhead. The moment you programmatically alter the file format, any other use of the data in that file will require knowing how to programmatically read the file.

  3. Higher Performance = More Throughput with Less Hardware
    MapR is known for record-setting speed (MinuteSort and TeraSort) that the MapR Distribution can give you. What you may not know is that we had a customer break the official record by sorting 1.65TB with a 298-node cluster. That is 1/7th the hardware of the previous record using 2,200 nodes. This means you can expect a much smaller data center footprint for the same performance found with other distros.

    Customers generally find that they get much more throughput (3-6x) than other distributions on the same amount of hardware. Consider a staggering comparison based on a recent blog post about Yahoo’s infrastructure.

    “Y!Grid is Yahoo’s Grid of Hadoop Clusters that’s used for all the “big data” processing that happens in Yahoo today. It currently consists of 16 clusters in multiple data centers, spanning 32,500 nodes, and accounts for almost a million Hadoop jobs every day”

    Yahoo Y!Grid: 1 million jobs over 32,500 nodes every day = ~31 jobs/node

    Leading Ad Tech MapR customer: 65,000 jobs over 400 nodes every day = 162 jobs/node

    31 vs 162 jobs/node => MapR >5x more efficient.

    (I’m glad I don’t have to pay Yahoo!’s electric bill.)
    In fact, a survey with MapR customers who had experience with other distributions showed that 31% chose MapR because it required less hardware (which of course also means lower data center costs).

  4. True Multi-Tenancy for Hadoop
    While often hard to calculate, systems with sophisticated workload management will squeeze much more efficiency from resources. In the TCO example above, we did not take into account MapR resource and workload management. We’ll save that math for a future post, but it’s worth mentioning that MapR combines YARN with volumes, data placement control, job placement control, and label-based scheduling to give you much more fine-grained resource management not only over compute, but all the way down to the node. As an example, one media customer of MapR was able to consolidate 8 HBase clusters down to 1 single cluster with MapR.

    In addition, multi-tenancy helps address cluster sprawl. As organizations use Hadoop over time, new users and new use cases emerge and each one seems to often demand a new cluster. But if the storage and compute resources can be truly partitioned, in most cases a new cluster is rarely needed. A production cluster is clearly needed and a test cluster representative of the production cluster is a recommended practice. For business continuity, a DR cluster is also recommended. But each new major project that comes along should not require a new separate production cluster along with corresponding test and DR clusters. That sprawl would quickly become an administrative burden and likely an inefficient use of hardware. If instead with true multi-tenancy, each new major project becomes a new rack of servers in the existing cluster or maybe even just a set of servers in an existing rack, the administrative burden is contained and hardware is used most efficiently. Not to mention that teams working in the same environment tend to share best practices better than teams working in separate silos which separate clusters tend to become. This actually has a HUGE impact on TCO for customers, but we left it out of the example model above.

How Much Is That Doggie in the Window?
So, when you’re thinking about how Hadoop can save you a ton of money and the more free software the better (i.e., eyeing that cute little puppy in the window), don’t forget to think about the care and feeding. With great (big) data comes great responsibility. MapR knows this and is why we invested heavily in an architecture built for maximum performance, reliability, and efficiency.

If you’re interested in trying out our new TCO analysis tool to see how much you can save by using MapR, contact us. There are lots of details you can configure and adjust based on your own environment, and it’s a great planning tool for sizing and estimating your costs, regardless of which Hadoop distro you choose.

Be sure to check out the complete top 10 list here.

This blog post was published October 10, 2014.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now