How Big Data Inefficiency Is Costing You a Bundle

Contributed by

6 min read

Information wants to be free. Open source is free. Moore’s law is making computing free. Free, Free, Free.

Enough already with the free. In the real world, computing costs money. Making great products costs money. More efficient computing saves money. If you’re running a serious big data infrastructure, you must first focus on getting value, but once that’s done, you must make sure you are not bleeding money in a variety of ways.

Is your cluster too big? Are you wasting storage? Do you spend too much time on admin? Are you building up technical debt? How much is downtime wasting? If you aren’t asking these questions, you are surely wasting money.

Here’s why. In the world of open source, especially one increasingly powered by the cloud, complexity and inefficiency drive up costs. Open source projects are great at breaking new ground and encouraging a community of innovation. Open source projects are not great at productizing and optimizing code for use by nonprogrammers. Here are some points to consider:

  • Tableau and Qlik made great businesses out of data visualization and discovery. Why hasn’t an open source project sprung up to compete? The D3 JavaScript libraries are a masterpiece–for programmers, not for users.
  • The team at TIBCO Spotfire implemented an enterprise runtime for R called TERR, achieving massive efficiency gains. Why? Because R wasn’t developed by software engineers, but by statisticians. As a result, the software engineering was lacking, leading to massive inefficiency. TERR has much better scalability and memory management. Models can be developed in R and then run in production in TERR, when efficiency and reliability matter.
  • We’ve been waiting for an open source competitor for desktop applications for years. Why hasn’t one made a dent?
  • There are a zillion ways to process log data using languages like Perl, tools like grep, and various other Unix utilities and systems. If those tools were enough, why is it that Splunk launched a successful business starting with crunching log files?

The world of big data is being stretched by the tension between the dynamic innovation and freedom of open source and the cost to forge a product that is efficient, powerful, and easy to use. The Hadoop ecosystem is breaking new ground and filling whitespace for developers in a breathtaking way. But is the Hadoop ecosystem creating efficient products that are easy to use?

All the major Hadoop providers, of course, say that they are creating a fabulous way to use the Hadoop ecosystem. But which ones have done the work so solve the hardest problems? You don’t have to guess. You can ask vendors for references, compare costs, and do the math.

Here are some of the dimensions to look at. (For a deeper dive on these issues, see the recent CITO Research white paper, “Five Questions to Ask Before Choosing a Hadoop Distribution”):

  • How big does a cluster need to be to handle a high volume workload? Whether you are paying for on-premise machines or for cloud computing, doing more with a smaller cluster can save serious money.
  • How fast can you recover from errors and roll back work? Face it; Hadoop jobs often fail. Being able to roll back quickly to a checkpoint can save hours or days.
  • How complex is your distribution? Can you meet enterprise standards for security, backup, and routine operations with normal levels of effort? Or do you need a lot of time from people with deep expertise?
  • How productized is your big data infrastructure? Are you dealing with a product that offers high-level abstractions to get the job done or settings with the complexity of an airplane dashboard?

There is no such thing as a free lunch. There is no escaping the strengths and weaknesses of open source projects. In my view, free does not mean efficient, secure, or easy to use. Some Hadoop providers essentially say, “Don’t worry. We are getting better. Hadoop’s open source ecosystem will eventually be just like a great commercial product.”

MapR takes a different approach, which recognizes that getting a product right requires extra work, some of which simply doesn’t happen inside an open source community. MapR argues as follows, “We are going to take responsibility for everything needed to create great commercial products for enterprise use in a way that is compatible with APIs. It won’t be free, but it will be worth it.” (And, recognizing that free does have value, MapR has a free Community Edition that allows customers to get started, without commercial support and lacking some enterprise features such as high availability.)

Free may be good enough if efficiency, system uptime, and ease of use really don’t matter to you. Going without these traits costs money. You can do the analysis and figure out how much. Then you can determine how free free really is and make a decision that’s right for you.

This blog post was published January 22, 2015.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now