Winning with Hadoop: Decisions That Drive Successful Projects

Contributed by

10 min read

Big data challenges and opportunities are rapidly spreading across a huge number of organizations, large and small, in a wide range of verticals. Not surprisingly, people are turning to scalable solutions such as Apache Hadoop and NoSQL-based technologies to meet these challenges. The choice of an excellent data platform along with smart selections from among the many Hadoop ecosystem tools are of course important decisions to set yourself up for success, but there are also some other fundamental choices that can make a big difference in meeting and exceeding your goals, regardless of the particular project or tool involved.

Three Decisions That Can Transform What You Do

What works? In a recent presentation for the Data Driven Business Day at the recent Strata Hadoop World conference in San Jose, I talked about some of the choices that can make a difference. The talk was called “Real World Stories: Decisions That Drive Successful Projects,” and the content was based on what I learned from a variety different people using these new technologies to great benefit as well as from my own experiences. Some of the ideas evolved from research for my most recent book, Real World Hadoop, in which co-author Ted Dunning and I report on how a variety of MapR customers are using Hadoop and NoSQL successfully, including for production use cases.

In this blog post, I’d like to focus on just three of the decisions that drive successful projects, because they are so important to all the other choices you will make:

Decision 1: Give Yourself Time to Think

For most of us, our work lives involve an almost non-stop flood of meetings, emails, texts, and immediate deadlines to meet—lots of good work, but no time to step away from the noise and details in order to think, to really think. Yet having the chance to let yourself look beyond what is happening to what could happen can really pay off. From better options for organization and operations to deep innovations, the great ideas are more likely to come if you give yourself the opportunity for reflection.

Hadoop projects - time to think

Give yourself a specific, protected, quiet time to think: It can really pay off.

This isn’t just trivial advice. It’s not easy to find time for serious thinking, and it doesn’t tend to just show up unless you consciously plan for it. I was reminded of that by a post in the blog Elided Branches from tech expert Camille Fournier, Head of Engineering for Rent the Runway and committer for the Apache Zookeeper project.

Looking back over the year, Camille was wondering what had been her best decision in 2014. At first she thought it was the decision to go to frequent releases—her team embraced that choice, and it worked out very well. But then Camille realized that change in release frequency was only her second best decision: Her best decision was the choice to set aside time away from all interruptions once a week in order to think. Without that, she might not have had the idea to improve her business by going to frequent releases.

Give yourself time to think: You may be surprised by what you come up with.

Decision 2: Listen to Your Data

Once you are thinking, one of the things you need to think about is your data. Data-driven decisions are a powerful way to steer a course to success, but sometimes it’s useful to look beyond the question you’ve targeted and just ask what your data is telling you. That way, you may find deeper and more valuable insights than those you’re already pursuing. And in any event, it’s important to believe what data tells you even if it doesn’t agree with your initial assumptions. If it’s good data, listen to it.

This approach is particularly relevant in the world of big data where the combination of data sources or the expansion of data volume can provide new and sometimes surprising discoveries. The power of large-scale data aggregation combined with creative thinking and painstaking work is demonstrated by a big data, open source story that goes back to the 19th century.

Here’s what happened. A sailor named Matthew Fontaine Maury injured his leg and could no longer go to sea. The U.S. Navy gave him a desk job, and you might think that meant his exciting work was over, but that’s actually where his biggest adventure began.

Maury found boxes of dusty old log books from ship captains, collected and ignored. A ship captain would record various measurements and observations at different times of the day—basically a time series. The measurements might include wind direction, water or air temperature, speed of water currents, what he had for breakfast and whether or not the rum was holding out. There was no real consistency to it, but taken together, there were millions of measurements from a huge number of voyages, all scattered among the log books. Maury saw the potential value in that data if it could be extracted and aggregated.

Working by hand, his team painstakingly assembled the useful time series data and published a collection of Winds and Currents Charts based on it. These charts provided valuable data for many shipping routes that navigators could use to optimize the course they chose for their particular voyage. These data-driven decisions had the power to improve business returns by making the voyages more efficient and to save lives, by reducing the great risks of setting out to sea.

Hadoop time series data Maps

Based on large-scale aggregation of time series data, Maury’s Wind and Currents Charts provided 19__th century ship navigators with valuable data to inform their decisions. (Image © Dunning and Friedman 2014, used with permission)

Maury extended the open source nature of this project by requiring ship captains to contribute new and more consistent data, collected on a standard template, and giving it to Maury in exchange for a copy of the Winds and Currents Chart. But the real lesson to learn here is the valuable insight that can be gained from large-scale data when you pay attention to what it tells you and when you think about it in new and creative ways. Using big data well isn’t just about having more data—it’s also about transforming the way you think about data.

Decision 3: Transform Your Thinking

One of the most important decisions you can make in working with big data technologies such as Hadoop and NoSQL is to look beyond traditional ways of doing things and transform your thinking. It isn’t that the old ways of working with data are wrong—they are still very important—but you won’t get the best value from new approaches if you just try to use the new tools to do things in the same way you’ve always done them.

Instead, you can do traditional jobs in new ways that let you scale your systems in a cost-effective manner. And you can do new jobs in new ways, using new data sources, larger amounts of data, and new analytic approaches to explore data to gain better insights. Working with Hadoop-based systems, it is helpful to transform your thinking to include:

  • Learn to delay decisions – you can save data in original form and then use it in a variety of ways. You’re not stuck with your first decisions about how you want to process and use data.
  • Save more data – New technologies make it feasible to save months or years of data instead of days or weeks. As a result, you can ask questions such as those involved in forensics or predictive maintenance that really were not previously possible.
  • Be more flexible – Recognize the capability to use more data sources, including unstructured and nested data, and to combine data from multiple sources.

Putting It into Practice

Ultimately it is people, not just tools, that solve problems. When you engage in big data projects, especially when the technologies are new to you, there’s a human element to be considered. In addition to the technical learning curve, there’s also the need to be comfortable with change and to be able to think in new ways. The most successful projects reflect those abilities.

For more tips on best practices and useful decisions in working with Hadoop and NoSQL solutions, you may enjoy our recent book on use cases. It is Real World Hadoop, by Ted Dunning and Ellen Friedman (published by O’Reilly in February 2015)

You can download a free ebook courtesy of MapR here.

To learn more about Maury and modern time series data, get a free e-copy of our book Time Series Databases: New Ways to Store and Access Data (published by O’Reilly in October 2015)

Follow the authors on Twitter: @Ellen_Friedman @ted_dunning

This blog post was published March 04, 2015.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now