29 min read
The number of organizations that are thinking about using Hadoop has grown astronomically over the past year. How do you know whether you’re ready to implement Hadoop, and what are the best practices? To help answer these questions, research organization TDWI has published an online survey and the Hadoop Readiness Guide to help you determine if you’re ready to implement Hadoop.
To help explain the findings, TDWI Research Directors Philip Russom and Fern Halper moderated a webinar where they discussed the findings in the survey. Watch the webinar to learn the keys to success for Hadoop readiness, Hadoop readiness patterns, where organizations are in terms of readiness, and best practices for moving ahead with Hadoop.
Following the summary of their findings, Jim Scott (MapR), Matt Allen (MarkLogic), Richard Winter (Teradata), and Andrew Popp (IBM) participated in a panel discussion regarding Hadoop best practices. In this two-part blog series, we’ll cover both the panel discussion and the Q & A following the webinar. Here is part one, which highlights the responses from the roundtable discussion.
What are some of the organizational practices in terms of big data and Hadoop?
Andrew Popp (IBM): I spend quite a bit of time with customers, and I see customers breaking down in two fronts. One is where they're able to leverage Hadoop to do better things in their organization. I also see customers where Hadoop becomes just a cool science project, but it doesn't necessarily bring a lot of business value.
If I go back to the folks that are able to leverage Hadoop to add bottom line value, I think the kernel that really sets it off in a completely different direction from the science project folks is that there's a need or some sort of business rationale that they're trying to accomplish. Everybody realizes that we've got this overabundance of data, and I think every organization, regardless of the size, acknowledges that.
I think what the organizations that are successful with Hadoop need to do is to identify a specific business case. Organizations that spend the time upfront to establish what it is they're trying to accomplish are able to turn the direction of their project into a much more successful tangent as opposed to, “Let's get Hadoop because it's cool,” and then it ends up sitting on a bunch of servers and people are saying, “I'm not sure what or who's working on that. The co-op student has left, and I'm not sure what's going on.” I really see establishing the business sort of challenge that you're trying to accomplish as key.
Do you think that they also have executive support?
Andrew: I definitely think that you need to go up a certain level to be able to get that support. Again, Hadoop can really solve some pretty fundamental business problems. So yes, I think that getting executive support early on is critical. It doesn't necessarily need to be there on day one, but if you've chosen the right problem, I think Hadoop can address many of the critical problems that are paralyzing organizations today. You do need to have executive support and you should get it earlier rather than later.
Jim, what are some of the organizational best practices that they're putting in place in terms of big data and Hadoop?
Jim Scott (MapR): From my personal experience, and from customers I've interacted with directly, the number one most important thing is automation. If you think about it in terms of preparation for success, one of the areas that Phillip showed in all of those diagrams was that IT readiness tends to be below average in most cases. That is the same thing that I've seen from personal experience. When people evaluate these platforms, one of the things they often get stuck on really early on is, "Which platform? What tool has the prettiest interface?" What happens is after they get maybe a month or two of experience, they start figuring out figuring out how to use it, but they need to know how to automate it. Figuring out how to automate those activities is one of the key factors of success in this type of platform and for putting in best practices.
The second area that I see is enabling self-service for data exploration. As soon as organizations can start figuring out what they can do and they start exposing that platform to others, that's when you get the data analysts or data scientists into the platform and they start getting their hands on it. Enabling them to do their job and taking down the walls is one of the most critical keys to success for a platform like this.
Reducing the cycle time for those people to get new insights into their data is remarkably important. It's predominantly driven off of the importance of getting away from that RDBMS- centric thinking that so many people have. The main thing is that you can't make the assumption in big data that you have a traditional environment. There isn't just one type of data. You don't have only one type of compute engine. Those are the two most important things that I've seen that can have an impact on getting an organization ready to run with big data and Hadoop.
Richard, what about you, if you think about your customers?
Richard Winter (Teradata): My experience with this area is with customers who are building enterprise business solutions on Hadoop. A key problem that we see is lack of experience with enterprise scale and production implementation. Customers often have this experience with other environments, but they don't have it on Hadoop. I think what customers need is people on their team, whether employees or skilled outsiders, who have the knowhow that comes from hands-on experience with engineering—implementing and operating complex business solutions on Hadoop. Only with this kind of experience can your team identify the key challenges up front, create an appropriate architecture, and adopt an approach with a good chance of success. I think this actually ties in directly to the point Jim just made about automation.
An example would be data ingestion. You have to get the data into Hadoop before you can do anything else with it. People often start out writing a custom script for each data source, but this breaks down and once you need more than a few data sources for your project, it's not scalable, either in terms of agility, cost, or human resources. So you need an automated way of ingesting a new data source, and Hadoop means that you have a wide variety of data sources. You need automation for a wide variety of data sources. This is one example of many in how you have complex challenges in building business solutions on Hadoop. You need an overall approach, an architecture, and the skills to be able to solve those challenges together to create a business solution that delivers real value.
Matt, what do you think are some of the best practices in terms of big data and Hadoop that you see?
Matt Allen (MarkLogic) : From our perspective, there really needs to be an aligned strategy that connects the technologist and the company to the business users. Too many times, MarkLogic's customers have thought that “If you build it, they will come” with Hadoop. In order to achieve any measure of success, the end objective must be really clearly articulated. That will help connect the software solutions to the business goals.
At MarkLogic, what we see is that some organizations think that all they need is Hadoop to be set up, and it'll solve all their problems. We've run into this a lot, because MarkLogic is an operation and transactional and SQL database, and it works really well with Hadoop for moving data there and doing batch analytics. But there's also a misconception when we talk to people; they think that Hadoop can do fast, consistent transactions and replace the role of a database, which isn't true. The Hadoop ecosystem, from our perspective, is still relatively immature and changing very rapidly, so experimenting with some of the newer components of Hadoop can lead to really promising opportunities. However, organizations need to be realistic by designing the organization and the projects to reflect that.
Culturally, the organization has to be very open to experimentation. Somebody mentioned Hadoop science projects, and I think that mentality is where Hadoop has grown out of. It needs to be adopted into the organization like that. They have to be flexible and willing to fail fast and early in order to tweak and tune and optimize. There are literally hundreds of choices when it comes to the implementation of Hadoop, because there are so many different components, and keeping consistent with the end goals will help you stay on track so you can make the right technology choices to achieve the business goals.
The leadership, of course, need to be supportive. They need to be all in. They must carry the vision. That's change management 101. It helps to also have a fully-funded POC research and development team. They can help experiment and they can be dedicated to explore different solutions. Gradually, you can bring that into the mainline centralized to your organization, and get to work working towards the automated approach that was mentioned.
Where do you think Hadoop fits into the enterprise ecosystem? What kind of data architectures do you see your customers implementing? Who owns Hadoop? Where is it physically located? What do organizations need to prepare in terms of existing data architecture? Jim, what do you think about this?
Jim: Hadoop alone isn't the answer to everything, and I think we tend to hear that a lot in the industry—that it's the sledge hammer, but you still have to be thoughtful about the problems that you're dealing with. I tend to go back to the original early days of Hadoop—there are plenty of great technologies that currently exist in the Hadoop ecosystem, but the whole point of Hadoop was to tear down the silos in a business to allow one place to perform analytics.
It's a really grand vision, but Apache Hadoop's architecture has left many opportunities for improvement. What I see all across the board when I talk to customers is that there are problems that will be faced at some point within Apache Hadoop or any new user. The first that I see is cluster sprawl predominantly related to the lack of multitenancy capabilities. Mixed workloads, like wanting to run NoSQL and analytics on the same hardware. Isolated security between clusters. I really see it a lot in financial industries.
All of these things lead to administration complexity. Additionally, consumers have been looking toward a platform that brings something more— it's this view of a single converged platform in the ecosystem: one cluster that supports true multitenancy, one location to administer the platform including security backups, point-in-time consistent snapshots, and disaster recovery capabilities. They should consider platforms that support random read/write capabilities that nearly every enterprise application created for the last 40 years has come to expect from their file systems.
That helps to simplify data integration. They'll be best served when they look at systems that have POSIX compliance and support NFS without any special considerations for how to configure it within their platform. The benefit is that the approach will prevent the need to purchase or build custom adapters for every legacy system they have in their business. Every business is different.
The final thing is that I would encourage when people are considering moving in this route for their enterprise architecture, is that they read up on the Zeta architecture. It's an enterprise architecture that enables any business to leverage all of their resources for any use case, and that's the most important business proposition now. Refocus your business to what's important now. It enables any business to effectively operate at Google scale without having to require thousands of data centers.
Richard, what do you think?
Richard: One trend we see emerging is Hadoop as the platform for the enterprise data lake or data repository. A lot of customers start out their journey with Hadoop thinking primarily about applications. As soon as they've even piloted more than one, they find themselves dealing with the issues of data sharing and data management in the Hadoop environment. I've spent the last year helping customers architect their data lake, and they need systematic approaches to metadata capture and to publishing data to make it available to users for self-service.
Customers often want a 360-degree-view of the customer. This includes all the new data, social, media data, mobile data, web log data, physical interaction data, and geo data, and these things tend to be on the Hadoop platform. So there's a large data management challenge taking place on Hadoop, or taking shape on Hadoop, and the architectures for Hadoop are developing around that data management challenge.
Matt, where do you think Hadoop fits into the enterprise ecosystem? What kind of data architectures are you seeing being deployed?
Matt: At MarkLogic, our customers that are using Hadoop include Fortune 500 companies. Our line of sight is definitely focused on these large enterprises. Generally, we see an environment with a large data lake implemented within central IT. The use cases involve large-scale batch analytics and a few others. We also see them using Hadoop as low cost storage for regulatory commitments.
One thing that we see is the accessibility of Hadoop for different departments once it's within these large organizations, meaning that it often happens from the bottom up. They'll think that they can just get it going quickly so they can pull data in, build an app, and then central IT will then need to host it. They go to central IT and they ask about hosting it, and central IT wants to know how this fits into their standard for corporate IT. We kind of see that problem escalating. We see this problem quite frequently, so our general advice is just to watch out, because it's easy to build the data lake only to realize later that you "need to build a boat to travel across it."
Regarding location, we often see hybrid models. Organizations are obviously always trying to move toward the cloud, and they get the benefits from low-cost commodity hardware. Our other advice would be just to be flexible with Hadoop. Hadoop is a rapidly changing ecosystem and compatibility is something to be very wary of. The IT executives don't understand, and the people selling it often don't understand all the different components either, so it's really important just to take a closer look at your architecture and understand the criteria for bringing new components into that architecture, so that you can ensure that you're really running software that’s proven in production environments.
Andrew, what kinds of date architecture do you see your customers implementing and what advice do you have?
Andrew: We could probably spin off a whole webinar on this series of questions alone. What we see is very similar to what the other folks have said. We're really focusing on is the data warehouse. We feel that there's an inflection point happening within the data warehouse architecture today. Fortune 500 organizations have invested a lot of money with respect to hardware and managing and optimizing and making those data warehouses as efficient as possible, so that higher order analytic systems, whether it's BI or whatever, can perform almost real-time querying and you can get the answer to how many widgets I sold.
I think Hadoop really plays an important role there, because it can now form another extension with respect to your data warehouse, whether it's in a data lake scenario or whether it's reversing ETL to ELT or something like that. Often, our customers are saying that their data warehouse is optimized to within an inch of its life, but there are all sorts of new data sets coming in, or there are all sorts of new requirements, whether its user self-service or whatever, and they need to learn how the two can work together.
Folks are often directed to a Hadoop conversation just because of the flexibility. I think that Hadoop has to be brought in that conversation with context in terms of how it can work with your existing data warehouse. They could even potentially offload the stress or the attention that the data warehouse is getting, because as they look into their data warehouse, they’ll likely discover that there's not a lot of that information that's necessarily always being queried. There's an opportunity to potentially offload that information and put it into Hadoop. Then, if you can use a common currency like SQL to be able to query both the data warehouse and also the data lake, it's really of minimal impac.t but it affords the organization much more flexibility. That’s because they can start storing and querying data that simply doesn't make sense to have sitting in a data warehouse.
From an architecture perspective, we definitely see that it’s a conversation that we're having in partnership with the existing data warehouse; we actually see it as an extension of that.
Matt, what are some leading technology and practice developments that you see as important for advancing the state of the art for Hadoop?
Matt: Regarding MarkLogic, we’re happy to have our customers running Hadoop right alongside our database. Hadoop is currently not an operational database; that's what MarkLogic offers, networked altogether. We have a connector for Hadoop to move data back and forth and keep the indexes intact. We think that that's a good architecture, but from our perspective, it's a challenge to keep up with all the latest releases.
I think the important thing is to state a point of view. You need to take a strong stance on what Hadoop can do best, and then continue to work on those use cases with customers. That will help refine the hundreds of choices that are available when you look at the different components.
From our perspective, if vendors can simplify the use cases and walk customers down the road from design development to implementation and have those use cases proven out in an enterprise environment, we really think they can evolve to address some additional objectives after that. That will also help in getting a more stable base of developers that know Hadoop. That's often another struggle; a difficult project often involves a lot of people, and Hadoop skills are not as prevalent as typical skills like SQL. It's important to have certifications for Hadoop, but that's also a challenge when you have all these different components that are being released really quickly. Again, it's important for vendors to state a strong point of view about the direction that their product is going and the use cases they really want to focus on solving. By having that, I think they'll be more successful.
Jim, what do you see as some leading technology and practice developments?
Jim: I actually used to be a customer of MapR as well. I worked on an implementation where we supported 60 billion transactions a day. We had a service level requirement to handle those transactions in less than 100th millisecond per request on that platform. To reiterate Matt's point, Hadoop itself isn't operational, but the NoSQL databases that can run on top of those platforms like HBase or MapR Database certainly can be operational in nature. I think those advancements that come out of platforms like MapR enable businesses to be able to support six nines of availability for the platform. I think the biggest message that's out there is that consumers are best served by competition in this space. There are a lot of pundits out there who will claim that proliferation of a new technology leads to confusion, and I think that perspective is really an anti-competition type of viewpoint that everyone out there really should understand.
The competition is directly to blame for the advancements of these current technologies. If you look at something like Apache Spark, it's probably the most popular buzz word in the Hadoop ecosystem. Is it Hadoop? No, but it’s part of the Hadoop ecosystem. It was born out of the desire to perform distributed complication faster and more easily than the previous technologies like Hadoop MapReduce. The ecosystem is still in its infancy. To prove that, look at Spark, which is so popular now, but there's another project called Apache Flink which showed up after Spark, and Apache Flink is gaining strong support in the community due to its performance. They just released the ability to support fifteen million transactions per second on their platform. This is stellar performance, compared to any of the other platforms out there.
Andrew, what do you think?
Andrew: I interpreted this question a little bit differently, and my perspective is from a Hadoop perspective. I see two initiatives that we're really bullish on that I think will ultimately help the end customer. One of them is standardization to some degree. Matt brought up some really good points and the fact is that Hadoop is not easy; it’s not like installing Microsoft Office. There are at least fifteen to twenty different projects within Hadoop today that you've got to sort of battle with to figure out how to install them. The degree to which you can install them really affects any sort of higher order applications that you want to put on top of it.
We've mentioned the 360-degree view of the customer, and that's definitely a use case that we see a lot. But the ability to be able to actually execute on that implies that you can lay down a Hadoop foundation. If you can't do that, you can't enable any of those higher order analytics to be able to start collecting metrics and then be able to build a customer profile. We're very bullish on standardizing the Hadoop distribution, where we can begin to normalize some of the more common projects so that the customer can forget about some of the complexities, lay down a distribution of Hadoop, and have some confidence that it will work with some of the higher order applications that they have already tested.
The other one is certainly something that is not necessarily new, but it's around the cloud and the ability to deploy Hadoop as a service. The idea there is that you take away the complexity from trying to lay down the software, but you also take away the complexity with respect to trying to provision hardware. We feel that that is certainly going to redefine the Hadoop market.
The other thing that's been implied a few times here is that a lot of the workloads associated with Hadoop can be quite elastic. I think one of the pushbacks we've seen is that customers don't necessarily want to save “x” machines to be able to lay down a Hadoop infrastructure, when they don't know if the demand is something that they can schedule, or the data is something that they can sort of forecast. We're really bullish again on offering a cloud or a hybrid environment. We see both a normalization and a standardization of the Hadoop distributions as well as giving customers the ability to deploy through the cloud as something that's really going to hopefully accelerate Hadoop and the ecosystem.
Do you see a lot of customers interested in Hadoop in the cloud?
Andrew: We actually see a lot of people interested in Hadoop in the cloud. It's a great way to get somebody to start to use the technology. With the service offering that we have, all they need to do is show up is with a data set and they can start to enable some of the higher order analytics, which is really where we feel the magic happens.
Richard, I'll end up with you. What are some of the leading technology and best practice developments that you see for advancing the state of the art for Hadoop?
Richard: One interesting area where we've been active at Think Big is in helping customers develop a road map. In many cases, there's a set of use cases they want to see implemented, and they know what they are. We help them figure out in what in order to attack those, so as to best navigate through the issues that you face when moving into this new technology and implementing it successfully. A roadmap with a little bit of planning around use cases is a best practice.
Another one has to do with the dreaded data swamp. There's been a lot written about how easy it is for data lakes to turn into data swamps. We talk to users who are there or worried about ending up there. The solution to that is really to have a conscious strategy and architecture for creating a data reservoir; a place where you can grow enterprise data assets that are as valuable and as well managed as they would be in a data warehouse, but that take advantage of the flexibility and technology benefits that you have on Hadoop. It's a somewhat different approach to data management; one that's looser in some ways. It's not necessarily based on a single enterprise data model, but is more agile.
I think best practices in that area are proving to be very valuable to users who want to have a managed resource of enterprise data on Hadoop. Almost everybody does once they've built their second application.
In part two of this blog series, we’ll publish the Q & A following the seminar, as well as answers to questions that weren’t covered during the webinar due to lack of time.
Want to learn more? Check out these resources:
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.