8 min read
Amazon Elastic MapReduce (Amazon EMR) makes it easy to provision and manage Hadoop in the AWS Cloud. The latest webinar from the Amazon Web Services Partner webinar series, titled “Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS,” showed examples of how to use Amazon EMR with the MapR Distribution for Apache Hadoop, and outlined the advantages of using the cloud to increase flexibility and accelerate projects while lowering costs.
Jonathan Fritz, AWS sr. product manager, began the webinar by providing an overview of Amazon EMR, and demonstrating how easy it is to run Hadoop workloads on AWS. Steve Wooledge, MapR VP of product marketing, provided details on the MapR Distribution for Hadoop, and discussed use cases for MapR with Amazon EMR. The webinar wrapped up with MapR Principal Sales Engineer Bruce Penn demonstrating the simplicity of creating a MapR cluster using EMR along with showcasing several integration points between MapR, EMR, Hive, Impala, and Tableau.
Features of the webinar included:
Examples of real-world applications and customer successes in production, including:
IDEXX, a company that provides diagnostics and information technology solutions for animal health and water/milk quality, uses MapR on AWS to flexibly scale its business at lower cost, and gain access to critical customer data instantly for rapid response times.
The Climate Corporation, which leverages big data to manage the economic impact of extreme weather. The company is able to perform sophisticated climatology simulations by deploying MapR with EMR, and by doing so, they are now able to insure farmers and show risk profiles for individual 20x20 plots across the U.S.
Want to learn more? Check out these resources on MapR and Amazon EMR:
The following questions were also asked during the webinar, but were not answered due to lack of time. Here are those questions/answers:
Are there performance differences between EBS & JBODs? How do we get around the Iow IOPS that we get on EC2?
Amazon is always working to improve I/O performance on the different node types. Ephemeral drives generally perform better than EBS; this is particularly true for the instance types with solid-state disks. AWS enables provisioned IOPS for EBS volumes, which can improve overall I/O performance.
Are CPU and memory sold as a married pair, e.g. 2 X CPU = <= 64GB RAM?
Across the various instance type families (M1, C1, CC2), that is generally true, although the CPU to Memory ratio is different for the different families.
How big were the databases you were querying with Hive and Impala?
The two tables that were queried were quite small with only a few hundred rows in each as the goal of the demo was to illustrate the connectivity from Tableau to MapR via our Hive and Impala ODBC drivers, as well as the speed of Impala as it does not require a MapReduce job in the way that Hive does.
Do you have any comparison information of your version of Hadoop and your competition with performance on Netezza?
We have not performed benchmarks comparing MapR and Netezza as there is no straight-forward apples-to-apples performance comparison, any more than is the cost-per-terabyte of stored data on Netezza vs. a Hadoop cluster. To increase performance with MapR or any Hadoop distribution, you simply add more nodes to the cluster. But ultimately, the question should focus on which queries your business requires against which data sets, and how to best meet the performance requirements of those workloads.
How does Amazon and MapR address security concerns with today's regulations, particularly compliance with PCI DSS?
_A valid answer would require more detail about your security needs. The AWS infrastructure and the MapR software both have comprehensive security models, including encryption at rest and over the wire. There are multiple MapR customers who have deployed a PCI-compliant solution. _
What tools are available now to measure/benchmark the performance of Hadoop in AWS?
Typical tools include the DFSIO and TeraSort benchmarks, both of which are a standard part of the Hadoop distribution. You can check out some third party tests at _http://flux7.com/white-papers/_
Where can I find information regarding the AWS Management Console, SDK and API part of AWS, where I can use AWS capabilities to the peak level?
Please refer to the AWS documentation, which is excellent: _http://aws.amazon.com/documentation/ec2/_
For the EC2 instances on which the cluster is provisioned, how is the hard disk attached to virtual instances?
Ephemeral disks are allocated from the physical spindles on the node hosting the Amazon instance. EBS disks are remote from the instance (accessed over a network link).
How does the database performance compare against Greenplum/HAWQ or Impala?
Impala on MapR is generally faster than the other distributions because the raw I/O performance of MapR is superior. Any of the open source query engines (Hive, Stinger, Shark, etc.) will see the same base improvement. For companies who want to run a full ANSI SQL compliant database on Hadoop, MapR supports HP Vertica natively on our distribution for Hadoop.
Since this is all running on VMS, what is the performance difference between VMS and bare metal?
There’s no easy answer to that, as it is very workload-dependent. The better question to pose is, “What performance do I need?” and then pick the platform that can reliably deliver that performance.
Why did you take off the NameNode from the MapR Hadoop framework? How does that stand out from the rest?
The NameNode is a single point of failure that reduces both the functionality and performance of a distributed file system. Implementing NameNode HA requires additional configuration on separate nodes from the data nodes on other distributions for Hadoop. MapR took the NameNode data and laid it out on the cluster, and took whatever was inside the NameNode and moved it into the data nodes. Now, everything is triplicated and ultra-reliable with the MapR distribution’s no-NameNode architecture. With no NameNode, there are no practical limits to the number of files that can be stored on MapR either, letting you go all the way to exabyte scale. An additional major benefit of this is much lower cost because of less hardware in the cluster compared to HDFS where you require multiple NameNodes to deal with the file limit at scale, and multiple active standby servers to implement NameNode HA. You can find more details on why architecture matters here. This is one area where MapR stands out from other distributions for Hadoop and provides better business continuity, reliability, and performance for Hadoop projects and applications.