How MapR became the most open Hadoop distribution

July 07, 2014 | BY Michael Hausenblas

I have a background in Open Data and will say that this area has a lot in common with Open Source software. One of the core tenets is that of freedom. No single organization, independent of its size or accumulated brain power, can ever anticipate all things that are possible; be that with the data or be it with regards to code.

There is one interesting aspect somewhat unique to software: while, say, in a proprietary database you don't get to see how your data is laid out on disk, nor can you choose the way your data is processed, in Hadoop and NoSQL land you typically get to pick most of it. You can decide to store your data in its raw JSON form or you can, for example, use Parquet to gain certain advantages in terms of query performance and memory footprint. You can choose to use Apache Drill to query your data or you can use Apache Hive, depending on your needs. Inevitably, with great freedom comes great responsibility and this shows. As a Data Engineer I spend considerable times with customers to discuss and advice on the many options the rich Hadoop ecosystem offers and how to best wire up ecosystem components, meeting the business objectives and guarantee SLAs. Open Source is an essential driver in this context.

Given the above it's fair to say that MapR is the most open Hadoop platform: we provide for the widest SQL on Hadoop, same goes for search, resource managers (such as YARN), as well as industry standards to enable interoperability with existing system such as ODBC and NFS. Last but not least, we support multiple versions of ecosystem components (for example Hive or HBase) which means we do not force our customers to upgrade the entire platform just because they want to benefit from new features in a particular component.

You can get your hands on a MapR cluster by downloading the MapR Sandbox for Hadoop, a fully functional Hadoop cluster running in a virtual machine.