August 21, 2013 | BY Ted Dunning
We are often asked by potential customers if Apache Mahout ™ integrates well with the MapR M7 Edition. The quick answer is, "Yes!” Mahout itself is extremely portable, and it easily connects with M7 where appropriate. The advantage of running Mahout on MapR has more to do with development simplicity, speed and reproducibility.
Advantage #1: You can easily mix and match modes without having to move data assets back and forth.
Almost all of Mahout's capabilities can run either in map-reduce mode or in local mode. In map-reduce mode, Mahout programs read and write data in the cluster's distributed file system. In local mode, files are written to and from the local file system.
Local mode is the perfect choice during the early development phase or any time input files are relatively small. Launching a map-reduce program has considerable overhead, so it’s often the case that the local version of a program can finish running before the distributed version has time to fully get started. On the other hand, you would want to use map-reduce mode when data sizes increase to the point that parallel execution pays off relative to the startup overhead.
With MapR, you can mix and match local and distributed modes of operation for all of these programs without having to move data assets back and forth between the local and distributed file systems. This may not sound like a major issue, but friction like this during development can radically decrease productivity. The ability to optimize execution style during production can be a substantial advantage in tuning throughput. Even at Facebook, 90% of all Hive jobs run with less than 10GB of input. This is near the threshold where local programs would actually run faster. Advantage #2: With MapR, you can precisely replicate model builds by using snapshots to freeze the incoming data at the start of model building.
Another key advantage of using the MapR distribution comes into play when you have built a model and need to be able to document and control exactly what training data was used to build that model. One option, and nearly the only option using other Hadoop distributions, is to stop all data ingestion during the model build. This is detrimental, as it builds in scheduling dependencies between the ingest and modeling processes. Such dependencies can quickly multiply to the point that most of the calendar is consumed with process synchronization, and little is left for doing useful work.With MapR, you can break these schedule dependencies by using snapshots to freeze the incoming data at the start of model building. This costs almost nothing in terms of time and storage. Most importantly, it also allows you to retain a record of exactly which data were used to build a model. This ability is critical for compliance reasons in industries such as finance and pharmaceuticals, but it is a strongly recommended practice anytime models are being built. Note that the training data on Big Data systems is often too large to make a copy, so a snapshot is required. It’s also important to point out that HDFS snapshots are not actually snapshots, and thus are unusable for this purpose. Think of MapR snapshots used this way as version control for your data.Both of these advantages are critically important when you are building models with Mahout.Interestingly, you gain the advantages described here without compromising the portability of Mahout. No Mahout code has to change ... it all just works.It is also important to note that these MapR advantages apply to many other machine learning systems as well. These even apply to many legacy systems other than for machine learning; the ability to have map-reduce programs interoperate transparently with legacy code has huge implications because it makes it easy to integrate big data with the rest of the computing world.