16 min read
If you ask ten data scientists, “Which machine learning tool is best?” you’ll likely get many different answers. But slightly more surprising is that if you ask any one data of those data scientists the same question, you’ll likely still get many different answers.
Why is that? The reason turns out to be one of a couple of surprising things about the current state of machine learning that I discovered while delving into the question of what’s the best tool for machine learning (or deep learning).
Here’s a quick preview of two key observations about what makes machine learning successful:
1. Most businesses, especially large ones, don’t rely on just one machine learning tool. Instead they usually keep three to five machine learning options in their tool box.
2. Most of the effort in successful machine learning projects is not about the model, algorithm or specialized machine learning tool – Instead, it’s about how to handle data and model logistics.
Let’s start with the first observation.
Turns out that at any point in time many organizations have adopted a range of several machine learning tools. At first I thought that would be because they have different groups working on different projects and that would generate the need for different tools, but there are other reasons.
Of course, each data scientist or data science team within an organization may have their own preferences as to which is “best”, but this isn’t the only reason an organization has multiple options in the tool box. Even within one team or project, it’s common to see multiple machine learning tools in regular use. Successful organizations tend to keep several options at hand in part because no single machine learning tool fits every situation, data set or scale. It’s generally a good idea to try out several approaches, especially in the initial stages of a project, and then see which performs better for that particular situation. And this approach continues even after models are in production, because the situation in which the model behaves may change – the input data, conditions in the real world etc. For that reason, the tool that gave the best result initially may not be the best for the next generation of models.
This observation doesn’t mean that all machine learning technologies are of the same quality – some really do offer better potential performance and a better fit to your data and your business goals – but it’s still helpful to be up-to-speed on several different top-notch tools. Let’s take a look at what does matter in choosing machine learning tools, with a particular focus on your options for deep learning.
Deep learning is a very hot topic lately. Machine learning and deep learning both are becoming mainstream projects for a variety of business goals, but widespread interest in deep learning is a more recent trend. Deep learning is a specialized subset of machine learning that uses the approach of building multiple hierarchical layers of connections (artificial neural networks) that in some ways resemble the connections of a biological brain. Deep learning is for a variety of different, and growing, applications but is most commonly seen in projects that include these applications:
There are many valuable and practical areas of machine learning that do not require deep learning – in some cases very simple approaches can deliver powerful results. For example, it’s possible to build a simple but very effective recommendation system that exploits the observed relationship between users and items. I talked about this with my co-author Ted Dunning.in the O’Reilly publication Practical Machine Learning: Innovations in Recommendation (available here as a free download).
But the newly accessible techniques in deep learning can tackle otherwise out-of-reach goals as well as to extend the scope of many machine learning projects including recommendation. In his blog on a short history of deep learning, Bernard Marr said the promise of deep learning “… demonstrates that given a large enough data set, fast enough processors, and a sophisticated enough algorithm, computers can begin to accomplish tasks that used to be completely left in the realm of human perception” (from “What is Deep Learning? A Short History Everyone Should Read” May 2017).
Just like conventional machine learning, putting deep learning into production typically requires that you be able to run many models at the same time. I address this topic in a new O’Reilly publication coming in September. I’ll touch on this idea briefly at the end of this article when we look at the importance of logistics.
A variety of publicly available deep learning tools are emerging. A sample of those that deserve a closer look are TensorFlow, MXNet, Caffe and H2O. These technologies may be less familiar to you than the more traditional machine learning tools, so here’s a brief comparison.
TensorFlowTM is a very popular technology specialized for deep learning that was released under an Apache 2.0 open source license in November 2015 after being developed by Google researchers in the Google Brain Team. Its purpose was to primarily to detect patterns in a manner that resembles (on a much smaller scale) the way connections in the human brain learn.
TensorFlow is now used by Google both for research and in their products and is available for the public to use in development and production as well. Although it’s not an Apache Foundation project, it does have a broad community of developers, contributors and users. In its first year after public release, over 480 people contributed to the project. The number of Github repositories that refer to TensorFlow was 1500 by May 2016, with only 5 of those being from Google, as reported by Jeff Dean. By the time of the first TensorFlow Dev Summit held in Mountain View 15 in February 2017, when version 1.0 was released, it was reported that TensorFlow was used in over 6000 open source repositories. The current release as of this article is version 1.2.
TensorFlow uses data flow graphs for numerical computations, described on the website this way: “Nodes in the graph represent mathematical operations, while the graph edges represent multi-dimensional arrays (tensors) communicated between them.” It performs optimizations very well and can be accessed through a flexible Python interface or via C++, so it does require some coding.
As it turns out, while very popular, TensorFlow is not the only way to easily take advantage of tensors, and these techniques are not limited to deep learning. Ted Dunning, Chief Application Architect at MapR, stated in the KDNuggets June 2017 technical article “Deep Learning 101: Demystifying Tensors” that “…tensor based computational systems like TensorFlow (or Caffe or Theano or Mxnet or whatever your favorite is) can be used for optimization problems that are very, very different from deep learning.”
MXNet is a highly scalable deep learning tool that can be used on a wide variety of devices. It is currently an incubator project with the Apache Software Foundation, admitted into incubation in January 2017. MXNet is supported by a number of organizations including Amazon AWS, Azure, Microsoft, Intel and Carnegie Mellon. Although it does not appear to be as widely used as yet compared to TensorFlow, MXNet growth likely will be boosted by becoming an Apache project.
Caffe is an open source deep learning tool known for speed, expressive architecture, and being extensible. It has an interesting history, original being authored by a then-PhD-student at UC Berkeley, Yangqing Jia and further developed by the U.C. Berkeley Vision and Learning Center. It was written in C++ and uses a Python interface. It is released under the open source BSD license and hosted on Github, with many contributors from a wider community. Version 1.0 was released in April 2017. Caffe is particularly designed for image classification and segmentation and is used especially by academic researchers as well as the general public, although not as widely (yet) as TensorFlow.
Another new development in this space was also announced in April 2017 when Facebook open sourced derivative known as Caffe2. As described in a TechCrunch article by John Mannes that month, Caffe2 previously has been used at scale across Facebook. Caffe programs can be converted to Caffe2 via a utility script. Caffe2 is reportedly aimed at being particularly developer friendly.
This older open source machine learning technology offers a broader foundation for machine learning, not just focused on deep learning, although that is included. It was developed by a Mountain View company now called H20.ai and released under the Apache 2.0 open source license. Development was led in part by data scientist Arno Candel, and the project has a scientific advisory council made up of Stanford University professors. H20 software was written in Java, Python and R and it has a powerful graphical interface that can be used to control training. It can use data from a variety of distributed file systems including the distributed storage of the MapR Converge-XTM data platform.
H20 is widely used for machine learning projects. A January 2017 TechCrunch article by John Mannes reported that around 20% of Fortune 500 companies use H20. The H20.ai website states that over 100,000 data scientists make use of this machine learning tool. One characteristic of H20 is that its iterative style of training offers a practical business-driven option: It makes it possible for a data scientist to interrupt training and deploy a model before it is fully optimized if time and performance considerations make this desirable.
What about the second surprising observation about machine learning, namely, that most of the effort is about logistics rather than about learning? As it turns out, the overall best tool for machine learning is the data platform itself. People generally make use of a small collection of favorite machine-learning-specialized technologies such as those described in this article, but regardless of which machine learning tools they’ve chosen, they still need a foundational technology to deal with logistics. The data platform must efficiently and reliably handle data storage and accessibility, model deployment and hand off of new models, and do this with multi-tenant style use, often across data centers on premises or in cloud.
Part of the need for an efficient data platform on which to run various machine learning tools is being able to deploy and manage multiple machine learning models at the same time. Trying out models produced by multiple tools is just one reason why you will have so many models. From deep learning to simpler machine learning, each project requires many models at the same time, with more than one model being developed and evaluated even after a model (or models) are in production. Each project may require data from multiple sources, and projects often need to deploy models at regional data centers, as shown in this diagram.
Collect raw data locally, share to data center to do learning on the global data and then deploy models back to regional centers in order to act locally. Logistics for machine learning may be handled most effectively at the platform level rather than in each program. Figure courtesy of Ted Dunning, used with permission.
Now multiply that by the number of teams running learning projects for different business goals, and you’ve got a mountain of model management (plus a lot of alliteration) to manage.
The MapR Converge-XTM data platform is an example of a foundational technology designed not only to support the various machine learning tools you may want to try – all of those mentioned in this article run on MapR without special connectors or having to copy training data to a local disk– but also to deliver the logistics needed for successful machine learning. Within the same technology, running on the same cluster and under the same security and administration, MapR provides the capability for data storage from multiple sources including streaming and the ability to make data available for machine learning models via a global namespace, in cloud, on premises, or with a hybrid architecture. The MapR platform’s highly scalable distributed file and object store, MapR XD, includes read/write files with accurate point-in-time snapshots and cross-data-center mirroring.
Logistics handled by MapR mirroring between data centers or within a data center also cover the embedded NoSQL database, MapR Database, and message streams, MapR Event Store, all within the same technology. In fact, models in different data centers can easily share the same streaming data thanks to MapR Event Store multi-master omni-directional stream replication capability. Essentially the same message streams can be shared by multiple consumers in multiple locations, including processing that is done at the IoT edge (for more, see content on MapR Edge computing).
What is the best tool for machine learning? The best “tool” is having the flexibility afforded by comparing models produced by multiple machine (or deep) learning technologies and doing this on a reliable data platform foundation such as the MapR Data Platform.
For more information on machine learning and on how to build a global data fabric, please try the following free resources:
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.