8 min read
Businesses implement machine learning (ML) solutions for one reason – to bring value to their enterprise. When dealing with big data and distributed file systems (DFS), it must seem tempting to use one tool that addresses each stage of model development and can run on vanilla HDFS (e.g., Spark). This one-size-fits-all methodology checks the boxes but serves as another limiting factor that prevents tangible business improvement. A true machine learning platform gives flexibility to choose the right tools, provides direct access to all of the data, and welcomes heterogeneity in a workflow rather than shuns it.
MapR is the only platform that allows POSIX-compliance access to distributed data – if a tool can run on Linux, it will run on MapR. In this post, we review some of the major building blocks of ML workflows and how MapR enables ML Polytheism, which requires finding the best assortment of tools that provides lift and a future roadmap for development that will evolve with technology.
Unless you've never tried to solve a real problem with machine learning, you might be astonished to hear that most of the time spent on this task involves presenting the raw data to the ML algorithms in such a way that they can attempt to optimize performance. This is referred to as Feature Extraction (or Feature Engineering). There are a few specific problems, where the feature extraction is included (e.g., image classification), but even in those cases, there must be significant pre-processing of very large data sets.
Vectorization of Text - This a common approach for preparing free-form text for analysis. It requires access to all of the data to build the dictionary of commonly-used words. Spark has a tool well-suited for this task. I used this technique to compare two sources of literature to determine if the author was the same in this blog post.
Building Risk Tables - This is a technique for assigning a numerical value to a nominal category (i.e., merchant category), especially used for classification tasks such as fraud detection. This involves simple math but requires accessing massive datasets. There’s an example of using Apache Drill to power these calculations in this GitHub.
The type of problem typically determines the family of algorithms from which the modeler chooses, but the approach works well for supervised, unsupervised and reinforcement learning scenarios. When the model is ready to be trained, it is important to select multiple methods for experimentation. For example, in a supervised classification setting, you may see differing results for linear, non-linear, tree-based, and boosted methods.
This is the fun part. Many of these algorithms have implementations in multiple libraries, but in some cases the difference among those methods may differ significantly. In some instances, you'll be downloading someone else's code and be forced to use their library, at least in the beginning. In other cases, you'll want to leverage a GPU for complex deep learning methods. The point is that your choices will depend on many factors, and limiting yourself to libraries that must read distributed data is not going to cut it.
Random Forest - Spark has a Random Forest library, but some prefer the common-lisp implementation, which has been shown to be faster and more reliable than other packages. This blog post provides details and tells you how to spin it up on MapR.
Recurrent Neural Network (RNN) - In some cases, you might find code that is so interesting, you just have to try it. It might not be worth rewriting (or possible) as a Spark job, so you want to run it as is. This blog post is an interesting tutorial on using Recurrent Neural Networks to generate text based on any data you train it with. Unfortunately, it runs in Torch, so if your data is located on a vanilla DFS, you’re out of luck.
The Kitchen Sink - There are too many other tools and libraries to mention, but the nature of ML polytheism is that you’ll try as many as you can. They include ML packages (SAS, R, Julia, etc.), tools (Jupyter, Zeppelin, RStudio), and libraries (scikit-learn, XGBoost, MXNet, etc.). True, some aren’t free, and some can run with limits against a DFS, but burning those extra calories to remove the barriers to experimentation means that most businesses avoid them.
It would be unreasonable to assume that any of the solutions above would be the magic bullet for your business problem. It's usually the case that some combination of models, either in a sequence (i.e., cascading models) or in combination (i.e., ensemble) will outperform an individual method. It won't be ideal for every scenario, but if you need dexterity in combining models, then creating containers to serve the models as APIs can be an effective way of implementing an ensemble. As a natural extension of this capability, you'll be wanting to add new methods for evaluation and to promote and retire models, too.
Flask - This is a simple Python library with a great deal of flexibility, as long as your data isn't on a vanilla DFS. It can serve a fairly complex model as an API; this repository shows how to classify images using a pre-built MXNet residual net model with a MapR container.
Kubernetes - Flask is a simple but effective way to get started with model servers, but taking containers into production would require something more reliable. Kubernetes is emerging as that orchestration tool. It provides structure to launching and monitoring multiple containers on a cluster. If we consider some of the cluster nodes to be GPUs, and if the jobs being fired are ML in nature, the value to modeling should be obvious. MapR believes that bringing persistence to containers and its management is important enough to include it into its data fabric.
Data science, much like other scientific disciplines, requires design and execution of many experiments. But data science is unlike those other disciplines in that sharing of work is promoted and encouraged. However, like most free things in life, they come as is. It really, really helps to have a flexible environment to run code of heterogeneous origin without limiting your access to data.
MapR provides the ability to manage complex ML workflows, run any tool that is meant for Linux, and marry the concept of a multiverse of ML environments and limitless access to your data. In a few months, it’s likely that there will be a new and exciting capability in the realm of data science. We don’t know what it is yet, but we do know that it will be readily available to run on your MapR Data Platform.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.