13 min read
Editor's note: This is the fourth in a series of blog posts on how to build effective AI and machine learning systems. The previous blog post is titled "With Machine Learning and AI, The Win Isn’t Always Where You Think."
Is this article for you?
Many people are newly interested in machine learning and other forms of data science. If you are one of them, you may find this short article useful to explain some basic ideas about how machine learning works, and how it compares to traditional software development. And if you already are an experienced modeler or data scientist, you also may find this useful because you need to be able to communicate with others - business leaders, architects, IT professionals, people with expert domain knowledge - about what your machine learning project entails.
With the ever-broadening adoption of machine learning and AI, it’s not surprising that people without a data science background are hearing a lot about these topics and wanting to know more about what’s involved. Recently, someone asked me, what, exactly is a model? So I’m taking a moment here to give a very high level, non-technical, explanation.
A lot of what is involved in machine learning development, including building the model, is less "math-y" than many people expect and much more like the thinking and skills that go into traditional software programming. People hear a lot about fancy learning algorithms and, if they’re not wild about math, they often tune out. But it’s important to know that the learning algorithm is just a relatively small part of what’s involved in the whole process of machine learning. There is also a huge amount of data manipulation to process raw data and extract features for training data as well as to handle outputs, evaluate and deploy models and do that over and over again.
Many people just getting into data science are surprised at how much code they have to write in order to build and deploy a machine learning model. They are surprised by just how much their coding background will help them. In spite of the "math-y" reputation machine learning has, there is a big role and requirement for coding, and much of it uses traditional coding skills, expanded with some modern data skills. The fact that lots of people are already halfway or more to having these data wrangling skills is one reason that the fairly new role of data engineering is rapidly growing in popularity.
Data scientists almost never develop a learning algorithm from scratch in day-to-day work, but they do need to have coding skills. They typically use and customize learning programs built by someone else. They do, however, have to write programs that manipulate and encode data into forms usable by the learning algorithm that they are using.
Let’s demystify the machine learning process a bit. In machine learning, the data used to produce a trained model (the training data) is an essential part of the learning process. Raw data is processed, and the features of interest are extracted to produce training data. The training process usually is iterative, as suggested in the following figure: it’s not just write code, test, run and done. Rounds of training, evaluation, adjusting and retraining go on until a trained, executable model with acceptable performance is produced - and there will be many, many models, as performance in production in real-world settings will vary with different models.
This figure shows that there are two main phases in machine learning: developing a trained, executable model and using that model with new data in order to draw insights or automate a decision. Note how new data can be used to train new versions of the model.
Why is the training data so important? And why is it used so early in the process? As we shall see in the next section, the training data effectively helps write the model code. Compare that to a traditional software program in which the developer chooses all aspects that are part of the code and then only uses data as input to run the completed software program.
In order to produce a trained machine learning model, a learning program learns parameter values from training data during the training process. That differs from the traditional software development in which the human developer specifies all the values used in a program. These learned parameter values will be used as part of the trained, executable model. Together, the learned parameter values and the scoring program comprise the trained model. Once trained, the model will be run with new data, either for testing (if using test data) or for scoring new data that is being used when the model is run in production.
Here is a toy example that illustrates the key role of data in building a trained model. Suppose we have recorded lots of colored dots, and we want to build a program that can predict what color a new dot will be (presumably because colored dots are important to us, somehow). The figure below is an example of this. Look at the three empty circles with question marks on this diagram &emdash; they mark the location of three unknown dots your program needs to predict. You can probably guess that the one on the left will be yellow, the one on the right will be green and the one in the middle is a bit of a toss-up.
This toy situation is so simple that a person could just code the predictive program because you could estimate the key values you would need. For example, based on the general knowledge that we have so far about colors and positions, we could write a program that predicts a new dot located at position (x,y) will be green if x > 0. A slightly better program would realize that if y = 0.5, we should really be saying green only if x > 0.25. And for y = -0.5, we should be more liberal and predict green if x > -0.25. Refining our first idea, we could predict green if x - y/2 > 0. Even better, we could note that for x - y/2 > 0.25, we are almost positive that the dot will be green while for x - y/2 < -0.25 we are pretty sure that it won’t be. That means that we could estimate the probability that the next dot is green with something like this little Python program:
def pGreen(x, y): # limit the output to the range 0 to 1 return limit(2*x - y + 0.5)
Notice how we did this. We looked at data manually and thought about things. And then we typed out a program. Data was involved, but only as a way for us to build up our own mental model about how things work. The result is a program with some magic numbers in it that we came up with by inspired inference.
We could rewrite our little program to be more generally useful by labeling those numbers as a, b, and c to emphasize the fact that those values could be changed for different kinds of data.
def probabilityModel(x, y): # limit the output to the range 0 to 1 return limit(a * x + b * y + c)
In contrast, in more complex real-world situations you would need a learning program to discover the values of magic numbers for you. If the situation involved a huge amount of data we could not just guess values for a, b, and c. In traditional software development, you as a developer would have to specify those values just as we did just now. But with machine learning, we would give a learning program the historical data, and it would discover good values of a, b, and c that could be plugged into a program like the one above.
What is "the model" in this example?
These values of a, b, and c (derived automatically from the data), together with these two lines of code (which are specific to the learning algorithm), are what we call "the model".
The point here is that the machine learning program will learn part of the program from the training data instead of us figuring out those values. To deploy this model is to arrange that the coordinates of any potential new dots are given to the code, and the result is returned to whoever asked about the new dot.
For the case of colored dots, there really isn’t much advantage to doing this because the example is so silly and so simple. But there are many applications where there are not just three numbers in a model, but thousands. Or millions. Or, in a few cases, billions. In those cases, hand-coding fails because of complexity. There are also many applications where we want to have a model that exactly reflects the data we have with as little bias from the developer of the model as possible.
This explanation is, of course, a great simplification of what’s involved with machine learning, but it should give you an idea of the basic requirements of building, training and deploying models as well as help you in thinking about what resources are required.
In the process of training a model, change takes place as the learning program discovers key values. But the situation is different when running models in production. In most circumstances, machine learning models do not change continuously while they are running in production. That is, values (such as those represented in our toy example by a, b, and c) are fixed for as long as a model is deployed. Changes in response to new data would occur by starting over at the training stage and computing new values from our old training data augmented with recent observations. Notice in the earlier figure that machine learning is an iterative process - evaluation goes on even once a model is in production, and models may be re-trained using new data. So there is ongoing influence from the outside world if you choose to take advantage of it, but this influence is normally episodic rather than continuous.
Keep in mind that in real-world applications of machine learning, there is a substantial amount of code that the learning algorithm does not learn. Just as in our toy example, the structure of the code that applies the model to new inputs is given, not learned.
For all these reasons, there is a lot of overlap between traditional coding skills and the coding required for machine learning but also some distinct, qualitative differences: the biggest difference is the role of data early in the process (during training) and the way it influences the nature of the model that is developed. That emphasis on the role of data also suggests the need to know and to preserve exactly which data is used to train different versions of models. Tips for how to handle the management of code and data versioning in machine learning and AI systems is the topic of the next blog post in this series. Stay tuned!
To learn more about best practices for AI and machine learning, try these free resources:
Whiteboard Walkthrough video: "How MapR Enables Multi-API Access to Files, Tables, Streams"
Free on-demand training: Introduction to Artificial Intelligence and Machine Learning
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.