15 min read
Editor's note: This is the third in a series of blog posts by this author on how to build effective AI and machine learning systems. The previous blog post is titled "Practical Tips for Data Access and Machine Learning Tools."
Even if you're not a data scientist, you may hold one of the most valuable skills in data science: the ability to understand your own business. And that is one type of expertise that data science specialists may lack.
It's tempting to be impressed by sophisticated algorithms and complex AI or machine learning systems, and that's OK, because they are indeed impressive and often quite valuable. But, surprisingly, even very simple approaches are often just as valuable, if not more so, as long as they fit your business needs adequately. The value in an AI or machine learning project is not directly proportional to its complexity. A better predictor of good return on investment is how well the machine learning or AI approach addresses the right question or need for decision in order to have a significant positive impact on real business goals. And finding that spot takes real talent and knowledge, whether it's done by the data scientist or by business and domain experts. The fact is, the win is often not where you might think.
In order to understand this better, consider these 3 real-world examples.
A widespread form of machine learning, especially in retail business, is to build a learning system for making product recommendations. The good news is that this is an example where very simple approaches can be very powerful. The algorithms do not need to be complicated as long as you use the right training data to learn preferences. A basic principle in effective recommendation is to look at what people do rather than what they say they like. In other words, the best training data to indicate preferences comes from behaviors, such as what items people buy, what books or articles they read, what music they listen to, or which restaurants they go to. This behavioral data usually yields much more accurate models of what people prefer than does training data based on ratings, such as restaurant reviews. Ratings in online reviews, for instance, tend to be sparse and skewed, since only a small portion of people take the time to fill them out. In addition, people who disliked the experience may be more motivated to review than those who liked it. In addition, many reviews are fake. Even in the best of situations, people don't always know what they will like, especially for new or untried options.
The methods to build a simple but powerful recommendation system are described in the short O'Reilly Media book Practical Machine Learning: Innovations in Recommendation that I wrote with machine learning expert Ted Dunning, CTO at MapR. This figure, used with permission, is from Chapter 4: a simplified diagram showing the transition from user history to co-occurrence analysis to an indicator matrix. Explanation is in the book, which you can download as a free PDF here.
On the other hand, using behavioral data solves both problems. All users produce this data, and it is much harder to spam in that it comes from a much broader range of users. These ideas are not just theoretical; I know of a MapR customer, for example, who used this approach to build a very effective system for restaurant recommendations in just a few months.
But when you build a recommender, which behavioral data really indicates preferences for the action of interest? Even with this simple way to build a recommender, it still can be tricky to find the right data, as was the case in another real-world example. In this other instance, the machine learning system was being developed for online video recommendations. At first, it looked as though everything was working well:
But the real-world results when models were deployed were not good: turns out, the system was learning which titles people liked, not which videos they actually wanted to watch.
By changing the training data from titles they clicked on (which is highly sensitive to spammy titles) to views of the first 30 seconds of videos, the effectiveness of the recommendation system soared: the win was not where it was expected. After the change, the model was based on behaviors that really reflected what people would actually watch. Discovering the difference – using 30-second views vs. movie title clicks as the data feature of interest – had to do with understanding the domain and the business rather than changing the algorithm.
Big industrial companies are leveraging sophisticated artificial intelligence and machine learning approaches that make use of IoT sensor data that is flooding in, and the wind industry is one such example. Predictive maintenance is a common machine learning use case for big wind farms with their fleet of giant turbines. But in one recent case, the win from machine learning for wind energy came from a surprising place: the weather. One of the challenges in wind energy is dealing with fluctuations in the amount of wind. The real cost of this variability is that producers cannot safely commit to delivering a certain level of power, and thus the power they produce is worth less because consumers need to hedge against the wind not blowing. Machine learning (at this point) cannot make the wind blow, but it can better inform decision makers about what the weather most likely will be, decreasing the cost of hedging production uncertainties. Instead of doing away with fluctuations in wind over time, machine learning was used to reduce human uncertainty about the amount of wind energy that would be generated by providing better predictions. In other words, with machine learning, the negative economic impact of this variability in wind energy can be mitigated and, in doing so, the value of the wind energy was increased. Here's what happened.
Data scientists from DeepMind and Google's Carbon Free Energy Program applied neural networks to widely and publicly available weather forecast data and turbine histories from Google Wind Farms to make predictions 36 hours in advance of energy generation.
Machine-learning based predictions of wind energy generation: From a figure in the article "Machine learning can boost the value of wind energy" by Sims Witherspoon, Program Manager DeepMind, and Will Fadrhonc, Carbon Free Energy Program Lead, Google in Google's The Keyword blog. Note how rarely the wind plant under-delivers.
By combining advanced modeling of weather with historical turbine performance data, these data scientists were able to optimize hourly delivery commitments of energy to the power grid. Being able to accurately schedule and deliver energy increased its value. The authors state that "to date, machine learning has boosted the value of wind energy by 20% compared to the baseline scenario of no time-based commitments to the grid." That's an impressive win, especially in a low-margin business.
Industrial IoT settings, such as the wind farms, oil and gas drill sites, manufacturing, and container shipping, are all situations in which AI and machine learning can play a role in extracting value from sensor data. But these IoT-intensive businesses are also just that: businesses. As such, they face the same processes and challenges as less technical brick-and-mortar stores or online retail sites, for instance. They handle customer orders, provide customer service, contract with suppliers, and handle billing. All these essential business processes need to run efficiently and reliably, and sometimes machine-based decisions can help.
This third real-world example of a machine learning win in a surprising place involves an IoT customer of MapR and the insight of a data scientist on MapR's professional services team. This large industrial customer is using a wide range of data, including IoT sensor data, in a variety of very sophisticated ways to improve the technical side of their business, but those situations are not the applications I want to call to your attention. Instead, it is a business bottleneck that could have happened in almost any company, an accounting issue. And applying a simple machine learning system to that bottleneck has paid off to the tune of tens of millions of dollars in savings and recovered revenues.
The accounting issue was that a small percentage of items, such as equipment acquisitions and services, were inadvertently mislabeled or just labeled as "miscellaneous," a natural side effect of a huge number of items being labeled. The problem is this: some of the items might have been subject to reimbursements or could have been listed as tax-deductible expenses. The good news was that the number of items that could have been better labeled was small – but that was also the challenge: for humans to find these and label them more accurately was like finding needles in a very large haystack. That's where machine learning was a big benefit in finding a target-rich collection: machine-based decisions were used to identify a group of transactions with a much higher likelihood of needing to be audited. This group was small enough to make it reasonable and feasible for humans to examine it and make corrections where needed, but it was large enough to be worth the effort. The result was worth tens of millions of dollars in savings and revenue that would otherwise be lost.
In this case, the surprising win came because of recognizing where machine learning could augment human decisions. The model did not need to be complicated. The brilliance of data science in this situation was not building an amazing model; it was in the ability to recognize a situation where machine learning would provide a big win for a small effort.
The fact that the next win from machine learning or AI may come in surprising places just underlines the need to make use of effective dataware in building your large-scale system. Machine learning and AI not only use large volumes of data but also diverse data sources and data types. Once you recognize the situations where you can best apply these learning approaches, you'll also identify what data will be involved in the training. Systems that already provide access to comprehensive datasets lower the entry cost for speculative machine learning and AI projects, making it much more likely that they will be worth the effort. This is why appropriate dataware is a key aspect of setting up for the win.
What is dataware? It is an essential concept in the modern machine learning/AI and IT landscape that helps manage data as a resource for a full range of real-time analytics and learning programs. In the best case, dataware does this regardless of location, for containerized and non-containerized applications, for on-premises data centers as well as for cloud, hybrid, and multi-cloud deployments. This gives data scientists the foundation on which to build effective applications.
Where do you get dataware? Dataware is found in the form of a modern data platform with the right capabilities to meet the requirements described above. The MapR Data Platform is a leading example of dataware that is well suited for machine learning and AI systems.
With its unique design of highly scalable files, tables, and message streams all engineered together in the same system, the MapR Data Platform provides universal access to data to avoid unwanted silos while giving system administrators fine-grained control over who does and does not have access. The MapR Data Platform supports global multi-tenancy, both with regard to direct data access needed for seamless function of machine learning models as well as safely running multiple applications that can access the same data. All this goes a long way toward setting up for the next clever discovery of where AI and machine learning could supply a win.
Webinar: "Data Management for AI and Machine Learning: Putting Dataware to Work" by Ellen Friedman
Free ebook: AI and Analytics in Production by Ted Dunning and Ellen Friedman
Blog post: "Leveraging the NVIDIA Data Science Workstation and DGX Pod for AI and Machine Learning" by Jim Scott
On-demand free online training from MapR Academy: Introduction to Artificial Intelligence and Machine Learning
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.