Practical Tips for Data Access and Machine Learning Tools

Contributed by

11 min read

Editor's note: This is the second in a series of blogs by this author on how to build effective AI and machine learning systems. The previous blog is titled "AI All Over the Place: Where Does Artificial Intelligence Pay Off?"

How do you get value from AI and machine learning applications?

It all starts with the data, but exactly how you handle and protect that data makes a world of difference. Data collection, data quality, and ease of data access are key elements in the success of AI/ ML systems. These factors affect the accuracy of results as well as performance parameters and how efficiently development proceeds. And ease of development, in turn, affects time-to-market.

This blog post, second in a series of articles on how to achieve success with AI and machine learning, takes a closer look at how your data habits affect the outcome of your machine-learning-based projects and provides practical tips for what you can do to improve the results. In particular, in this article we look at how data access affects outcomes and what role dataware should play to make this work well.

[The previous blog post in this series was "AI, All Over the Place: Where Does Artificial Intelligence Pay Off?" and relates to content covered in the free eBook AI and Analytics in Production by Ted Dunning and Ellen Friedman.]

Machine learning systems use data from a wide variety of sources, often at large scale. Data may come from databases, files, searchable documents, IoT sensors, or web clicks. But data variety and scale of data are not the only challenges for data collection and handling. Some of the potentially challenging issues that could slow progress in developing machine learning systems are not really under your control. For example, getting permission to collect particular data (especially for sensitive data) or the quality of raw data are just facts of life that must be dealt with. The good news is that for many other challenges, you can avoid or mitigate potential problems through your choice of system design, architecture, and infrastructure. These more sweeping issues should be under your control, and handling them efficiently can prove advantageous to a wider range of projects than just the machine learning or AI systems you are building.

Here, then, are a few practical tips regarding data handling that should improve the success of AI/ML across a wide range of projects.

Data Logistics

It can be tempting to focus your attention on the most exciting aspects of machine learning, the learning model and choice of algorithm. Of course, these matter to the success of a learning system, but the effort – and choices – required to build a successful system that delivers real business value actually lies elsewhere. The logistics required for AI and machine learning systems are huge compared with the learning step, and much of the logistics involves data handling. The following figure shows how other aspects of machine learning systems dwarf the learning step. Note the data-focused or data-related parts of the diagram, including data collection, data verification, feature extraction, analysis tools, and process management tools.

Only a small part of ML systems is the learning code itself. The rest is a vast and complex infrastructure that includes various aspects of data collection and processing.

[Figure is based on "Hidden Technical Debt in Machine Learning Systems" article by Scully et al. (Google, Inc.)]

The first tip for success is to recognize and plan for the scope of these beyond-ML-learning issues and to do so in a way that provides a high degree of flexibility. With data collection, for instance, you should consider more than just how you will handle data ingestion for a current ML project you have under development. Most likely, you may want to add new data sources in the future, for this project or additional projects, and the new data sources may involve different data formats or structures.

So the second tip for success is to design architecture and choose infrastructure that support a variety of projects, including some as yet unknown. You should not have to build a separate system for each new project. This commonality is particularly important for machine learning and AI projects, which often are speculative and thus high-risk despite offering the potential for very high returns. If your system, including your data platform, can handle a variety of data sources, structures, tools, and multi-tenancy, you can lower the entry cost for your "2nd project" and thus make it more reasonable to try out high-risk/high-potential AI and machine learning projects.

Data Access

Regardless of the source, data for training models and for input to production machine learning systems needs to undergo data preparation that matches the needs of the particular project. Data preparation can be quite extensive, involving processes such as:

  • Aggregations
  • Deduplications
  • Reformatting
  • Feature extraction

These processes often need to be carried out on large volumes of data, especially for preparing data in advance of the training process. The third tip for success is to use a data platform that scales data storage reliably and in a cost-effective way, whether you work on-premises or in the cloud. You may consider eventually storing large volumes of data especially for further reference, in lower-cost archival storage, but keep in mind that machine learning is an iterative style of work. Feature extraction happens repeatedly during trial-and-error exploration during development and again even after successful models are deployed in production, because the world can and will change. There are greater latencies in retrieving data from archival storage such as S3 formats. (There are ways to mitigate that latency as well. Stay tuned for future blog posts.)

Maybe the biggest difference you can make through your conscious choices is to make large-scale data directly available for use by modern machine learning and AI tools. You should not have to repeatedly copy data out of a big data platform in order to make it available to modern tools and legacy systems. The problem lies with systems that rely on HDFS (Hadoop Distributed File System) for scalable data storage. Remember, big data does not equal Hadoop. Here is why that matters.

Many of the standard big data preparation and ETL tools have output that writes to HDFS, a write-once, read-only file system. But most machine learning and AI tools (along with standard analytics tools) do not natively read from HDFS, necessitating the cumbersome step of copying data out (and output back) for training or running the learning models. This definitely slows development and production systems, but it can be avoided. The MapR Data Platform, for example, provides highly scalable data storage and management with open APIs, including the Hadoop API. So you can store the output of popular open source data preparation tools, such as Apache Spark, directly on the MapR platform. Yet MapR is also POSIX-compliant, which means that modern machine learning tools can read that same data directly, without having to copy data out, as shown in the following figure.

One of the biggest advantages a dataware of this nature can offer machine learning and AI systems is that it combines direct data access by data preparation tools (such as Apache Spark and Apache Hive) with direct access by a wide variety of AI/machine learning tools, without having to copy data out for the training and modeling process. MapR dataware is unusual in having both capabilities in the same system. This simplification can make a big difference in time-to-value.

This flexibility conferred by a big data platform that is POSIX-compliant is also helpful because experienced data science teams tend to use a collection of "favorite" machine learning tools. No single tool will fit every situation, so data scientists need freedom of choice. This freedom includes the ability to use the next great AI or machine learning tool to come along. For all these data access reasons, the fourth tip for success is to use dataware that provides highly scalable storage along with direct data access by AI and machine learning tools as well as by legacy systems.

Comprehensive Data

It's no surprise that the success of a particular AI or machine learning system depends in part on the overall architecture and infrastructure of your organization. If your big data systems are somehow set apart from everything else, you may still have unwanted data silos, which can prevent data scientists from easily seeing a comprehensive view of data.

In addition, if it is onerous to access data across data centers or geographical locations including from the IoT edge, then it likely will be difficult for your data scientists to see a comprehensive view of particular types of data needed for training or as input for model applications in production. These situations may artificially skew the available datasets and in turn skew the decisions (output) of your AI and machine learning models. These problems certainly will limit performance.

The fifth tip for success with regard to data-related issues in machine learning is to examine your architecture and potential lines of data flow for unintentional silos and bottlenecks that may limit what data scientists (and machine learning systems) see and to choose infrastructure that supports both big data and traditional projects, all on the same system, in order to avoid these problems.

For further information, use these free resources:

This blog post was published April 09, 2019.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now