Accelerating Data Access and Production Deployments for NVIDIA-Driven Data Science

Contributed by

5 min read

Today marked significant AI announcements for both NVIDIA and MapR. NVIDIA unveiled their open source RAPIDS framework and their end-to-end data science platform. Their advancements speed model training and enable more iterations, faster, for better model accuracy. Their release also leverages containerized workflows to improve data scientist productivity.

MapR was highlighted as an important NVIDIA partner, and we announced highly complementary data management and logistics capabilities that also include containerized components.

NVIDIA has invested heavily to make core data science operations faster and more efficient for GPU-based processing represented in the green areas of the diagram.

MapR addresses the white space surrounding the green. MapR has invested heavily to make the data access and preparation faster and more efficient. In other words, speeding and simplifying the data management and deployment tasks to maximize the NVIDIA driven data science. MapR has also innovated to make the model deployment and integration with operations more efficient to speed business impact.

Data Logistics Matter

The data science and model training are obviously important, but the accuracy of a model is highly dependent on the quality and amount of data available. Often, the bulk of time spent is acquiring and preparing the right data to fuel the data science pipeline. How data is acquired, delivered, and continues to flow is an essential ingredient to Machine Learning success. In fact in their O'Reilly book, Machine Learning Logistics, authors Ted Dunning and Ellen Friedman point out that data logistics account for 90% of machine learning success. The benefit of MapR is to make these data aspects of the data science as straightforward, scalable, and secure as possible.

Google has published a white paper that provides a supporting perspective. The white paper (which by the way, is one of the best titles I've come across) Machine Learning: The High Interest Credit Card of Technical Debt, underscores the important of data logistics. The downside of getting it right can be costly. According to the Google team of authors,

"Machine learning offers a fantastically powerful toolkit for building complex systems is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning."

Some of the highlights of the MapR solution for AI include the ability to collect data at scale from a variety of sources and preserve raw data so that potentially valuable features are not lost and to avoid the data logistics dependencies warned about in the Google paper. MapR also supports accelerated stream-based delivery of standard files including Parquet, ORC, JSON, Avro, and CSV file formats directly into NVIDIA.

It's not the Science, It's the Impact

The importance of a model is ultimately its business impact. How does the model get integrated into the business operations? What is the continuous learning and adjustment process that follows? How agile is the model maintenance? How do you avoid the type of ongoing maintenance tasks and costs raised in the Google white paper? These questions all point to the importance of the far right of the above diagram. MapR capabilities to integrate model deployment into operations include:

  • Make input and output data available to many independent applications even across geographically distant locations, on premises, in the cloud or at the edge
  • Manage multiple models during development and easily roll into production
  • Improve evaluation methods for comparing models during development and production, including use of a reference model for baseline successful performance

While these announcements are significant, this marks only the beginning. MapR will continue to work with NVIDIA and the ecosystem around Apache Arrow, CUDA and DGX to enable broad adoption and support the largest breadth of AI and analytic workloads.

Read Jim Scott's blog post to learn more on how the MapR Data Platform and NVIDIA's RAPIDS framework complement each other in reducing the amount of effort spent in data logistics.

This blog post was published October 10, 2018.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now