6 min read
NVIDIA has announced RAPIDS data science framework, a set of libraries for executing an end-to-end data science training pipeline completely in the GPU. This is an incredibly important step forward to accelerating AI use cases. Prior to this release, in order to leverage the GPU for these workloads meant having some substantial expertise with the CUDA libraries. While extremely powerful, CUDA requires expertise that many don’t have, thus creating a steep barrier for entry for those interested in really pushing the use of the GPU to the limits.
With this release, NVIDIA has taken a major leap forward in simplifying the use of the GPU by doing the hard work for all of us who want to take advantage of the accelerated compute capabilities of the GPU. They have created a version of Pandas and Scikit-Learn which easily benefits from a GPU. They have also created XGBoost algorithms which benefit from the massive scale-out performance of a GPU.
They enabled this innovation by leveraging the Apache Arrow dataframe and creating a CUDA Data Frame (cuDF), allowing the data to be processed to be placed directly into GPU memory. For those unaware, Apache Arrow was born out of the Apache Drill project and its value vector in-memory data representation. These dataframes help to get the data closer to the physical hardware, and help to prevent additional memory copies thus speeding up processing. In addition, this methodology helps to free up the CPU. I feel like most of the time people take CPU for granted. If reducing memory copies and data movement can free up 10% to apply that 10% to a real problem, then that is a 20% opportunity cost. This is pretty substantial when considering the volume of use cases where people want to use vast amounts of data to solve their problems with AI solutions.
If we step back a bit and look at the bigger picture we can begin to hone in on a broader set of capabilities where RAPIDS can easily be plugged in. Kubernetes (k8s) and containers deliver to us an isolation model that allows applications to be quickly moved between physical environments with minimal effort -- on-premise, cloud, or both. MapR supports injecting storage into containers with k8s flex volume driver. NVIDIA supports pushing GPU support into a container with k8s. The combination of these two capabilities is really something, because this enables access to all the data and the GPUs available in your environment without having to build extra heavy containers. We relegate these capabilities to k8s which lightens up and simplifies the CI/CD pipeline.
Looking beyond the training part of the pipeline, these capabilities benefit production as well, where we need to handle real-time event-based data. The creation of our models benefit from RAPIDS, and now they can respond to events in a highly scalable way. MapR Event Streams provides the ability to perform analytics and gather the data on the edge for IoT scenarios, and it can also automatically replicate the data to other locations for larger scale continuous learning, further enabled by RAPIDS.
Another key piece of enabling technology provided by MapR is its ability to create point-in-time views of all of the data (streams, files, databases) as well as models and source code to act over data that is fixed at a point in time. This provides the user the ability to version everything together. This is a great way to provide a sanity check and make sure things are moving in the right direction and to be able to ask as many questions as necessary without the data changing. There is no data copy that occurs in this scenario, and can easily handle any volume of data.
MapR can also place data adjacent to specialized hardware like GPUs. This can prove to be extremely beneficial when performing workloads over substantial volumes of data. Often, others in the industry will act as if this doesn’t matter. However, most people do not have infinite bandwidth available on their network, which means, we should care about how long it takes to move data across a network.
Ideally, we want to solve all these data logistics issues to enable data scientists to focus on their core competencies instead of the logistics. Put the data where it needs to be, not a copy of it because copies cause problems. They also need access to enough available compute to solve their problem. The more we can free up the CPU from doing things like moving data from disk to CPU memory to GPU memory the more CPU we can free up and make available to solve the problems of the business.
RAPIDS takes the burden off the CPU by moving workloads to the GPU where the workload will run faster, because really capable software engineers did the hard work of knowing exactly how to best use CUDA and get the most out of the GPU. When combining these new libraries containers and k8s on MapR we can reduce the amount of effort wasted on data logistics.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.