6 min read
Data scientists have the ability to solve some very difficult problems by leveraging machine learning and AI-based technologies. Solving these problems is easier said than done when corporate infrastructure and data management systems are not completely up to the task.
Data logistics consume upwards of 90% of a data scientists' time spent while attempting to solve these problems. That being said, more must be done to enable these specialists to accomplish their goals in a less painful way.
A normal data science workflow resembles the following steps: identify data sources, identify tools, write code, train model, test model, analyze results, repeat until satisfied, figure out a way to get the solution into production.
The bigger the problem to be solved, typically the bigger the data logistics problem or the larger the data set required to meet the need. This leads to longer cycle times. If the data set to be used for training is 10 terabytes in size, the user is unlikely to be happy pulling the dataset across the network every time they need to run through another training cycle. Every time data is output from a testing run, that data must be kept around. Copying data back-and-forth across the network constantly is tedious and time-consuming.
This is where we should consider improving the daily life of the data scientist. By putting a Data Science Workstation (DSWS) on the user's desktop, they can expedite their entire lifecycle. This will bring the necessary GPU horsepower to the user's fingertips. Combine the DSWS with a dataware solution that creates a uniform data fabric and suddenly the data scientist no longer has to worry about how to get the data they need. Of course, the data they need could be in a NoSQL document database, event stream (real-time events), or file.
Imagine leveraging the smart mirroring capability that provides bidirectional support for all those data types. The data scientist doesn't have to figure out how to get a copy of the data from an event stream or from a NoSQL table. It is just there to use. The output from the tests can be mirrored back to the central hub without having to manually run every copy step they would normally be accustomed to performing.
If we combine the DSWS with the likes of Kubernetes (k8s), then a tool like Kubeflow can be used by the data scientist to build their entire pipeline. The pipeline can then be run in the production environment with nearly zero additional effort because the entire system architecture is identical on both the DSWS and the production environment. The shapes of the environment and total compute and storage are very different, but k8s provides abstraction from the physical environment, enabling considerably shorter development to production lifecycles. This reduces a lot of friction between the data scientists who don't need to understand everything about the production systems and the systems administrators who do not need to worry about understanding every detail of the data science workflow.
If we take this entire model a step further, we can explore how this same architecture can then be used at the edge. If we have a lot of horsepower, DGX Pod style, running in our data center, we likely want to be processing massive data sets. Consider all the data collected at the edge in a scenario like connected cars or even in the oil and gas industry. The data is not only collected at the edge, but a certain amount of intelligence is being applied there. This architecture allows for high speed, low latency capture of the data while being processed and applied for each specific edge use case.
These use cases normally would be machine learning based models that have likely been trained over massive amounts of data in a DGX Pod. With this smart mirroring capability provided by the data platform that extends across data types and locations, all of the data management and security is built in, and the continuous training can occur on the DGX Pod. New models can be trained and tested, then pushed back to the edge.
The processing power brought to the data scientist by leveraging the Data Science Workstation provides a tremendous simplification for the development and testing phase of the life cycle. Add to that the ability of Kubernetes and Kubeflow, and moving those solutions into production becomes even simpler. Push all these solutions to the edge without having to create a new architecture or new solution stack, and suddenly we have more power at our fingertips than ever before. However, we cannot forget that without the dataware layer providing all of the data management, movement, and security, this wouldn't be able to be run on a singular architecture. The dataware removes a tremendous volume of latency from the process and elevates this entire stack from duct-tape and bailing-wire to that of proper enterprise solution that is easy to manage and enables deployment agility with data management between and across the edge, cloud, and on-premises.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.