The Importance of Data Science

Contributed by

4 min read

Data scientists are a unique breed, focused on understanding and drawing specific business, medical, financial, and manufacturing conclusions, using data. They are able to hypothesize, improving business outcomes, based on initial analysis with continued proof points, through continuously testing those base conclusions and the iterative improvements to ensure the conclusions reached continue to be correct and true.

The world of data science is changing our world in every business.

Two major challenges are facing the data science community: the ability to collect and manage the iterative models that represent the conclusions as well as formulate and ensure that the data collected is the right data to run those models on.

The collection and management of these models requires that data scientists have their own workbooks. These workbooks have unique requirements in that they must retain the models that were run, identify the data set they were run against, and track/manage representations of how those models have been iteratively changed to achieve the proof point. There are many ways in which this challenge is solved–the use of proprietary workbenches has been the one most frequently adopted in early implementations, but MapR chose to take a different path. Leveraging the innovation engine that is open source, MapR has adopted technologies like Zeppelin and Jupyter as our standard workbook for supporting this active and highly dynamic group of engineers. Our rationale was simple: why would we constrain some of the most creative people in the engineering community by restricting the way they do work and, at the same time, prevent them from adding their learnings to the broader data science community? We’ve adopted these capabilities and, now, deliver them as part of our MapR solution. Thanks, in large part, to our own internal data scientists as well as customer and community feedback, it was abundantly clear that, in running our company on our own technology, we needed to listen, evaluate, and adopt what our internal data scientists and experts recommended. These workbooks are the entry point into all products for the data science community of users, and they have found broad adoption and success–two metrics that, at MapR, we test every time we decide to add capabilities to our data fabric.

It has been estimated that eighty percent of a data scientist’s time is spent curating and preparing the data for use in the journey of finding these truths. The MapR fundamental philosophy is that customers want to bring the compute where the data is, rather than always creating a vast array of pooled data to run analytics on. The pooling of information is a legacy of the past and represents, in many ways, the internal IT perspective of how this problem is solved: move the data into a single silo and let analytics–typically in the form of reporting–run wild. We believe the next generation of applications will, inherently, have analytics as part of the production system. We’ve demonstrated this by bringing analytics directly into streams of data or moving the analytic engine to edge computing.

MapR understands the challenges and opportunity for data scientists. We’ve built an entire suite of products, leveraging our data platform, to address the unique challenges that this powerful set of customers faces.

To learn more about the MapR Data Science Refinery, go here.

This blog post was published December 11, 2017.