6 min read
Python has become the darling language of the data science and data engineering world. It's versatile and powerful, yet easy enough for beginners to use. While we encounter Python developers in every area of IT from web development to network management, we're really seeing the boom right now in machine learning and deep learning application development.
But there's a problem where data science and big data intersect as Hadoop does not have native support for Python. On a filesystem like MapR-XD, this is less of an issue since any library that supports parallel computation can use MapR-XD as a Direct NFS storage layer. If you want to leverage Apache Hadoop YARN for distributed computation, however, you are limited to the Spark Python API (PySpark).
While MapR already supports Python and PySpark with MapR-ES for global event streaming, we wanted to move even further in our support for Python developers in the community and on MapR. With MEP 4.1, we launched Python bindings for the MapR-DB OJAI Connector for Apache Spark to enable PySpark jobs to read and write to the MapR-DB document database via the OJAI API.
Modern big data applications store data in various ways. For example, a customer application could store and leverage customer data on the filesystem, but then make a customer segmentation accessible to realtime apps in a more efficient way by processing these with Apache Spark and storing the output to a database. With its flexible data model and easy-to-use API, MapR-DB JSON is a good candidate for this. This enables Python developers to use their favorite language and libraries to read data from either the filesystem or the DB, based on requirements, and then store that Spark data structure as a JSON document in the database.
In addition, we launched version 1.1 of the MapR Data Science Refinery, which adds support for distributed Python archives, enabling PySpark developers to easily access the libraries they need from wherever their jobs run in the cluster.
A pain point for PySpark developers has been that the Python version and libraries they need must exist on every node in the cluster that runs Spark. This is possible to maintain, but increases the IT management burden and creates friction between data science teams and IT administration.
The newest Data Science Refinery release eases this burden by allowing you to specify a Python archive at runtime that contains all necessary libraries and their dependencies. These libraries can be stored in the global namespace and are unzipped into staging directories at runtime to make them available to distributed PySpark jobs.
Why is this useful? Well, it means that nearly all Python use cases can be accommodated on MapR. Python developers now have easy distributed access to MapR-DB and MapR-ES and have long had full access to MapR-XD via Direct Access NFS. In addition, we've made it easy to manage Python libraries in the MapR Data Science Refinery.
Goal: Peruse the Yelp Open Dataset and plot the probability of receiving a particular rating using MatPlotLib,PySpark, SparkSQL, and MapR-DB. This tutorial assumes you've already uploaded the JSON dataset from here to your distributed file system and untarred it into the /user/mapr/ directory.
Step 1: Create a Python environment and store it to MapR-XD.
Detailed steps for doing this with Condas can be found here, but the overall process is as follows:
Zip this directory up from inside the directory:
zip -r mapr_yelp_tutorial.zip ./
Store this to MapR-XD:
hadoop fs -put mapr_yelp_tutorial.zip /user/mapr/python_envs/
Step 2: Load the MapR Data Science Refinery and specify the Python archive created earlier in the Docker run command or environment variable file.
Set the following variable either in the Docker Run command or in the environment variables file you're using:
Log into Apache Zeppelin on the specified host and port.
Step 3: Run the code!
We use the following functionality:
Import the business JSON file to MapR-DB:
mapr importJSON -idField business_id -src business.json -dst /user/mapr/business
Query the data with Drill to see a scatter plot of the ratings distribution:
SELECT stars, review_count FROM dfs. /mapr/my.cluster.com/user/mapr/business
#Query Business table in SparkSQL and store results to MapR-DB using the MapR-DB OJAI Connector for Apache Spark from pyspark.sql import SparkSession SparkSession.builder.appName("MapRDBOJAIConnectorPythonAPI").getOrCreate() df = spark.loadFromMapRDB("/user/mapr/business") df.createOrReplaceTempView("tempview") # Create SQL view of all businesses sqlDF = spark.sql("SELECT _id, state, stars from tempview") #Plot histogram of the data resulting from our query try: from StringIO import StringIO except ImportError: from io import StringIO import matplotlib import matplotlib.pyplot as plt matplotlib.use('svg') import pandas def show(plt): img = StringIO() plt.savefig(img, format='svg') img.seek(0) print(r'%html ' + img.getvalue()) plt.hist(x=sqlDF.select('stars').toPandas(), histtype='stepfilled') plt.xlabel('Stars') plt.ylabel('Probability') plt.grid(True) show(plt) plt.close()
Interesting that there seems to be zero probability of getting 2.6-3.0 stars, right?
MapR recognizes that Python is powerful, versatile, and very useful in traditional and new data pipelines and will continue to seek out ways to support Python developers across all engines and applications.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.