Modern Python & PySpark Application Development on MapR

Contributed by Rachel Silver

Python has become the darling language of the data science and data engineering world. It's versatile and powerful, yet easy enough for beginners to use. While we encounter Python developers in every area of IT from web development to network management, we're really seeing the boom right now in machine learning and deep learning application development.

But there's a problem where data science and big data intersect as Hadoop does not have native support for Python. On a filesystem like MapR-XD, this is less of an issue since any library that supports parallel computation can use MapR-XD as a Direct NFS storage layer. If you want to leverage Apache Hadoop YARN for distributed computation, however, you are limited to the Spark Python API (PySpark).

While MapR already supports Python and PySpark with MapR-ES for global event streaming, we wanted to move even further in our support for Python developers in the community and on MapR. With MEP 4.1, we launched Python bindings for the MapR-DB OJAI Connector for Apache Spark to enable PySpark jobs to read and write to the MapR-DB document database via the OJAI API.

Modern big data applications store data in various ways. For example, a customer application could store and leverage customer data on the filesystem, but then make a customer segmentation accessible to realtime apps in a more efficient way by processing these with Apache Spark and storing the output to a database. With its flexible data model and easy-to-use API, MapR-DB JSON is a good candidate for this. This enables Python developers to use their favorite language and libraries to read data from either the filesystem or the DB, based on requirements, and then store that Spark data structure as a JSON document in the database.

MapR Spark

In addition, we launched version 1.1 of the MapR Data Science Refinery, which adds support for distributed Python archives, enabling PySpark developers to easily access the libraries they need from wherever their jobs run in the cluster.

A pain point for PySpark developers has been that the Python version and libraries they need must exist on every node in the cluster that runs Spark. This is possible to maintain, but increases the IT management burden and creates friction between data science teams and IT administration.

PySpark jobs

The newest Data Science Refinery release eases this burden by allowing you to specify a Python archive at runtime that contains all necessary libraries and their dependencies. These libraries can be stored in the global namespace and are unzipped into staging directories at runtime to make them available to distributed PySpark jobs.


Why is this useful? Well, it means that nearly all Python use cases can be accommodated on MapR. Python developers now have easy distributed access to MapR-DB and MapR-ES and have long had full access to MapR-XD via Direct Access NFS. In addition, we've made it easy to manage Python libraries in the MapR Data Science Refinery.

Putting This to Work

Goal: Peruse the Yelp Open Dataset and plot the probability of receiving a particular rating using MatPlotLib,PySpark, SparkSQL, and MapR-DB. This tutorial assumes you've already uploaded the JSON dataset from here to your distributed file system and untarred it into the /user/mapr/ directory.

Step 1: Create a Python environment and store it to MapR-XD.

Detailed steps for doing this with Condas can be found here, but the overall process is as follows:

  1. Create a Python environment with Pandas and MatPlotLib:
    conda create -p mapr_yelp_tutorial/ python=2 pandas matplotlib

  2. Zip this directory up from inside the directory:
    cd mapr_yelp_tutorial/
    zip -r ./

  3. Store this to MapR-XD: hadoop fs -put /user/mapr/python_envs/

Step 2: Load the MapR Data Science Refinery and specify the Python archive created earlier in the Docker run command or environment variable file.

  1. Set the following variable either in the Docker Run command or in the environment variables file you're using: ZEPPELIN_ARCHIVE_PYTHON=/user/mapr/python_envs/

  2. Log into Apache Zeppelin on the specified host and port.

  3. Download or copy the link for our demo notebook from here and import it into Zeppelin.

Import new note

Step 3: Run the code!

We use the following functionality:

  1. Import the business JSON file to MapR-DB: mapr importJSON -idField business_id -src business.json -dst /user/mapr/business

  2. Query the data with Drill to see a scatter plot of the ratings distribution:

SELECT stars, review_count FROM
dfs. /mapr/

Chart Probability

  1. Load the data from MapR-DB into a PySpark Dataframe using the MapR-DB OJAI connector for Apache Spark. Query this data using SparkSQL to generate a filtered view. Finally, chart the probability of each rating using MatPlotLib.
#Query Business table in SparkSQL and store results to MapR-DB using the MapR-DB OJAI Connector for Apache Spark
from pyspark.sql import SparkSession
df = spark.loadFromMapRDB("/user/mapr/business")

# Create SQL view of all businesses
sqlDF = spark.sql("SELECT _id, state, stars from tempview")

#Plot histogram of the data resulting from our query
    from StringIO import StringIO
except ImportError:
    from io import StringIO

import matplotlib
import matplotlib.pyplot as plt
import pandas

def show(plt):
    img = StringIO()
    plt.savefig(img, format='svg')
    print(r'%html ' + img.getvalue())

plt.hist('stars').toPandas(), histtype='stepfilled')

zero probability

Interesting that there seems to be zero probability of getting 2.6-3.0 stars, right?

MapR recognizes that Python is powerful, versatile, and very useful in traditional and new data pipelines and will continue to seek out ways to support Python developers across all engines and applications.

This blog post was published February 09, 2018.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.