How To: Leveraging Python Environments from DSR (Conda)

Contributed by

4 min read

At some point, you're going to need to run some of the popular new Python libraries that everybody is talking about, like MatPlotLib or SciPy. At this point, you may have noticed that it's not as simple as installing it on your local machine and submitting jobs to the cluster. In order for the Spark executors to access these libraries, they have to live on each of the Spark worker nodes.

While you could go through and manually install each of these environments across the cluster, using pip or Anaconda, this requires a fair bit of IT overhead to manage this installation and maintain currency and developer preference.

MapR recommends creating the environment that you want to use, zipping it up, storing it to MapR-FS, and then leveraging Spark to distribute the binary across the cluster. There is some minimal impact to job spin-up time, depending on the size of the archive. But the advantages are:

  • No IT involvement: archives are unzipped into the YARN temporary staging directory at runtime and then removed when the job is complete.
  • Collaboration: many users can share one environment.
  • Easy to customize: if you need to make changes, it's very simple to alter and then store back to your global namespace.

This post will track how to do this with Condas. And, steps are available in our docs:

Installing Custom Packages for PySpark Using Conda

Use Existing Condas from the MapR Data Science Refinery

Launch the MapR Data Science Refinery container, specifying the path to your Python archive in the Docker Run command or environment variable file as such:

docker run -it [...]  
-e ZEPPELIN_ARCHIVE_PYTHON=/path/to/python_envs/ [...]

MSG: Copying archive from MapR-FS: /user/mapr/python_envs/ -> /home/mapr/zeppelin/archives/zip/
MSG: Extracing archive locally
MSG: Configuring Saprk to use custom Python
MSG: Configuring Zeppelin to use custom Python with Spark interpreter

If you built this archive using the example below, the path would be:


Now this is available to you in Apache Zeppelin and for all PySpark jobs. You can test this by checking your Python version:

Create New Conda using the MapR Data Science Refinery

Continuum Analytics provides an installer for Conda called Miniconda, which contains only Conda and its dependencies, and this installer is what we’ll be using today. You can also install the full build of Anaconda if you prefer.

wget -P /tmp/
sudo yum install bzip2
bash /tmp/

We typically recommend the default settings here, but feel free to change them to suit your needs.

Next, we have to create the Python environment that we want to use as a Conda. For these purposes, we'll show how to create one using Python 2.7.X and the NumPy library and call this Conda 'mapr_numpy'.

mkdir mapr_numpy
[/path/to/]conda create -p /mapr_numpy python=2 numpy
[mapr@]$ /home/mapr/miniconda3/bin/conda create -p ./mapr_numpy python=3.5 numpy
Fetching package metadata ...........
Solving package specifications: .
Package plan for installation in environment /home/mapr/mapr_numpy:
The following NEW packages will be INSTALLED:
numpy: 1.14.0-py35h3dfced4_0
openssl: 1.0.2n-hb7f436b_0
pip: 9.0.1-py35h7e7da9d_4
Proceed ([y]/n)? y

You can test that this Conda was created correctly by checking the Python version as such:

./mapr_numpy/bin/python -V
Python 2.7.14 :: Anaconda, Inc.

Store Conda to MapR-FS

First, we need to zip this environment up (use whichever tool you prefer):

cd mapr_numpy/
zip -r ./

Then, store it to a directory you have access to in MapR-FS:

hadoop fs -put /user/mapr/python_envs/

This blog post was published January 17, 2018.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now