4 min read
At some point, you're going to need to run some of the popular new Python libraries that everybody is talking about, like MatPlotLib or SciPy. At this point, you may have noticed that it's not as simple as installing it on your local machine and submitting jobs to the cluster. In order for the Spark executors to access these libraries, they have to live on each of the Spark worker nodes.
While you could go through and manually install each of these environments across the cluster, using pip or Anaconda, this requires a fair bit of IT overhead to manage this installation and maintain currency and developer preference.
MapR recommends creating the environment that you want to use, zipping it up, storing it to MapR-FS, and then leveraging Spark to distribute the binary across the cluster. There is some minimal impact to job spin-up time, depending on the size of the archive. But the advantages are:
This post will track how to do this with Condas. And, steps are available in our docs:
Launch the MapR Data Science Refinery container, specifying the path to your Python archive in the Docker Run command or environment variable file as such:
docker run -it [...] -e ZEPPELIN_ARCHIVE_PYTHON=/path/to/python_envs/custom_pyspark_env.zip [...] maprtech/data-science-refinery:v1.1_6.0.0_4.1.0_centos7 [...] MSG: Copying archive from MapR-FS: /user/mapr/python_envs/mapr_numpy.zip -> /home/mapr/zeppelin/archives/zip/mapr_numpy.zip MSG: Extracing archive locally MSG: Configuring Saprk to use custom Python MSG: Configuring Zeppelin to use custom Python with Spark interpreter [...]
If you built this archive using the example below, the path would be:
Now this is available to you in Apache Zeppelin and for all PySpark jobs. You can test this by checking your Python version:
Continuum Analytics provides an installer for Conda called Miniconda, which contains only Conda and its dependencies, and this installer is what we’ll be using today. You can also install the full build of Anaconda if you prefer.
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -P /tmp/ sudo yum install bzip2 bash /tmp/Miniconda3-latest-Linux-x86_64.sh
We typically recommend the default settings here, but feel free to change them to suit your needs.
Next, we have to create the Python environment that we want to use as a Conda. For these purposes, we'll show how to create one using Python 2.7.X and the NumPy library and call this Conda 'mapr_numpy'.
mkdir mapr_numpy [/path/to/]conda create -p /mapr_numpy python=2 numpy Example: [mapr@]$ /home/mapr/miniconda3/bin/conda create -p ./mapr_numpy python=3.5 numpy Fetching package metadata ........... Solving package specifications: . Package plan for installation in environment /home/mapr/mapr_numpy: The following NEW packages will be INSTALLED: [..] numpy: 1.14.0-py35h3dfced4_0 openssl: 1.0.2n-hb7f436b_0 pip: 9.0.1-py35h7e7da9d_4 [..] Proceed ([y]/n)? y
You can test that this Conda was created correctly by checking the Python version as such:
./mapr_numpy/bin/python -V Python 2.7.14 :: Anaconda, Inc.
First, we need to zip this environment up (use whichever tool you prefer):
cd mapr_numpy/ zip -r mapr_numpy.zip ./
Then, store it to a directory you have access to in MapR-FS:
ex. hadoop fs -put mapr_numpy.zip /user/mapr/python_envs/
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.