3 min read
Apache Spark can use various cluster managers to execute applications (Standalone, YARN, Apache Mesos). When you install Apache Spark on MapR, you can submit an application in Standalone mode or by using YARN.
This blog post focuses on YARN and dynamic allocation, a feature that lets Spark add or remove executors dynamically based on the workload. You can find more information about this feature in this presentation from Databricks:
Let’s see how to configure Spark and YARN to use dynamic allocation (that is disabled by default).
The example below is for MapR 5.2 with Apache Spark 1.6.1; you just need to adapt the version to your environment.
The first thing to do is to enable dynamic allocation in Spark. To do this, you need to edit the Spark configuration file on each Spark node
and add the following entries:
spark.dynamicAllocation.enabled = true spark.shuffle.service.enabled = true spark.dynamicAllocation.minExecutors = 5 spark.executor.instances = 0
You can find additional configuration options in the Apache Spark Documentation.
Now you need to edit the YARN configuration to add information about Spark Shuffle Service. Edit the following file on each YARN node:
and add these properties:
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle,mapr_direct_shuffle,spark_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> <value>org.apache.spark.network.yarn.YarnShuffleService</value> </property>
Spark Shuffle service must be added to the YARN classpath. The jar is located in the Spark distribution:
To do this, add the jar in the following folder on each node:
You can either copy the file or create a symlink:
$ ln -s /opt/mapr/spark/spark-1.6.1/lib/spark-1.6.1-mapr-1605-yarn-shuffle.jar /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib
Since you have changed the YARN configuration, you must restart your node managers using the following command:
$ maprcli node services -name nodemanager -action restart -nodes [list of nodes]
Your MapR cluster is now ready to use Spark dynamic allocation. This means that when you submit a job, you do not need to specify any resource configuration. For example:
/opt/mapr/spark/spark-1.6.1/bin/spark-submit \ --class com.mapr.demo.WordCountSorted \ --master yarn \ ~/spark-examples-1.0-SNAPSHOT.jar \ /mapr/my.cluster.com/input/4gb_txt_file.txt \ /mapr/my.cluster.com/user/mapr/output/
Note that you can still specify the resources, but in this case, the dynamic allocation will not be used for this specific job. For example:
/opt/mapr/spark/spark-1.6.1/bin/spark-submit \ --class com.mapr.demo.WordCountSorted \ --master yarn \ --num-executors 3 --executor-memory 1G \ ~/spark-examples-1.0-SNAPSHOT.jar \ /mapr/my.cluster.com/input/4gb_txt_file.txt \ /mapr/my.cluster.com/user/mapr/output/
In this blog post, you learned how to set up Spark dynamic allocation on MapR. If you have any further questions about this tutorial, please ask them in the comments section below.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.