7 min read
This post will help you get started using the Apache Spark Web UI to understand how your Spark application is executing on a Hadoop cluster. The Spark Web UI displays useful information about your application, including:
This post will go over:
This post assumes a basic understanding of Spark concepts. If you have not already read the tutorial on Getting Started with Spark on MapR Sandbox, it would be good to read that first.
This tutorial will run on the MapR Sandbox. Version 4.1 needs Spark to be installed , Version 5 includes Spark.
Spark Components of Execution
Before looking at the web UI, you need to understand the components of execution for a Spark application. Let’s go over this using the word count example from the Getting Started with Spark on MapR Sandbox tutorial, shown in the image below.
In order to run this example, first at the command line you can get the text file with this command:
Then you can enter the code shown below in the Spark shell. The example code results in the wordcounts RDD, which defines a directed acyclic graph (DAG) of RDDs that will be used later when an action is called. Operations on RDDs create new RDDs that refer back to their parents, thereby creating a graph. You can print out this RDD lineage with toDebugString as shown below.
The first RDD, HadoopRDD, was created by calling sc.textFile(), the last RDD in the lineage is the ShuffledRDD created by reduceByKey. Below on the left is a diagram of the DAG graph for the wordcount RDD; the green inner rectangles in the RDDs represent partitions. When the action collect() is called, Spark’s scheduler creates a Physical Execution Plan for the job, shown on the right, to execute the action.
The scheduler splits the RDD graph into stages, based on the transformations. The narrow transformations (transformations without data movement) will be grouped (pipe-lined) together into a single stage. This physical plan has two stages, with everything before ShuffledRDD in the first stage.
Each stage is comprised of tasks, based on partitions of the RDD, which will perform the same computation in parallel. The scheduler submits the stage task set to the task scheduler, which launches tasks via a cluster manager.
Here is a summary of the components of execution:
The diagram below shows Spark on an example Hadoop cluster:
Here is how a Spark application runs:
Using the Spark web UI to view the behavior and performance of your Spark application
You can view the behavior of your Spark application in the Spark web UI at http://
Under the Stages tab, you can see the details for stages. Below is the stages page for the word count job, Stage 0 is named after the last RDD in the stage pipeline, and Stage 1 is named after the action.
You can view RDDs in the Storage tab.
Under the Executors tab, you can see processing and storage for each executor. You can look at the thread call stack by clicking on the thread dump link.
This concludes the Getting Started with the Spark web UI tutorial. If you have any further questions about using the Spark web UI, please ask them in the section below.
You can find more information about these technologies here:
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.