Pig is a platform for parallelized analysis of large data sets. Pig programs use a language called Pig Latin.
In this tutorial, create a directory for the US Constitution text file, and then create a Pig script that runs a word count MapReduce job on the text in the file. After you run the MapReduce job, view the wordcount file generated by the job.
Create a new directory for the constitution.txt file:
Note the directory path because you will need it in the next section when you create a PIG script to run the MapReduce job.
Create a Pig script and run a word count MapReduce job:
In the script window, enter the following Pig Latin commands:
A = LOAD '/oozie/wordcount' USING TextLoader() AS (words:chararray); **B = FOREACH A GENERATE FLATTEN(TOKENIZE(*));** C = GROUP B BY $0; **D = FOREACH C GENERATE group, COUNT(B);** STORE D INTO '/oozie/wcresults';
Note: You may need to edit the directory paths in lines A and D. Verify that the path in line A points to the directory where you uploaded the constitution.txt file. Verify that the path in line D points to a directory where you want the output results of the wordcount.
View the wordcount file that contains the MapReduce job results:
Next: Use Job Designer to create and submit a MapReduce job design.
Job Designer is an application that you can use to submit MapReduce, Hadoop streaming, or JAR jobs. A MapReduce job contains Java map and reduce functions. You can use existing mapper and reducer classes in a MapReduce job design without writing a main Java class. A Hadoop streaming job is a job where map and reduce functions, written in a non-Java language, read and write standard Unix inputs and outputs. A JAR job is a job where map and reduce functions, written in Java, read and write standard Unix inputs and outputs.
When you create a MapReduce job in Job Designer, you can configure variables in the form of $variable_name for all job design settings except Name and Description. If you include variables, you can specify values for the variables in a dialog box that appears when you submit the job.
In this tutorial, use File Browser to create a directory that you can upload the sample JAR file to. Use Job Designer to create a MapReduce job using the sample JAR file. Submit the job, and view the output file.
Create a new directory:
Upload the JAR file:
Create a MapReduce Job Design:
Configure the job settings with the information below:
|Name||Enter MapReduce_Job_Design as the job name.|
|Description||Enter Job Design Tutorial as the descriptor.|
|JAR Path||Enter the fully-qualified path to the JAR file with the classes that implement the mapper and reducer functions.
|Job Properties||Click the Add property button four times. Enter the following property names and their associated value:
Click Save. The Job Designs page appears with the MapReduce_Job_Design in the list.
Submit the job design:
View the output file:
Next: Use Oozie to create and submit a workflow.
Oozie is a workflow system for Hadoop. Use Oozie to set up workflows that execute MapReduce jobs and to set up a coordinator that manages workflows.
In this tutorial, create a workflow to run the same MapReduce job that you ran in the previous tutorial. Submit the workflow to run the job, and then view the output file.
Create a workflow:
Click the Create button. The Create Workflow page appears.
Click the Add Property button four times, and enter the following property names and values:
Click Done. The MapReduce action appears in the workflow.
View the output file: