Apache Pig is a platform for parallelized analysis of large data sets via a language called Pig Latin. (For more information about Pig, see the Pig project page.)
You'll be working with Pig from the Linux shell. Open a terminal by selecting Applications > Accessories > Terminal (see A Tour of the MapR Virtual Machine).
Note: Although this tutorial was originally designed for users of the MapR Virtual Machine, you can easily adapt these instructions for a node in a cluster, for example by using a different directory structure.
In this tutorial, we'll use version 0.11 of Pig to run a MapReduce job that counts the words in the file
/in/constitution.txt in the
mapr user's directory on the cluster, and store the results in the file
- First, make sure you have downloaded the file: On the page A Tour of the MapR Virtual Machine, select Tools > Attachments and right-click constitution.txt to save it.
- Make sure the file is loaded onto the cluster, in the directory
/user/mapr/in. If you are not sure how, look at the NFS tutorial on A Tour of the MapR Virtual Machine.
Open a Pig shell and get started:
- In the terminal, type the command
pigto start the Pig shell.
- At the
grunt>prompt, type the following lines (press ENTER after each): After you type the last line, Pig starts a MapReduce job to count the words in the file
- When the MapReduce job is complete, type
quitto exit the Pig shell and take a look at the contents of the directory
/myvolume/wordcountto see the results.