Apache Pig is a platform for parallelized analysis of large data sets via a language called Pig Latin. (For more information about Pig, see the Pig project page.)

You'll be working with Pig from the Linux shell. Open a terminal by selecting Applications > Accessories > Terminal (see A Tour of the MapR Virtual Machine).

Note: Although this tutorial was originally designed for users of the MapR Virtual Machine, you can easily adapt these instructions for a node in a cluster, for example by using a different directory structure.

In this tutorial, we'll use version 0.11 of Pig to run a MapReduce job that counts the words in the file /in/constitution.txt in the mapr user's directory on the cluster, and store the results in the file wordcount.txt.

Open a Pig shell and get started:

  1. In the terminal, type the command pig to start the Pig shell.
  2. At the grunt> prompt, type the following lines (press ENTER after each):
    A = LOAD '/user/mapr/in' USING TextLoader() AS (words:chararray);
    B = FOREACH A GENERATE FLATTEN(TOKENIZE(*));
    C = GROUP B BY $0;
    D = FOREACH C GENERATE group, COUNT(B);
    STORE D INTO '/user/mapr/wordcount';
    After you type the last line, Pig starts a MapReduce job to count the words in the file constitution.txt.
  3. When the MapReduce job is complete, type quit to exit the Pig shell and take a look at the contents of the directory /myvolume/wordcount to see the results.

Attachments: