MapR 4.0.x Documentation : Pig

Apache Pig is a platform for parallelized analysis of large data sets via a language called PigLatin. For more information about Pig, see the Pig project page.

Once Pig is installed, the executable is located at: /opt/mapr/pig/pig-<version>/bin/pig

Make sure the environment variable JAVA_HOME is set correctly. Example:

# export JAVA_HOME=/usr/lib/jvm/java-6-sun

Installing Pig

The following procedures use the operating system package managers to download and install Pig from the MapR Repository. For instructions on setting up the ecosystem repository (which includes Pig), see Preparing Packages and Repositories.

If you want to install this component manually from packages files, see Packages and Dependencies for MapR Software.

To install Pig on an Ubuntu cluster:

  1. Execute the following commands as root or using sudo.
  2. This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the Advanced Installation Topics.
  3. Update the list of available packages:

    apt-get update
  4. On each planned Pig node, install mapr-pig:

    apt-get install mapr-pig

To install Pig on a Red Hat or CentOS cluster:

  1. Execute the following commands as root or using sudo.
  2. This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the Advanced Installation Topics.
  3. On each planned Pig node, install mapr-pig:

    yum install mapr-pig

Getting Started with Pig

Apache Pig is a platform for parallelized analysis of large data sets via a language called Pig Latin. (For more information about Pig, see theĀ Pig project page.)

Open a terminal window so you can work with Pig from the Linux shell.

Note: Although this tutorial was originally designed for users of the MapR Virtual Machine, you can easily adapt these instructions for a node in a cluster, for example by using a different directory structure.

In this tutorial, we'll use version 0.11 of Pig to run a MapReduce job that counts the words in the file /in/constitution.txt in the mapr user's directory on the cluster, and store the results in the file wordcount.txt.

  • First, download the file: select Tools > Attachments on this Confluence page (see the top-right corner of the page) and right-click constitution.txt to save it.
  • Load the file onto the cluster and place it in the directory /user/mapr/in.

Open a Pig shell and get started:

  1. In the terminal, type the command pig to start the Pig shell.
  2. At the grunt> prompt, type the following lines (press ENTER after each):

    A = LOAD '/user/mapr/in' USING TextLoader() AS (words:chararray);
    B = FOREACH A GENERATE FLATTEN(TOKENIZE(*));
    C = GROUP B BY $0;
    D = FOREACH C GENERATE group, COUNT(B);
    STORE D INTO '/user/mapr/wordcount';

    After you type the last line, Pig starts a MapReduce job to count the words in the file constitution.txt.

  3. When the MapReduce job is complete, type quit to exit the Pig shell and take a look at the contents of the directory /myvolume/wordcount to see the results.

Attachments:

constitution.txt (text/plain)