MapR 4.0.x Documentation : Mahout

Apache Mahout™ is a scalable machine learning library. For more information about Mahout, see the Apache Mahout project.

On this page:

See also Upgrading Mahout and Working with Mahout.

Installing Mahout

Mahout can be installed when MapR services are initially installed as discussed in Installing MapR Services. If Mahout wasn't installed during the initial MapR services installation, Mahout can be installed at a later date by executing the instructions in this section. These procedures may be performed on a node in a MapR cluster (see the Advanced Installation Topics) or on a client (see Setting Up the Client).

The Mahout installation procedures below use the operating system's package manager to download and install Mahout from the MapR Repository. If you want to install this component manually from packages files, see Packages and Dependencies for MapR Software.

Installing Mahout on a MapR Node

Mahout only needs to be installed on the nodes in the cluster from which Mahout applications will be executed. So you may only need to install Mahout on one node. However, depending on the number of Mahout users and the number of scheduled Mahout jobs, you may need to install Mahout on more than one node.

Mahout applications may run MapReduce programs, and by default Mahout will use the cluster's default JobTracker to execute MapReduce jobs.

Install Mahout on a MapR node running Ubuntu

Install Mahout on a MapR node running Ubuntu as root or using sudo by executing the following apt-get install command:

# apt-get install mapr-mahout

Install Mahout on a MapR node running Red Hat or CentOS

Install Mahout on a MapR node running Red Hat or CentOS as root or using sudo by executing the following yum install command:

# yum install mapr-mahout

Installing Mahout on a Client

If you install Mahout on a Linux client, you can run Mahout applications from the client that execute MapReduce jobs on the cluster that your client is configured to use.

Tip: You don't have to install Mahout on the cluster in order to run Mahout applications from your client.

Install Mahout on a client running Ubuntu

Install Mahout on a client running Ubuntu as root or using sudo by executing the following apt-get install command:

# apt-get install mapr-mahout

Install Mahout on a client running Red Hat or CentOS

Install Mahout on a client running Red Hat or CentOS as root or using sudo by executing the following yum install command:

# yum install mapr-mahout

Configuring the Mahout Environment

After installation the Mahout executable is located in the following directory:
/opt/mapr/mahout/mahout-<version>/bin/mahout

Example: /opt/mapr/mahout/mahout-0.7/bin/mahout


To use Mahout with MapR, set the following environment variables:

  • MAHOUT_HOME - the path to the Mahout directory. Example:
    $ export MAHOUT_HOME=/opt/mapr/mahout/mahout-0.7
  • JAVA_HOME - the path to the Java directory. Example for Ubuntu:
    $ export JAVA_HOME=/usr/lib/jvm/java-6-sun
  • JAVA_HOME - the path to the Java directory. Example for Red Hat and CentOS:
    $ export JAVA_HOME=/usr/java/jdk1.6.0_24 
  • HADOOP_HOME - the path to the Hadoop directory. Example:
    $ export HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2
  • HADOOP_CONF_DIR - the path to the directory containing Hadoop configuration parameters. Example:
    $ export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf

You can set these environment variables persistently for all users by adding them to the /etc/environment file as root or using sudo. The order of the environment variables in the file doesn't matter.

Example entries for setting environment variables in the /etc/environment file for Ubuntu:

      JAVA_HOME=/usr/lib/jvm/java-6-sun
      MAHOUT_HOME=/opt/mapr/mahout/mahout-0.7
      HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2
      HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf

Example entries for setting environment variables in the /etc/environment file for Red Hat and CentOS:

      JAVA_HOME=/usr/java/jdk1.6.0_24
      MAHOUT_HOME=/opt/mapr/mahout/mahout-0.7
      HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2
      HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf

After adding or editing environment variables to the /etc/environment file, you can activate them without rebooting by executing the source command:

$ source /etc/environment

Note: A user who doesn't have root or sudo permissions can add these environment variable entries to his or her ~/.bashrc file. The environment variables will be set each time the user logs in.

Getting Started with Mahout

To see the sample applications bundled with Mahout, execute the following command:

$ ls $MAHOUT_HOME/examples/bin

To run the Twenty Newsgroups Classification Example, execute the following commands:

$ cd $MAHOUT_HOME
$ ./examples/bin/classify-20newsgroups.sh

The output from this example will look similar to the following:

Attachments:

Screenshot.png (image/png)