World's Fastest Implementation of Random Forest Algorithms on the MapR Data Platform

Contributed by

13 min read

Introduction

The MapR Data Platform is a foundational technology, designed not only to be a highly scalable distributed storage system but also to support the various machine learning tools you may want.

At MapR Professional Services, we are fortunate to be able to solve a wide variety of problems across dozens of industries. In the course of our work, we occasionally uncover some hidden gems, such as an implementation of Random Forest that is faster and more accurate than other major implementations, such as scikit-learn or Ranger. With MapR’s POSIX-compliant file system, these advanced applications can often be run in a distributed fashion with little or no modification.

In this blog post, we would like to show you how to get started using “the world's fastest implementation of Random Forest,” calledcl-random-forest(CLRF). CLRF is an implementation of Random Forest by Satoshi Imai for multiclass classification and univariate regression. This implementation also includes a global refinement of Random Forest (Ren, Cao, Wei, and Sun. “Global Refinement of Random Forest.” CVPR, 2015), and this refinement makes it faster and more accurate than standard Random Forest. For more performance information, please visit the official homepage (https://github.com/masatoi/cl-random-forest). You can see the comparison with scikit-learn and Ranger, according to their benchmarks.

We demonstrate how easy it is to adapt advanced algorithms to distributed architectures, using the MapR Data Platform.

Random Forest

Random Forest is an ensemble technique that trains a group of decision trees on a random subset of a training set. A new record is presented to this group, and whatever class the preponderance of underlying decision trees chooses is the classification. For this example, one of the most important aspects of this algorithm is that each decision tree is independently trained on a random subset of variables and records. When in the parallel mindset, independence is next to godliness; it means that you can spread the training of a forest over multiple cores/machines.

To start, here is a simple one-machine example and multiple-machines example. The original, cl-random-forest, was designed for just a single machine, and this will show you how we can run it on distributed machines. Distributing the computation over many machines helps improve the performance for large models (e.g., 4000 decision trees).

Building a Random Forest Model with Common Lisp

- Demo Environment

We installed MapR 6.0.1 with MapR POSIX Client on AWS (5 nodes) and ran a Random Forest model, having 4000 decision trees. The machine specification was:

- Dataset

As a training dataset, we use MNIST. After downloading, place it at /mnist on the MapR cluster.

First, start mapr-posix-client, and confirm the dataset. Because we have a POSIX-compliant file system, conversion is simple. No need for HDFS or ‘Hadoop style’ custom code or libraries.

\# mkdir /mapr

\# service mapr-posix-client-basic start

\# ls /mapr/cluster1/mnist/

mnist.scale    mnist.scale.t

- Software

You need a lisp implementation (we use CCL) and the de facto standard package/library manager (Quicklisp) to download cl-random-forest and related dependencies on all nodes.

1) Download CCL and extract.

\# curl -OL https://github.com/Clozure/ccl/releases/download/v1.11.5/ccl-1.11.5-linuxx86.tar.gz

  % Total        % Received % Xferd  Average Speed   Time    Time         Time  Current

                                 Dload  Upload   Total   Spent        Left  Speed

100   616        0   616        0         0    608          0 --:--:--  0:00:01 --:--:--   608

100 48.3M  100 48.3M        0     0   959k          0  0:00:51  0:00:51 --:--:-- 2283k

\# tar xvzf ccl-1.11.5-linuxx86.tar.gz

2) Install Quicklisp.

To install Quicklisp, download quicklisp.lisp and load it.

\# curl -O https://beta.quicklisp.org/quicklisp.lisp

  % Total        % Received % Xferd  Average Speed   Time    Time         Time  Current

                                 Dload  Upload   Total   Spent        Left  Speed

100 49843  100 49843        0     0  33639          0  0:00:01  0:00:01 --:--:-- 50397

\# ./ccl/dx86cl64 --load quicklisp.lisp

  ==== quicklisp quickstart 2015-01-28 loaded ====

        To continue with installation, evaluate: (quicklisp-quickstart:install)

        For installation options, evaluate: (quicklisp-quickstart:help)

Welcome to Clozure Common Lisp Version 1.11-r16635  (DarwinX8664)!

CCL is developed and maintained by Clozure Associates. For more information

about CCL, visit http://ccl.clozure.com.  To enquire about Clozure's Common Lisp

consulting services, email info@clozure.com or visit http://www.clozure.com.

? (quicklisp-quickstart:install)

... (skip)

? (ql:add-to-init-file)

;;; The following lines added by ql:add-to-init-file:

  #-quicklisp

  (let ((quicklisp-init (merge-pathnames "quicklisp/setup.lisp"

                                         (user-homedir-pathname))))

        (when (probe-file quicklisp-init)

          (load quicklisp-init)))

Press Enter to continue.

? (quit)

? is the lisp REPL prompt. Type (quicklisp-quickstart:install), then you can see some packages are installing. To load Quicklisp into your lisp session after initial installation, you can type (ql:add-to-init-file).

After installing CCL and Quicklisp, you can load random-forest with:

\# ./ccl/lx86cl64

Clozure Common Lisp Version 1.11.5/v1.11.5  (LinuxX8664)

For more information about CCL, please see http://ccl.clozure.com.

CCL is free software.  It is distributed under the terms of the Apache

Licence, Version 2.0.

? (ql:quickload :cl-random-forest)

Building a Random Forest

1) On Single Node

Let’s use the Read-Eval-Print-Loop (REPL) on a client node.

First, we will run the REPL and import the cl-random-forest package with (ql:quickload :cl-random-forest).

\# ./ccl/lx86cl64

Clozure Common Lisp Version 1.11.5/v1.11.5  (LinuxX8664)

For more information about CCL, please see http://ccl.clozure.com.

CCL is free software.  It is distributed under the terms of the Apache

Licence, Version 2.0.

? (ql:quickload :cl-random-forest)

To load "cl-random-forest":

  Load 1 ASDF system:

        cl-random-forest

; Loading "cl-random-forest"

(:CL-RANDOM-FOREST)

Note: ; is the comment symbol in lisp.

And we load the training dataset to build Random Forest with cl-random-forest:

(defparameter mnist-dim 784)

(defparameter mnist-n-class 10)

;; Read training data

(multiple-value-bind (datamat target)

   (read-data "/mapr/cluster1/mnist/mnist.scale" mnist-dim)

  (defparameter mnist-datamatrix datamat)

  (defparameter mnist-target target))

;; Add 1 to labels because the labels of this dataset begin from 0

(loop for i from 0 below (length mnist-target) do

  (incf (aref mnist-target i)))

Now, cl-random-forest supports parallel learning in a single machine with the lparallel library.

Let’s set lparallel’s kernel object to enable/disable parallelization. For example, to enable parallelization with 8 threads:

(setf lparallel:\*kernel\* (lparallel:make-kernel 8))

;; if you don’t want to use lparallel

;; (setf lparallel:\*kernel\* (lparallel:make-kernel nil))

Next, we train a Random Forest Classifier with these parameters:

(defparameter mnist-forest

   (make-forest mnist-n-class mnist-datamatrix mnist-target

               :n-tree 4000 :bagging-ratio 0.1 :max-depth 10 :n-trial 10 :min-region-samples 5))

The function make-forest receives the number of classes, the data, target vector, and optionally the max depth of the tree, the minimum number of samples in the region the tree divides, and the number of trials of splits.

- n-tree: number of decision trees that will be built

- bagging-ratio: used for sampling from training data to construct each sub-decision tree

- maxDepth: maximum depth of a tree. Increasing the depth makes the model more powerful, but deep trees take longer to train.

The Random Forest model will be trained by making associations between the input features and the labeled output associated with those features.

We can use the function time to display the total processing time in building the Random Forest. For our test run:

(time

(defparameter mnist-forest

(make-forest mnist-n-class mnist-datamatrix mnist-target

               :n-tree 4000 :bagging-ratio 0.1 :max-depth 10 :n-trial 10 :min-region-samples 5)))
(DEFPARAMETER MNIST-FOREST (MAKE-FOREST MNIST-N-CLASS MNIST-DATAMATRIX MNIST-TARGET :N-TREE 4000 :BAGGING-RATIO 0.1:MAX-DEPTH 10:N-TRIAL 10 :MIN-REGION-SAMPLES 5))

took 100,018,377 microseconds (100.018380 seconds) to run.

 28,893,651microseconds ( 28.893652 seconds, 28.89%) of which was spent in GC.

During that period, and with 8 available CPU cores,

301,393,794 microseconds (301.393800 seconds) were spent in user mode

       7,785,833 microseconds (  7.785833 seconds) were spent in system mode

286,976 bytes of memory allocated.

184,144 minor page faults, 0major page faults, 0 swaps.

You can see it took 100,018,377 microseconds (100.018380 seconds) to run. Yes, this is good, and beats many other implementations.

However, since we have a MapR cluster of compute nodes available, the next step is to build the trees in a distributed fashion.

2) On Distributed Nodes

We have loaded 4 distributed nodes (node1~4) to build a Random Forest model and 1 node as client (node5).

For the distributed version, we need two more packages from lfarm: lfarm-server and lfarm-client. To load these packages, we use QuickLisp again.

Let’s run repl on each node:

\# ./ccl/lx86cl64

Clozure Common Lisp Version 1.11.5/v1.11.5  (LinuxX8664)

For more information about CCL, please see http://ccl.clozure.com.

CCL is free software.  It is distributed under the terms of the Apache

Licence, Version 2.0.

? (ql:quickload :cl-random-forest)

? (ql:quickload :lfarm-server)

;; PLEASE run on each server

;; (lfarm-server:start-server "node1" 11524 :background t) ;; on node1

;; (lfarm-server:start-server "node2" 11524 :background t) ;; on node2

;; (lfarm-server:start-server "node3" 11524 :background t) ;; on node3

;; (lfarm-server:start-server "node4" 11524 :background t) ;; on node4

This code should run on each node1~4.

<node1>

<node2>

<node3>

<node4>

From this point, the code is similar to a single node, and we will give all instructions in one code block.

Code for distributed version is red in below.

Now, we can run to build a Random Forest Model. Let’s run the REPL on node 5 and load code:

;; on node5

(ql:quickload :lfarm-client)

(ql:quickload :cl-random-forest)

(defpackage clrf-multinode

  (:use :cl :cl-random-forest))

(in-package :clrf-multinode)

;; Connect to the servers

(setf lfarm:\*kernel\* (lfarm:make-kernel '(("node1" 11524) ("node2" 11524) ("node3" 11524) ("node4" 11524))))

;; load Library

(lfarm:broadcast-task #'ql:quickload :cl-random-forest)

;; Init random-state

(lfarm:broadcast-task

 (lambda ()

   (setq \*random-state\* (make-random-state t))))

;; It is necessary to distribute the dataset file to each server beforehand

(lfarm:broadcast-task

 (lambda ()

   (defparameter mnist-dim 784)

   (defparameter mnist-n-class 10)

   (multiple-value-bind (datamat target)

      (cl-random-forest.utils:read-data "/mapr/cluster1/mnist/mnist.scale" mnist-dim)

     (defparameter mnist-datamatrix datamat)

     (defparameter mnist-target target))

   (loop for i from 0 below (length mnist-target) do

     (incf (aref mnist-target i)))))

;; build model

(time

  (lfarm:broadcast-task

   (lambda ()

     ;;multi-thread conf

     (setf lparallel:\*kernel\* (lparallel:make-kernel 8))

     (defparameter mnist-forest

       (cl-random-forest:make-forest mnist-n-class mnist-datamatrix mnist-target

                 :n-tree 1000 :bagging-ratio 0.1 :max-depth 10 :n-trial 10 :min-region-samples 5)))))

;; Predict

(defun predict-forest-multinode (index)

  (let* ((result (lfarm:broadcast-task

                  `(lambda ()

                     (clrf::class-distribution-forest mnist-forest mnist-datamatrix ,index))))

         (n-node (length result))

         (n-dim (length (aref result 0)))

         (class-dist (make-array n-dim :element-type 'double-float :initial-element 0d0)))

    (print result)

    (loop for res across result do

      (loop for i from 0 below n-dim

            for elem across res do

              (incf (aref class-dist i) elem)))

    (loop for i from 0 below n-dim do

      (setf (aref class-dist i) (/ (aref class-dist i) n-node)))

    class-dist))

(predict-forest-multinode 0)

(predict-forest-multinode 1)

Most code is similar, except the lfarm part. We don’t need to modify anything in the original cl-random-forest and just use lfarm to train the model.

Now this code is using 4 distributed nodes and 8 cores each per machine:

... (skip)

;; Connect to the servers

(setf lfarm:\*kernel\* (lfarm:make-kernel '(("node1" 11524) ("node2" 11524) ("node3" 11524) ("node4" 11524))))

... (skip)

 (lfarm:broadcast-task

   (lambda ()

     ;;multi-thread conf

     (setf lparallel:\*kernel\* (lparallel:make-kernel 4))

     (defparameter mnist-forest

       (cl-random-forest:make-forest mnist-n-class mnist-datamatrix mnist-target

                 :n-tree 1000 :bagging-ratio 0.1 :max-depth 10 :n-trial 10 :min-region-samples 5))))

... (skip)

- lfarm:make-kernel: set 4 nodes to train

- lparallel:make-kernel: 8 cores to train

With these options, we can see the training time like:

(LFARM-CLIENT.KERNEL:BROADCAST-TASK (LAMBDA NIL (SETF LPARALLEL.KERNEL:\*KERNEL\* (LPARALLEL.KERNEL:MAKE-KERNEL 4))(DEFPARAMETER MNIST-FOREST (MAKE-FOREST MNIST-N-CLASS MNIST-DATAMATRIX MNIST-TARGET :N-TREE 1000:BAGGING-RATIO 0.1:MAX-DEPTH 10:N-TRIAL 10:MIN-REGION-SAMPLES 5))))

took 23,067,552microseconds (23.067553 seconds) to run.

During that period, and with 8 available CPU cores,

 3,976microseconds (0.003976 seconds) were spent in user mode

 390microseconds (0.000390 seconds) were spent in system mode

2,944 bytes of memory allocated.

4minor page faults, 0major page faults, 0 swaps.

#(MNIST-FOREST MNIST-FOREST MNIST-FOREST MNIST-FOREST)

It took 23,067,552 microseconds (23.067553 seconds) to run and build a Random Forest with 4000 decision trees.

As a result, we can analyze the performance to build a Random Forest.

What’s Next?

In the next blog, we will compare the performance of the implementation across multiple datasets and cluster sizes and extrapolate the performance gains of a distributed implementation.


This blog post was published July 30, 2018.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now