12 min read
Apache Accumulo is a popular BigTable-like framework created by the NSA and open-sourced as an Apache project. We’ve previously blogged about using Accumulo 1.4 with MapR, and thought now was a good time to update the post with the latest versions of Accumulo and MapR.
By using MapR and Accumulo together, Accumulo users get all the benefits of the MapR Distribution for Apache Hadoop. Accumulo users inherit the same strong dependability of the MapR platform that users of HBase and MapR tables have enjoyed for a long time. For example, snapshots provide consistent point-in-time recovery in the event of user and application errors, while mirrors provide disaster recovery and backup for Accumulo-based applications.
In brief, here are the MapR-specific steps to install Accumulo on your existing MapR cluster. These instructions assume that you follow the official Accumulo installation documentation. This document only addresses the MapR-specific pieces. After the summary, we’ll go into more details about each of the steps.
The steps in this article are for running Accumulo 1.6 on MapR 3.1. The steps should be similar with MapR 3.0.x.
Perform the following steps.
1) Download Accumulo 1.6.0 or later and untar the tar file under /opt/accumulo. This will create
2) Create a MapR volume for Accumulo and mount it under /accumulo. You can use the MCS console or execute this command:
maprcli volume create –name project.accumulo –path /accumulo
3) Disable compression for Accumulo data using this command:
hadoop mfs -setcompression off /accumulo
4) Create an Accumulo specific “shadow tree” for Hadoop so we can disable read/write caching.
cd /opt/accmulo/accumulo-1.6.0 mkdir hadoop mkdir hadoop/hadoop-0.20.2 cd hadoop/hadoop-0.20.2 ln -s /opt/mapr/hadoop/hadoop-0.20.2/* . rm conf mkdir conf cd conf ln -s /opt/mapr/hadoop/hadoop-0.20.2/conf/* cp core-site.xml t mv t core-site.xml
5) Edit /opt/accumulo/accumulo-1.6.0/hadoop/hadoop-0.20.2/conf/core-site.xml and add this:
<property> <name>fs.mapr.readbuffering</name> <value>false</value> </property> <property> <name>fs.mapr.aggregate.writes</name> <value>false</value> </property>
6) Create a few symbolic links under /opt/mapr/hadoop.
cd /opt/mapr/hadoop/hadoop-0.20.2/lib ln -s /opt/mapr/lib/json-20080701.jar . ln -s /opt/mapr/lib/libprotodefs.jar .
7) Edit warden.conf to leave space for Accumulo (refer to Accumulo docs for amount needed). In this example we are assuming 2GB is needed. Change this:
service.command.os.heapsize.max=2750 (from 750) service.command.os.heapsize.min=2256 (from 256)
Note: you may also wish to configure Accumulo as a first class service that is managed by MapR’s warden. The general instructions for doing so can be found here: http://doc.mapr.com/display/MapR/warden.%3Cservicename%3E.conf.
8) Create appropriate initial configuration files in /opt/accumulo/accumulo-1.6.0/conf by following the Accumulo installation instructions for copying from the examples.
9) Edit accumulo-env.sh. Follow the Accumulo documentation for this, but take note of these two MapR related settings:
10) Edit accumulo-site.xml. Edit the stanza for zookeeper (the Zookeeper location information can be found in warden.conf):
<property> <name>instance.zookeeper.host</name> <value>host1:5181,host2:5181,host3:5181</value> <description>comma separated list of zookeeper servers</description> </property>
Add this stanza to change the Accumulo tablet server port:
<property> <name>tserver.port.client</name> <value>9996</value> </property>
Add these two stanzas to enable MapR features:
<property> <name>master.walog.closer.implementation</name> <value>org.apache.accumulo.server.master.recovery.MapRLogCloser</value> </property> <property> <name>tserver.wal.blocksize</name> <value>562M</value> </property>
At this point, you can complete the installation and configuration of Accumulo on a single node in a MapR cluster by continuing to follow Accumulo installation instructions. Perform the same steps on additional nodes when you put Accumulo on them.
Now that we’ve shown you the steps to configure Accumulo and MapR, we are going to explain the reasoning behind these settings. For each previous step, we provide more detailed information here:
Step #2: Create a volume for Accumulo
We recommend creating a volume for Accumulo data and mounting it at /accumulo which is the default Accumulo location. By creating a volume just for Accumulo, you’ll be able to leverage the many volume management features of MapR (such as snapshots, mirroring, and quotas) with Accumulo data. For example, this makes it possible for you to easily schedule snapshots of your Accumulo data and mirrored replication of that same data for enhanced protection.
Step #3: Disable compression
Recall that by default MapR transparently compresses all data, which greatly reduces storage requirements and in general improves performance by reducing I/O requirements. However, for some database like applications such as HBase, that transparent compression duplicates the compression they can do and thus should be disabled. Accumulo is no different.
Steps #4 and #5: Create an Accumulo specific “shadow tree” for Hadoop
As with HBase, the MapR transparent read and write aggregation is not appropriate since database like systems tend to have fairly random data access profiles (as opposed to sequential access) Therefore, read caching and write aggregation should be disabled by setting the properties shown earlier.
Ordinarily these properties would be set in a localized site configuration file such as accumulo-site.xml since the total Hadoop configuration is normally a merger of the core-site.xml file and every subsidiary configuration file. This is what we do with HBase (in fact you’ll see those exact same properties set in hbase-site.xml in MapR). Unfortunately, Accumulo does not pass properties found in accumulo-site.xml to the parent Apache Hadoop configuration layers. As a result, we have to do something a bit cleverer. Essentially we will create a custom Hadoop core-site.xml that is used only by Accumulo—we want the rest of your cluster to be able to take advantage of the MapR transparent caching and aggregation.
In order to ensure that the Hadoop runtime that is used by Accumulo is using this Accumulo specific core-site.xml file, there are two options. In the first option, you would edit the Accumulo classpath setttings to remove the normal Hadoop conf directory from the classpath (removing $HADOOP_HOME/conf). These settings are in the accumulo-site.xml file and specified as the general.classpaths property. You could then copy the modified core-site.xml file to the Accumulo conf directory and rely on the usual Hadoop classpath based search for core-site.xml to pick up the correct file. Note that other Hadoop conf files are not in the Accumulo conf directory so you may have to copy them as well depending on the scenario.
The second approach (and the one chosen) is to create a shadow Hadoop tree that is identical to the real MapR-defined Hadoop tree but has this custom configuration file. We leverage symbolic links to make this all work. Earlier we showed you the steps to make this work. Notice that everything in that tree is just a symbolic link to the real Hadoop tree except for the core-site.xml file.
Once we have the shadow tree with only a real core-site.xml file (everything else being links), we simply edit core-site.xml as described earlier to add the two properties. As is usual with Hadoop, make sure you perform these steps on every node that is running an Accumulo server component (or just copy the changes).
Note that a side effect of either approach is that if you edit the real core-site.xml file, those changes are not visible to Accumulo. You’ll need to define a process to ensure that the two copies of core-site.xml are consistent.
Step #6: Create symbolic links
There is a minor bug in the MapR classpath configuration that this fix works around. This will be fixed in the next MapR release.
Step #7: Edit warden.conf to leave space for Accumulo
We need to ensure that Warden is aware of the resources used by Accumulo. Recall that MapR Warden automatically starts, stops, and manages the resources used by the various Hadoop components. In particular, the Warden takes into account the expected memory utilization of components to use memory appropriately. Since the Warden does not know about Accumulo, we need to make some changes to the warden.conf file on each node running Accumulo to ensure that the Warden leaves sufficient resources for Accumulo. If you look in the warden.conf file, you’ll see that there are a number of settings related to heap size for each service. We need to ensure that Warden sets aside enough memory for Accumulo. Since there is currently no explicit option for Accumulo, we instead tell the Warden that the operating system is consuming additional memory that it cannot use.
If you look at the default warden.conf, you’ll see these properties with these values:
Increase the max and min values to take into account the expected memory usage of the Accumulo servers on nodes that will run them. For example, if you expect that the Accumulo server processes will consume 2GB of memory, you would change the values to this:
We are not claiming to be experts in Accumulo sizing. Therefore, you will need to determine yourself what Accumulo processes are running on each node and what their expected memory utilization is, and then update warden.conf appropriately.
Step #9: Edit accumulo-env.sh
The Accumulo documentation explains the meaning of the values in this file and the need to change them. We just want to point out two things that are relevant to MapR.
First, since we’ve created a shadow Hadoop tree, you’ll want to point Accumulo to that tree rather than the cluster default under /opt/mapr/hadoop. Secondly, MapR includes a Zookeeper client on every node and that is all Accumulo actually requires. There is no need for a separate Zookeeper install. As such the Zookeeper “home” is really just the location of the Zookeeper client library - /opt/accumulo/accumulo-1.6.0/hadoop/hadoop-0.20.2/lib.
Step #10: Edit accumulo-site.xml
The MapR advanced file system layer (MapR XD) has some internal behaviors which are not quite the same as HDFS.
The Accumulo documentation explains the meaning of the remaining values in this file and the need to change them. We just point out how to determine those values with MapR.
The Zookeeper endpoints are defined in warden.conf and can easily be found by looking for the value of the zookeeper.servers property. In addition, the default port for tablet servers in Accumulo conflicts with a MapR default port. Therefore, we recommend changing the tablet server port.
We hope that you find running Accumulo on MapR to be an excellent pairing of the many enterprise features of MapR with the function and power of Accumulo. Please let us know what you think—we are listening.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.