MapR Hadoop features the complete Hadoop distribution including components such as Hive and HBase. There are a few things to know about migrating Hive and HBase, or about migrating custom components you have patched yourself.
Hive facilitates the analysis of large datasets stored in the Hadoop filesystem by organizing that data into tables that can be queried and analyzed using a dialect of SQL called HiveQL. The schemas that define these tables and all other Hive metadata are stored in a centralized repository called the metastore.
If you would like to continue using Hive tables developed on an HDFS cluster in a MapR cluster, you can import Hive metadata from the metastore to recreate those tables in MapR. Depending on your needs, you can choose to import a subset of table schemas or the entire metastore in a single go.
Importing table schemas into a MapR cluster
Use this procedure to import a subset of Hive metastore from an HDFS cluster to a MapR cluster. This method is preferred when you want to test a subset of applications using a smaller subset of data.
Use the following procedure to import Hive metastore data into a new metastore running on a node in the MapR cluster. You will need to redirect all of links that formerly pointed to the HDFS (
hdfs://<namenode>:<port number>/<path>) to point to MapR-FS (
Importing an entire Hive metastore into a MapR cluster
Use this procedure to import an entire Hive metastore from an HDFS cluster to a MapR cluster. This method is preferred when you want to test all applications using a complete dataset. MySQL is a very popular choice for the Hive metastore and so we’ll use it as an example. If you are using another RDBMS, consult the relevant documentation.
- Ensure that both Hive and your database are installed on one of the nodes in the MapR cluster. For step-by-step instructions on setting up a standalone MySQL metastore, see Setting Up Hive with a MySQL Metastore.
On the HDFS cluster, back up the metastore to a file.
- Ensure that queries in the dumpfile point to the MapR-FS rather than HDFS. Search the dumpfile and edit all of the URIs that point to
hdfs://so that they point to
Import the data from the dumpfile into the metastore running on the node in the MapR cluster:
Using Hive with MapR volumes
MapR-FS does not allow moving or renaming across volume boundaries. Be sure to set the Hive Scratch Directory and Hive Warehouse Directory in the same volume where the data for the Hive job resides before running the job. For more information see Using Hive with MapR Volumes.
HBase is the Hadoop database, which provides random, real-time read/write access to very large datasets. The MapR Hadoop distribution includes HBase and is fully integrated with MapR enhancements for speed, usability, and dependability. MapR provides a volume (normally mounted at
/hbase) to store HBase data.
- HBase bulk load jobs: If you are currently using HBase bulk load jobs to import data into the HDFS, make sure to load your data into a path under the
- Compression: The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Before using HBase, turn off MapR compression for directories in the HBase volume.
If you have applied your own patches to a component and wish to continue to use that customized component with the MapR distribution, you should keep the following considerations in mind:
- MapR libraries: All Hadoop components must point to MapR for the Hadoop libraries. Change any absolute paths. Do not hardcode
maprfs://into your applications. This is also true of Hadoop ecosystem components that are not included in the MapR Hadoop distribution (such as Cascading). For more information see Working with MapR-FS.
- Component compatibility: Before you commit to the migration of a customized component (for example, customized HBase), check the MapR release notes to see if MapR Technologies has issued a patch that satisfies your business requirements. MapR Technologies publishes a list of Hadoop common patches and MapR patches with each release and makes those patches available for our customers to take, build, and deploy. For more information, see the MapR Release Notes.
- ZooKeeper coordination service: Certain components, such as HBase, depend on ZooKeeper. When you migrate your customized component from the HDFS cluster to the MapR cluster, make sure it points correctly to the MapR ZooKeeper service.