Drill serves as a query layer that connects to data sources through storage plugins. A storage plugin is a software module for connecting Drill to data sources. A storage plugin typically optimizes execution of Drill queries, provides the location of the data, and configures the workspace and file formats for reading data. Several storage plugins are installed with Drill that you can configure to suit your environment. Through a storage plugin, Drill connects to a data source, such as a database, a file on a local or distributed file system, or a Hive metastore.
You can modify the default configuration X of a storage plugin and give the new configuration a unique name Y. This document refers to Y as a different storage plugin, although it is actually just a reconfiguration of original interface.
On the Storage tab of the Web Console, you can view and reconfigure a storage plugin, assuming you have permission.You can access each node running a Drillbit from a browser by starting the Drill Web Console.
Storage plugin configurations are available for the following data sources:
The Apache Drill documentation describes the attributes and definitions you configure for storage plugins, except for the MapR-DB format, which is included only with Drill for MapR. The MapR-DB format is described later in this documentation.
The Web Console includes some default storage plugin configurations. The following table lists the default configurations and their descriptions:
|Points to a JAR file in the Drill classpath that contains the Transaction Processing Performance Council (TPC) benchmark schema TPC-H that you can query.|
|Points to MapR-FS by default. Drill automatically configures this instance when you install Drill in a MapR cluster, but you can configure the instance to point to any distributed file system, such as a Hadoop or S3 file system.|
|Provides a connection to HBase/M7.|
|Integrates Drill with the Hive metadata abstraction of files, HBase/M7, and libraries to read data and operate on SerDes and UDFs.|
Provides a connection to MongoDB data.
When you add or update a storage plugin configuration on one Drill node in a Drill cluster, Drill broadcasts the information to all of the other Drill nodes. All nodes have identical storage plugin configurations. You do not need to restart any Drillbits when you add or update a storage plugin configuration.
Configuring Storage Plugin Configurations
You can add, remove, or update Drill storage plugin configurations using the Web Console:
If you click Update next to
dfs, the following default configuration appears :
dfs configuration includes the storage plugin type, connection information, default workspaces, and file formats that the data source supports. You can add and remove workspaces and file formats.
Changing the Connection Attribute
You can also change the connection if you want the configuration to point to a different cluster.
By default, Drill connects to the cluster that the Drill node belongs to. You do not need to modify the connection unless you want to connect Drill to a different cluster. To connect to a different cluster, edit the connection to include the name of the cluster that you want to connect to.
Using the MapR-DB Format
The MapR-DB format is defined within the default
dfs storage plugin configuration when you install Drill from the
mapr-drill package on a MapR node. The
maprdb format improves the estimated number of rows that Drill uses to plan a query. It also enables you to query tables like you would query files in a file system because MapR-DB and MapR-FS share the same namespace.
You can query tables stored across multiple directories. You do not need to create a table mapping to a directory before you query a table in the directory. You can select from any table in any directory the same way you would select from files in MapR-FS, using the same syntax.
Instead of including the name of a file, you include the table name in the query. The userid running the query must have read permission to access the MapR table.
The following image shows a portion of the
dfs configuration with the
Configuring the HBase Storage Plugin
hbase storage plugin configuration, which is included in the Apache Drill and Drill on MapR distributions, needs to be configured for use on a MapR cluster. To configure the
hbase storage plugin configuration to point to MapR-DB, set the
size.calculator.enabled parameter in the configuration to "false."
Configuring the Hive Storage Plugin
MapR Drill supports all the Hive versions supported by MapR (Hive 0.13, 1.0, and 1.2). Drill can work with only one version of Hive on a given cluster. To query across multiple versions of Hive from Drill, install each version of Hive on a separate cluster. For example, you have Drill and Hive 0.13 deployed in a production cluster, while a customer is testing Hive 1.0 on a test cluster. Drill can query data from Hive tables on the test cluster as well as Hive tables on the production cluster. You need to define separate storage plugins, each corresponding to a specific Hive version of the metastore, as described in the next section.
To access Hive tables using custom SerDes or InputFormat/OutputFormat, all nodes running Drillbits must have the SerDes or InputFormat/OutputFormat
JAR files in the following location:
Hive Remote Metastore Configuration
The remote Hive metastore configuration runs as a separate service outside of Hive. The metastore service communicates with the Hive database over JDBC. Point Drill to the Hive metastore service address, and provide the connection parameters in a Hive storage plugin configuration to configure a connection to Drill. In the following procedure, you change the default Hive storage plugin configuration to match your MapR-FS environment.
Verify that Hive is running.
- Issue the following command to start the Hive metastore service on the system specified in the
hive --service metastore
Start the Drill Web Console.
Select the Storage tab. If Web Console security is enabled, you must have administrator privileges to perform this step.
- In the list of disabled storage plugins in the Drill Web Console, click Update next to
Update these Hive storage plugin parameters to match the location of the Hive metastore URI, version, and location of Hive you are using:
- "jdbc:<database>://<host:port>/<metastore database>
- Change the default location of files to suit your environment. For example, change
"fs.default.name": "file:///"to the MapR-FS location:
- To run Drill and Hive in a secure MapR cluster, do the following tasks:
- Remove the following line from the configuration:
"hive.metastore.sasl.enabled" : "false"
- Click Enable in the Web Console to enable the Hive storage plugin.
The Hive storage plugin configuration is disabled by default.
Add the following line to
<DRILL_HOME>/conf/drill-env.shon each Drill node and then restart the Drillbit service:
- Remove the following line from the configuration:
Hive to Drill Type Mapping
Using Drill you can read tables created in Hive that use data types in the Hive-to-Drill type mapping table. The Hive version used in MapR supports the Hive timestamp in Unix Epoch format. Currently, the Apache Hive version used by Drill does not support this timestamp format. The workaround is to use the JDBC format for the timestamp, which Hive accepts and Drill uses, as shown in the type mapping example.