MapR 5.0 Documentation : Connecting Drill to Data Sources

Drill serves as a query layer that connects to data sources through storage plugins. A storage plugin is a software module for connecting Drill to data sources. A storage plugin typically optimizes execution of Drill queries, provides the location of the data, and configures the workspace and file formats for reading data. Several storage plugins are installed with Drill that you can configure to suit your environment. Through a storage plugin, Drill connects to a data source, such as a database, a file on a local or distributed file system, or a Hive metastore.

You can modify the default configuration X of a storage plugin and give the new configuration a unique name Y. This document refers to Y as a different storage plugin, although it is actually just a reconfiguration of original interface.

On the Storage tab of the Web Console, you can view and reconfigure a storage plugin, assuming you have permission.You can access each node running a Drillbit from a browser by starting the Drill Web Console. The way you start the Web Console depends on your security setup. 

 Storage plugin configurations are available for the following data sources:

The Apache Drill documentation describes the attributes and definitions you configure for storage plugins, except for the MapR-DB format, which is included only with Drill for MapR. The MapR-DB format is described later in this documentation. 

The Web Console includes some default storage plugin configurations. The following table lists the default configurations and their descriptions:

Configuration NameDescription
cpPoints to a JAR file in the Drill classpath that contains the Transaction Processing Performance Council (TPC) benchmark schema TPC-H that you can query. 
dfsPoints to MapR-FS by default. Drill automatically configures this instance when you install Drill in a MapR cluster, but you can configure the instance to point to any distributed file system, such as a Hadoop or S3 file system.
hbaseProvides a connection to HBase/M7.
hiveIntegrates Drill with the Hive metadata abstraction of files, HBase/M7, and libraries to read data and operate on SerDes and UDFs.
mongo

Provides a connection to MongoDB data.

When you add or update a storage plugin configuration on one Drill node in a Drill cluster, Drill broadcasts the information to all of the other Drill nodes. All nodes have identical storage plugin configurations. You do not need to restart any Drillbits when you add or update a storage plugin configuration.

Configuring Storage Plugin Configurations

You can add, remove, or update Drill storage plugin configurations using the Web Console. The following image shows the default storage plugin configurations: 


If you click Update next to dfs, the following default configuration appears :

{
  "type": "file",
  "enabled": true,
  "connection": "maprfs:///",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
         "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json"
    },
    "maprdb": {
      "type": "maprdb"
    }
  }
}

The dfs configuration includes the storage plugin type, connection information, default workspaces, and file formats that the data source supports. You can add and remove workspaces and file formats.

Changing the Connection Attribute

You can also change the connection if you want the configuration to point to a different cluster.

By default, Drill connects to the cluster that the Drill node belongs to. You do not need to modify the connection unless you want to connect Drill to a different cluster. To connect to a different cluster, edit the connection to include the name of the cluster that you want to connect to.

Example: "connection": "maprfs://<cluster_name>/"

Using the MapR-DB Format

The MapR-DB format is defined within the default dfs storage plugin configuration when you install Drill from the mapr-drill package on a MapR node. The maprdb format improves the estimated number of rows that Drill uses to plan a query. It also enables you to query tables like you would query files in a file system because MapR-DB and MapR-FS share the same namespace.

You can query tables stored across multiple directories. You do not need to create a table mapping to a directory before you query a table in the directory. You can select from any table in any directory the same way you would select from files in MapR-FS, using the same syntax.

Instead of including the name of a file, you include the table name in the query. The userid running the query must have read permission to access the MapR table.

SELECT * FROM mfs.`/users/max/mytable`;

The following image shows a portion of the dfs configuration with the maprdb format:


Configuring the HBase Storage Plugin

An hbase storage plugin configuration, which is included in the Apache Drill and Drill on MapR distributions, needs to be configured for use on a MapR cluster. To configure the hbase storage plugin configuration to point to MapR-DB, set the size.calculator.enabled parameter in the configuration to "false."

Example
{
  "type": "hbase",
  "config": {
    "hbase.zookeeper.quorum": "localhost",
    "hbase.zookeeper.property.clientPort": "2181"
  },
  "size.calculator.enabled": false,
  "enabled": false
}

Configuring the Hive Storage Plugin

MapR Drill supports all the Hive versions supported by MapR (Hive 0.13, 1.0, and 1.2). Drill can work with only one version of Hive on a given cluster. To query across multiple versions of Hive from Drill, install each version of Hive on a separate cluster. For example, you have Drill and Hive 0.13 deployed in a production cluster, while a customer is testing Hive 1.0 on a test cluster. Drill can query data from Hive tables on the test cluster as well as Hive tables on the production cluster. You need to define separate storage plugins, each corresponding to a specific Hive version of the metastore, as described in the next section.

To access Hive tables using custom SerDes or InputFormat/OutputFormat, all nodes running Drillbits must have the SerDes or InputFormat/OutputFormat JAR files in the following location:

 <drill_installation_directory>/jars/3rdparty

Hive Remote Metastore Configuration

The remote Hive metastore configuration runs as a separate service outside of Hive. The metastore service communicates with the Hive database over JDBC. Point Drill to the Hive metastore service address, and provide the connection parameters in a Hive storage plugin configuration to configure a connection to Drill. In the following procedure, you change the default Hive storage plugin configuration to match your MapR-FS environment. 

  1. Verify that Hive is running.

  2. Issue the following command to start the Hive metastore service on the system specified in the hive.metastore.uris:
    hive --service metastore
  3. Start the Drill Web Console.

  4. Select the Storage tab. If Web Console security is enabled, you must have administrator privileges to perform this step.

  5. In the list of disabled storage plugins in the Drill Web Console, click Update next to hive.
  6. Update these Hive storage plugin parameters to match the location of the Hive metastore URI, version, and location of Hive you are using:

     

    • "hive.metstore.uris" 
    • "jdbc:<database>://<host:port>/<metastore database>
       
    Default Hive Storage Plugin Definition
    {
      "type": "hive",
      "enabled": false,
      "configProps": {
      "hive.metastore.uris": "",
      "javax.jdo.option.ConnectionURL": "jdbc:<database>://<host:port>/<metastore database>",
      "hive.metastore.warehouse.dir": "/tmp/drill_hive_wh",
      "fs.default.name": "file:///",
      "hive.metastore.sasl.enabled": "false"
      }
    }

     

  7. Change the default location of files to suit your environment. For example, change "fs.default.name": "file:///" to the MapR-FS location: maprfs:/// 
  8. To run Drill and Hive in a secure MapR cluster, do the following tasks:
    1. Remove the following line from the configuration:
      "hive.metastore.sasl.enabled" : "false"
    2. Click Enable in the Web Console to enable the Hive storage plugin.
      The Hive storage plugin configuration is disabled by default.  
    3. Add the following line to <DRILL_HOME>/conf/drill-env.sh on each Drill node and then restart the Drillbit service:

      export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS -Dmapr_sec_enabled=true -Dhadoop.login=maprsasl -Dzookeeper.saslprovider=com.mapr.security.maprsasl.MaprSaslProvider -Dmapr.library.flatclass"

Hive to Drill Type Mapping

Using Drill you can read tables created in Hive that use data types in the Hive-to-Drill type mapping tableThe Hive version used in MapR supports the Hive timestamp in Unix Epoch format. Currently, the Apache Hive version used by Drill does not support this timestamp format. The workaround is to use the JDBC format for the timestamp, which Hive accepts and Drill uses, as shown in the type mapping example.

For more information about connecting Drill to data sources, refer to Connect to Data Sources on the Apache Drill documentation web site. For information about workspaces, refer to Workspaces.

 

Attachments: