This section describes the configuration of the Drill system that you have installed and introduces the overall use case for the tutorial.
Storage Plugins Overview
The Hadoop cluster within the sandbox is set up with MapR-FS, MapR-DB, and Hive, which all serve as data sources for Drill in this tutorial. Before you can run queries against these data sources, Drill requires each one to be configured as a storage plugin. A storage plugin defines the abstraction on the data sources for Drill to talk to and provides interfaces to read/write and get metadata from the data source. Each storage plugin also exposes optimization rules for Drill to leverage for efficient query execution.
Take a look at the pre-configured storage plugins by opening the Drill Web UI.
Feel free to skip this section and jump directly to the queries: Lesson 1: Learn About the Data Set
- Launch a web browser and go to:
http://<IP address of the sandbox>:8047
- Go to the Storage tab
- Open the configured storage plugins one at a time by clicking Update
- You will see the following plugins configured.
A storage plugin configuration for MapR-DB in the sandbox. Drill uses a single storage plugin for connecting to HBase as well as MapR-DB, which is an enterprise grade in-Hadoop NoSQL database. See the Drill Documentation for information on how to configure Drill to query HBase.
This is a storage plugin configuration for the MapR file system (MapR-FS) in the sandbox. The connection attribute indicates the type of distributed file system: in this case, MapR-FS. Drill can work with any distributed system, including HDFS, S3, and so on.
The configuration also includes a set of workspaces; each one represents a location in MapR-FS:
root: access to the root file system location
clicks: access to nested JSON log data
logs: access to flat (non-nested) JSON log data in the logs directory and its subdirectories
views: a workspace for creating views
A workspace in Drill is a location where users can easily access a specific set of data and collaborate with each other by sharing artifacts. Users can create as many workspaces as they need within Drill.
Each workspace can also be configured as “writable” or not, which indicates whether users can write data to this location and defines the storage format in which the data will be written (parquet, csv, json). These attributes become relevant when you explore Drill SQL commands, especially CREATE TABLE AS (CTAS) and CREATE VIEW.
Drill can query files and directories directly and can detect the file formats based on the file extension or the first few bits of data within the file. However, additional information around formats is required for Drill, such as delimiters for text files, which are specified in the “formats” section below.
A storage plugin configuration for a Hive data warehouse within the sandbox. Drill connects to the Hive metastore by using the configured metastore thrift URI. Metadata for Hive tables is automatically available for users to query.
Client Application Interfaces
Drill also provides additional application interfaces for the client tools to connect and access from Drill. The interfaces include the following.
Drill provides ODBC/JDBC drivers to connect from BI tools such as Tableau, MicroStrategy, SQUirrel, and Jaspersoft; refer to Drill Interfaces Introduction to learn more.
SQLLine is a JDBC application that comes packaged with Drill. In order to start working with it, you can use the command line on the demo cluster to log in as root, then enter
mapr as the login password. For example:
Drill Web UI
The Drill Web UI is a simple user interface for configuring and manage Drill. This UI can be launched from any of the nodes in the Drill cluster. The configuration for Drill includes setting up storage plugins that represent the data sources on which Drill performs queries. The sandbox comes with storage plugins configured for the Hive, HBase, MapR file system, and local file system.
Users and developers can get the necessary information for tuning and performing diagnostics on queries, such as the list of queries executed in a session and detailed query plan profiles for each.
Detailed configuration and management of Drill is out of scope for this tutorial.
The Web interface for Drill also provides a query UI where users can submit queries to Drill and observe results. Here is a screen shot of the Web UI for Drill:
Drill provides a simple REST API for the users to query data as well as manage the system. The Web UI leverages the REST API to talk to Drill.
This tutorial introduces sample queries that you can run by using SQLLine. Note that you can run the queries just as easily by launching the Drill Web UI. No additional installation or configuration is required.
Use Case Overview
As you run through the queries in this tutorial, put yourself in the shoes of an analyst with basic SQL skills. Let us imagine that the analyst works for an emerging online retail business that accepts purchases from its customers through both an established web-based interface and a new mobile application.
The analyst is data-driven and operates mostly on the business side with little or no interaction with the IT department. Recently the central IT team has implemented a Hadoop-based infrastructure to reduce the cost of the legacy database system, and most of the DWH/ETL workload is now handled by Hadoop/Hive. The master customer profile information and product catalog are managed in MapR-DB, which is a NoSQL database. The IT team has also started acquiring clickstream data that comes from web and mobile applications. This data is stored in Hadoop as JSON files.
The analyst has a number of data sources that he could explore, but exploring them in isolation is not the way to go. There are some potentially very interesting analytical connections between these data sources. For example, it would be good to be able to analyze customer records in the clickstream data and tie them to the master customer data in MapR DB.
The analyst decides to explore various data sources and he chooses to do that by using Drill. Think about the flexibility and analytic capability of Drill as you work through the tutorial.
Start running queries by going to Lesson 1: Learn About the Data Set.