SQL Query on Mixed Schema Data Using Apache Drill

Contributed by

16 min read

You may have heard this statement before:

_ Apache Drill does schema discovery on-the-fly._

What does that mean, and why should it matter to you?

The power of SQL for business analytics is a given, but the challenge in big data settings is that SQL is normally a static language that assumes pre-defined, fixed and well-known schema. SQL also needs flat data structures. It has been assumed that you need fixed schema for performance. In contrast, working in the realm of big data requires being flexible and dynamic. You’ll encounter a wide range of data types, including unstructured and semi-structured data and nested JSON and Parquet; these data formats are not appropriate for traditional approaches.

The effort required for transformation of these classes of data in order to query them with SQL on traditional databases (AKA data modeling) often puts these data sources off limits in a pragmatic sense. That’s where the innovative design of open source Apache Drill comes in. Using standard ANSI SQL syntax, Drill tackles this wide variety of big data sources and formats with ease and without the need for a lot of expensive ETL processing. Drill explores and queries data as it is, with existing data types, by interacting using a JSON data model.

In a talk titled “Just-in-Time Optimization” at Strata + Hadoop World conference in San Jose, Ted Dunning, Chief Application Architect at MapR, discussed this problem and how Drill tackles it. Ted explained that Drill “moves much of the optimization and type specificity out of the query-parsing and static-optimization processes and into the execution process itself” and does this without needing to know in advance much about the schema. In addition, Drill can optimize queries across data sets with changing schema. Handling data with multiple schemas is important with big data because even if data starts out with a well-defined schema on day 1, it typically isn’t long before that schema changes. The result is data that can’t be handled directly by traditional SQL engines. Ted emphasized this point by saying ““Drill brings together the sophistication and familiarity of SQL with the flexibility of the Hadoop ecosystem.”

Example: Apache Drill SQL query across mixed schema data

This ability of Drill to discover schema meta-data on-the-fly has practical advantages. Suppose in a retail situation that you’ve got a table in which a new column for a new production was added six months ago. You would like to query across the older data (lacking the column) and the newer data (with the column) using a single query. Drill can do this because schema mutation is a core feature.

I asked Systems Engineer Chris Matta about his experience in the field working with MapR customers who are using Drill to solve similar problems. His reply was “Sure, it happens all the time. When Drill runs across a new field that it hadn't encountered before it includes it in all of the results, setting the fields from the data without that column to NULL. This is what makes Drill so powerful in real-world use and avoids having to maintain schema and do ETL to make the data ‘fit’.” Chris went on to explain that Drill works on all the underlying storage plugins (and across them if you’re joining).

That got me thinking it might be fun and useful to show how you would query two data sets with different schema.

In one, you know the name, company, and title of individuals. Later you find out that some of these people have a secret identity as a super-hero, so you add a column with that identify for the new cases being added. With traditional approaches this mixed schema data would be difficult to query. But with Drill you just point your query at the old and new data and go.

Notice what happens in these three situations:

  1. Query old and new data sets together:a new column was added in the new data set, but Drill can still query both old and new data using one query.
    Note: Now how select* is handled in the presence of a variable schema. The effect may not be what you are looking for.
  2. Masking: there is a secret identity for some individuals, so you may want to mask their real name. That can be done with query design still across old and new data. User will see “—secret—” instead of a name if the person is a superhero with secret identify – this protects their cover.
  3. Apache Drill views for masking: Drill view have chained impersonation. This won’t be surprising if your background is standard SQL, but users of SQL-like tools such as Hive or Impala may be surprised. See also Chapter 3 of Sharing Big Data Safely.

Here’s how it looks (my thanks to Ted Dunning & Chris Matta for help with the example):

Keep in mind that files can be treated as tables. Often we collect data one way and then later collect more data in a new format. Here we have two files, old.json and new.json. With Drill, we can query each of these as if they were a table. We have to use a data source that accesses data found in a file system, and we have to put the path name inside back-ticks to avoid having Drill treat the path name as if it were some kind of SQL expression or reserved word.

0: jdbc:drill:> select * from maprfs.ted.`json/old.json`;
+---------------+-------------------+-------------------+--------------+
|       name    |       company     |       title       |       city   |
+---------------+-------------------+-------------------+--------------+
| Tony Stark    | Stark Industries  | Engineer; ex-CEO  | Malibu       |
| Pepper Potts  | Stark Industries  | CEO               | Los Angeles  |
| JARVIS        | Stark household   | Assistant         | Malibu       |
+---------------+-------------------+-------------------+--------------+
3 rows selected (0.369 seconds)
0: jdbc:drill:> select * from maprfs.ted.`json/new.json`;
+---------------+------------+--------------------+-----------------+--------------+
|       name    |  identity  |      company       |       title     |    city      |
+---------------+------------+--------------------+-----------------+--------------+
| Clark Kent    | Superman   | Daily Planet       | Reporter        | Metropolis   |
| Lois Lane     | null       | Daily Planet       | Reporter        | Metropolis   |
| Peter Parker  | Spiderman  | Empire State Univ  | Student         | New York     |
| Bruce Wayne   | Batman     | Wayne Enterprises  | Philanthropist  | Gotham City  |
+---------------+------------+--------------------+-----------------+--------------+
4 rows selected (0.344 seconds)

Notice that the new data has an extra column.

Query Old and New Data Sets: Multiple Files as One Table

We can also query the entire directory as if it were a single table. This implicitly queries all of the files together. Notice how the composite table only has the columns that are common between the two files.

0: jdbc:drill:> select * from maprfs.ted.`json`;
+---------------+--------------------+-------------------+--------------+
|       name    |     company        |       title       |       city   |
+---------------+--------------------+-------------------+--------------+
| Tony Stark    | Stark Industries   | Engineer; ex-CEO  | Malibu       |
| Pepper Potts  | Stark Industries   | CEO               | Los Angeles  |
| JARVIS        | Stark household    | Assistant         | Malibu       |
| Clark Kent    | Daily Planet       | Reporter          | Metropolis   |
| Lois Lane     | Daily Planet       | Reporter          | Metropolis   |
| Peter Parker  | Empire State Univ  | Student           | New York     |
| Bruce Wayne   | Wayne Enterprises  | Philanthropist    | Gotham City  |
+---------------+--------------------+-------------------+--------------+
7 rows selected (0.368 seconds)

If we ask, we can include the identity column by mentioning it explicitly. Again, we need back-ticks, but this time because identity is a reserved word in SQL but we want to treat it like a field name.

0: jdbc:drill:> select name, company, `identity`, title, city from maprfs.ted.`json/*` ;
+---------------+--------------------+------------+-------------------+--------------+
|       name    |     company        |  identity  |          title    |     city     |
+---------------+--------------------+------------+-------------------+--------------+
| Tony Stark    | Stark Industries   | null       | Engineer; ex-CEO  | Malibu       |
| Pepper Potts  | Stark Industries   | null       | CEO               | Los Angeles  |
| JARVIS        | Stark household    | null       | Assistant         | Malibu       |
| Clark Kent    | Daily Planet       | Superman   | Reporter          | Metropolis   |
| Lois Lane     | Daily Planet       | null       | Reporter          | Metropolis   |
| Peter Parker  | Empire State Univ  | Spiderman  | Student           | New York     |
| Bruce Wayne   | Wayne Enterprises  | Batman     | Philanthropist    | Gotham City  |
+---------------+--------------------+------------+-------------------+--------------+
7 rows selected (0.304 seconds)

Masking

We can even hide the identity of anybody who is known to be a super-hero using a bit of fancy SQL. Anybody from the old file will have no known super-hero identity because that wasn’t recorded back then. In the new file, anybody known to be a super-hero can have their name hidden. Lois Lane doesn’t have a super-hero identity (in this table, that is, and ignoring the complexities of DC alternative universes) so even though she is mentioned in the new data, her name is not masked.

0: jdbc:drill:> select
. . . . . . . >    case
. . . . . . . >    when `identity` is null then name
. . . . . . . >    else ' -- secret --'
. . . . . . . >    end
. . . . . . . >    as name,
. . . . . . . >    company,
. . . . . . . >    `identity`,
. . . . . . . >    title,
. . . . . . . >    city
. . . . . . . > from maprfs.ted.`json`;
+----------------+--------------------+------------+-------------------+--------------+
|       name     |     company        |  identity  |      title        |     city     |
+----------------+--------------------+------------+-------------------+--------------+
| Tony Stark     | Stark Industries   | null       | Engineer; ex-CEO  | Malibu       |
| Pepper Potts   | Stark Industries   | null       | CEO               | Los Angeles  |
| JARVIS         | Stark household    | null       | Assistant         | Malibu       |
|  -- secret --  | Daily Planet       | Superman   | Reporter          | Metropolis   |
| Lois Lane      | Daily Planet       | null       | Reporter          | Metropolis   |
|  -- secret --  | Empire State Univ  | Spiderman  | Student           | New York     |
|  -- secret --  | Wayne Enterprises  | Batman     | Philanthropist    | Gotham City  |
+----------------+--------------------+------------+-------------------+--------------+
7 rows selected (0.374 seconds)

Using Apache Drill Views to Mask Data

You can use views in Drill to hide some data more securely than is possible with just a fancy query. To create the view, figure out a query that gives you the data you want to expose in the view. In this case, we will just omit the identity column altogether, but the previous masking query could also have been used just as well.

0: jdbc:drill:> create view maprfs.ted.some_columns as select
. . . . . . . >    name,
. . . . . . . >    company,
. . . . . . . >    title,
. . . . . . . >    city
. . . . . . . > from maprfs.ted.`json`;
+-------+------------------------------------------------------------------+
|  ok   |                             summary                              |
+-------+------------------------------------------------------------------+
| true  | View 'some_columns' created successfully in 'maprfs.ted' schema  |
+-------+------------------------------------------------------------------+
1 row selected (0.494 seconds)

Once you have the Drill view, you can use it as an ordinary table, but you can’t access data in the table if it isn’t exposed in the view; selecting identity from the view gives an error due to unknown column.

0: jdbc:drill:> select * from maprfs.ted.some_columns;
+---------------+--------------------+-------------------+--------------+
|     name      |     company        |     title         |     city     |
+---------------+--------------------+-------------------+--------------+
| Tony Stark    | Stark Industries   | Engineer; ex-CEO  | Malibu       |
| Pepper Potts  | Stark Industries   | CEO               | Los Angeles  |
| JARVIS        | Stark household    | Assistant         | Malibu       |
| Clark Kent    | Daily Planet       | Reporter          | Metropolis   |
| Lois Lane     | Daily Planet       | Reporter          | Metropolis   |
| Peter Parker  | Empire State Univ  | Student           | New York     |
| Bruce Wayne   | Wayne Enterprises  | Philanthropist    | Gotham City  |
+---------------+--------------------+-------------------+--------------+
7 rows selected (0.519 seconds)
0: jdbc:drill:>

Note that if you try to create a view that you already have, you will need to drop it first.

0: jdbc:drill:> drop view maprfs.ted.some_columns;
+-------+---------------------------------------------------------------------+
|  ok   |                               summary                               |
+-------+---------------------------------------------------------------------+
| true  | View [some_columns] deleted successfully from schema [maprfs.ted].  |
+-------+---------------------------------------------------------------------+
1 row selected (0.489 seconds)

Internally, a view is just another file that is treated specially. The content of the file that defines view looks like this

[tdunning@se-node10 ~]$ cat some_columns.view.drill
{
  "name" : "some_columns",
  "sql" : "SELECT `name`, `company`, `title`, `city`\nFROM `maprfs`.`ted`.`json`",
  "fields" : [ {
      "name" : "name",
      "type" : "ANY",
      "isNullable" : true
  }, {
      "name" : "company",
      "type" : "ANY",
      "isNullable" : true
  }, {
      "name" : "title",
      "type" : "ANY",
      "isNullable" : true
  }, {
      "name" : "city",
      "type" : "ANY",
      "isNullable" : true
  } ],
  "workspaceSchemaPath" : [ ]
}
[tdunning@se-node10 ~]$

As you can see, the Drill view is written in JSON and contains our original query and some type information (which just says that everything is of type ANY which is a Drill extension to SQL). It is also possible to embed a workspace path in a view so that your user doesn’t have to worry about where the tables in the query are actually located. Opening up the permissions on a view allows you to let others query your view and thus the underlying files even if they cannot query the underlying files directly. This technique allows you to enforce masking of certain data for particular users.

Note that the phrase “opening up the permissions” is not used in any database sense. It is literally a matter of changing the permissions on the file that contains the view. The same thing is true when you query files - whether or not you can read them using Drill is determined by the permissions on the file, just as it would be if for any other way of reading the files.

Conclusion

Apache Drill is an open source query engine that brings together the power of standard SQL with the flexibility needed for modern big data settings. It also offers some excellent ways to share or protect the identity of super-heroes. For more on working with Drill, see the following resources.

To get started, download the Apache Drill sandbox.

To learn more about Drill views see Chapter 3 of Sharing Big Data Safely © 2015 Ted Dunning and Ellen Friedman (O’Reilly)

For free online training with Drill go here.


This blog post was published June 13, 2016.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now