Green Data: Using Apache Drill on Solar Inverter Data

Contributed by

20 min read

As part of my Green Data blog post series, I am going through a process of optimizing my solar arrays in order to allow for better tracking. Through the course of this blog post, I realized that the data I would need to truly compare my two arrays' power production did exist in my SolarEdge inverter monitoring portal but was not immediately available to me in an automated fashion. I could look at it manually at any point in time, but I could not download the data throughout the day to compare production.

Here is a screen shot from the portal; note the individual panel amounts, but this data is not available in the SolarEdge API.

Finding a great community of SolarEdge users, I realized I may be able to get the data directly from the inverter. This community conversation coalesced around a GitHub repository at: There were some ups and downs there, but once I started getting that data, I needed to understand the raw data and make it into something usable. As a sidecar to my Green Data blog post, I put together this post, which shows me writing the raw JSON data from the SolarEdge inverter into a MapR XD directory, and then using Apache Drill to understand and utilize this data.

Exploring the Data

When I go first to explore any data with Apache Drill, the first thing I need to think about is its location. For the the SolarEdge inverter JSON, it's in a directory called solarinverterdata in a workspace I have called As it turns out, this location in dfs (MapR XD) is /data/prod/solarinverterdata, and from a POSIX perspective, it's /zeta/brewpot/data/prod/solarinverterdata – ‘zeta' being my MapR POSIX mount point (typically /mapr) and ‘brewpot' being my cluster name. You will notice that in all the views and queries I am running, I use‘tablename.' I could also issue a query of


and then the workspace would be implicit for my session, allowing me to just issue queries directly to ‘tablename' (or in my case, the directory name of ‘solarinverterdata').

A Note on Virtual Columns

Apache Drill has some helpful virtual columns that can be used to help make queries easier. More information can be found here: I will discuss two virtual columns that I am using in more detail here.


This is one of a few implicit columns that Drill creates when querying a file system. It does NOT automatically return this in a select * query; you have to ask for it by name, hence, why I do:

select `FILENAME`, a.* from table a

This allows me to get the filename for the record, and then all the fields in that record.

dir0 - Directories as Data

The way the inverter logger outputs files is hourly files, including the date and hour in the filename. It ALSO puts each day's hourly files into a directory under the solarinverterdata directory. Thus, we have a directory in YYYY-MM-DD format (i.e., 2018-12-01, 2018-12-02, etc.) containing up to 24 hourly JSON files of my raw inverter data. The directory that the data is in is returned by the virtual column dir0, so I can actually do queries that include a where clause, like this

select * from`solarinverterdata`
where dir0 = '2018-12-02'

to only read data from the directory solarinverterdata/2018-12-02 in the workspace This is very helpful, but I do want to point out that dir0 is a bit different from the FILENAME virtual column. It will return as part of select * queries.

Initial Query of the Data

select `FILENAME`, a.* from`solarinverterdata` a
where `FILENAME` like 'solardata_2018-11-28-08%'

This is my initial query; let's talk about why I ran it this way and what I got back from the query:

The reason for the query is I wanted to see the filename AND be able to focus on the files for 2018-11-28-08 (2018-11-28 is the data; 08 is the hour). Essentially, I didn't want to look at the data from when the solar collectors were not collecting (at night). This allows me to see those records that are of importance to me and more quickly dig into my data. The output files are in the format YYYY-MM-DD-HH. By choosing a time when the sun was up (8 in the morning), I would have valuable data to explore.

The data returned is not super clean. I will have to dig through it to determine its value; here are my observations:

  • FILENAME: The file the record is sourced from
  • dir0: The directory the FILENAME is located in
  • inverters: This appears to be information specifically about inverters. Since I only take network traffic from one inverter, I only get data from one inverter. The serial number is the key here: "73132E72." This information is for the inverter as a whole, and thus is not important for me as I want to see one array vs. another array on the same inverter.
  • meters_0x0022, events, Unknown_device_x: These fields do not seem important to me yet, so while they exist in the data, I will ignore them.
  • optimizers: This appears to have the data I am looking for – in other words, the data on the individual power optimizers on each panel. Since I know which optimizers are on which array, this is the data I can use to compare the two arrays' production to determine the efficacy of my Green Data project.

Focusing in on the Optimizers Column

So, running the query

select `FILENAME`, a.`optimizers` as optimizers
from`solarinverterdata` a
where `FILENAME` like 'solardata_2018-11-28-08%'

to focus on the optimizers column allows me to see that for each record, there are zero or more optimizers reporting on their data. Not every optimizer reports in each record; however, the time of the report is included in each record, and it appears if an optimizer doesn't report, it just has a blank record.

The record appears to be a map of Key (the Optimizer Serial Number) : {Map of data for that optimizer}. The data includes Uptime, Temp, Edat Vmod, Time, Date, and more.

Formatting out the data from the solar optimizer column:


Working with Nested Data

The data in my optimizer records is nested, which can be hard to work with. Specifically, my optimizer data is a map of objects with the key as the optimizer serial number, and the value as the data for that optimizer. I want to discuss this, but in order to convey the concepts, I am going to use a simplified representation of the data:

{"SN1":{"field1": 1, "field2": 2}, "SN2":{"field1": 1, "field2": 2}, "SN3": {"field1": 1, "field2": 2}}

The optimizer field is getting us closer to usable data, but I need to have this data split out into values, so I can perform queries and aggregations on them without confusing myself in the process. I really just need each value (the {"field1":1, "field2: 2} part in my simplified example) as a record split out and associated with the optimizer serial number (the "SN1," "SN2" part in my example).


Luckily, Apache Drill has some neat tools for working with this nested data. The first tool I am going to use is a function called KVGEN(). Since my key names are actually data (the serial number of each optimizer), it make it hard to reference them in an automatic fashion. Yes, I could create a view that pulls out each known optimizer ID, but then this query wouldn't work for other people trying to use my work to interpret their inverter data, forcing them to start from scratch.

So what KVGEN() does is take each key and value in the map and outputs an array of objects. These objects all have two values: a key (the key from before, which will be the optimizer serial number) and the value (which will be the object from before with all the data).

Using my simplified example from above, KVGEN() would output:

{"key": "SN1", "value":{"field1": 1, "field2": 2}},
{"key": "SN2", "value":{"field1": 1, "field2": 2}},
{"key": "SN3", "value":{"field1": 1, "field2": 2}}

Applying that to the optimizer column would look like this:

FLATTEN() - Making Sense of the Output of KVGEN()

Since I now have an array of objects, I can use another Apache Drill built-in function: FLATTEN(). What this will do is take any record with an array, and then create a record for each value in that array.

To illustrate this, here is a very small table named mytable:

1    [1, 3, 5, 7, 9]
2    [2, 4, 6, 8, 10]

This table has two columns: an INTEGER column named `ID` and an ARRAY column named `MYARRAY.` If I were to run a query like this:

select `ID`, FLATTEN(`MYARRAY`) as `MYVAL`
from `mytable`

The results would look like this:

1       1
1       3
1       5
1       7
1       9
2       2
2       4
2       6
2       8
2      10

Essentially, for each array value in a single record of `MYARRAY,` it returns a record, duplicating the `ID` field for the record. This is perfect for the results of my KVGEN(), because I had an array of objects returned, and now I will get each object as a record. I can now access the key name (the optimizer serial number) generically by the keyname "key" and the value of that optimizers report as "value." Using my simplified example, applying FLATTEN() to my example output of KVGEN():

Original simplified example:

{"SN1":{"field1": 1, "field2": 2}, "SN2":{"field1": 1, "field2": 2}, "SN3": {"field1": 1, "field2": 2}}

Original simplified example using KVGEN():

{"key": "SN1", "value":{"field1": 1, "field2": 2}},
{"key": "SN2", "value":{"field1": 1, "field2": 2}},
{"key": "SN3", "value":{"field1": 1, "field2": 2}}

Original simplified example after FLATTEN() and KVGEN():

key  value
SN1 {"field1": 1, "field2": 2}
SN2 {"field1": 1, "field2": 2}
SN3 {"field1": 1, "field2": 2}

Applying that to the optimizer data would look like this:

Putting It All Together

Now that we've explored the data and started to pull out the general format and nesting of the data, let's put it together into something usable.

Referencing the Fields

Each record returned is an object of key: optimizer serial number and value: {data returned from the optimizer for that report}, and we can simply reference the fields to make a query that shows each data field as a column. Doing this by example, I will pull out the optimizer serial number by referencing:

  • The subquery alias (b)
  • The column alias (opt)
  • The field name (key)
  • I use an alias to be more meaningful to me (opt_sn)

SELECT b.`opt`.`key` as `opt_sn`

Let's look at an item in the value field, specifically a field like `Eday` which I believe is short for EnergyDay (the amount of energy returned for the day). To return this value, I use:

SELECT b.`opt`.`value`.`Eday` as `Eday`

  • b is the reference to the subquery I am selecting from
  • opt is the name of the field I gave the results of FLATTEN(KVGEN(a.optimizers))
  • value is the name of the field that KVGEN() returns with the value of the record
  • Eday is the name of the field in that object I am looking for
  • Eday is also the name of the alias I am giving this field

As you will see in the query below, I am also casting my data just to make it easy to work with later. While this isn't required, it does make things easier down the line. If you can cast, and your data is small, it doesn't hurt to do it now.

You can see in some of these results, some records return None or NaN for various columns. That's because, as noted above, some records don't have reports from the optimizers. Thus, those are blank. They still return, just without values. This is acceptable.

Creating a View

Since I now have the report for each optimizer, including the time, the date, and the amount of energy in a nice tabular format, let's make a view so I can more easily work with the data. It's much easier to just reference fields in a view than it is to do all the fancy FLATTEN/KVGEN work for every query. It also makes it easier for others to understand my resultant dataset, with a handy view they can use to see how I got to it.

Prior to showing the view creation, I want to explain three fields I am adding to the view and why:

  • part_dt: This is the dir0 value renamed to part_dt. This helps me as a partitioning field. If we only want data from one day, say 2018-12-04, I can assign part_dt = '2018-12-04,' and this will only read the JSON files in the directory 2018-12-04. This improves query performance. It's not needed, but it's handy to have.
  • opt_id: This is the dot notation ID seen in the SolarEdge monitoring portal. Essentially, I know the serial number of each optimizer, and I know the dot notation ID assigned to it (from reviewing my portal data), so I created a CASE statement to return that. It just helps me see the optimizers in a different way and helps group them together per array while also aligning the serial number data with the notation IDs seen in the SolarEdge portal. This is custom to each installation, so when I do this, it's specifically for me.
  • opt_string: This is the Inverter String it belongs to. It is a confusing name when we are talking about data engineering with data types and applying that to the world of solar inverters. Inverters can group panels into "strings." Essentially, String 1 is the north array from my Green Data blog post, and String 2 is the south array. Since I know which physical optimizers are on each array, I do this in my view query with a CASE statement. This helps us get the data we need later on for aggregation and comparison.

Other than those additions, the view statement is just the previous SELECT statement with a


prefixing the query. The return just shows that the view was created successfully, and now I can easily reference my inverter data in a tabular format that makes for easy aggregations and comparisons. All of this is done without changing how the raw data, complex and nested JSON from an IoT device, is stored in MapR XD, making the data engineering component of my solar project very straightforward.

Using the Data

Using my view, I can now run simple, easy to understand queries on my data:

select ds, opt_id, max(opt_string) as opt_string, max(ts) as max_ts, max(Eday) as max_Eday
where part_dt = '2018-12-02' and ds is not Null
group by ds, opt_id
order by opt_id ASC

This query groups by the optimizer ID, shows the string, the max(ts) (which will be the last report time of that optimizer for the day), and the Max of Eday (which will be the most it has reported on that day). Since I limited the query to a specific day, this should show me the output for the optimizers for the day, and then sort it by opt_id. I believe Eday to be the cumulative for the day; I need to validate that with the SolarEdge portal. I also included a "and ds is not Null" in the where clause; I don't need to see records that don't have data for a given optimizer on that day.

A More Complex Query with Visualization

Here is a query where the goal is to show the power utilization every 10 minutes between the two arrays (grouped by opt_string and having all the optimizers on each array summed together):

Since I am using my jupyter_drill module, so that when I run a query in Jupyter Notebooks I automatically get the results in a grid AND a variable with the results in a Pandas DataFrame, I can apply simple visualization by doing this:

And then running this code to get a simple line chart that shows the comparison between the two arrays:

While there are some interesting data breaks (I was updating and modifying the inverter monitoring code), it's really cool to see the output of going from raw, nested JSON to usable results, visualized via Apache Drill queries.


Apache Drill is a great tool for anyone working with data. Allowing me to take complex, ugly JSON files and put usable structure with standard SQL statements is very powerful. While I don't "Drill" into the meaning of my results in this post, as part of my Green Data blog post series, I am sure after more months of data collections, I'll have a new post that really outlines how data helped me convert more sunlight into electricity.

This blog post was published February 26, 2019.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now