Green Data, Part 3: Exploring Data with Jupyter Notebooks/Apache Drill

Contributed by

6 min read

This blog post is the third in a 5-part series that explores the use of IoT and Data Science to maximize solar energy by leveraging the MapR Data Platform. Read the first 2 blog posts here and here.

To justify the project, I needed to better define the problem; i.e., I needed to be able to quantify how "off" my arrays were: I needed to gather data. I knew I would be working with this data in my Jupyter notebooks, and since I have written a module for connecting to Apache Drill with Jupyter Notebooks, I decided that a simple way to log the data would be to use JSON; this would allow me the ability to work with my data, while utilizing visualization tools in Jupyter Notebooks.

I created a GitHub repo at Much of the work I am doing can be found there, included test scripts that allowed me to test the individual components (data gathering, motor control, solar calculations, etc.). Working with sample code for the sensors, and a neat module for Python I found called Pysolar, I was off to the races.

For each array, I collected all of the sensor data, and the solar altitude and azimuth at the time of data collection. This is in the script. It’s simple: you configure some basic information in the env.list file, and you run sample data to get basic results written to a JSON file. I then copied those daily JSON files from each array back to a directory in my MapR Data Platform and began to explore.

Exploring the Data

For this project, and all screenshots contained here showing data and graphs, I am using Jupyter Notebooks with Apache Drill for my queries. I am using a module I wrote called "jupyter_drill" ( that allows me to interact with Apache Drill easily and then run queries on the JSON data produced by the Raspberry Pis.

Drill was a great choice as I did not have to do any ETL work in order to work with the data, just copy the JSON files to a directory in MapR XD and run the queries you see here. The jupyter_drill module returns results both to a table (if less than 1000 results) and a Pandas DataFrame. That is how I am using the graph modules from to display the data. The variable prev_drill always has a data frame of the results of my last query in Drill. While this may seem to gloss over some important ETL facets, it actually just shows how easy it is to work with data with Apache Drill, and why I chose Drill for this project: more time in the data, less time trying to make heads or tails of the data and processes to load the data.

Interpreting the Data

After settling on using the X-axis reading on the accelerometer (see my previous post on the hardware I used for my reasoning on that). I started collecting data on the north array (left in the pictures) on Nov 2nd, and on the south array on November 3rd. Unfortunately, November is NOT a great month in Wisconsin for solar performance, so finding days when the optical sensor is working well was difficult. I also found that the sensor returns values from approximately 1250 (fully turned east for morning sun) to -1250 (fully turned west for western sun). I have done further calibration on this, described later in this blog post.

Even with the poorly calibrated sensors and poor weather days, I found some interesting things. Here is a query on a single day (November 3rd) showing the tracking position for both arrays. Blue is the north array (left in pictures); orange is the south array (right in pictures).


  • The south array went full west in the morning. I remember this detail: it had frost on the sensor, so once I cleared it, it came back to normal tracking.
  • There are many times the array tracking is in agreement. But even though they mostly agree, there are times when they are out of sync, which indicates one or the other is doing better in its tracking.

Another example of a confused array tracking poorly is here (for the north array only):


  • Here, the array moved to perfectly flat about halfway (0 on the graph, due south orientation) way too early, around 8:00 am.
  • The array then went full west until 10:00 am, when it decided to go back and start tracking somewhat normally.

Even without knowing what was causing the obvious errors, I wanted to understand how close the tracker gets to the ideal angle when things are working well. I’ll cover this topic in Part 4 of my blog post series.

This blog post was published December 19, 2018.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now