Recording the time at which a measurement was made or an event occurred can make data much more useful for revealing valuable insights, so it’s no wonder that there’s an increasing interest in time series data and in methods and technologies for building time series databases. That’s why co-author Ted Dunning and I have written a short book titled Time Series Databases: New Ways to Store and Access Data, published by O’Reilly this week_._ The book examines the fundamental concepts and practical methods for implementation of scalable, cost-effective time series databases.
While the idea of using time series data is not new, the astounding quantity of data being generated from a wide variety of sources makes this a new world for time series. During a recent flight from San Francisco to New York City, I was thinking about the huge number and range of measurements being made throughout the flight, recorded many times per second.
Figure © 2014 Friedman & Dunning, used with permission. Graph shows altitude data from planes taking off at a business California airport.
You’re likely familiar with the existence of a so-called “black box” that records flight data – an odd term since these boxes are usually a bright color. The boxes are mentioned in the news because they are retrieved when possible after an accident in order to reconstruct the events that may have caused the problem. But the data collected by aircraft sensors is not just used in these dire situations. It’s analyzed regularly in order to optimize various aspects of a flight including saving fuel or monitoring performance of equipment. Flight data measurements include parameters such as air speed, altitude and flight path, fuel consumption and control settings, and the time of each measurement is also recorded. That time information is the key to getting real value from this information. Not only can the data provide a view of conditions at a particular moment in time, but by recording the data as a time series, it’s possible to cross-correlate various parameters and events throughout the flight.
The airline company is not the only business interested in data from these sensors. Some manufacturers of big equipment such as the turbines used in jet engines, power generation or windmills are incorporating sensors that can report back to the manufacturer throughout the life of the equipment. These “smart parts” not only provide feedback on equipment design and quality assurance, but they also have a real added value to the customer. The manufacturer may make this data available to customers as a service.
Sensor data in the Internet of Things is among the rapidly expanding sources of data that is collected as time series. And that raises the issue of how best to collect, ingest, store and access such huge amounts of data in time series databases. New challenges need new tools, so in our latest book we’ve described some new ways to build scalable, cost-effective NoSQL databases using Apache HBase or MapR-DB and with table designs specially aimed at time series data.
In the book we provide an overview of time series use cases from high frequency stock trading to environmental sampling done by unmanned robots on the ocean, but our main focus is on how to use new open source tools that enable you to build very high performance time series databases. We focus on the design and function of databases rather than on the analysis methods for time series data. We include a detailed explanation of these open source tools:
Our explanation of time series databases includes pointers to better performance through clever table design. We provide details about the strategies needed for good design of unique row keys and how this can make a big difference for performance in data retrieval. We also show that OpenTSDB is a great tool for efficient data storage. Time series data can be added to tables point-by-point, in a wide table format, or whole rows can be compressed to a single data structure (blob).
In the book, we also describe in detail the use of open source extensions that MapR developed to make it possible to do direct blob-loading, as diagramed in this figure:
Direct blob-loading in a time series database, using open source software developed by MapR
to extend what can be achieved with OpenTSDB. (Image © 2014 Friedman & Dunning, used with permission).
With the open source MapR extensions, very high ingestions rates have been achieved. For example, direct blob loading made it possible to ingest data at a rate of 100 million points /second into MapR-DB, using only 4 nodes of a 10-node cluster. The machines were high performance, but substantial results should be possible through this method even with less powerful hardware.
The book also provides an introductory view of using time series databases in conjunction with machine learning models for predictive maintenance scheduling. We go on to describe a more advanced topic, the use of geo-temporal databases. The book has a combination of high-level concepts and technical details about building time series databases, so it should have something of interest for both a very technical audience of developers or data scientists as well as for business analysts and system administrators.
You can learn more about these topics by reading Time Series Databases: New Ways to Store and Access Data, which for the present is being provided for free, courtesy of MapR.
Access the open source extensions developed by MapR here.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.