19 min read
Editor's Note: In this week's Whiteboard Walkthrough, Ted Dunning, Chief Application Architect at MapR, talks about how current trends are turning the internet upside down. He also talks about how this is leading to the requirements for very very high speed time series databases, which leads to practical designs based on modern NoSQL architectures to implement these high speed time series databases.
Here is the transcription:
Hi, my name's Ted Dunning. I'm also known as the guy in the red hat. I work for MapR as Chief Application Architect. What I do is help customers design stuff. I may be the guy in the red hat, but today we're doing a Whiteboard Walkthrough and it's fun and informal so I'm going to lose the hat.
What we're going to talk about today is a couple things. One is how current trends are turning the internet upside down. Also going to talk about how this is leading to the requirements for very very high speed time series databases. Then that in turn leads to practical designs based on modern NoSQL architectures to implement these high speed time series databases.
Let's start. What I'm going to talk about first is how the internet's turning upside down, but to talk about that we have to talk about how the internet works now. The way it works right now is the internet is like a big pipe. I'm not the one who invented that. Some Alaskan senator did that for me, but the idea is real.
Data comes from central servers, servers like at NetFlix or other places where they push bits into the internet and the internet isn't really built a series of tubes. It's built layer on layer, secondary servers, caches, routers and switches, things like that, are interconnected. The property of the internet is for most intents and purposes, if you put a bit in at one end, it will come out the other.
Now, what people haven't really thought about although obviously the engineers have, is that having gazillions of people pulling bits from the internet means that somebody has to stick those bits in at the other end, at the servers upstream. The bits have to go in and the bits have to come out. There are, as we said, a gazillion consumers and there's only a few sources.
What that means is that the sources are very very hot, hot places for technology and each box in the server room is expensive. It's much more expensive than a home wireless router. Now, on the other hand, even though there are very very small pieces of hardware there and relatively small bandwidth requirements, what happens is that this entire region over here is where most of the money goes. It's purely because there are so many millions and even billions of bandwidth consumers. Each one of them just costs a little bit. Each one of them just consumers a little bit of data. It seems like a lot when you're watching a movie or something like that but it really isn't a huge amount.
Since there are so many, the aggregate cost and the aggregate bandwidth are very high. The aggregate bandwidth is what makes it necessary for these servers to be fancy and big and advanced, but the sheer number is what causes the massive investment on this side, the cost of it.
What's happening is the role of the internet is changing. It's changing from a way of broadcasting bits to consumers into a way to pull bits from machines, from things, from things like your water heater, from things like every panel on every photovoltaic installation or array. Each one of those panels typically in a modern install has its own inverter that converts to a standardized form of electricity and that allows each panel to adapt to its own changing conditions, but that also causes data.
That data then is sent back for central analysis. Every stationary turbine that generates electricity, every jet engine that General Electric makes, all of these come with data plans just like your phone. The data on all these things, you know you might have a turbine here and you might have a wind farm there and a solar PV panel there. You might have my whole water heater at home, which is busy measuring data right now.
All of these things are generating bits and sending them to a central place. Just like in reverse, you can see how this is turning the entire internet backwards or reversed or something like that. Bits are coming in and being concentrated so now we have a huge number of bits here.
In the industrial internet, what we have is machines, valves, pipes, temperature sensors, being collected on site in what are called plant historians. This is called Tier 1 in the industrial internet, the nascent internet of things. That's being collected and routed and transmitted ultimately to Tier 2.
Now, Tier 1 is fairly well understood and the data rates are fairly moderate. The sensors are pretty well understood as well. People are putting a huge amount of effort into making these sensors and Tier 1 systems work well because this is where the aggregate dollars are very high so we have to drive the cost of the sensors very very low. They have to be at most a few dollars and preferably a few cents. That takes technology. It takes pretty advanced thinking to make that happen.
Tier 1 is where the aggregate dollars are going but Tier 2 is where the bits are going and that's where the very very high bandwidth and very very high technology is required. That's what we're going to talk about today. Talk about time series.
Now, time series have different meanings in different industries, and the techniques I'm going to talk to you about today started with data centers. In data centers, you typically start collecting data but it doesn't ever really matter so much if the system that's monitoring is out for a few minutes now and again, or if it loses a few bits, you can correct it next week and everything will get better.
In industrial settings, what happens is you have to know that data system is going to work and it's not going to lose any data. What that means is the people who buy these Tier 2 times series databases insist on testing, testing its scale. They require that you load typically say three years of data, three years of data from the entire world as far as the customer is concerned and test the database, demonstrate it works at scale.
Of course, they don't want you to test for three years. They want you to load all that data in a day, 1000-to-1 real time. If we're getting a million samples per second out of the entire plant here and all of the different Tier 1 collection points, we're going to have 1,000 times or 100 times that if we're willing to let the test take 10 days, bandwidth requirement for back-filling for test purposes at the Tier 2.
That means instead of a million data points per second or 100,000 per second or something like that, we're going to need to handle 100 million data points per second. Potentially even considerably higher than that, we may be able to fill the database at a billion data points per second.
If you think about it, typical databases on a single machine are pretty happy with a few 1,000 data points per second in just rate. With clever design and good optimization you can push it to 20,000 or 50,000 per second fairly reasonably, maybe even reach 100,000 if you're very clever, but 100,000 is a far far cry from 100 million.
Let's talk about how to make that happen. The way we're going to talk about making that happen is also described in more detail in the book that Ellen Friedman and I wrote recently, which is available for free on the MapR website, but let's talk about that.
The data in one of these databases is organized in an unusual way, unusual in the sense of how a relational database would normally be organized. In a relational database, you would have a time, a marker for which measurement is being made, and the value of that measurement.
This would be a fine relational form for the data that we're recording, a so-called fact table network. Each measurement therefore would require one row to be inserted. Now, in a relational database, there's a lot of general purpose capabilities and those general purpose capabilities get in our way sometimes when we're building a special purpose thing.
Here we're building a special purpose time series database and some of the characteristics that are unusual about time series databases come back to haunt us and prevent us necessarily from getting the best performance.
One thing about them is that as we make measurements, the measurements once made do not typically change. We might go back and say we're going to restate a region of history here because we had some late arriving data or something like that, but once we measure the temperature on Thursday, September 19, 1901, we're not going to go back and due to accounting rules or something like that, change what we measured.
We may have measured the wrong thing and there may have been some error in the thermometer but it is what we measured, so data is roughly immutable after we measure it.
Secondly, in a time series database, we usually want data for a range from a particular starting time to an ending time. This temporal range query dominates the workload.
Now, let's see how we can use these characteristics in a non-relational database to get some big advantages. What we do in the database that we're going to be looking at which is based, these are roughly based on the design of openTSDB which was roughly based on some of the Google time series databases that are kept.
What's going to happen here is here we have the metric ID itself, 101, 102, 103, different numbers for different measurements that are being made. We'd have another table somewhere which will describe the exact characteristics of each sensor.
Then we have time here but this is not time of the measurement, this is the time of the beginning of a time window. Here for brevity I'm using one minute as a time window. It's more common to use an hour so you can put thousands of measurements into one time window.
Here we have minute by minute time windows and you can see that I miswrote this, measurements 101 and 102 both had measurements at 15:00. 102 and 103 had measurements at 15:01. Measurement 103 also had another measurement time window at 15:02.
Now, we don't really care that 101 didn't have any measurements at 15:02 and we don't really care that measurement 103 didn't have any measurements at 15:00. That's okay. Missing data's okay. We might have added sensor 103 and removed sensor 101.
Then for recent data, and these last two lines are the recent data, what we do is we store the actual measurement like this value of 11 there in a column which is indicated, the name of the column is the time offset within this time window. We have a time window here of one minute so we have offsets here in this 103 15:01 row of +7 and +35. That means at 15:01:07 and 15:01:35 we'd made measurements of 11 and 20 respectively for measurement device 103.
Likewise, measurement device 103 at 15:03:50 had a measurement value of 33. This is very cool because what it does is it gives us one row for potentially 1,000 or more measurements. Having one row for many measurements makes the retrieval of time ranges very fast.
Retrieving each row, whether it contains one sample or 1,000, is not nearly 1,000 times more expensive in the wide row case. This is a characteristic of databases. Finding the data is very expensive. Starting a read is very expensive but reading more or reading less doesn't make that much difference. Having these rows with many measurements facilitates retrieval of time series data.
There's one more trick. What happens in openTSDB is after a period of time, so here for measurement 101 and 102 at 15:00, measurement 102 at 15:01, what's happened is the rows, the data that were in these columns has been compacted into a BLOB format that contains all of the time offsets and the values for all of the measurements in that window.
Those are stored as a single value in the database typically in a compressed format like protobufs or something like that and typically in a compressed format a binary representation that does not contain schema information since the schema information would be highly repetitive and therefore expensive to store.
That means that we can retrieve even faster the entire contents for that time window but it means something else. It means that we could also cheat and especially when backfilling years of data, we could just insert the BLOB format instead of the wide table format that's used by the openTSDB.
What that lets us do is insert thousands of samples with one database insertion. The acceleration is very nearly equal to the number of samples we insert with each operation.
Now we've done just that. We've tested it using standard openTSDB code on MapR DB. MapR DB is an in-file system implementation of the Hbase API. We've tested that and on four machines we were able to achieve an insertion date of just over 100 million data points per second, and on eight machines we were able to achieve just over 200 million inserts per second, and on nine almost 300 million per second. This represents essentially linear scaling as we add more machines.
What that means is that each machine that we added could handle about 30 million data points per second. Four times that gets you just over 100 million. Eight times that gets you into the 200s and nine times that get you right up around 300 million data points per second. This is really fast.
The typical reaction of people when you talk to them and tell them about 100 million data points per second is they just kind of go, and give you a classic Scooby Doo sort of expression.
This is exciting and especially for the backfill situation where I have three years of antique data that I need to insert into a test database very quickly, this is perfect because we need no changes to the operational aspects of the time series database. We can just back-fill the data very very fast.
It's less useful as it stands for a live database, and in a live database you want to be inserting data as statelessly as possible. If we were to insert the BLOB format, we would have to have some software entity that would in memory accumulate those values for a time window and insert them all at once. That accumulation is state.
The good news is that with queuing systems like Kafka or things you can build with MapR file system itself, you can have what's called a resilient queue. That resilient queue or buffer allows this reader to be building the state in memory but if it fails or if you have to restart it somewhere else, it can just start reading from that resilient queue back at the beginning of the time series for each time window. That lets it accumulate the stuff in memory at almost no cost in complexity.
These techniques can be extended even to real time data acquisition. Of course, if you acquire data at 10 or 100 million data points per second, you're just going to make the testing problem harder because then you're going to need 100 or 1,000 times that rate for the back-fill, but you can even achieve that by having a reasonably good sized cluster to do these retrievals on.
That's the nubben, that's the heart of how turning the internet inside out or backwards to front or upside down, is causing time series requirements to balloon and how to implement it with things you find around the house.
That's been a whiteboard walkthrough for high speed time series databases. If you have any questions, go ahead and send them to us. If you have any suggestions for subsequent walkthroughs, we could talk about all kinds of things. You can reach us at @MapR. That's for the main MapR Twitter identity. You can also reach me at @ted_dunning, or you can reach me by email, the quickest one is ted@MapR.com. That's me. Send me some ideas. I love to hear from people. Thanks.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.