Real-Time Processing: Why Interactive Queries Aren’t Good Enough

January 05, 2015 | BY Ted Dunning

A number of people have been claiming lately that interactive responses to queries constitute real-time processing. For instance, Mike Olson has been quoted saying that interactive queries are what is needed for real-time processing.

I like to start with something more like the wikipedia definition of real-time computing instead. Wikipedia defines real time as a response before a deadline. A relaxed form of this is stream processing, where response is per record ASAP, but with no clear latency deadline.

Clearly, the latency graph below, which showcases the performance consistency of MapR-DB versus HBase running on a competitive Hadoop distribution, simply blows away any reasonable claim of real time by our competition, unless they completely avoid persistence guarantees of any kind. For instance, you could have an in-memory queue that writes to HBase whenever it can, with the understanding that a crash or any kind of maintenance action has a very good chance of causing the loss of data. If you don’t use any persistence, then either Storm or Spark Streaming allows such a system to respond to incoming records in bounded time, but it is very unusual to have absolutely no persistence requirements. Sadly, another common strategy is to simply lie about the risks while pretending that not really analyzing the problem and thus not knowing of any failure modes is the same as not having failure modes.

On the other hand, if you have good bounds on latency, however, you can make real guarantees backed up by real analysis. You can guarantee not to lose data and to acknowledge every incoming record in a bounded time. That is why the latency performance of MapR DB is so significant.

Read latency test MapR DB

The Other Side of the Coin

Interestingly, the flip side of real time is the need to freeze a continually changing dataset. For instance, if you are going to run any kind of large aggregation query or train a machine learning model, you need the data to sit still during the scan in order to have repeatable results. With real snapshots, this is trivial. With no more than fuzzy snapshots, it is essentially impossible, and you have to rely on application cooperation and correctness to get correct answers. Or, as with real-time response, you can just pretend. For instance, you can pretend that applications will stop writing to all files in an hourly directory by five minutes past the hour, so that you can start training with any hourly directory from at least five minutes in the past.

This will work. Sometimes.

That word “sometimes” is the key point. Our competition can build systems that do real time. Sometimes. They can meet SLAs. Sometimes. They can stand behind guarantees. Sometimes.

For customers who believe “sometimes” just isn't good enough, MapR is the only option.