January 04, 2017 | BY Ted Dunning
In this week’s Whiteboard Walkthrough Ted Dunning, Chief Application Architect at MapR, provides some pointers for building better machine learning models, including the advantages of data streams and microservices style design in the example of a credit card fraud detector, the need for metrics, and how reconstruction of data from an auto-encoder can serve as a figure of merit that helps identify good models.
For additional resources on fraud detection, anomaly detection, microservices and streaming data:
- Watch Ted’s Whiteboard Walkthrough video “Key Requirements for Streaming Platforms: A Microservices Advantage”
- Read blog/tutorial “Real Time Credit Card Fraud Detection Using Apache Spark and Event Streaming” by Carol McDonald
- Watch Ted’s Whiteboard Walkthrough video “State vs. Flow Data Architecture in the Financial Sector“
- Read free online Chapter 3 “Streaming Architecture: Ideal Platform for Microservices” in O’Reilly ebook Streaming Architecture: New Designs Using Apache Kafka and MapR Streams © 2016 Ted Dunning and Ellen Friedman
- Download free pdf for the O’Reilly book Practical Machine Learning: A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman
Here is the full video transcription:
Hi. I'd like to talk to how you can use anomaly detection and auto encoders to build a fraud detector, at least in a vague outline, using microservices. Now, the idea here is that we're going to build something that takes transactions in, that's what this stands for, and gives us immediately an indication of whether or not that transaction is okay or not. One way to do that is to use something called an auto encoder, which takes features of the incoming transaction, encodes that using a model through something called an information bottleneck, that is it encodes it into a very small amount of information. Then it decodes it back into the same reconstructed features.
If we can do that well, then the model that is doing the encoding and decoding can be said to understand the inputs. The quality of the reconstruction can be considered to be an indication of whether or not the transaction incoming is like most transactions. Now, let's just for a moment not worry about the details of how auto encoders work. Let's take a very, very simple case, because this model is a form of history, and this idea of impinging a current transaction against some historical intuition is the essence of that model. Let's get back to that.
Let's assume that we're going to simplify this down. The incoming transaction will have a location and a time. If I have a credit card that's used today in San Jose and four seconds later in New York and the card is actually present in both places, I say present in both places, there's a very good chance one of those uses is fraud. This is a concept called card velocity. In the early 90s that feature, not alone, but that feature was really major in the stopping of a very large amount of credit card fraud, because at that time the way it was done was by duplicating cards, sending them all around the country, and they would be used all over the place. Card velocity's what we're going to focus on today.
The idea with card velocity is we get a location and a time and we look at the previous location and time for that card. We subtract locations to get a distance, and we subtract times to get a duration. We take the ratio to get a speed. That's what's card velocity is. It's how fast would the card have had to go from one transaction to the next. Our history database will hold our previous location, our previous time. The incoming transaction has the current time, and obviously after we decide is that a large card velocity or a small one, large meaning fraud, small meaning non-fraud, then we say we store that current location, that current time into the history database.
This is a pretty standard sort of thing you might build. You build the little model here. It would store locations, it would look up locations, and it would make decisions. That's the old school way of building it. The problem is if you want to scale this, you wind up with a very hot database, because many different applications are trying to update it and read from it. You also have the problem that the database itself represents a very wide API. All kinds of things could happen to it. People could update it. People could read it. People might start scanning it. A lot of these uses are difficult to record and constrict to some sort of simple interface. Even worse, a lot of these uses of this database will cause problems in terms of performance of the fraud detector itself. It may break its SLAs. This ultimately is a mis-design, an anti-pattern in the modern microservice sort of context.
A better way to build this, and this is kind of surprising, a better way to build this system is to have the same transaction coming in, the same decision happening, the same history being read, but look at what's missing. The fraud detector does not write back to the database. Instead, what it does is it writes historical data to a transaction stream, a message stream, a topic in Kafka parlance, a topic in MapR Streams parlance, a topic within a stream for the MapR sort of world. That historical transaction, which is now a kind of high level, not a relational sort of update anymore, it's a high level data structure that describes the transaction. That transaction is then used to update the database.
Now, this looks very little different from this original, except that I've added complexity and added new mechanisms to this. Before I had decision and database. Now I have decision, and queue, and database. It looks like a bad move, but it's actually a really good move. The reason is because we can now draw an abstraction barrier around the fraud decision-ing system. This abstraction barrier here means that nobody else can see inside that, except where arrows go in or go out. They can send a transaction in as before that works as before. History comes out and is now exposed to the world in this transactional queue. The history then updates the database, and so the historical database works as before, but there's a big difference.
Now, nobody can look at the database inside the fraud detector. That means that if I build a new fraud detector, I can use any design or technology I like to build that historical database. I can change it up so that this system works better, so that I'm more efficient, so I'm cheaper. Whatever a team wants to do inside the abstraction barrier the outside world isn't allowed to really comment on. The previous design, if we want to change the database, all of the consumers of that database in all different ways have to be on board with that change. It isn't just a technical problem. It becomes a political problem, a people problem. This could be much harder. You have to get 12 people in the room who agree on every change. That can't happen quickly. Here this database is completely hidden.
Now, again, that doesn't sound entirely good, but the fact is we can reconstruct the contents of that database from this stream. Tables and streams of changes are in some sense dual representations of each other. The table is an instantiation, an instance of the most recent values seen on the stream. The stream is the entire history of the table. They are the same thing in a very interesting way. That means that new applications could read from this queue and perhaps build a visualization of where are transactions happening, where are they happening in real time practically.
You can imagine a CEO might come in and say, "Oh. We've got to have a great big monitor in the lobby that shows where things are happening. It'll impress investors so much." We can now do that with no impact on the mission critical fraud detector, no impact because we're not touching its database. We can scale the fraud detector by building another abstraction boundary hidden fraud detector like this that sends its history out to the history queue and reads from the queue all of the updates. This is a really critically important idiom in microservices. This is how we would send data through here and how we would keep private the state-ful nature of this.
Now, it seems wasteful to have multiple copies of that database, but typically the database could be very small, because it's special purpose. It can often be in memory, giving performances benefits and things like that, the sort of benefits that you get whenever you have special purpose aspects of things, and you can only code to a special purpose then. Having all of history come out in a transactional stream is really, really exciting then, because anybody who needs to, and is allowed to of course, can build historical databases. They can do data mining. They can machine learning. They can build alternative implementations of fraud detectors. You get enormous flexibility, and you get very fast design changes, because you can run test instances of this. If this is a persistent queue, then you also get the benefits that you can run a fraud detector against a whole lot of history and see how it does relative to the current implementations.
That's very exciting. That's also a great way to build these auto encoders, because that historical data can now be the auto encoder itself. These systems now should always use this model. The fraud detector itself should always be, because it is a microservice, it should have what's called a consistent micro-architecture, architecture below the level of detail that we normally draw when we draw a system diagram. Here is an abstraction of it, and you'll find out more about this in the video on anomaly detection, operational anomaly detection, but the microservice should have inputs and outputs, but it should always also have metrics and exceptions.
In the case of the fraud detector an exception might be fraud being found, or it might be the fraud detector has detected some ill-formed input. The metrics might be here are the latencies for the last 10 seconds as a histogram. They might be per transaction latency, but overall these metrics that we put out of this can now help us find faults in the system. When the system starts breaking in a distributed way it's very difficult to find and understand, but if we build a system like this with good abstraction boundaries and good micro-architecture, we can help it not only work better, be quicker to build, but be quicker to diagnose when it fails.
This is really an exciting new idiom for building these large distributed systems. This doesn't have to be a single computer. We now have abstracted over that. It can be an entire cluster. We have enormous flexibility about how we build things and how they work. It's a really exciting time, and this is a really exciting way to build distributed systems. I hope you build one with me sometime. Thanks very much.