6 min read
AI and ML technologies, and their use in facial recognition systems, have received a lot of attention recently – some of it negative, referring to the lack of accuracy that is sometimes seen. If you’ve tried this out, what you realize quickly is that an effective system with reasonable accuracy requires multiple models. For example, in the setup I describe in this blog, there are different models for face detection (finding faces in an image) and face matching (how close are these two images of a face).
The models I’m using here are good for frontal face images; for a sophisticated system, there would likely be other models that could deal with images from different angles (side, up, down, etc.) or that could interpolate an image from one angle to match an image from another.
If you want to do real-time, then you’ve a short window in which to execute each of the models. The result is likely to be an aggregate across the results of many models – giving a probabilistic rather than absolute determination.
You quickly arrive at the need for a streaming architecture – particularly when you consider that a practical application is likely to be distributed in nature. You might want to take a look at the rendezvous architecture described in Machine Learning Logistics by Ted Dunning and Ellen Friedman for an overview of how it might work.
For the real-time communication aspect, MapR Event Store provides an implementation based on the Kafka API. You could use Apache Kafka for it, but we’re talking about simplification here, so let’s come back to that option later.
For execution of the face detection model, the MapR REST API and Kafka Connect module provides methods for direct injection of the image into the communication system. For widely distributed systems, the MapR Data Platform and MapR Edge provide a few ways of easily implementing the system on a global scale. In terms of simplification, the key here is that nothing changes when moving from one machine to running on a cluster, in the cloud, across clusters, or across clouds – it all stays the same.
The multi-model execution and result aggregation for the rendezvous implementation are very well served from MapR Event Store as described in the referenced text.
The image database and metadata database can be easily implemented in MapR Database (an HBase binary and document database). Could I not use Apache Hadoop for this? Well, yes and no. At a certain scale, it would be fine; however, at some point you’ll run into issues with the number of files that Hadoop can handle, particularly if you are storing the results of each match for later analysis. Are there ways around this? Possibly, but we’re trying for real-time, and, again, we’re talking about this in the context of simplification – why not just use a file system that doesn’t have that problem?
Lastly, let’s consider portability: we want our code to be able to run on a laptop for easy development and on a cluster, so we want to write our code such that it could execute anywhere. The changes required to move from local to cluster execution are – none! No changes are required – because of MapR’s API compliance, code that runs in a standard computer environment also runs on MapR. This makes portability easy and opens access to all the libraries and repositories of shared code available. They simply work (consider all that exists for Python, R, TensorFlow, etc.). What that means here is that I was able to pull down example code from the web and just follow it without having to change it, vastly simplifying the development and giving me “execute anywhere” for free.
So far, we’ve been looking at functionality; however, in a production world, we need to include operational aspects, such as security and auditing. And thus, coming back to the question of Apache Kafka: yes, the system could be built with it; it could also have used HBase or Cassandra or Mongo as the NoSQL database and HDFS as the file system. However, in such an architecture, I have to consider operational aspects in multiple places – for example, how to implement security across Kafka, Cassandra, and HDFS. Can it be done? Yes. Is it simple? No. If I want to operate at scale, I also have to deploy dedicated infrastructure to my different functional areas and monitor each separately. With MapR, I deploy one thing – MapR. I define security at the data level and that applies to my files, my database records, and my streaming data, which is much simpler.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.