The Data Platform for Better Biometrics

The Data Platform for Better Biometrics


Biometrics has gained considerable traction in recent years as a core enabling technology for a broad spectrum of use cases. Biometrics is used today for border control, efficient disbursement of government services and entitlements, and controlling access to everything from secure government facilities to popular theme parks and resorts. New systems continue to be deployed with some frequency, while previously deployed systems are entering into a “technology refresh” stage, whereby initial systems and architectures are being reassessed in order to incorporate recent developments in technology.

While biometric algorithms—the defining and most tangible characteristic of a biometric system— are rightfully receiving considerable attention, MapR believes that an equally important architectural component deserves its due consideration, and that is the underlying data platform. The last ten years has seen a sea change in approaches to data management, as companies such as Google, Facebook, LinkedIn, Yahoo, and others were forced to innovate, in order to control costs while delivering a new breed of services to the market. Other industries have followed suit, and we are now witnessing a once-in-a-generation replatforming of the Information Technology enterprise.

Biometric Identity Management solutions have much to gain from these technological advances. As new systems are developed and aging systems reach their end-of-life, organizations would do well to consider the data platform that will power their biometric capability today and into the future. MapR provides world record performance across a breadth of data services, in support of a range of biometric use cases that previously required multiple subsystems, each of which had to be engineered separately for data resiliency, high availability, cross-data center replication, access control, and more. By choosing the right data platform, customers can now deploy biometric systems with less complexity, less cost, faster performance, and greater interoperability than previous generation architectures allowed.

MapR capabilities have fundamentally altered the art of the possible for biometric identity management systems. This paper shows how a better data platform can reduce complexity, deliver superior capabilities, and greatly reduce the administrative overhead of an enterprise-scale biometric system.


Previously, architects of biometric identity management systems have been forced to cobble together an array of subsystems in order to achieve the comprehensive set of required functionality. Biometric systems have been deployed that use one data store for imagery, another for biometric templates, perhaps a third as a system of record. Specific biometric services are sometimes hosted on dedicated and distinct infrastructure (identification vs. authentication, for example). Analytical functions are frequently offloaded to separate systems. Auditing and system monitoring might require yet another database, on physically separate hardware, while messaging might be a complete subsystem in its own right.

Yet many requirements apply to the system as a whole and therefore need to be architected, implemented, and maintained for each subsystem. Continuous operations (COOP) and disaster recovery (DR) is an illustrative case, generally requiring measures to be enacted for each and every technology integrated into the solution. So the cost of each subsystem has to be multiplied by the number of locations. Their respective data stores will have to be replicated, each according to their own specific capabilities and administrative functions. And each will have ongoing operations and maintenance (O&M) costs, to include onerous processes such as system testing and upgrades. The cost to the project is likely to be considerable, in terms of both money and time, and the introduction of risk is unavoidable. The complexity of the task inevitably leads to potential for misconfiguration and software defects being exposed, typically when you can least afford them.

There are many such examples. High availability, security, backup procedures, capacity monitoring and planning—these are all system-level requirements and tasks that will have to be considered for each subsystem. Add to this picture the typical challenges encountered when integrating multiple point solutions into a complex system and the true impact of this complexity comes into painful focus:

  • Higher costs, for both the initial implementation and ongoing O&M
  • Redundant and underutilized hardware
  • Longer implementation schedules
  • Challenging system upgrades that take months to plan and likely require system downtime
  • Undesirable data movement, introducing latency and the possibility of “data drift” (inconsistent data across the various systems)
  • An inability to keep pace with industry innovation and provide continually improved services to customers
  • Increased staffing and training requirements
  • Higher risk across a number of dimensions

These challenges appear only to be getting bigger, as new modalities are introduced and new applications are created. Given these challenges and what appears to be a steadily increasing role for biometrics in the future, it is hardly surprising that adopters of this technology are now seeking innovative new approaches.

Fortunately, there is a better way.


The MapR Converged Data Platform integrates real-time database capabilities, global event streaming, and scalable enterprise storage with a collection of data processing and analytical engines to power a new generation of biometric systems. The integration of these capabilities into a single platform is illustrated below in Figure 1.

The MapR Converged Data Platform stack

Figure 1. The MapR Converged Data Platform includes core data services (rectangle outlined in red) and data processing engines (squares outlined in black). Commercial applications (squares outlined in gray) access the platform using a suite of open APIs.

MapR software is installed on a cluster of commodity servers. The software makes the cluster of individual machines appear as a single entity, or rather three single entities: a single file system (MapR-FS), a single database (MapR Database), and a single messaging system (MapR Event Store). These core data services are available immediately, as soon as you launch MapR, without any additional installation or configuration. All three scale horizontally by adding nodes (i.e. servers) to the cluster. With the addition of each new node, the system increases its storage capacity (disks 1), its processing power (CPUs), and its RAM. Clusters can scale to thousands of machines. The three components (MapR XD, MapR Database, and MapR Event Store) are built on a converged platform. From an architectural perspective, they are built on top of the same tightly integrated architectural pieces and are optimized to utilize system resources efficiently (data caching mechanisms, for example). From a functional perspective, they are administered under a single umbrella and share consistent features such as data protection (e.g. snapshots and mirrors), security (e.g. access control expressions), and multi-tenancy (e.g. data placement constraints and size quotas). MapR clusters can be set up in the cloud or on premise. Data can be replicated and synchronized across multiple MapR clusters around the world, whether that data exists as files, database tables, or event streams.

Data services are accessed using industry-standard open APIs, providing familiar interfaces to system administrators and application developers while avoiding vendor lock-in. These APIs are provided in Table 1.

MapR DatabaseHBase, OJAI, SQL
MapR Event StoreKafka, SQL2

Table 1. Open APIs provide familiar interfaces and avoid vendor lock-in.


MapR Database is an enterprise-grade, high-performance NoSQL database. It excels at fast data insertion and even faster recall—exactly what is needed for handling biometric-relevant scenarios such as enrollment, encounters, and identity verification. And in contrast to traditional relational databases, MapR Database is “schema-less”—arbitrary attributes can be added as needed, allowing for sparse data sets and evolving schemas. Identities within the system can therefore have an essentially unlimited number of attributes attached to them, and unique or infrequently used attributes can be selectively applied to subsets of the population. MapR Database provides native JSON support, an increasingly popular choice among system developers.

An excellent demonstration of the scalability, performance, and reliability of MapR Database, as well as its suitability for this use case, is Aadhaar. Aadhaar is the largest biometric database in the world, used to authenticate over a billion people across the entire country of India for a wide variety of services. Biometric templates for all enrolled citizens are stored in MapR Database, which is queried for all authentication requests. Some key system metrics are as follows (additional details are provided in a later section):

  • Over 1 billion citizens enrolled
  • 1 million new enrollments added daily
  • Millions of requests per day, with response times under 200ms


MapR-FS is the high-performance, distributed file system built into the MapR Converged Data Platform. It is POSIX compliant and provides transparent read/write file access via NFS, allowing system components and applications to mount the MapR cluster and interact with it as if it were direct-attached storage. It includes many important features for production deployments such as advanced data replication, access controls, and transparent data compression at virtually unlimited scale. MapR-FS provides system-of-record reliability for file-based data such as raw biometric imagery.


MapR Event Store is a global publish-subscribe event streaming system that connects data producers and consumers worldwide in real time with unlimited scale. Events can be published by any number of information producers (potentially millions), and MapR Event Store will reliably persist those messages and make them accessible to any number of subscribers (potentially millions more). As with files and database records, MapR scales by automatically spreading data and processing across all nodes in the cluster.


a diagram of remote sites and consumers

Figure 2. MapR Event Store is the global pub-sub event streaming system for big data.

MapR Event Store can be used within biometric systems in a number of interesting ways. The most obvious is as a global event notification system. Whenever an encounter is recorded, an alert is raised, or an entity’s biometric information is updated, the relevant event can be published to the stream and all subscribers to that information worldwide will be instantly notified. Information exchange between systems can, of course, be bidirectional, allowing service requests to be received in the same manner as encounters are communicated to interested parties.

MapR Event Store can also be used as an alternative means of designing workflows. Rather than orchestrate processes by hardwiring various stages of processing within a monolithic application, each stage can act as a “microservice” that simply listens for information of interest, performs some processing, and publishes its outputs back to the stream. New stages are added by simply subscribing to those outputs.

A diagram of microservices on a converged data platform

Figure 3. Microservices can leverage the MapR Converged Data Platform for immediate access to operational (current) and analytical (historical) data in MapR. MapR is used by Aadhaar today primarily for authenticating users for government subsidies, but— with MCDP and microservices—use cases could easily be extended to Banking, Border Security, Law Enforcement, and more.

Event streams can be used as a system of record. Service requests can, for example, be received on an input stream and responded to on an output stream—both the request and the response can live forever in the system. In the event of a system failure or a change to the process workflow, event streams can be replayed, in the proper sequence, extending as far back as the input stream is retained.

Finally, MapR Event Store can be used for analytics and system monitoring. As an example, a metrics microservice might listen for incoming requests for the sole purpose of tracking the number of service requests received. The microservice could publish updated metrics to MapR Database for display within a real-time system monitoring dashboard.


All data ingested into MapR via file, table, or stream can be analyzed in place using a variety of analytical engines in either batch or real-time. Common choices include Apache Drill for SQL-based analytics and Apache Spark for more complex programmatic analyses. MapR supports operational applications and analytical processing on the same cluster.


The ability to simultaneously host each of these data services on a single, horizontally scalable platform represents an incredible feat of engineering and has quite simply never before been possible. This opens up new possibilities for architects of biometric identity management systems, who were previously compelled to integrate multiple point solutions into a cohesive whole. In contrast to the complexities of many currently deployed systems, MapR offers a greatly simplified picture: a single, horizontally scalable cluster that provides all of the following:

  • Comprehensive data storage for both structured and unstructured data (i.e. templates, demographics, imagery, event logs, system metrics, etc.)
  • System-of-record reliability for all data, including files, tables, and streams
  • Fast database, file system, and messaging operations in support of varied identity services and applications
  • Global publish/subscribe messaging and event notification
  • Multiple analytical engines for in-place analytics across files, tables, and streams

These capabilities and some relevant enterprise-class features are described below.


With MapR-FS, MapR Database, and MapR Event Store, MapR provides unified data storage within a global namespace that can accommodate all biometric data types, including templates, demographic data, biometric imagery, encounter histories, audit trails, and more. Whatever the data type, the MapR Converged Data Platform provides uniform mechanisms for distributing the data across the cluster, protecting against data loss, and replicating data across remote data centers. This uniformity translates to superior system reliability and greatly simplified system administration across the full range of services.

Data is automatically compressed as it comes into the system, and on-disk data structures are optimized for files, JSON data models, and streams. Customers can have full faith and confidence in the MapR Converged Data Platform as a system of record for files, tables, and streams.


Identity Management systems are almost always mission critical to an organization and carry the associated requirements for high availability. MapR provides built-in high availability for file system, database, and messaging system services, requiring no additional hardware and no additional configuration or maintenance. Services are distributed across multiple nodes of the cluster, providing seamless failover in the event of unexpected process failures. The MapR architecture contains no single points of failure.

MapR is known for high reliability in production environments. MapR customers have been known to run for multiple years in heavily used production environments with zero downtime. Rolling upgrades are supported, such that core system software can be upgraded without loss of service. This was one of many reasons Aadhaar chose MapR as the backend database for authentication services.


Data written to MapR is automatically replicated to protect against failures of disk drives, servers, and even entire racks of equipment. All copies are automatically dispersed across the cluster, algorithmically placed for optimal protection and automatic load balancing. Recovery from failures is automatic. This entire process is completely transparent and applies to all data types: files, tables, and streams. System administrators need only worry about overall system storage capacity, an easily monitored metric.

To protect against user and application errors, MapR provides true point-in-time snapshots. Snapshots can be initiated manually or scheduled to occur at regular time intervals—a simple administrative function that can be configured in minutes. When a snapshot is created, the existing data within MapR is captured instantly, with zero storage overhead. Snapshots reliably capture the existing state of files, tables, and streams.


As a data management system, MapR is fast, efficient, and cost-effective. As demonstrated in publicly available benchmarks, MapR has set multiple world performance records in file, database, and streaming scenarios. Customer experience supports these claims: independently validated surveys show that 77% of MapR customers cite performance as a key factor in their selection of MapR, and MapR enjoys a 99% customer retention rate. On Aadhaar (see case study below), MapR services millions of authentication requests a day within sub-second response times across the entire country of India, which has a population of more than 1 billion registered citizens.


MapR can replicate and synchronize data across globally distributed clusters, both in the cloud and on premise. Move only the data you need by specifying the appropriate files, tables, and/or event streams (which are organized by message topic). Data transfer is secure and extremely efficient— data changes are tracked at the block level (8k in size) and only new or modified blocks are transferred. The MapR Converged Data Platform offers an easily configured replication technology that provides reliable worldwide transport of all biometric and related data, including imagery, templates, database tables, and event streams. Replication across all data types can be configured in minutes, and once configured it requires very little administrative attention (“set and forget”).

Table replication can be multi-master (or active-active): all sites can be producing information, simultaneously sending and receiving updates to and from multiple remote locations. MapR Event Store replicates not only the message data but also the consumer offset positions (i.e. which consumers have consumed which messages), such that client failover across multiple locations is seamless. For the file system, MapR provides mirrors, which automatically replicate all file operations (create, delete, and update) to one or more remote locations.

A diagram that shows global active read-write

Figure 4. MapR Database multi-master table replication allows you to (a) minimize network latency on global data through local clusters, (b) reduce risk of data loss through bidirectional replication, and (c) failover applications by redirecting them to another cluster if the primary one fails.

MapR supports complex, globally distributed topologies. A single source can replicate to up to 64 destinations and receive data from up to 64 sources. Replication chaining is supported, and MapR provides built-in loop detection, such that data is never replicated to a remote cluster more than once.


Applications and connected systems often wish to be notified about events that are occurring within the biometric identity system, such as an encounter being processed or an entity’s biometric information being updated. MapR Event Store provides this capability as a core data service of the MapR Converged Data Platform. No additional engineering or administrative effort is required to scale this capability, make it highly available, make it resilient against data loss, or implement the many other features required of a mission critical system. MapR Event Store inherits this functionality directly from the platform. Clients access this core capability using the industry-standard Kafka API, often with just a few lines of code.


MapR scales linearly, such that doubling the nodes in the cluster doubles the storage and processing capacity of the system. From a biometric systems perspective, simply add nodes in order to do any of the following:

  • Manage a greater number of user identities
  • Handle more daily enrollments
  • Process more encounters per day
  • Service more authentication and/or identification requests
  • Store more biographic imagery
  • Analyze greater amounts of data faster
  • Send more notifications through the system

Adding nodes to the system is a simple administrative process. Once registered with the cluster, MapR automatically redistributes data to the new nodes and redirects service requests accordingly.


Once the data lands in MapR (the file system, database, or streams), it can be immediately processed, in place and in parallel, by any of the open-source processing frameworks packaged with the MapR distribution and deployed to the cluster. Some of the more popular processing frameworks include Spark, Drill, MapReduce, Hive, and Pig. Data can be normalized, aggregated, and transformed as needed. Advanced analytics can be performed using Spark, Mahout, or other 3rd-party machine learning libraries. Output data sets (in the form of analytical results, aggregated data, or normalized, enriched data sets) can be written back to MapR-FS, or they can be fed into MapR Database to support real-time applications. Output data (alerts, aggregates, or individual data records, for example) can be published to MapR Event Store as a series of messages, identified by topic, and MapR Event Store will reliably deliver those messages to all system components that have registered as a subscriber to that information.

Apache Drill is an open-source distributed SQL processing engine that can query petabytes of data. It talks natively to MapR-FS and MapR Database, allowing you to combine data from these sources within a single query. Drill supports JSON natively, allowing data that arrives in JSON format to be queried instantly, without any schema registration required—Drill will discover the schema on-the-fly. New attributes are discovered automatically and are immediately available for query. Drill provides ODBC and JDBC interfaces, allowing access to MapR data (files, database tables, JSON documents, and streams) from a large number of 3rd-party Business Intelligence and reporting tools such as Tableau, MicroStrategy, and many others.


The MapR Converged Data Platform provides converged security—consistent access control mechanisms across files, tables, and streams. MapR can limit data access across all data services to specific groups and users as specified, using MapR Access Control Expressions (ACEs). System auditing is also comprehensive—all data access can be logged and queried, regardless of data type. The audit logs, themselves, are written directly to MapR-FS in JSON format, which can be queried natively using Apache Drill.

Diagram of the 4 pillars of authentication

Figure 5. MapR offers comprehensive security, covering the four pillars of authentication, authorization, auditing, and data protection.


The flexibility of MapR comes from its breadth of data services that are available for use whenever needed. The time-consuming tasks of evaluating, procuring, installing, integrating, administering, securing, and maintaining multiple point solutions are all avoided, which frees developers to do what they want to do: build better apps. With MapR, they have a world-class file system, document database, and messaging system at their fingertips, with a high level of assurance that it is up and running and performing optimally. The use of popular, open APIs decreases the learning curve and increases productivity, as they will already be familiar with many of the interfaces. Native JSON support across the platform is another attractive feature, JSON being the data format most commonly used by developers.

MapR has fully embraced the industry shift away from monolithic applications towards collections of microservices. Microservices are more easily updated and deployed, leading to more nimble applications that provide new capabilities to consumers at a faster pace than could previously be accomplished. MapR Event Store allows multiple versions of a microservice to run concurrently, ensuring that new functionality can be thoroughly tested in production environments before any switchover is made. And the ability to replay event streams from a specific point in time means that new services can “catch up” to the current state of the system before operating in parallel on any new data as it arrives.

MapR is a multi-tenant environment. Individual tenants can be restricted to specific work areas (called MapR volumes) to which administrative policies can be applied that control access, restrict disk space consumption, and protect data, according to pre-defined schedules. Data and processing can be restricted to specific nodes of the cluster, further ensuring that the actions of one tenant do not adversely impact other tenants hosted within the same MapR environment.


Aadhaar case study

Aadhaar is the resident identity and authentication system being deployed across India to provide a unique identity (the Aadhaar Number) to every resident in India and a digital platform to authenticate, anytime and anywhere. It is administered by the Unique Identification Authority of India (UIDAI).

Before Aadhaar, it was difficult to verify someone’s identity. There was no universal identifier (nothing analogous to the USA’s social security number, for example). Existing repositories contained only a small subset of the population and also suffered from data integrity issues (duplicate and untrusted entries). A large percentage of the country is illiterate and spread across many remote villages, so documented proof of identity was often hard to come by. Distributing government services and entitlements was difficult—legitimate, deserving citizens often had great difficulty obtaining them, and the incidents of fraud, waste, and abuse were high.

Aadhaar aims to register every resident by capturing a set of basic demographic information (name, address, date of birth, and gender) and biometrics (10 fingerprints, 2 iris scans, and 1 face photo). That information is first checked against the existing Enrollment Database, to verify uniqueness (a process called “de-duplication”). Once verified, the resident is given a 12-digit Aadhaar Number—a unique, lifetime, biometric-based identity—which is recorded alongside the resident’s demographic attributes and biometric derivatives in both the Enrollment and Authentication Databases. The multi-factor authentication service is provided as a platform— government and commercial service providers across India can query the system to authenticate a person (answering the question “are you who you say you are?”). The calling application chooses the subset of user identity information it is willing to accept, based on its security needs, and the service responds back with a simple Yes/No. No transactional information is ever stored in Aadhaar—it is built purely as an Identity Platform, unaffiliated with any specific government entitlement or commercial service.

Aadhaar was launched in 2010 and has evolved into a vital digital infrastructure. With over a billion residents currently registered (the population of India is 1.2 billion), Aadhaar now claims to be the largest biometric database in the world, and the only billion-user platform outside of American commercial companies (Facebook, for example).

For more than two years, Aadhaar has been operating 24/7 using MapR as the backend database for their authentication subsystem. Millions of authentication requests are serviced every day, at sub-second response times (200 ms). Requests come in from across the entire country of India, with 1 million more identities being registered every day.

Aadhaar replaced their existing database technology with the MapR Converged Data Platform because of its reliability, speed, and ability to handle multiple geographically dispersed clusters. In addition to being able to handle the authentication workload, Aadhaar has to meet strict availability requirements, provide robustness in the event of hardware failure, and operate across multiple data centers. The ability to replicate data selectively to remote data centers was a key factor in their evaluation. MapR also enabled Aadhaar to implement rolling upgrades of core system software, allowing them to update their technology without incurring any system downtime or loss of service to customers. Finally, and perhaps even more fundamentally, MapR exhibited zero data loss at scale. Experience with alternative technologies lead to multiple instances of data loss, forcing Aadhaar to introduce yet another database management system. This protected the program against data loss (which absolutely could not be tolerated) but at the cost of greater complexity and many of the associated issues described in this paper (higher O&M costs, latency, etc.). Aadhaar relies on both Morpho/Safran and NEC AFIS to provide biometric matching algorithms.


MapR has built a world-class data services tier that is highly reliable, scales to global deployments, and delivers some of the fastest performance characteristics achievable. At any scale, MapR efficiently and reliably manages the full range of essential data types and exposes a wide suite of services and capabilities for building modern, next-generation applications and running in-place analytics. MapR delivers these capabilities on a single converged platform, allowing system-level enterprise requirements (data distribution, replication, security, and more) to be provided in a uniform fashion, leading to superior system performance and stability at lower cost while greatly reducing administrative overhead.

In the world of biometrics, the MapR Converged Data Platform fundamentally changes the art of the possible. By expanding the focus beyond the biometric algorithms and asking more of the underlying data platform, architects and practitioners are now in a position to deliver greatly enhanced system capabilities using far simpler and more cost-effective architectures.

Download PDF