January 19, 2015 | BY Michael Hausenblas
In this series of blog posts on the Internet of Things (IoT), we've initially established why IoT naturally lends itself to big data, reviewed the current IoT landscape and had a look at some IoT use cases (smart cities, smart phones, and smart homes). In this post, we'll discuss requirements for an IoT data processing platform as well as introduce a high-level architecture that is able to meet the requirements.
A data platform that needs to process data from IoT devices in a reliable way, at scale should meet the following requirements:
- Native raw data support. Both in terms of data ingestion and processing, the platform should be able to natively deal with IoT data. Hadoop in general and the MapR Data Platform in particular make it possible to land the incoming data in its raw format (JSON, log files, etc.) and—for optimization purposes—to convert data downstream to more sophisticated formats such as Parquet.
- Support for a variety of workload types. IoT applications usually require that the platform can natively support stream processing, and that it can deal with low-latency queries against semi-structured data items, at scale.
- Business continuity. Commercial IoT applications usually come with SLAs in terms of availability, latency and disaster recovery metrics (Recovery Point Objective/Recovery Time Objective). Hence, the platform should be able to guarantee those SLAs, innately. This is especially critical in the context of IoT applications in domains such as health care, where people’s lives are at stake.
- Security & Privacy. The platform must ensure a secure end-to-end operation, including integration with existing authentication and authorization systems in the enterprise such as LDAP, Active Directory, Kerberos, SAML or PAM. Last but not least, user privacy must be warranted by the platform, from ACLs over data provenance support to data encryption and masking.
The Internet of Things Architecture (iot-a)
To help architect and consider concrete IoT applications, let's now discuss polyglot processing architecture: the Internet of Things Architecture (iot-a); note that the iot-a in a sense is a meta-architecture, operating on a higher abstraction level than, for example, the Lambda or Kappa architecture. The iot-a assumes that input data—typically time series data—from, say, a sensor is arriving as a stream and that there are (up to) three major query modes in use:
1. output is generated as-it-happens, that is, in a continuous fashion.
2. output is generated based on an interactive query by an end user or another system.
3. output is generated in batches.
In order to satisfy these three outputs, three main building blocks can be used:
1. a Message Queue/Stream Processing (MQ/SP) block
2. a Database (DB) block and
3. a Distributed File System (DFS) block
The MQ/SP block is capable of buffering data, applying some arbitrary business logic, as well as ingesting it into downstream blocks. Further, the DB block provides fine-grained, low-latency access to the data. Due to the nature of the data, the DB block usually utilizes a NoSQL solution. Finally, the DFS block performs batch jobs (aggregations, etc.) over the entire dataset, including integration with unstructured data source (such as images or PDF docs) as well as offering long-term storage (archiving) functionality. Especially in the MQ/SP block we expect to see approximation algorithms, which are eventually corrected by the DFS layer logic.
There are many more aspects around the iot-a that are worthy of a discussion, so we've set up a community advocacy site dedicated to this topic
The iot-a.info site contains more detailed elaboration on the three processing blocks mentioned above, examples for each building block and a summary of time series databases that are available.