9 min read
In his most recent post, my colleague Ronak Chokshi described the evolution of use cases in manufacturing and how the rise of IoT technology has spurred a wave in new investment and innovation. In this post, I'm going to dive deeper into some technical considerations that software architects should weigh when designing infrastructure that can drive improvements in product quality, manufacturing line efficiency, and supply chain accuracy.
Most manufacturing infrastructure modernization projects start with a simple goal: open up existing plant data to modern tools, such as business intelligence tools for better reporting, or data science tools, like R or Python for building predictive models. To achieve these goals, data collection can be as simple as querying the historian for already-collected data points and putting that data into an open data format that new tools can access. While this may indeed be a fine way to kick-start data collection, it is dangerous to be too short-sighted and design application infrastructure around this initial need.
The use cases that drive the most value in manufacturing are those that rely on artificial intelligence and machine learning to predict failures of equipment, detect quality issues in real time, or pinpoint points of sub-optimal efficiency. These algorithms require a frequency of data that historians are unable to collect. For example, Tupras, a major oil refinery in Turkey, struggled with their historian's 60s collection interval from their 200k sensors and designed an infrastructure that could handle collecting data each second to support their new use cases. Collecting data more at a shorter time interval means augmenting existing data pipelines that support the historian systems and building new collection mechanisms that interact directly with data sources, like sensors, PLCs, and DCSs. Depending on the type, this could be as simple as interfacing with an existing protocol, like OPC, SCADA, or MQTT, or writing a custom integration.
In addition to collecting more frequently from existing sources, often it's necessary to add new types of data sources to the factory to support emerging use cases. I wrote an article recently on how video is rapidly gaining popularity for IoT use cases, due to decreases in cost and advancements in image recognition algorithms.
The first decision to make when bringing in a data source is whether to write to a bulk object, like a file or database table, or send individual events into a stream. It may be tempting to choose based on an initial use case (for example, writing to files to make it easier to access from R), but this is dangerous for the same reason as siphoning off of the historian is—you may be limiting your options moving forward. Consider this: streaming data can always be written to files, but file data can't be written real-time to a stream. That means that starting with file-based ingestion means giving up future real-time use cases.
A common design pattern for building multiple applications off of a single source of data is called "stream of record." The basic idea is to take source data and push it event-by-event into a stream where the data is held indefinitely. Each new real-time application that comes online simply subscribes to this stream. Each new batch or file-oriented application can be supported by building a lightweight process that pulls data out of the stream and writes it to files that are optimized for the application. The best part of this approach is that you can arbitrarily add new applications with different requirements or change your mind on an existing application. A simple example of this technique being used to collect pressure data from instruments in a factory is pictured below.
Congratulations! Based on the suggestions above, you've built a streaming pipeline that integrates data from 50k sensors every second, plus 2 HD video streams! Assuming 10B per sensor reading, and 5Mbps per video stream, you're now ingesting 14Mbps worth of data. If you're lucky, your factory is in a metropolitan area, and you have a 10Mbps internet connection; if you're not, you may be hanging off of a slow, flaky, high-latency satellite connection. Either way, you're not going to be able to pipe everything to the cloud.
Fear not, the majority of the heavy lifting needed to build your AI-assisted factory applications can happen at the "edge," or the factory premises, by using a technique called "think globally, act locally." This technique is based on the realization that training a machine learning model and using that model are two very different actions that can occur in different locations.
Machine learning model training works best under two conditions:
As such, the place where both of these conditions are typically met is in the cloud, or a centralized corporate data center. This is where data would have been aggregated from multiple distributed locations and joined with public or existing data, and compute can be summoned with an API call.
On the other hand, models are best deployed close to where the action is—at the edge. This allows insights and actions to be derived as close to event-time as possible and significantly cuts down on data that needs to leave the factory. That isn't to say that no data is leaving the factory—you need new data to train models on in the cloud, but you can strip out the uninteresting data. The diagram below illustrates the relationship between the applications sitting at the cloud and edge.
Once you have your factory IoT applications written, the next hurdle is figuring out the operational pieces—how to get the applications deployed, versioned, and secured. Here, there are several challenges to overcome:
Containerizing applications solves all three of these challenges:
Only MapR provides a comprehensive platform for factory IoT applications.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.