Microservices are relatively simple, single-purpose applications that work in unison via lightweight communications. Contrast these to monolithic applications that take on multiple functions, and require heavy coordination with other applications when trying to integrate them. Event-driven microservices leverage event streaming engines (like MapR Event Store or Apache Kafka) as the lightweight communications system, and represent an increasingly popular means of powering modern big data solutions. In fact, some organizations refer to this architecture as a “streams first architecture,” because the streams component is a primary differentiator from other architectures. But the principles around microservices still play a big role in the overall benefits.
Coordinated microservices form a data pipeline as part of a larger big data solution.
Why are microservices useful? You gain agility because they are small and therefore relatively easy to build, and typically require minimal coordination when integrating them with other microservices. They’re efficient because they can typically be built and maintained by small, often cross-functional teams. And they offer flexibility by promoting reuse across different solutions.
While converged applications benefit from a microservices architecture, the reverse is true as well -- a microservices approach benefits from a converged application approach, and from a converged data platform. As an example, fewer components are needed to tie the operational (current, real-time) data with analytical (historical) data. If each microservice has complete and immediate access to all data, you remove steps in the pipeline that merely deal with data movement and copying. A microservices architecture also gains from other characteristics of a converged data platform including:
The last bullet point above calls out capabilities in the MapR Converged Data Platform that help with microservices lifecycles. Both capabilities are based on a built-in MapR construct known as “volumes.” MapR volumes are a powerful way to organize data in a cluster. They are logical partitions that grow automatically as the data in the volume grows and are not tied to specific nodes or storage devices. But they can optionally be tied to specific hardware if desired through “data placement control” for the purposes of multi-temperature data topologies where “hot data” gets placed on more powerful nodes.
These two lifecycle management patterns, data versioning and output testing, are described below.
Data versioning is especially important in an event streaming environment, especially when new application versions need to be tested against prior versions. During an application development lifecycle, enhancements to the code will result in different outputs. These outputs are important to preserve during the development lifecycle to compare results and verify improvement. These outputs can be in the form of database records, files, and event data. Managing these different formats as a single collective “version” can be difficult without platform support.
Different versions of database records, files, and event data need to be tracked and managed together in a streaming environment.
This is where MapR volumes help. Since volumes are logical partitions in your cluster that can contain databases, files, and streams, each application version output can be directed to a specific volume with the associated output data. All data for version 1.0 of your application is organized together, and for each successive version, you have a separate volume for the newer data versions. Once your newer data has been checked and validated against an older version, the entire older data version can be easily collectively deleted as a volume.
In addition, input data can be organized in a volume and then be preserved with a snapshot. This creates an immutable copy of the data that can be used as the basis for ongoing testing for future versions of your application. You can keep enhancing your application and run it against a known data set to ensure you can identify changes that are a direct result of your code changes, not due to changes in the data.
In streaming environments, especially those dealing with machine learning, testing different outputs simultaneously can be a useful practice. Model validation in a test environment is certainly valuable, but applying your models in a real world environment can offer an additional level of insight.
Different outputs need to be organized and tracked separately in a manageable way.
One efficient way to implement A-B or multivariate testing is to segregate your outputs into different MapR volumes. As noted in the previous section, database records, files, and event data can be organized and managed together, so the outputs for each model in your A-B/multivariate testing can be stored in distinct volumes. The outputs of each model can be compared to identify which modified characteristics of your application leads to more desirable results.
In the diagram above, 3 distinct versions of the application are run in parallel, with one version getting the bulk of the inputs. Should one of the application versions prove to be superior, all inputs can be redirected to that version, and the other version volumes can be snapshotted to preserve the data for future analysis, or the version volumes can be deleted completely.
The topology in the diagram above also applies to deploying new applications in a production environment. Instead of “flipping the switch” to a brand new application version, you can deploy a new version in parallel, using the same input data as the prior version, to ensure a more graceful upgrade process.
Get started with microservices with the information below, especially the O’Reilly Streaming Architecture eBook by MapR chief application architect Dr. Ted Dunning and big data expert Dr. Ellen Friedman.
Agile Data Processing Pipelines Using Microservices and MapR
We describe in this paper the mechanics of a data processing pipeline, implemented as a series of microservices and powered by a next-generation converged data platform.
Streaming Architecture: New Designs Using Apache Kafka and MapR Event Store
This book by Ted Dunning and Ellen Friedman is a great resource on event streams and how they make up the foundation of a microservices architecture.