7 min read
With the advent of the Internet of Things, organizations are constantly on a mission to track the whereabouts of their assets in real time, whether on the move or stationary. On the other end of the spectrum, they also want to be able to look at the historical data accumulated over time to project the future growth of business and make the right investments based on good analytical results.
Traditionally, it has not been easy to achieve the desired outcomes described above because it involved significant investment and careful planning of a system that has the hardware and software pieces capable of monitoring the movement and capturing the data generated in the wake of thousands, if not tens of thousands of sensors, moving in all directions at the same time.
The good news is that now organizations have various modern tools on hand to make it all happen – cloud, microservices, software as a service (SaaS), big data analytics, etc. Open source tools such as Kubernetes to orchestrate containers for microservices, Apache Spark for real-time analytics and machine learning to project the future growth, and an advanced modern data platform such as the MapR Data Platform that combines the power of cloud and open source software and supports the strictest of privacy regulations like GDPR, are all at your disposal.
MapR Data Fabric For Kubernetes
Amazon Elastic Container Service for Kubernetes (Amazon EKS) is a managed service that makes it easy for you to run Kubernetes on AWS without needing to install and operate your own Kubernetes clusters. Customers who look to have a tightly integrated platform to scale their data analytics needs can truly benefit by combining EKS with running MapR on AWS.
Below is a list of the benefits of MapR Data Fabric for Kubernetes:
Persist data for containerized applications, so they are stateful.
The microservices running as containers are designed as ephemeral and are highly disposable as they only provide compute. This means they are not stateful if the data is stored in local storage in the containers and if they died for any reason. The MapR Data Fabric for Kubernetes is a solution to address this challenge. By securely persisting data to MapR XD, containers are becoming stateful and can easily pick up their tasks from where they left off, after they were restarted.
Scale data as containers grow.
By decoupling the storage (MapR Data Fabric) and compute (Kubernetes) infrastructure, organizations can scale their data independently without having to worry about going over budget one way or another, as in a system where storage and compute resources are coupled.
Protect data with replication, mirroring, and instant snapshots.
Persisted data in MapR XD will be managed by MapR for disaster recovery, data protection, access control, auditing, etc.
Benefit from MapR tickets for end-to-end security.
The proposed architecture has the containers orchestrated by Kubernetes; the containers communicate with MapR through the fuse POSIX client, running on the Kubernetes workers; the client inherits the security features that MapR offers, including wire-level security for encryption and MapR tickets for authentication. See https://mapr.com/whitepapers/security-and-big-data-governance-mapr/.
Dynamic volume provisioning makes it very easy to create MapR volumes, defined in Kubernetes storage class.
Static and dynamic volume provisioning are both supported; however, if you have a large number of containers that need volumes, then dynamic provisioning will automatically and effortlessly handle the creation/deletion of the volumes, according to the policies defined in a storage class.
By leveraging the benefits described above, the result is a system in the cloud that can scale, compute, and store independently and does not require an army of in-house IT professionals to ensure the system's uptime and software update/maintenance.
The graph below describes the architecture of this demo. There are three containers orchestrated by Amazon EKS, and a MapR Sandbox is also created in the same subnet where the Kubernetes workers are located, so the containers can mount the MapR volumes.
Lambda Architecture in the NYC Citi Bike Demo
The first container is a microservice that grabs the data in real-time; the data includes the geolocations of all the bike stations as well as the address, capacity, and number of available bikes, etc. The second container hosts a time-series InfluxDB microservice that mounts a MapR dynamic volume (similar to MySQL, PostgreSQL or MSSQL, where a mount point in the OS is used to store the database). The third container is a Grafana service that visualizes the time-series data.
The ingested data tees off upon entering the system, following two paths: one in which the data is processed, persisted, and visualized in real time; another in which raw data is persisted into a MapR XD volume, where the city planners can look back at the historical demand and supply of each bike station to determine if they should expand or downsize the operational capacity of a bike station, according to budget. Here is how they can do it: the raw historical data can be analyzed by the various open source tools that come with the MapR Data Platform, such as Yarn, Spark, Drill, Hive, Zeppelin, etc. Note that these tools can also be containerized via the MapR Persistent Application Client Container (PACC) and moved into the Kubernetes land.
Visualization of Citi Bike Demo with Grafana
Analyzing Raw JSONs with Apache Drill
You can also use the MCS (MapR Control System) portal to manage the dynamic volumes you just created for DR, quota, access log, and more.
The MapR Control System (MCS) UI
The demo instructions are available here: https://github.com/maprpartners/citibike/blob/master/README.md.