A Practical Guide to Microservices and Containers

by James A. Scott

Buyer's Guide to a Modern Data Architecture

As you begin the process of moving to a modern and agile data architecture, keep a set of operating principles and objectives in mind to minimize complexity and future-proof your choices.

Walking the Walk: Quantium’s Journey to Developer Agility

Data analytics developer Quantium has embraced many of the principles and technologies described in this book, and its development and IT teams are seeing the payoff in improved business agility, better decision-making and faster results.

A few years ago, the maker of advanced analytics made the decision to move much of its development work from a Microsoft SQL Server environment to a MapR big data platform using containers. The idea was to give developers more control over their own environments while enabling innovations to be easily shared.

“The old way of coding would be going to the IT department and asking them to spin up a VM. You’d have to wait a week for that,” said Gerard Paulke, a platform architect at Quantium. “If you used up all the RAM, you’d have to ask for another VM. Managing resources was very difficult.”

Shared infrastructure had other shortcomings as well. If a VM went down, so did all the processes running on it, and there was no guarantee that other VMs would be able to pick up the load. Version control was a chore. Developers couldn’t use the latest versions of their favorite tools until they had been installed and tested by the IT organization. And upgrades could break software that had been created to work with earlier versions of those tools.

Containers now provide much of the functionality that was formerly served by virtual machines. Developers have the freedom to not only launch their own environments whenever they want but also to work with their preferred tools without interfering with others. “For example, we can have multiple versions of the Hive metastore in different containers,” without causing conflicts, Pauke said. “It’s agile and resilient.”

Quantium created a common base Docker image that has all the components needed for secure development. Developers can use these shells as a template for constructing their own environments.

“If developers want to try something new, they can just spin up an edge node,” Paulke said. “If they like it, we can containerize it and put it into our app store for anyone to launch.” Sharing containers enables everyone to benefit from each other’s innovations.

Automation and orchestration tools have taken over from human administrators to handle the deployment, scaling, and management of containerized applications. Apache Mesos and Apache Marathon, which are precursors to Kubernetes, provide isolation and resource allocation so containers don’t conflict with each other. If a VM fails, any running containers are automatically shifted to another VM. Orchestration software also automatically finds the resources any given container needs so that it launches smoothly.

“From the user’s perspective, their services are always available,” Paulke said. “We’re running hundreds of applications, and literally one person can manage the whole infrastructure.”

For others who are interested in adopting containers, Quantium advises designing the platform to be self-service from the ground up. Use a basic set of common container images that developers can check out of a library, and expose as much as possible through well-documented APIs.

Bare-metal servers should be identically configured to the greatest degree possible so that automation can be handled smoothly in software, using tools like Puppet and Ansible. Use playbooks to track changes to the environment and enable backtracking to earlier versions, if necessary.

Finally, talk a lot and listen even more. In moving to an agile environment, “I’ve found people issues are the biggest issues,” Paulke said. Developers need to get comfortable with the idea of self-service, and IT administrators must learn to give up some control. Once they see how fast and flexible the new world of development can be, however, they won’t want to go back.

Backups, disaster recovery and business continuity. In the age of new technologies this tends to be the most overlooked feature because it is generally ignored until too late of a stage in an implementation. Ensuring that the core platform handling the storage of the data can be properly backed up and restored are critical to the longevity of any business. Backing up massive volumes of data is often easier said than done. When raising the question of backing up a Hadoop cluster at a large international conference, only one person out of 100 said they had tested backing up and recovering their platform. Make sure that your research includes this topic as the number one most important capability.

Minimize data movement. As the internet of things brings more data generating devices online and overall data volumes continue to swell, you’ll want to think about the most appropriate places to process data in order to minimize bandwidth usage and its accompanying network latency. Intelligent edge devices will increasingly be critical to efficient design. Avoid centralizing all your processing in a few servers. When working with cloud vendors, use service-level agreements that specify response time thresholds. Be aware of surcharges for data transfer between on premises and cloud environments.

Unified/flexible security model. As noted earlier, containers and microservices introduce complexity, which can lead to security vulnerabilities. Adopt a policy-driven governance structure, use group-based authentication and minimize colocation of production, development and public facing services. Automate as much as you can through scripting and orchestration services like Kubernetes.

Establish data governance practices. Data governance procedures ensure that data is available, usable, valid and secure. A data governance program typically includes a governing body, a defined set of procedures, ownership criteria and execution plans. A policy specifies who is accountable for data, and processes are established that define how the data is stored, archived, and protected. The policy also specifies how data can be used and by whom. If government regulations are involved, that is included as well. Having a mature data governance practice in place enhances security, reduces duplication and creates a foundation for use and growth of data.

Establish data lineage and auditing procedures. As data flows through a process from capture to storage to refinement to use in production or analytic applications, it’s important to track and audit its status at all stages in order to prevent errors or omissions. Data lineage provides an audit trail of data points at each stage, with the ability step back through the process for debugging or error correction.

Apply Master Data Management. MDM is a technique for linking all critical data in an organization to one master file that serves as a common point of reference. With MDM, you have one canonical reference to critical information that points to all related data, regardless of where it resides. Access to that information can then be granted on a need-to-know basis without regard to organizational boundaries. MDM can be applied across cloud and on-premises data. The discipline minimizes errors and duplication, provides employees with the most up-to-date data, simplifies ETL and reduces compliance risk.

Support multiple data formats. Data can no longer be assumed to fit neatly into rows and columns, or to conform to standard formats. As new data sources like the internet-of-things come online, data formats will proliferate. Consider your future needs to process such data types as images, audio/video, geolocation, temperature and geometric. Data stores should accommodate the JSON data interchange format for flexibility and readability.

Build scalable infrastructure. A scale-out architecture gives you maximum flexibility to expand your resources as needed. It should be your architectural choice for servers, storage and networking equipment. Scale-up architectures are sometimes needed for performance-intensive uses, such as network switching, but they impose limitations that can be expensive to resolve if you run out of capacity.

Allow for geo-distribution. Many organizations are decentralizing data analysis, enabling regional teams to maintain independent databases that are periodically consolidated with the head office. Low-latency geo-distribution of very large-scale databases and computation engines can provide a competitive edge, but only if good data governance principles are in place. The architecture must provide persistence across multiple data types, with data shared and updated everywhere at the same time with fine-grained access control. Containers and microservices are well-suited for a geo-distributed approach.

Use Standard APIs. Avoid hard-wired interfaces where possible and instead use APIs. This makes it easier for your organization to share application functionality and for you to selectively expose functions to others. Where possible, use standardized APIs such as those specified by OpenStack, Windsoc or the frameworks of your preferred programming languages.

Support containerization. We’ve already made the case at length for adopting containers to take advantage of their flexibility, speed of deployment, configurability and automation. Containers should be a standard part of any agile infrastructure.

Support SQL across all data formats. SQL is the lingua franca of data access, and it is supported by nearly every major database platform, either natively or by extension. Not every product that says it is SQL-compatible conforms to a full set of SQL commands, however, and there are also multiple versions of SQL to consider. Define a base set of query functions you will need and ensure that they are supported by any database management system you bring into your environment. ANSI SQL is the key to compatibility.

Support multiple machine learning frameworks. Machine learning is in its early stages of maturity, and there are numerous libraries available, as was described in Chapter 4. Each of these libraries has its own strengths and weaknesses, so ensure that your environment can accommodate a variety of libraries as your needs evolve. File system standards are the key.

Support on-premises and cloud infrastructure. Give yourself as much room as possible to decide which platforms to use for which applications. If adopting a private or hybrid cloud infrastructure, choose platforms that are compatible with those of public cloud providers so that you may easily shift workloads. This approach also enables you to move on-premises applications to the public cloud in stages by first testing them on a private cloud.

Budget for training. Many of the technologies described earlier in this book will require significant retraining of existing staff. This can be costly, but it is usually less expensive to retrain than to hire new developers, who don’t have full knowledge of your business. Perform a knowledge assessment as part of your transition planning and develop a timeline and budget for upgrading necessary skills.

Build a DevOps and DataOps culture. When moving to a DevOps and DataOps development approach, be prepared for significant disruption as the organization adopts agile processes. This is a major departure from traditional waterfall lifecycles, and education will be required of both developers and users. Developers must acclimate themselves to more rapid code releases and greater control over their environment. Users must be ready to engage more closely with the development process as code reviews become more frequent. This is a cultural shift that often requires the IT organization to play the role of the evangelist in selling the benefits of DevOps and DataOps across the organization.

Prepare an application migration plan. Moving existing applications to cloud infrastructure can involve as much work as rebuilding them from scratch. Analyze your application portfolio and determine the degree of difficulty involved in re-platforming. Some legacy applications may be best left to on-premises infrastructure until replaced. Some may be better candidates for rebuilding as microservices in the short term. Others may move smoothly to the cloud with little modification. If your plans call for rebuilding any existing applications, always consider a microservices approach.

Define administrative tasks. Automation and DevOps can have significant impact on tasks like systems administration, backup and security. Many tasks that were once performed manually can be automated. Backup may become more complex in a heavily containerized environment. As noted earlier, containers and microservices also introduced new security concerns. In most cases, moving to agile infrastructure reduces overall systems management overhead, but it doesn’t reduce the need for new skills. Plan your training calendar and budget accordingly.

Assess network impact. Containers and microservices don’t necessarily increase network overhead, but network performance can be affected if load-balancing and resource optimization aren’t effectively applied. Move to containerization in stages and use your test environment to measure the impact on networks. Your network managers may require some retraining.

Adopt orchestration. As noted earlier, anything greater than a modest container deployment benefits from orchestration. Tools like Kubernetes will leave behind the use of manual provisioning and management and give you the flexibility you will need to scale your containerized environment in a highly automated fashion.

Developing the Business Case for a Modern Data Architecture

As many organizations moved to a modern data architecture, they are using what can be called a “connected” approach. For example, they may use Hadoop and Spark for analytics, Kafka for streaming, HBase or Cassandra for operational data stores and MongoDB for document-oriented processing. Data sources may include MQTT from lightweight IoT devices, JSON, structured log data, and traditional relational tables. Each engine is assigned a subset of available nodes and hard-coded interfaces are used to transfer data between services. This approach is common in environments that have added new data types in a piecemeal fashion, or where ownership of data resides in different departments.

An alternative architecture is what can be called a “converged” approach. In this scenario, all services can run anywhere in a cluster and share resources. There may be different topics in a stream that need to read or write data from different sources, but these topics all reside in the same place. The advantage of this approach is that any application can consume data from any source and all resources are shared because they live on the same cluster.

The converged approach has several major advantages over the connected approach, including that of linear scalability of the resource pool. Shared infrastructure reduces overhead and simplifies management. A unified data platform enables greater application flexibility and streamlines the use of microservices. Data can be persisted as part of the platform through APIs. For example, JSON documents can be persisted directly into the database without connecting code because they’re on the same platform.

A converged platform is superior to a connected one in nearly every case, but moving from a connected to a converged platform is not always simple for some organizations. The lines in a connected architecture are a substantial work effort that need not be present in a converged architecture.. Many of the advantages of containers, microservices and orchestration may be lost when interfaces need to be hand-coded and manually maintained. Developing the business case for a modern data architecture starts with making a commitment to adopt converged principles.

Making a business case starts with understanding the stakeholders and their priorities. If the initiative is driven entirely by the IT organization, it will probably fail. Understand the business context. Is the company in growth mode or is it focused on keeping costs down? Is IT seen as a source of competitive advantage or as a cost of doing business? Is the company attempting a digital transformation or is it content to stay the course? Is management’s vision long-term-oriented or quarter-to-quarter? Organizations that are in cost-control mode or unable or unwilling to embrace change will struggle to make the commitments needed to transform the application architecture.

One company that built a strong business case was Karlsruhe Institute of Technology (KIT), one of the largest and most prestigious research and education institutions in Germany. KIT’s challenge was to develop new technologies to run control centers for renewable energy networks. Because many more devices are active in the creation of renewable energy, the application needed to be able to scale and seamlessly integrate new data formats, particularly those generated by the internet of things (IoT) devices. KIT sought to create an ecosystem of big data storage applications using Apache Kafka, Spark and Apache Drill for processing high-velocity data, combined with microservices-based applications that would be integrated into future control centers as its infrastructure expanded.

By adopting the MapR Converged Data Platform, the organization has achieved unprecedented flexibility. The IT organization can execute any file-based application directly on data in the cluster without modification. Both legacy applications and new microservices-based applications can access the same data.

KIT was also able to break down some legacy applications and split them into smaller, scalable parts using microservices and containers. The MapR platform enabled it to combine container automation and big data software on one computing cluster for better performance and data integration.

Ultimately, KIT will build an intelligent national IoT infrastructure that can also be used in other projects that combine high data velocity and volume. By selecting a high-performance platform that supports dedicated services on top to store and analyze data, the Institute is paving the way for a future free of reliance on fossil fuels1.

knowing/doing2

There are four stages to building a business case for a modern data architecture.

Stage 1: Define the business issue. As noted above, companies in cost-containment mode have very different motivations for considering a modern data architecture than those that are undergoing a digital transformation. Making a business case involves understanding where the greatest perceived value will be. A modern data architecture has significant potential both to cut costs and to transform the business, but implementation approaches are quite different.

For example, a company seeking to grow its business can use modernization to unshackle itself from the limits of legacy applications and database platforms, develop applications faster and more nimbly and transform the customer experience. It will probably want to undertake a more sweeping overhaul of its IT infrastructure than a company that is looking to reduce licensing costs by moving to open software, for example. In the latter case, a more modular approach should be pursued, looking for quick wins and incremental growth.

Stage 2: Analyze alternatives and select the best options. Many factors go into selection criteria, including timeframe, budget, technical abilities of current staff, robustness of existing infrastructure and the need to reengineer legacy applications. Rolling your own solution using open source tools is generally the lowest-cost option, but it can place heavy demands upon existing staff. Buying packaged solutions from vendors is more costly but also quicker and less risky. Or you can attempt some combination of the two. Many organizations choose a core set of open source tools – such as Hadoop, Apache Spark, Apache Drill and Docker – and then select an integrator or software vendor that can deliver a platform to support them. Whatever your approach, be sure you are thinking of the long-term and planning for scalability and adaptability to new data sources and applications such as deep learning.

Stage 3: Prepare the business case. Modernizing your data architecture can require a lot of persuasion. Significant cost and time may be involved before benefits are evident. Rather than risk the project on a single presentation to the executive committee, meet individually with key stakeholders in advance to test your case. Ask for feedback on how to best present your case and anticipate the questions and objections you will encounter.

When preparing the case, keep the business objectives in mind. Don’t fall into the trap of making this an IT-centric proposition. Align outcomes with the business goals, providing as much detail as possible about actual anticipated costs, revenues and overall ROI. Lean on business stakeholders to help with this process. Be prepared to present three different cost scenarios: best case, most likely and worst case. Also be prepared to defend your assumptions. An executive committee or board of directors will want your facts and figures to be grounded in reality. Integrators or vendors can help, since they have engaged in similar projects many times.

Stage 4: Deliver the business case. If you have prepared as described above, delivering the case is the easy part. Don’t let your enthusiasm get in the way of making your argument. Your audience will most likely be skeptical and will attempt to poke holes in your model. Thoroughly researching the costs, timelines and paybacks is your best defense.

Final Thoughts

Building applications based upon services isn’t a new idea, but until now structural impediments have prevented organizations from realizing the benefits of this architecture. The combination of software-defined infrastructure and cloud computing has now removed the major obstacles, setting the stage for the most significant transformation in the way applications are built and delivered in the past 50 years.

The monolithic, vertically integrated applications that have defined enterprise computing since the 1960s will increasingly become a liability for companies seeking to realize the agility that will be demanded by digital transformation. All organizations that hope to operate at the speed of business in the future will need to adopt a services model. The only question is when.

The technology tools are now in place. The benefits of a DevOps and DataOps approach are clear. While it may not make sense for organizations to be diving in headfirst at this stage, there is no reason not to begin experimentation and long-term planning. The risks of not doing so are too great.

1https://mapr.com/resources/karlsruhe-institute-technology-uses-mapr-build-control-centers-renewable-energy

2https://mapr.com/ebooks/architects-guide