Vulcan's Impact team is a key part of late Microsoft co-founder and philanthropist Paul G. Allen's network of organizations and initiatives that work to be catalysts for positive change around the world. Empowered by Paul's vision to help create a better world, we take an unconventional approach to tackling some of the world's hardest problems, including using machine learning to curb wildlife poaching and leveraging satellite imagery to combat illegal fishing. But doing so challenges our project teams to iterate quickly on complex software systems to keep up with evolving technologies, innovations in the problem domain, and refinements of project scope to maximize the impact of our effort. For almost two decades, Agile software development methodologies have provided the framework to help engineering teams meet these challenges, and concrete practices like continuous integration (CI), scrum, and a supporting git branching model have been reliable enablers, but they have limitations.
Figure 1 depicts a typical CI workflow. Engineers perform development work and limited testing on a code branch in a local environment, then merge their changes into the main code branch in a central repository. This action triggers automated build, testing, and - if extended with continuous deployment (CD) - deployment to a testing, staging, or production site. If teams are following some common best practices, individual work items are tightly scoped to small, incremental changes which can be built, tested, and deployed rapidly. Keeping the surface area of each change as small as possible makes problems easier to troubleshoot, and small code changes are also easier to deconflict with changes made by other contributors to the code base.
But in this workflow, the CI system can become a quality bottleneck. That's precisely what it's for, you might say, and we agree in principle: a key qualification of any CI system is that it prevents code from being released to end users if it fails any automated build or test steps. However, build and test failures in the CI pipeline can potentially stop all work flowing through the pipeline until the offending code is reverted or corrected, as illustrated in figure 2. How can we improve our code quality before the CI process begins so that we minimize the likelihood of a disruption in the CI pipeline? How can we enable developers to perform more comprehensive testing before their changes are merged to the main code line?
Assuming, for the moment, that we enjoy a culture in which software engineers embrace their primary responsibility for code quality throughout the software development lifecycle, engineers face the quite practical limitation that complex software systems are just hard to test in local development environments. While good unit tests supply a healthy measure of protection against breaking an individual component of a system, well-designed integration and functional tests will tell us if the components of our system are still working properly together. These are the tests that most often fail our CI pipeline, so these are the tests we want to run before we get there.
Suppose a Vulcan software engineer - let's call her Heather - has completed development on a new feature in component "A" of one of our projects. All of the unit tests are passing, but Heather wants to run the system-wide integration tests before opening a pull request. Component "A" also interacts with components "B" and "C", so she'll need all three running. Can she run the whole system in her local development environment? Maybe... if all of the following are true:
- She has sufficient CPU, memory, and other computing resources to support all three components.
- She has a deployment process that targets her local environment.
- Her local environment is configured precisely the same as the non-local deployment environment.
But these are unlikely to be true in the real world, because local development environments tend to have these well-known limitations:
- They generally aren't powerful enough to run large, complex software systems efficiently.
- They require a different deployment process than a remote environment.
- They are running different operating systems at different patch levels and with different ancillary software installed than the intended deployment environment. This can cause a problem to arise in the local environment due simply to those differences. Worse, differences in the local environment might mask a problem that will only surface in a remote environment.
There are mitigations, of course. Perhaps Heather could run only component "A" on her local machine and configure it to interact with components "B" and "C" that are already deployed somewhere else. But this might require some tricky configuration. It might pollute the "B" and "C" environment with test data. And it depends on network connectivity between the two environments, which may be blocked for security or other reasons. Heather's team might have created a deployment process for local environments, but in doing so, they have accepted the added cost of maintaining two separate deployment processes instead of just one. And there are few practical ways to avoid configuration differences between local and remote environments. What Heather needs is a quick way to build and run the entire software system in an isolated environment that can be repeatably constructed to be identical to the production environment in which it will eventually run. Enter the Vulcan Cloud developer experience platform.
The Platform is a suite of tools and technologies that we developed or integrated to improve and standardize the developer experience here at Vulcan, including addressing this problem specifically. Using the Platform, with a single command, developers can create a CI pipeline identical to the main pipeline and build, test, and deploy their own code to an isolated, ephemeral environment identical to the production environment.
Relevant components of the Platform include:
- Concourse CI, our selection for a continuous integration (CI) platform. Among its many advantages, Concourse CI pipelines are configured from file-based artifacts which are kept under source control and treated with the same control and rigor as other parts of the code base. This "configuration as code" concept helps ensure a reliable, repeatable CI process.
- Kubernetes orchestration engine for Docker containers. Building Docker images from Dockerfiles ensures that each image using the same Dockerfile is identical to the last. Deploying Docker containers to Kubernetes using file-based Kubernetes workflow definitions provides the same guarantees for each Kubernetes deployment.
- Managed Kubernetes services on Google Cloud and Azure, with support for Amazon Web Services on the roadmap. Cloud computing deployments ensure that sufficient computing power is always available to run the complete software system.
- Hashicorp Vault, a platform for securely storing and retrieving passwords, tokens, keys, and other application secrets.
- A Python command-line application that interacts with cloud IaaS providers and Kubernetes clusters to:
- Provision and de-provision cloud computing resources
- Deploy parameterized Kubernetes workloads to managed Kubernetes clusters in the cloud
The Platform has enabled the following enhanced software development workflow at Vulcan. We pick up with Heather's earlier conundrum: how can she build and test the entire software stack, including her recent changes to component "A", before her changes are merged into the main code line and run through the project's common CI pipeline?
Satisfied with the performance of her code in her local development environment, but before she opens a pull request, Heather runs a single command on the Platform, then watches as the following actions take place:
- The Platform creates a Concourse CI pipeline identical to her project's common CI pipeline, but one that reads from her branch in the code for component "A".
- The CI pipeline builds all of the components of the software system from code, including component "A" with Heather's changes.
- The CI pipeline runs unit, integration, and other tests from the software components' code bases, testing the entire system with Heather's changes.
- The CI pipeline uses the specified cloud IaaS provider to provision the necessary compute resources, including a Kubernetes cluster.
- The CI pipeline deploys Kubernetes workflows to the cluster which use the specific Docker images built by the same CI pipeline.
Figure 3 depicts the resulting infrastructure. Not only has Heather run all of the project's tests on a complete, isolated, working copy of the entire software system, but now she can run manual tests, add data, change configurations, or perform any other actions with confidence because:
- This ephemeral environment is hers to break. No other external systems have dependencies on it.
- If her environment gets into an unrecoverable bad state, Heather can easily destroy it and re-deploy it with strong guarantees that the new ephemeral environment will be identical to the first one.
And because her environment includes its own CI pipeline, any subsequent changes that Heather commits to her component "A" code branch will automatically trigger another build-test-deploy sequence. Further more, because her CI pipeline is also watching the main code lines for components "B" and "C", any changes there will also trigger automatic build-test-deploy in Heather's ephemeral environment, which ensures that she always knows how her component "A" changes will fare with the current versions of components "B" and "C".
Satisfied that her changes to component "A" will play nicely with the rest of the system, Heather opens a pull request to get her changes reviewed and merged into the main code line. Her ephemeral environment remains available to assist with code reviews, demonstrations, and subsequent changes until after her code is merged, or any other time she decides the environment is no longer needed. Heather runs another single Platform command that destroys the ephemeral CI pipeline, Kubernetes cluster, and other resources. Figure 4 shows the last step in the new workflow.