DevOps: The Dev experience and the Ops experience
When we started working on our products to protect Ocean health, we built a streaming platform to monitor fishing vessels in near real time around the globe. This means hundreds of messages a second flowing in and through our processing stages before ending in a GIS enabled Postgres database. At this time our organization was moving to a new paradigm of using Kubernetes orchestration of deployable containers in the cloud.
When organizations get to this stage, the system architects, the software developers and operational teams start reading a lot about Kubernetes clusters; trying to decide if we should have one large cluster where we deploy everything or many smaller clusters. Philosophical discussions over the advantages of each are bound to happen, and will only get more complicated when they bring up the principles of micro services, developer velocity, operational consistency and site reliability.
We had plenty of these conversations, and in this blog post I want to take you through our evolutionary journey and current conclusions. And in this DevOps world, I want to break this discussion down to the pros and cons from the perspective of Dev and Ops.
We started by containerizing each component and building our Kubernetes knowledge base, and wound up with a single cluster with a lot of tightly coupled components working together to take the stream of vessel position data and make coherent sense of it. We built small containers applying the philosophy that a container should handle a single process, and do it well. So, we had containers to read streams of data, and place it on a semi-persistent message queue like Kafka, other containers to read these topics processing the data, and adding derived data to the immutable source message, until finally the various derived data streams where consumed and stored in Postgis. This worked wonderfully.
The developers where able to build and deploy containers locally when working on parts of the streaming pipeline, spinning up only the portions they needed to verify upstream dependencies and downstream consumers. This led to isolation and decent development velocity, although not testing the entire cluster did give us some integration troubles when two developers worked on similar or sidelong components.
The Ops team found this, deploying all these containers into long lived clusters for QA, Staging and Production, to be a more arduous process, with long build times for a single cluster, often waiting hours on a single build step to complete before any of the cluster could be deployed. They also ran into noisy neighbor problems running these containers as pods on Kubernetes nodes. Many times, one of our more CPU intensive processes would in up being scheduled on the same node as our database, and Ops would watch as the database was starved of compute power. But all in all, we managed to get this system deployed and stable in our production environment, but occasionally they still have to kill a streaming pod that landed on the wrong node.
A New Product; A New Hope; A New Cluster
Next, we were tasked with building a new product! And of course, the pains and lessons we learned from the first round were corrected in this new product. Instead of many repos on Github deploying into a single large cluster, we thought we would try a single repo to a single cluster, and build a lightweight micro service. Now everybody knows, you want to use micro services for their advantages; agility, isolation, scalability, and availability.
Learning from our previous pains, we built our second product to be far more light weight, and created a new Kubernetes cluster to deploy our new product. But there was one hitch, this new product needed to get data from Kafka in the other cluster. So, we had a new problem to solve, cluster to cluster communication.
From the Developer Experience, this was large improvement. No longer having to wait hours for builds, and no longer bothered by other Developers committing work that kicked off our builds, we found developer velocity skyrocketed! The isolation of developing on the full stack in a single repo in a single cluster, meant we had less issues integrating with other developer’s code. And best of all, no noisy neighbor problems from the other cluster or components! However, because this new cluster needed streaming data from the other cluster, we could not stream data into our dev clusters, and had to rely upon a db snapshot to get data we could work against. And in the few cases, we needed streaming data, setting up two clusters and wiring them up to talk to each other proved onerous and time intensive.
The Operational team found this new product to much easier to deploy and run! The deploys were simple and quick, and the isolation from the large cluster made scalability a breeze. The only difficulty the operations team had was working with the cluster to cluster communication in each environment. All in all, this was an improvement over the first product for the Ops team.
More Services! More Clusters! Micro Services Ahoy!
We followed this pattern for a while, adding more clusters to our growing list of features and backend tools. We liked this idea of simple small micro services, and got excited, building new repos, to new clusters.
The Developers loved this for the isolation allowed them to develop quickly, failing fast, and fixing it even faster. We build and deployed several new micro services in record time. However, all of these new services had a lot of set up work to do, each new service needed a new repo, a new deployment directory, containerization, container registry and lots and lots of YAML
files. And the wiring up of all these micro services started to get so complicated that it was a constant conversation point, and we even have several wiki pages dedicated to who is using what IP ranges, and how to wire up what clusters to other clusters. Our complexity was growing faster and faster.
From the Ops perspective, we loved this; individual clusters were easy to deploy, configuration driven and best of all operated on their own nodes. Monitoring these clusters and keeping them running was fairly simple. However, the overall responsibility was growing in complexity as we had update architectural diagrams and wiki pages to keep track of what cluster was talking to what cluster and how. Worst of all, sometimes redeploying one cluster, would require us to scale down containers in an upstream cluster.
Noisy Neighbors and Inter-Cluster Communication
Meanwhile, we are still running our first product, the tightly coupled single cluster mentioned above. It had one main problem, we spent lots of time trying to fix the infamousnoisy neighbor
The Developers never really experienced the noisy neighbor problems in their dev clusters, or even if they did they barely noticed as they were focus on their component. We tried tuning our database to be more performant, then we vertically scaled our nodes, then we tried utilizing Kubernetes resource requests to make sure certain components got the resources they needed, then we tried Kubernetes Anti-Affinity policies to make sure CPU intensive processes didn’t get scheduled on the same node.
The Ops team was constantly monitoring for CPU or IO bound processes, and often would kill a pod to get it scheduled on another node. And finally, we found a solution that works; Kubernetes Node Pools!
We built new cluster specs and node pools isolating our intensive and critical components from the others, and finally the Ops Team could breathe a sigh of relieve as the alarms stopped ringing, and the manual interventions became a relic of the past.
Cluster to cluster communication was still a pain, and we could write another long blog post about TLS and exposing multiple Kafka brokers; figuring out what solution to use and how to best handle Cluster to Cluster communication in the same GCP Project in the same region. But this helped us expose our message data plane to all the other clusters for ease of cluster to cluster communication.
Now we simply have to configure our micro services who talk to Kafka with the correct IP ranges and viola we can communicate from one cluster to another fairly simply! This is the growing complexity for the Ops team as they must have careful documentation of which IP range QA, Staging and Prod is using. This growing complexity is worrisome, until we implemented Kubernetes Name spacing!
From Many Clusters to One Cluster
Names pacing would allow us to move our micro services from separate clusters into a single cluster without a ton of rework and rewiring! The micro services are still deployed independently, and in this new one cluster name spaced world, they each get their own Node Pools! A single Kubernetes cluster can support up to 5,000 Nodes and up to 150,000 Pods, so it is clear, that Kubernetes can handle all we need to throw at it.
With some clever deployment work by our internal shared services team, we have recipes that allow us to pick and choose what components we wish to deploy, so the Devs can spin up only what upstream and downstream dependencies they need to work on their service. The complicated cluster to cluster communication is gone, so the need to keep track of who is using what IP address is removed, and suddenly we are free!
From the Operational perspective, this is a large improvement, allowing us to clean up duplicate monitoring pods, and stop having to check 7+ Grafanas instances to get a sense of system health. As we move into this single cluster world, the Ops team is excited to streamline their deploys and simplify the monitoring of our production clusters. They also get to ditch the pages of IP ranges for QA, Staging and Prod, and what cluster is talking to which. They are excited about the future and the new lessons we will learn, especially as we move to multi zone and multi region failover setups.
Conclusion: One Cluster or Many?
After doing the hard work of starting with one tightly coupled cluster, moving to many loosely coupled micro services, and now to one large name spaced, node pooled cluster, I can say without a doubt that it is best to have one large cluster.
Simplify your life, simplify your YAML, simplify your deploys by correctly architecting your Kubernetes clusters to allow for micro service growth in a single large cluster.
Simplify your developer’s experience by letting them work in isolated dev clusters, deploying only what they need, and not waste valuable time spinning up multiple clusters just to test a few lines of code.
Simplify your operational experience, by having one cluster to manage, one alert manager and Grafana dashboards. Simplify your deployments with less YAML and configuration duplicated across repos of micro services.
Kubernetes with Node Pool isolation and Name Spacing can greatly empower your Devs and your Ops; making both sides happy, if only for a while.