For the last two years at InfluxData I worked on our custom orchestrator that empower InfluxCloud v1 to run. I have some talk about it at InfluxDays, but they are not recorded so, I can’t really post them here, sadly.
If you are thinking: “Why you should write your own orchestrator?”, I have few answers for you.
Btw now InfluxCloud v2 leverages Kubernetes.
Writing a good orchestrator is super fun! When I started but still today a big part of it are frustrating and not so good but the one we wrote following reactive planning and control theory are lovely! This article is an introduction about Control Theory. Chris Goller Solution Architect at InfluxData was the first person that told me about how Control Theory works in theory, and he pushed me to try reactive planning for our orchestrator.
As Kubernetes contributor I recognized some of those patterns as looking at shared informers, controller and so on. So I understood since the beginning that those patterns was everywhere around me!
Colm MacCárthaigh from Amazon Web Service with his talks (like the one posted here) helped me to find resources to read, more patterns and use cases for it.
When I started to work as a Web Developer, designing APIs or websites I had different challenges to face. To write a solid CRUD you put all your effort when a request comes to your API, you validate it, apply transformation to sanitize the input and if it is valid you save it in your database. You need to build good UX, complex validations systems and so on. But what lands in the database is right and rock solid.
There are other systems where you do not have a database that tells you what is right or not. You need to measure the current state, calculate what needs to get back to your desired state and you need to apply what you calculated.
Those systems are everywhere:
Orchestrator but more in general big microservices environment do not have the concept of data locality as we used to have in the past. The data you need can change continously, and they need to collected from different sources and combined in order to calculate what needs to be done.
I think this is the main reason about why patterns coming from Control Theory works well.
If you need to write a program that provisions 3 virtual machines and attach them to a random DNS record you can approach this problems in 2 ways. You can write a procedure that:
Another way you have to fix this issue is to start from checking what you have, making a plan to matches what it is not as you desire. So it will look like this:
If you are wondering how all those checks makes the system more reliable is because you never know what you already created or what it is already where. Let’s assume you are on AWS. API requests can fails at the middle of your process and you need to know where you are. AWS itself can stop or terminate instances, or some other procedures can do it or for manual mistake.
Approaching the problem in this way allows you to repeat the flow over and over because it idempotent and at every retry the process will be able to reconcile any divergence between what you asked for (3 VMs and one DNS record) and what it is actually running. This process is called reconciliation loop.
Colm MacCárthaigh highlights three major areas around how a successful Control Theory implementation looks like:
The way you retrieve the current state of the system is crucial in order to have a low latency. They are crucial in order to calculate what needs to be done because from the current state your program get different decisions.
This section is where I have more experience with. The desired state is stored and clear usually. You know here to go. You get the measurements and with this information you need to write a procedure capable of making a plan stating from your current state to get to the desired one.
I wrote a few weeks ago an introduction about reactive planning it is the way I used to calculate a plan.
I am also preparing a PoC in Golang with actual code you can run and test to share in practice what means reactive planning.
It is the part that take a calculated plan, and it executes it. I worked a lot with schedulers that are able to take a set of steps and execute them one by one or in parallel based on needs.
Think about one of them problem you have a try to think in a more reactive way, starting from checking where you are and not from doing things. Reliability and stability for your code will improve drastically.