Reactive planning is a cloud native pattern

28 Nov 2018 - Tags: software design, cost, cloud, scale, design pattern, reactive planning, cloud native

Probably this title can sound a bit weird to anyone that already know what reactive plan is and how far it can look from all the cloud-native and distributed system hipster movement but recently one of my colleagues Chris Goller pushed this pattern to one of the projects that we have at InfluxData and I find it glorious!

“In artificial intelligence, reactive planning denotes a group of techniques for action selection by autonomous agents. These techniques differ from classical planning in two aspects. First, they operate in a timely fashion and hence can cope with highly dynamic and unpredictable environments. Second, they compute just one next action in every instant, based on the current context.” (Wikipedia)

The Wikipedia definition of reactive planning as you can see is perfect to handle a system where the current status can change very frequently based on external and unpredictable events.

This is a perfect approach for provisioning/orchestrator tool like Mesos, Cloud Formation, Kubernetes, Swarm, Terraform. Some of them are working like this already.

The general idea is that before any action you need a plan because for these tools an action means: cloud interaction, spin up of resources that cost money. You need to be proactive avoiding useless execution.

A plan is made of a serious of steps and every step can return other steps if it needs. The plan is complete when there are no steps anymore. The plan gets executed at least twice, the second time it should return zero steps because the first attempts built everything you need, this is the signal that determines its conclusion. If it keeps returning steps it means that there is something to do and it tries again.

Let’s start with an example. Think about what Cloud Formation does. You can declare a set of resources and before to take action it needs to understand what to do. It is making a plan checking the current state of the system. This first part makes the flow idempotent and solid because you always start from the current state of the system. It doesn’t matter if it changes over time because of somebody that removed one of the resources. If something doesn’t exist it creates or modify it. Very solid.

Every single step is very small. Let’s take another example like creating a pod in Kubernetes. When you create a pod there are a lot of actions to do:

  • Validation
  • Generate the pod id, the pod name
  • Register the pod to the DNS, you
  • Store it to etcd Reach to CNI to configure the network
  • Reach to docker, container or whatever you use to get the container
  • Maybe reach to AWS to create a persistent volume
  • Attach the PV

If you try to design all interaction in a single “controller” you will end up with a lot of * if/else, error handling and so on. Mainly because as you can see almost every step interact over the network with something: database, DNS, CNI, docker and so on. So it can fail, it needs circuit breaking, retry policy and much more complexity.

It is a lot better to design the code where every point is a small step if the step that reaches docker fails it can return itself as “retry” or it can return other steps to abort everything and clean up. You will end up with small reusable (or not that much reusable) steps.

All the steps are combined within a plan, the “PodCreation” plan. There is a scheduler that takes and execute every step in the plan recursively.

This freedom allows you to use an incremental approach

The scheduler as first call a create method for the plan, the create method checks what to do based on the current state of the system, it is the responsibility of this function to return no steps when there is nothing to do.

I think this Reactive Planning is one of the best ways to organize the code in a cloud-native ecosystem for its reactive nature as I said and for the fact that it forces you to check the state of the system if you don’t do that the plan will keep executing forever. Obviously, you can use a high-level check to skip a lot of steps, this requires a balance if the plan you are executing is critical and frequently used you should check for every step if it requires an effort that won’t pay back you can implement deepest and preciser checks. You can check for the PodStatus. If it is running we are good nothing to do. Or you can check if Docker has a container running and if it has the right network configuration. If it is running but with no network, you can return the step that interacts with CNI to set the right interface. This freedom allows you to use an incremental approach, you start with an easy creation method with checks for only critical and high-level signal demanding a more solid and sophisticated set of checks for later, when you will have the best knowledge about where the system fails.

Hero image via Pixabay