Durable Executions and the Future of Workflows
From BPMN to choreography to durable executions: why workflow changes went from O(N) to O(1)
I was invited to join Temporal’s Constellation program last week. This is a good excuse to write down what I’ve learned about workflow orchestration the hard way.
The Fundamental Problem
Building distributed systems means writing failure handling code. A lot of it. Retries, timeouts, idempotency checks, state recovery, duplicate detection, partial failure rollback. At some point you realize that a major part of your codebase exists only to handle things going wrong. The happy path is 20% of the logic. The other 80% is dealing with “what if this call fails halfway through?”
Traditional workflow engines tried to solve this with explicit state machines. You draw a diagram, define states and transitions, and the engine tracks where you are and resumes from there. Years ago while building a fintech, we used Camunda BPMN. It worked, but the operational model was painful—JVM-only, hard to test locally, hard to decompose into services. The technology has improved since then, but the fundamental model of explicit state machines defined separately from your application code creates friction. When the workflow definition and the application code diverge, you have two systems to keep in sync, and that’s one system too many.
Choreography: The Other Extreme
We pivoted to event choreography. No central workflow at all. Services publish events, other services react. This solved the “two systems” problem by eliminating the workflow engine entirely. Now the workflow was just emergent behavior from event handlers scattered across services.
The new problem was that workflow changes became O(N). Adding a step to the application flow meant touching every service in the chain. You’d find the service emitting the “before” event, update it to emit something new, create or update the service handling the new step, have it emit the “after” event, then update the downstream listeners to consume the new event instead of the old one. Five services, five deployments, five things that could fail independently.
In a regulated industry where product requirements shift constantly, this was brutal. Every small workflow tweak became a multi-sprint project because of coordination overhead. Choreography traded the workflow engine for coordination tax, and you paid it on every change.
The Insight: Durable Async/Await
I discovered Temporal in 2021 while working in biotech, looking for a way to track sample lifecycles in a lab. A blood sample arrives, gets processed through multiple stages, waits for batching, maybe needs re-testing, and eventually produces results. This process takes days or weeks, involves multiple systems and humans, and must be traceable for compliance. A blood sample is a durable execution—its lifecycle is a workflow that needs to survive failures, restarts, and deployments without drowning in failure handling code.
The insight behind Temporal, Restate, and similar systems is durable async/await: what if your function calls could survive process crashes? What if a promise, once started, would eventually complete even if the machine running it caught fire?
@workflow.defn
class SampleProcessing:
@workflow.run
async def run(self, sample_id: str) -> ProcessingResult:
await workflow.execute_activity(receive_sample, sample_id)
analysis = await workflow.execute_activity(run_analysis, sample_id)
if analysis.needs_retest:
analysis = await workflow.execute_activity(run_analysis, sample_id)
await workflow.execute_activity(generate_report, sample_id, analysis)
return ProcessingResult(sample_id, analysis)
This looks like normal async code, but each await is a durable checkpoint. If the process dies after receive_sample completes, the workflow resumes from that point on restart. The event log is the ground truth, and the running process is just transient state derived from replaying that log. The happy path is the fault-tolerant path—you write the logic you’d write if nothing ever failed, and the runtime handles the rest.
O(1) Workflow Changes
Back to that fintech choreography problem. With Temporal, the workflow is code in your repo—same version control, same deployment, same CI as everything else. Want to add a step? Add a line, commit, and deploy. New executions use the new code while old executions continue on the old path, thanks to workflow versioning. No N-service coordination, no multi-week project for a simple change. O(N) became O(1).
This matters most in domains where workflows change frequently. In finance, regulatory changes hit constantly. In healthcare, protocols get updated. In logistics, processes get optimized based on operational data. The cost of change determines how fast you can adapt, and if every change requires coordinating deployments across a dozen services, you’re not adapting—you’re drowning in ceremony.
The Ecosystem
I maintain Awesome Durable Executions to track what’s happening in this space, and it’s growing fast. Temporal and Cadence (Uber’s original implementation) are the established players. Restate implements durable async/await as an SDK that lets you deploy anywhere. Conductor from Orkes, DBOS (durable execution on Postgres), Inngest, Resonate, Cloudflare Workers, and Dapr all represent different approaches to the same core insight: make the happy path durable, and failure handling becomes the runtime’s problem instead of yours.
The convergence is real. Different implementations, different trade-offs, but the same fundamental recognition that workflow orchestration needs to live in your codebase and evolve at the same speed as your business logic.
If you’re building systems where workflows change frequently, or where most of your code is failure handling, durable executions aren’t optional. They’re how you stop paying the complexity tax on every feature.