👉 Brought to you by:- OpenOps - the open source FinOps & CloudOps automation platform. Automate common cloud workflows with human-in-the-loop approvals and first-class AI support.- WorkOS: Modern APIs for auth, identity, and enterprise features like SSO, SCIM, RBAC, and more. Durable WorkflowsSynonyms: Durable Execution, Workflows-as-Code.
💡 ✅ Who is this for?- Backend developers building multi-step async workflows - Teams running AI agents or long-running LLM tasks that need reliability - Anyone tired of duct-taping queues, cron jobs, and retry logic together TL;DR:- Problem: Most async workflows are held together with queues, retries, cron jobs, and hope. Durable workflows are the abstraction trying to replace that mess.
- Solution: Durable workflows let you write long-running, multi-step workflows as normal code. They handle crashes, retries, waits, and human approvals for you.
- In Sum: Durable workflows are becoming a core offering to build resilient complex long running workflows sparked by the AI agent boom.
How does it work? 💡Say you're calling an LLM, then saving the result to a database. const res = await llm.gen(prompt) // costs $, takes 30s
await db.save(res) // crashes here
Server dies after the LLM call but before the save. You retry, the LLM runs again, you pay twice, and you might get a different answer. Add a queue to decouple them? Now you need retry logic, dead letter handling, and state tracking. Three systems for two lines of code. With durable execution, you wrap each step and the runtime tracks your progress - handling crashes, retries, and even pauses that last days: const res = await step.run("gen", () => llm.gen(prompt))
await step.run("save", () => db.save(res))
If the server crashes after step 1, it skips the LLM call entirely on retry and picks up at step 2. Scale this to ten steps, add retries, timeouts, human approvals that pause for days - the durable service handles all of it. This matters most where silent failure is unacceptable: AI agents running long jobs overnight, payment flows where failures cost money, and multi-step onboarding where losing a user mid-process is a serious issue. Questions ❔- Queues vs workflows? Queues are a low-level primitive. You still need retry logic, state storage, and concurrency control. Durable workflows bundle everything into a single abstraction that tracks the full execution path.
- Worth the complexity? Only if your workflows must complete. For fire-and-forget tasks where occasional failure is acceptable, you might not need this. For payments, long-running tasks, or AI agent orchestration? It makes a lot of sense.
Why? 🤔- Kill boilerplate: No more building queues, dead letter handling, retry logic, state storage, cron jobs. The runtime handles it all. You just write async functions.
- Create code that survives crashes: Execution resumes exactly where it left off. Netflix said they managed to use it on their home-grown system and "simplify it quite a lot, by workflow abstraction".
- Long-running workflows become really easy to spin up: Send an email sequence over 90 days? Just await step.sleep("30-days", "30 days") between sends. Without durability, you'd need cron jobs, a database to track progress, and complex scheduling.
- Built-in observability: Most platforms provide visual timelines, step-by-step debugging, event history, and one-click replay for failed runs. Debugging distributed systems is hard; these tools have it out of the box.
Why not? 🙅- Latency overhead: Every step involves persisting to durable storage. Not suitable for sub-10ms requirements. Restate claims 15ms median latency (best in class), but there's always overhead.
- You might not need it: If your async jobs are quick and occasional failure is fine, durable execution is overkill. A Redis queue with basic retry logic handles 80% of use cases. Be honest about whether you're solving a real problem or adding complexity because it's trendy.
- Ongoing race, hard to pick out the winner: The market is maturing fast but still noisy. Temporal's $5B valuation and AWS lambda durable functions signal the pattern is real, but there are now 10+ competing tools with different trade-offs. It always sucks picking the one going to be obsolete.
- Replay-based tools have extra gotchas: Temporal and similar replay-based systems require deterministic workflow code (no Math.random(), no Date.now()), and changing step order while workflows are running breaks checkpoints. Step-based tools like Inngest, Trigger.dev, and Cloudflare Workflows checkpoint after each step instead of replaying, so these constraints are much lighter. Worth understanding before you pick a tool.
- Temporal - The OG. Enterprise-grade, multi-language SDKs, MIT licensed. Ships integrations with OpenAI Agents SDK and Vercel AI SDK.
- Trigger.dev - TypeScript-first, fully open source (Apache 2.0). Positioning around AI agent workflows.
- Inngest - Event-driven durable functions. Works on Lambda, Cloudflare Workers, or servers.
- Cloudflare Workflows - Edge-native durable execution on Workers + Durable Objects. Pay for CPU time, not wait time. GA since April 2025.
- Restate - Low-latency, single binary, Rust-based. Built by Apache Flink creators.
- Vercel Workflow SDK - Uses "use workflow" and "use step" directives. Works with Next.js, SvelteKit, Astro.
- Absurd - Durable execution with just Postgres. A single SQL file + thin SDK. Experimental, not production-ready yet.
- More: AWS Lambda Durable Functions (first-party durable execution), Azure Durable Functions, Hatchet (v1 rewrite, 10k tasks/sec, MIT), DBOS (Postgres-backed, OpenAI Agents SDK integration), Golem (WebAssembly-based), Prefect (Python data pipelines, transactional tasks), Resonate (single-binary on Postgres).
🤠 My opinion: I'm actively using Vercel Workflow SDK - it fits my Vercel ecosystem and is good enough, though observability could improve. To be honest I don't use all of the durable aspects, I mostly use it as a serverless replacement for long running tasks, but it's good to know I have that option. I've heard great things about Trigger.dev too. Beyond that, choose by language support, feature set, and pricing. Whatever you pick, start with their managed cloud; self-hosting is something you can always punish yourself with later ;) Forecast 🧞- AI agents made this urgent overnight: OpenAI, Vercel, Dapr, and AWS all shipped durable execution integrations in early 2026. 20% of Temporal Cloud actions come from AI-native companies. Durability is becoming the basics for production-grade agents.
- Postgres is eating the workflow layer: Supabase Queues, Absurd, Resonate, and DBOS all bet on Postgres as the only infrastructure you need. Absurd is literally a single SQL file - absurdctl init and you have durable execution. If your app already runs Postgres, why deploy a separate orchestration cluster?
- $650M+ in VC says consolidation is coming: Temporal alone has raised $650M, including a $300M Series D at a $5B valuation in early 2026. Multiple TypeScript-first tools are competing for the same developers, and not all will survive. Expect acquisitions or sharp pivots within 18 months.
- This segment is more durable (pun intended) than most devtools: AI is eating devtools. But durable execution is the rare category where AI is a tailwind, not a headwind. AI agents create demand for crash recovery; you can't AI-away the need for reliability infrastructure. This is plumbing, not UI, and AI disrupts the UI layer first. Infra seems to be one of the things we will be using more of, not less of.
- On-prem AI may pull workflows back from the cloud: Most durable execution tools are cloud-managed. But enterprises are moving LLM inference on-prem for cost and compliance, and the workflows orchestrating those models will follow. Running Temporal Cloud to coordinate a local GPU cluster feels backwards. Self-hostable tools like Restate and Absurd may get a second wind, not from developer preference, but from "data gravity".
Who uses it? 🎡- Snap - Powers microservice orchestration for 453M daily users
- GitBook - Cut sync times from hours to minutes using Inngest for bi-directional sync
- Repl.it - Running AI agents reliably at scale
- Papermark - Real-time PDF conversion at scale using Trigger.dev
Thanks 🙏I wanted to thank @TomGranot, who edits every issue, and recently started a YouTube channel exploring up-and-coming tools here. I am not saying he got the idea from me, but he did not not get the idea from me. :). EOFI've been working on something that would solve some of my issues with OpenClaw: security, no way to feel confident it wouldn't blow up anything connected to it, incremental improvement, easy integrations with the outside world without API keys and more. I'll post about it on my LinkedIn soon, follow in case this sounds cool. Any questions, feedback, or suggestions are welcome 🙏 Simply reply to this e-mail or tweet at me @agammore.
|