What's on Our Side
Durable execution for distributed systems
The infrastructure we don't think about
Worker configuration & writing workflows
@Miquel is on vacation. LLMs allowed.
01
Durable execution for distributed systems
Durable execution platform
Temporal lets you write code that survives failures. Your workflows run to completion even if processes crash, networks fail, or servers restart.
🔄
Failed activities are retried without manual intervention
💾
Workflow state is persisted and survives crashes
👁
Every workflow execution is observable and queryable
We use it for: Durable execution of conversations and enrichments.
The core architecture
Client
(starts workflows)
gRPC
Temporal Server
(orchestration + persistence)
Task Queues
Workers
(execute code)
Deterministic logic: decisions, branching, orchestration
Side effects: API calls, DB queries, I/O operations
Signals, Updates & Queries
Fire-and-forget
Send data to a running workflow without waiting for a response. The workflow decides when and how to process it.
Our example: Call Manager
A single workflow acts as a queue — receives signals to dispatch calls to leads.
Send and get a result
Send data to a workflow and wait for it to process and return a result. Like a synchronous RPC into a running workflow.
Our example: Send Message from Inbox
User sends a message manually — update returns the result of the send operation.
Read-only, no mutations
Read the current state of a workflow without changing it. Useful for inspecting workflow state from outside.
We don't use them
We prefer writing state to the database so it's queryable outside of Temporal.
02
The infrastructure we don't think about
So we can focus on business logic
📚
Every workflow event durably stored, enabling replay and recovery
📋
Routing workflow and activity tasks to the right workers
📈
Query and search across all workflow executions via the UI and APIs
🕑
Sleep, cron schedules, and delayed execution without external schedulers
🔒
Namespace isolation, mTLS certificates, identity management
💨
Server-side scaling, replication, and high availability
Bottom line: We don't manage databases, queues, or server infrastructure. Temporal Cloud is the control plane; our workers are the data plane.
03
Worker configuration & writing workflows
Tuning workers for Node.js
How workers pick up work
Fetch tasks from server
Execute tasks concurrently
Long-poll the Temporal Server for new tasks. More pollers = faster task pickup, but each holds a connection.
Concurrent execution capacity. A task occupies a slot while running. More slots = more throughput, but more resource usage.
Default config doesn't work for us
Temporal defaults are tuned for Go/Java where high concurrency reduces I/O wait. Node.js has a single-threaded event loop — too many concurrent workflow tasks cause contention and tasks exceed the 10s start-to-close timeout.
Workflow slots: 5–10 (fixed)
Activity slots: resource-based (CPU 1.0, memory 0.6)
Workflow thread pool: 2 threads
Sticky queues: Workers cache workflows locally. When the same worker gets the next task for a workflow, it skips full replay — reducing latency and CPU. We tune the nonStickyToStickyPollRatio to favor sticky execution.
Lessons from production
Each cached workflow runs in its own V8 isolate. The workflow bundle is loaded into every isolate — so bundle size directly multiplies with cache size.
Bundle size
12 MB
Base RSS
1,200 MB
Per cached WF
~12 MB
92 cached WFs
2,950 MB
Bundle size
2.9 MB
Base RSS
500 MB
Per cached WF
~1 MB
92 cached WFs
593 MB
Root cause: An unused gpt-tokenizer package (~12MB) was pulled in via barrel exports. Fix: "sideEffects": false in package.json to enable tree-shaking.
Our actual configuration
tuner: {
workflowTaskSlotSupplier: {
type: 'fixed-size',
numSlots: 8,
},
activityTaskSlotSupplier: {
type: 'resource-based',
rampThrottle: '100 ms',
tunerOptions, // CPU 1.0, memory 0.6
},
localActivityTaskSlotSupplier: {
type: 'resource-based',
tunerOptions,
},
},
nonStickyToStickyPollRatio: 0.5,
maxCachedWorkflows: 300,
Fixed-size to avoid contention on the event loop. Enrichments are short-lived workflows.
Scales with available CPU/memory. 100ms ramp throttle prevents sudden spikes.
Half of polls target sticky queues. 300 cached workflows to reduce replays.
Our actual configuration
workflowThreadPoolSize: 2,
tuner: {
workflowTaskSlotSupplier: {
type: 'fixed-size',
numSlots: 10, // 5 each thread
},
activityTaskSlotSupplier: {
type: 'fixed-size',
numSlots: 10,
},
localActivityTaskSlotSupplier: {
type: 'resource-based',
tunerOptions,
},
},
maxCachedWorkflows: 700,
nonStickyToStickyPollRatio: 2 / 6,
5 slots per thread. Conversations are long-lived workflows that need more parallelism.
Fixed rather than resource-based — conversation activities (AI calls, API) are I/O-bound, not CPU-bound.
Higher cache because conversations are long-lived. Lower sticky ratio because we have many workers sharing the load.
Where code runs and how to think about it
Two very different execution environments
Activity Execution
Full Node.js environment
Workflow Execution
Sandboxed & deterministic
Why sandboxed? Workflows must be deterministic so they can be replayed to reconstruct state after a crash. Side effects must go through activities.
The sandbox and safe deployments
Workflow code is bundled into a single JS file and loaded into a V8 sandbox that restricts what you can use — no fs, no fetch, no Date.now(), no Math.random(). This ensures workflows are deterministic and can be safely replayed.
Running workflows replay their history on the new code. If the code changed, replay breaks. Temporal's patching API lets old and new workflows coexist:
if (patched('my-change')) {
// new behavior
} else {
// old behavior (for replaying
// existing workflows)
}
deprecatePatch('my-change')
// Now only the new behavior runs.
// Clean up the branch later.
Watch out: Anything you import in your workflow file ends up in the bundle. Keep workflow files lean — barrel exports can silently pull in large dependencies (see OOM slide).
Every activity = network round trips
Workflow
schedules activity
Server
persists + queues
Activity
executes
Server
persists result
Workflow
continues
2 network hops to/from Temporal Server + 2 persistence writes. Even a trivial activity adds latency overhead.
Logic between activity calls (branching, loops, transforms) runs locally in the V8 sandbox with no network cost.
Note: Temporal supports eager activity dispatch — the server can send an activity task directly in the workflow task response, skipping one round trip. So in practice, it's not as bad as the diagram suggests.
What belongs together in one activity
An activity is a unit of failure and retry. Group things that should fail or succeed together.
fetchLinkedInProfile()
fetchLinkedInExperience()
fetchLinkedInEducation()
3x latency overhead. These are all fetches from the same source — if one fails, you'd retry them all anyway.
fetchLinkedInData()
Fetches profile, experience, and education in one go
1x latency overhead. Retry is meaningful: all fetches from the same source share a failure boundary.
Rule of thumb: An activity should encapsulate things that you'd want to retry as a unit. If step B only makes sense after step A succeeds, they belong in the same activity.
04
@Miquel is on vacation. Use your favorite LLM. ~10 min
🤖
Claude, ChatGPT, Gemini — whatever you prefer. Good prompting is part of the challenge.
⏰
Read the scenario, diagnose the problem, propose a fix. Then we discuss.
🏆
These all happened to us in production. Context matters — generic answers won't cut it.
Non-determinism errors after deploy
It's Monday morning. A deploy went out on Friday that added a new enrichment step to the conversation workflow. Over the weekend, hundreds of running conversation workflows started throwing non-determinism errors. New conversations work fine. The diff:
// conversation workflow
const lead = await enrichLead(leadId);
+ const company = await enrichCompany(lead.companyId);
const campaign = await getCampaign(campaignId);
await startConversation(lead, campaign);
NonDeterminismError
Old workflows already have getCampaign as their 2nd event in history. The new code expects enrichCompany as the 2nd event. On replay, the history doesn't match the code → NonDeterminismError.
const lead = await enrichLead(leadId);
if (patched('add-company-enrichment')) {
const company = await enrichCompany(lead.companyId);
}
const campaign = await getCampaign(campaignId);
await startConversation(lead, campaign);
But wait: This fixes old workflows, but breaks new ones. Workflows that started after the original deploy already have enrichCompany in their history — but without a patch marker. When they replay with this code, patched() sees no marker and takes the old branch → non-determinism again.
A single deploy can't fix both. You'd need to either terminate and restart the affected workflows, or do a two-phase rollout. Prevention: always use patched() before any workflow runs the new code path.
Workers OOM-ing every hour
Workers started crashing with OOM kills. It started after a deploy, but the deploy didn't change any workflow or activity code. Memory climbs steadily and hits the 3.8GB container limit after ~60 seconds. Restarting buys you another minute before it crashes again.
maxCachedWorkflows: 300
The new function imported gpt-tokenizer (~12MB of tokenizer merge rules). Via barrel exports, it got pulled into the workflow bundle — even though no workflow used it. Each cached workflow loads the bundle into its own V8 isolate: 12MB × 300 cached workflows = 3.6GB.
Bundle size before
12 MB
Bundle size after fix
2.9 MB
Fix: Add "sideEffects": false to the shared package's package.json so webpack can tree-shake unused exports. Lesson: Always inspect your built workflow bundle, not just source code.
Workflow tasks timing out
Workflow tasks are consistently timing out with NOT_FOUND errors. Individual workflow task execution takes only a few milliseconds when measured. The worker is running with default Temporal SDK configuration on Node.js. CPU usage hovers around 60%.
NOT_FOUNDNOT_FOUND mean here?
The default config gives the worker 40 concurrent workflow task slots. That's tuned for Go/Java where goroutines/threads run in parallel. In Node.js, all 40 tasks contend for the single-threaded event loop. Tasks queue up waiting for their turn — by the time the worker responds, the server has already timed out and returns NOT_FOUND.
tuner: {
workflowTaskSlotSupplier: {
type: 'fixed-size',
numSlots: 8, // not 40
},
},
workflowThreadPoolSize: 2, // spread across 2 threads
Lesson: Temporal SDK defaults are designed for multi-threaded runtimes. Node.js workers need explicitly tuned, lower slot counts. More slots ≠ more throughput when you have a single event loop.
Temporal: What's on Our Side