Temporal: What's on Our Side

⏱

Temporal in a Nutshell

Durable execution platform

Temporal lets you write code that survives failures. Your workflows run to completion even if processes crash, networks fail, or servers restart.

🔄

Automatic Retries

Failed activities are retried without manual intervention

💾

State Persistence

Workflow state is persisted and survives crashes

👁

Full Visibility

Every workflow execution is observable and queryable

We use it for: Durable execution of conversations and enrichments.

🛠

How It Works

The core architecture

💻

Client

(starts workflows)

→

gRPC

⏱

Temporal Server

(orchestration + persistence)

← polls →

Task Queues

⚙

Workers

(execute code)

Workflow Tasks

Deterministic logic: decisions, branching, orchestration

Activity Tasks

Side effects: API calls, DB queries, I/O operations

📩

Talking to Running Workflows

Signals, Updates & Queries

Signals

Fire-and-forget

Send data to a running workflow without waiting for a response. The workflow decides when and how to process it.

Our example: Call Manager

A single workflow acts as a queue — receives signals to dispatch calls to leads.

Updates

Send and get a result

Send data to a workflow and wait for it to process and return a result. Like a synchronous RPC into a running workflow.

Our example: Send Message from Inbox

User sends a message manually — update returns the result of the send operation.

Queries

Read-only, no mutations

Read the current state of a workflow without changing it. Useful for inspecting workflow state from outside.

We don't use them

We prefer writing state to the database so it's queryable outside of Temporal.

☁️

Temporal Cloud Handles

So we can focus on business logic

📚

Event History Persistence

Every workflow event durably stored, enabling replay and recovery

📋

Task Queue Management

Routing workflow and activity tasks to the right workers

📈

Visibility & Search

Query and search across all workflow executions via the UI and APIs

🕑

Timers & Scheduling

Sleep, cron schedules, and delayed execution without external schedulers

🔒

Multi-tenancy & Security

Namespace isolation, mTLS certificates, identity management

💨

Scaling & Availability

Server-side scaling, replication, and high availability

Bottom line: We don't manage databases, queues, or server infrastructure. Temporal Cloud is the control plane; our workers are the data plane.

03.A

Worker Configuration

Tuning workers for Node.js

⚙

Polling & Slots

How workers pick up work

Task Queue

← long poll →

Pollers

Fetch tasks from server

→

Slots

Execute tasks concurrently

Pollers

Long-poll the Temporal Server for new tasks. More pollers = faster task pickup, but each holds a connection.

Slots

Concurrent execution capacity. A task occupies a slot while running. More slots = more throughput, but more resource usage.

⚠️

The Node.js Challenge

Default config doesn't work for us

PROBLEM

Default: 40 workflow slots

Temporal defaults are tuned for Go/Java where high concurrency reduces I/O wait. Node.js has a single-threaded event loop — too many concurrent workflow tasks cause contention and tasks exceed the 10s start-to-close timeout.

OUR CONFIG

Fixed-size, low slot counts

Workflow slots: 5–10 (fixed)
Activity slots: resource-based (CPU 1.0, memory 0.6)
Workflow thread pool: 2 threads

Sticky queues: Workers cache workflows locally. When the same worker gets the next task for a workflow, it skips full replay — reducing latency and CPU. We tune the nonStickyToStickyPollRatio to favor sticky execution.

💣

Workflow Cache & OOM

Lessons from production

Each cached workflow runs in its own V8 isolate. The workflow bundle is loaded into every isolate — so bundle size directly multiplies with cache size.

BEFORE

Bundle size

12 MB

Base RSS

1,200 MB

Per cached WF

~12 MB

92 cached WFs

2,950 MB

AFTER

Bundle size

2.9 MB

Base RSS

500 MB

Per cached WF

~1 MB

92 cached WFs

593 MB

Root cause: An unused gpt-tokenizer package (~12MB) was pulled in via barrel exports. Fix: "sideEffects": false in package.json to enable tree-shaking.

🔎

Enrich Service Worker

Our actual configuration

tuner: {
  workflowTaskSlotSupplier: {
    type: 'fixed-size',
    numSlots: 8,
  },
  activityTaskSlotSupplier: {
    type: 'resource-based',
    rampThrottle: '100 ms',
    tunerOptions, // CPU 1.0, memory 0.6
  },
  localActivityTaskSlotSupplier: {
    type: 'resource-based',
    tunerOptions,
  },
},
nonStickyToStickyPollRatio: 0.5,
maxCachedWorkflows: 300,

Workflow slots: 8 (fixed)

Fixed-size to avoid contention on the event loop. Enrichments are short-lived workflows.

Activity slots: resource-based

Scales with available CPU/memory. 100ms ramp throttle prevents sudden spikes.

Sticky ratio: 0.5

Half of polls target sticky queues. 300 cached workflows to reduce replays.

💬

Conversation Service Worker

Our actual configuration

workflowThreadPoolSize: 2,
tuner: {
  workflowTaskSlotSupplier: {
    type: 'fixed-size',
    numSlots: 10, // 5 each thread
  },
  activityTaskSlotSupplier: {
    type: 'fixed-size',
    numSlots: 10,
  },
  localActivityTaskSlotSupplier: {
    type: 'resource-based',
    tunerOptions,
  },
},
maxCachedWorkflows: 700,
nonStickyToStickyPollRatio: 2 / 6,

2 workflow threads, 10 slots

5 slots per thread. Conversations are long-lived workflows that need more parallelism.

Activity slots: 10 (fixed)

Fixed rather than resource-based — conversation activities (AI calls, API) are I/O-bound, not CPU-bound.

Sticky ratio: 2/6, 700 cached WFs

Higher cache because conversations are long-lived. Lower sticky ratio because we have many workers sharing the load.

03.B

Writing Workflows

Where code runs and how to think about it

📍

Where Tasks Execute

Two very different execution environments

Inside a Temporal Worker (Node.js)

Main Thread

Activity Execution

API calls
Database queries
File I/O
Any side effects

Full Node.js environment

Worker Threads (V8 Sandbox)

Workflow Execution

Deterministic logic only
No I/O, no network
No randomness, no Date.now()
Replayed on recovery

Sandboxed & deterministic

Why sandboxed? Workflows must be deterministic so they can be replayed to reconstruct state after a crash. Side effects must go through activities.

🔐

Determinism & Versioning

The sandbox and safe deployments

The Workflow Bundle

Workflow code is bundled into a single JS file and loaded into a V8 sandbox that restricts what you can use — no fs, no fetch, no Date.now(), no Math.random(). This ensures workflows are deterministic and can be safely replayed.

Deploying workflow changes safely

Running workflows replay their history on the new code. If the code changed, replay breaks. Temporal's patching API lets old and new workflows coexist:

1. Add the patch

if (patched('my-change')) {
  // new behavior
} else {
  // old behavior (for replaying
  // existing workflows)
}

2. Once old workflows complete

deprecatePatch('my-change')
// Now only the new behavior runs.
// Clean up the branch later.

Watch out: Anything you import in your workflow file ends up in the bundle. Keep workflow files lean — barrel exports can silently pull in large dependencies (see OOM slide).

⏳

Latency

Every activity = network round trips

What happens when a workflow calls an activity

Workflow

schedules activity

→

Server

persists + queues

→

Activity

executes

→

Server

persists result

→

Workflow

continues

Overhead per activity call

2 network hops to/from Temporal Server + 2 persistence writes. Even a trivial activity adds latency overhead.

Workflow code is free (almost)

Logic between activity calls (branching, loops, transforms) runs locally in the V8 sandbox with no network cost.

Note: Temporal supports eager activity dispatch — the server can send an activity task directly in the workflow task response, skipping one round trip. So in practice, it's not as bad as the diagram suggests.

🎯

Activity Scope

What belongs together in one activity

An activity is a unit of failure and retry. Group things that should fail or succeed together.

TOO GRANULAR

fetchLinkedInProfile()

fetchLinkedInExperience()

fetchLinkedInEducation()

3x latency overhead. These are all fetches from the same source — if one fails, you'd retry them all anyway.

RIGHT SCOPE

fetchLinkedInData()

Fetches profile, experience, and education in one go

1x latency overhead. Retry is meaningful: all fetches from the same source share a failure boundary.

Rule of thumb: An activity should encapsulate things that you'd want to retry as a unit. If step B only makes sense after step A succeeds, they belong in the same activity.

Rules

🤖

LLMs Allowed

Claude, ChatGPT, Gemini — whatever you prefer. Good prompting is part of the challenge.

⏰

3 min per scenario

Read the scenario, diagnose the problem, propose a fix. Then we discuss.

🏆

Real incidents

These all happened to us in production. Context matters — generic answers won't cut it.

🚨

Scenario 1

Non-determinism errors after deploy

INCIDENT

It's Monday morning. A deploy went out on Friday that added a new enrichment step to the conversation workflow. Over the weekend, hundreds of running conversation workflows started throwing non-determinism errors. New conversations work fine. The diff:

  // conversation workflow
  const lead = await enrichLead(leadId);
+ const company = await enrichCompany(lead.companyId);
  const campaign = await getCampaign(campaignId);
  await startConversation(lead, campaign);

What you know

Only workflows started before the deploy are failing
New workflows work perfectly
The error is: NonDeterminismError

Questions

Why are only old workflows failing?
Can you fix it for both old and new workflows in a single deploy? How?
How do you prevent this next time?

✅

Scenario 1 — Answer

Old workflows already have getCampaign as their 2nd event in history. The new code expects enrichCompany as the 2nd event. On replay, the history doesn't match the code → NonDeterminismError.

The naive fix: use patching

const lead = await enrichLead(leadId);
if (patched('add-company-enrichment')) {
  const company = await enrichCompany(lead.companyId);
}
const campaign = await getCampaign(campaignId);
await startConversation(lead, campaign);

But wait: This fixes old workflows, but breaks new ones. Workflows that started after the original deploy already have enrichCompany in their history — but without a patch marker. When they replay with this code, patched() sees no marker and takes the old branch → non-determinism again.

A single deploy can't fix both. You'd need to either terminate and restart the affected workflows, or do a two-phase rollout. Prevention: always use patched() before any workflow runs the new code path.

🚨

Scenario 2

Workers OOM-ing every hour

INCIDENT

Workers started crashing with OOM kills. It started after a deploy, but the deploy didn't change any workflow or activity code. Memory climbs steadily and hits the 3.8GB container limit after ~60 seconds. Restarting buys you another minute before it crashes again.

What you know

V8 heap usage is normal (~700MB)
RSS is 3.3GB — most memory is outside the heap
Config: maxCachedWorkflows: 300
Reverting the deploy fixes it

Questions

Where is the memory if it's not on the V8 heap?
The deploy didn't change workflow code. What else could cause this?
What would you investigate next?

✅

Scenario 2 — Answer

The new function imported gpt-tokenizer (~12MB of tokenizer merge rules). Via barrel exports, it got pulled into the workflow bundle — even though no workflow used it. Each cached workflow loads the bundle into its own V8 isolate: 12MB × 300 cached workflows = 3.6GB.

Bundle size before

12 MB

Bundle size after fix

2.9 MB

Fix: Add "sideEffects": false to the shared package's package.json so webpack can tree-shake unused exports. Lesson: Always inspect your built workflow bundle, not just source code.

🚨

Scenario 3

Workflow tasks timing out

INCIDENT

Workflow tasks are consistently timing out with NOT_FOUND errors. Individual workflow task execution takes only a few milliseconds when measured. The worker is running with default Temporal SDK configuration on Node.js. CPU usage hovers around 60%.

What you know

Worker config: defaults (no custom tuning)
Runtime: Node.js (TypeScript SDK)
Each individual task takes ~5ms to execute
The Temporal Server responds with NOT_FOUND

Questions

How can a 5ms task exceed a 10s timeout?
What does NOT_FOUND mean here?
What would you change in the config?

✅

Scenario 3 — Answer

The default config gives the worker 40 concurrent workflow task slots. That's tuned for Go/Java where goroutines/threads run in parallel. In Node.js, all 40 tasks contend for the single-threaded event loop. Tasks queue up waiting for their turn — by the time the worker responds, the server has already timed out and returns NOT_FOUND.

The fix: reduce concurrency

tuner: {
  workflowTaskSlotSupplier: {
    type: 'fixed-size',
    numSlots: 8,   // not 40
  },
},
workflowThreadPoolSize: 2,  // spread across 2 threads

Lesson: Temporal SDK defaults are designed for multi-threaded runtimes. Node.js workers need explicitly tuned, lower slot counts. More slots ≠ more throughput when you have a single event loop.

Temporal

Table of Contents

What is Temporal

What Temporal Cloud Manages

What We Manage

Production Quiz

What is Temporal

Temporal in a Nutshell

Automatic Retries

State Persistence

Full Visibility

How It Works

Workflow Tasks

Activity Tasks

Talking to Running Workflows

Signals

Updates

Queries

What Temporal Cloud Manages

Temporal Cloud Handles

Event History Persistence

Task Queue Management

Visibility & Search

Timers & Scheduling

Multi-tenancy & Security

Scaling & Availability

What We Manage

Worker Configuration

Polling & Slots

Pollers

Slots

The Node.js Challenge

Default: 40 workflow slots

Fixed-size, low slot counts

Workflow Cache & OOM

Enrich Service Worker

Workflow slots: 8 (fixed)

Activity slots: resource-based

Sticky ratio: 0.5

Conversation Service Worker

2 workflow threads, 10 slots

Activity slots: 10 (fixed)

Sticky ratio: 2/6, 700 cached WFs

Writing Workflows

Where Tasks Execute

Inside a Temporal Worker (Node.js)

Main Thread

Worker Threads (V8 Sandbox)

Determinism & Versioning

The Workflow Bundle

Deploying workflow changes safely

1. Add the patch

2. Once old workflows complete

Latency

What happens when a workflow calls an activity

Overhead per activity call

Workflow code is free (almost)

Activity Scope

Production Quiz

Rules

LLMs Allowed

3 min per scenario

Real incidents

Scenario 1

What you know

Questions

Scenario 1 — Answer

The naive fix: use patching

Scenario 2

What you know

Questions

Scenario 2 — Answer

Scenario 3

What you know

Questions

Scenario 3 — Answer

The fix: reduce concurrency

Questions?