From eca74d6ac4bd9bb69dcdd2e9c376a857a1fd51d7 Mon Sep 17 00:00:00 2001 From: Donald Pinckney Date: Mon, 2 Feb 2026 18:20:32 -0500 Subject: [PATCH 1/4] Add initial skill for testing, which is simply Steve's skill (#1) * Add initial skill for testing, which is simply Steve's skill * Rename skill to 'temporal-dev' and update version Updated skill name and version for Temporal Python. --- SKILL.md | 117 +++++++++++++++ references/concepts.md | 38 +++++ references/error-reference.md | 24 +++ references/interactive-workflows.md | 53 +++++++ references/logs.md | 23 +++ references/tool-reference.md | 84 +++++++++++ references/troubleshooting.md | 184 +++++++++++++++++++++++ scripts/analyze-workflow-error.sh | 225 ++++++++++++++++++++++++++++ scripts/bulk-cancel-workflows.sh | 145 ++++++++++++++++++ scripts/ensure-server.sh | 98 ++++++++++++ scripts/ensure-worker.sh | 91 +++++++++++ scripts/find-project-workers.sh | 73 +++++++++ scripts/find-stalled-workflows.sh | 117 +++++++++++++++ scripts/get-workflow-result.sh | 202 +++++++++++++++++++++++++ scripts/kill-all-workers.sh | 134 +++++++++++++++++ scripts/kill-worker.sh | 108 +++++++++++++ scripts/list-recent-workflows.sh | 113 ++++++++++++++ scripts/list-workers.sh | 221 +++++++++++++++++++++++++++ scripts/monitor-worker-health.sh | 111 ++++++++++++++ scripts/wait-for-worker-ready.sh | 76 ++++++++++ scripts/wait-for-workflow-status.sh | 118 +++++++++++++++ 21 files changed, 2355 insertions(+) create mode 100644 SKILL.md create mode 100644 references/concepts.md create mode 100644 references/error-reference.md create mode 100644 references/interactive-workflows.md create mode 100644 references/logs.md create mode 100644 references/tool-reference.md create mode 100644 references/troubleshooting.md create mode 100755 scripts/analyze-workflow-error.sh create mode 100755 scripts/bulk-cancel-workflows.sh create mode 100755 scripts/ensure-server.sh create mode 100755 scripts/ensure-worker.sh create mode 100755 scripts/find-project-workers.sh create mode 100755 scripts/find-stalled-workflows.sh create mode 100755 scripts/get-workflow-result.sh create mode 100755 scripts/kill-all-workers.sh create mode 100755 scripts/kill-worker.sh create mode 100755 scripts/list-recent-workflows.sh create mode 100755 scripts/list-workers.sh create mode 100755 scripts/monitor-worker-health.sh create mode 100755 scripts/wait-for-worker-ready.sh create mode 100755 scripts/wait-for-workflow-status.sh diff --git a/SKILL.md b/SKILL.md new file mode 100644 index 0000000..70af934 --- /dev/null +++ b/SKILL.md @@ -0,0 +1,117 @@ +--- +name: temporal-dev +description: "Start, stop, debug, and troubleshoot Temporal workflows for Python projects. Use when: starting workers, executing workflows, workflow is stalled/failed, non-determinism errors, checking workflow status, or managing temporal server start-dev lifecycle." +version: 0.1.0 +allowed-tools: "Bash(.claude/skills/temporal/scripts/*:*), Read" +--- + +# Temporal Skill + +Manage Temporal workflows using local development server (Python SDK, `temporal server start-dev`). + +## Environment Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `CLAUDE_TEMPORAL_LOG_DIR` | `/tmp/claude-temporal-logs` | Worker log directory | +| `CLAUDE_TEMPORAL_PID_DIR` | `/tmp/claude-temporal-pids` | Worker PID directory | +| `TEMPORAL_ADDRESS` | `localhost:7233` | Temporal server gRPC address | +| `TEMPORAL_WORKER_CMD` | `uv run worker` | Command to start worker | + +--- + +## Quick Start + +```bash +# 1. Start server +./scripts/ensure-server.sh + +# 2. Start worker (kills old workers, starts fresh) +./scripts/ensure-worker.sh + +# 3. Execute workflow +uv run starter # Capture workflow_id from output + +# 4. Wait for completion +./scripts/wait-for-workflow-status.sh --workflow-id --status COMPLETED + +# 5. Get result (verify it's correct, not an error message) +./scripts/get-workflow-result.sh --workflow-id + +# 6. CLEANUP: Kill workers when done +./scripts/kill-worker.sh +``` + +--- + +## Common Recipes + +### Clean Start +```bash +./scripts/kill-all-workers.sh +./scripts/ensure-server.sh +./scripts/ensure-worker.sh +uv run starter +``` + +### Debug Stalled Workflow +```bash +./scripts/find-stalled-workflows.sh +./scripts/analyze-workflow-error.sh --workflow-id +tail -100 $CLAUDE_TEMPORAL_LOG_DIR/worker-$(basename "$(pwd)").log +# See references/troubleshooting.md for decision tree +``` + +### Clear Stalled Environment +```bash +./scripts/find-stalled-workflows.sh +./scripts/bulk-cancel-workflows.sh +./scripts/kill-worker.sh +./scripts/ensure-worker.sh +``` + +### Check Recent Results +```bash +./scripts/list-recent-workflows.sh --minutes 30 +./scripts/get-workflow-result.sh --workflow-id +``` + +--- + +## Key Scripts + +| Script | Purpose | +|--------|---------| +| `ensure-server.sh` | Start dev server if not running | +| `ensure-worker.sh` | Kill old workers, start fresh one | +| `kill-worker.sh` | Kill current project's worker | +| `kill-all-workers.sh` | Kill all workers (`--include-server` option) | +| `find-stalled-workflows.sh` | Detect stalled workflows | +| `analyze-workflow-error.sh` | Extract errors from history | +| `wait-for-workflow-status.sh` | Block until status reached | +| `get-workflow-result.sh` | Get workflow output | + +See `references/tool-reference.md` for full details. + +--- + +## References (Load When Needed) + +| Reference | When to Read | +|-----------|--------------| +| `references/concepts.md` | Understanding workflow vs activity tasks, component architecture | +| `references/troubleshooting.md` | Workflow stalled, failed, or misbehaving - decision tree and fixes | +| `references/error-reference.md` | Looking up specific error types and recovery steps | +| `references/tool-reference.md` | Script options and worker management details | +| `references/interactive-workflows.md` | Signals, updates, queries for human-in-the-loop workflows | +| `references/logs.md` | Log file locations and search commands | + +--- + +## Critical Rules + +1. **Always kill workers when done** - Don't leave stale workers running +2. **One worker instance only** - Multiple workers cause non-determinism +3. **Capture workflow_id** - You need it for all monitoring/troubleshooting +4. **Verify results** - COMPLETED status doesn't mean correct result; check payload +5. **Non-determinism: analyze first** - Use `analyze-workflow-error.sh` to understand the mismatch. If accidental: fix code to match history. If intentional v2 change: terminate and start fresh. See `references/troubleshooting.md` diff --git a/references/concepts.md b/references/concepts.md new file mode 100644 index 0000000..144709e --- /dev/null +++ b/references/concepts.md @@ -0,0 +1,38 @@ +# Temporal Concepts + +Understanding how Temporal components interact is essential for troubleshooting. + +## How Workers, Workflows, and Tasks Relate + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ TEMPORAL SERVER │ +│ Stores workflow history, manages task queues, coordinates work │ +└─────────────────────────────────────────────────────────────────┘ + │ + Task Queue (named queue) + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ WORKER │ +│ Long-running process that polls task queue for work │ +│ Contains: Workflow definitions + Activity implementations │ +│ │ +│ When work arrives: │ +│ - Workflow Task → Execute workflow code decisions │ +│ - Activity Task → Execute activity code (business logic) │ +└─────────────────────────────────────────────────────────────────┘ +``` + +**Key Insight**: The workflow code runs inside the worker. If worker code is outdated or buggy, workflow execution fails. + +## Workflow Task vs Activity Task + +| Task Type | What It Does | Where It Runs | On Failure | +|-----------|--------------|---------------|------------| +| **Workflow Task** | Makes workflow decisions (what to do next) | Worker | **Stalls the workflow** until fixed | +| **Activity Task** | Executes business logic | Worker | Retries per retry policy | + +**CRITICAL**: Workflow task errors are fundamentally different from activity task errors: +- **Workflow Task Failure** → Workflow **stops making progress entirely** +- **Activity Task Failure** → Workflow **retries the activity** (workflow still progressing) diff --git a/references/error-reference.md b/references/error-reference.md new file mode 100644 index 0000000..0926e4f --- /dev/null +++ b/references/error-reference.md @@ -0,0 +1,24 @@ +# Common Error Types Reference + +| Error Type | Where to Find | What Happened | Recovery | +|------------|---------------|---------------|----------| +| **Non-determinism** | `WorkflowTaskFailed` in history | Replay doesn't match history | Analyze error first. **If accidental**: fix code to match history → restart worker. **If intentional v2 change**: terminate → start fresh workflow. | +| **Workflow code bug** | `WorkflowTaskFailed` in history | Bug in workflow logic | Fix code → Restart worker → Workflow auto-resumes | +| **Missing workflow** | Worker logs | Workflow not registered | Add to worker.py → Restart worker | +| **Missing activity** | Worker logs | Activity not registered | Add to worker.py → Restart worker | +| **Activity bug** | `ActivityTaskFailed` in history | Bug in activity code | Fix code → Restart worker → Auto-retries | +| **Activity retries** | `ActivityTaskFailed` (count >2) | Repeated failures | Fix code → Restart worker → Auto-retries | +| **Sandbox violation** | Worker logs | Bad imports in workflow | Fix workflow.py imports → Restart worker | +| **Task queue mismatch** | Workflow never starts | Different queues in starter/worker | Align task queue names | +| **Timeout** | Status = TIMED_OUT | Operation too slow | Increase timeout config | + +## Workflow Status Reference + +| Status | Meaning | Action | +|--------|---------|--------| +| `RUNNING` | Workflow in progress | Wait, or check if stalled | +| `COMPLETED` | Successfully finished | Get result, verify correctness | +| `FAILED` | Error during execution | Analyze error | +| `CANCELED` | Explicitly canceled | Review reason | +| `TERMINATED` | Force-stopped | Review reason | +| `TIMED_OUT` | Exceeded timeout | Increase timeout | diff --git a/references/interactive-workflows.md b/references/interactive-workflows.md new file mode 100644 index 0000000..f1bafe3 --- /dev/null +++ b/references/interactive-workflows.md @@ -0,0 +1,53 @@ +# Interactive Workflows + +Interactive workflows pause and wait for external input (signals or updates). + +## Signals + +Fire-and-forget messages to a workflow. + +```bash +# Send signal to workflow +temporal workflow signal \ + --workflow-id \ + --name "signal_name" \ + --input '{"key": "value"}' + +# Or via interact script (if available) +uv run interact --workflow-id --signal-name "signal_name" --data '{"key": "value"}' +``` + +## Updates + +Request-response style interaction (returns a value). + +```bash +# Send update to workflow +temporal workflow update \ + --workflow-id \ + --name "update_name" \ + --input '{"approved": true}' +``` + +## Queries + +Read-only inspection of workflow state. + +```bash +# Query workflow state (read-only) +temporal workflow query \ + --workflow-id \ + --name "get_status" +``` + +## Testing Interactive Workflows + +```bash +./scripts/ensure-worker.sh +uv run starter # Get workflow_id +./scripts/wait-for-workflow-status.sh --workflow-id $workflow_id --status RUNNING +uv run interact --workflow-id $workflow_id --signal-name "approval" --data '{"approved": true}' +./scripts/wait-for-workflow-status.sh --workflow-id $workflow_id --status COMPLETED +./scripts/get-workflow-result.sh --workflow-id $workflow_id +./scripts/kill-worker.sh # CLEANUP +``` diff --git a/references/logs.md b/references/logs.md new file mode 100644 index 0000000..9be8b5f --- /dev/null +++ b/references/logs.md @@ -0,0 +1,23 @@ +# Log Files + +| Log | Location | Content | +|-----|----------|---------| +| Worker logs | `$CLAUDE_TEMPORAL_LOG_DIR/worker-{project}.log` | Worker output, activity logs, errors | + +Default log directory: `/tmp/claude-temporal-logs` + +## Useful Log Searches + +```bash +# Find errors +grep -i "error" $CLAUDE_TEMPORAL_LOG_DIR/worker-*.log + +# Check worker startup +grep -i "started" $CLAUDE_TEMPORAL_LOG_DIR/worker-*.log + +# Find activity issues +grep -i "activity" $CLAUDE_TEMPORAL_LOG_DIR/worker-*.log + +# Tail live logs +tail -f $CLAUDE_TEMPORAL_LOG_DIR/worker-$(basename "$(pwd)").log +``` diff --git a/references/tool-reference.md b/references/tool-reference.md new file mode 100644 index 0000000..4702172 --- /dev/null +++ b/references/tool-reference.md @@ -0,0 +1,84 @@ +# Tool Reference + +## Lifecycle Scripts + +| Tool | Description | Key Options | +|------|-------------|-------------| +| `ensure-server.sh` | Start dev server if not running | - | +| `ensure-worker.sh` | Kill old workers, start fresh one | Uses `$TEMPORAL_WORKER_CMD` | +| `kill-worker.sh` | Kill current project's worker | - | +| `kill-all-workers.sh` | Kill all workers | `--include-server` | +| `list-workers.sh` | List running workers | - | + +## Monitoring Scripts + +| Tool | Description | Key Options | +|------|-------------|-------------| +| `list-recent-workflows.sh` | Show recent executions | `--minutes N` (default: 5) | +| `find-stalled-workflows.sh` | Detect stalled workflows | `--query "..."` | +| `monitor-worker-health.sh` | Check worker status | - | +| `wait-for-workflow-status.sh` | Block until status | `--workflow-id`, `--status`, `--timeout` | + +## Debugging Scripts + +| Tool | Description | Key Options | +|------|-------------|-------------| +| `analyze-workflow-error.sh` | Extract errors from history | `--workflow-id`, `--run-id` | +| `get-workflow-result.sh` | Get workflow output | `--workflow-id`, `--raw` | +| `bulk-cancel-workflows.sh` | Mass cancellation | `--pattern "..."` | + +## Worker Management Details + +### The Golden Rule + +**Ensure no old workers are running.** Stale workers with outdated code cause: +- Non-determinism errors (history mismatch) +- Executing old buggy code +- Confusing behavior + +**Best practice**: Run only ONE worker instance with the latest code. + +### Starting Workers + +```bash +# PREFERRED: Smart restart (kills old, starts fresh) +./scripts/ensure-worker.sh +``` + +This command: +1. Finds ALL existing workers for the project +2. Kills them +3. Starts a new worker with fresh code +4. Waits for worker to be ready + +### Verifying Workers + +```bash +# List all running workers +./scripts/list-workers.sh + +# Check specific worker health +./scripts/monitor-worker-health.sh + +# View worker logs +tail -f $CLAUDE_TEMPORAL_LOG_DIR/worker-$(basename "$(pwd)").log +``` + +**What to look for in logs**: +- `Worker started, listening on task queue: ...` → Worker is ready +- `Worker process died during startup` → Startup failure, check logs for error + +### Cleanup (REQUIRED) + +**Always kill workers when done.** Don't leave workers running. + +```bash +# Kill current project's worker +./scripts/kill-worker.sh + +# Kill ALL workers (full cleanup) +./scripts/kill-all-workers.sh + +# Kill all workers AND server +./scripts/kill-all-workers.sh --include-server +``` diff --git a/references/troubleshooting.md b/references/troubleshooting.md new file mode 100644 index 0000000..bb58c85 --- /dev/null +++ b/references/troubleshooting.md @@ -0,0 +1,184 @@ +# Troubleshooting Temporal Workflows + +## Step 1: Identify the Problem + +```bash +# Check workflow status +temporal workflow describe --workflow-id + +# Check for stalled workflows (workflows stuck in RUNNING) +./scripts/find-stalled-workflows.sh + +# Analyze specific workflow errors +./scripts/analyze-workflow-error.sh --workflow-id +``` + +## Step 2: Diagnose Using This Decision Tree + +``` +Workflow not behaving as expected? +│ +├── Status: RUNNING but no progress (STALLED) +│ │ +│ ├── Is it an interactive workflow waiting for signal/update? +│ │ └── YES → Send the required interaction +│ │ +│ └── NO → Run: ./scripts/find-stalled-workflows.sh +│ │ +│ ├── WorkflowTaskFailed detected +│ │ │ +│ │ ├── Non-determinism error (history mismatch)? +│ │ │ └── See: "Fixing Non-Determinism Errors" below +│ │ │ +│ │ └── Other workflow task error (code bug, missing registration)? +│ │ └── See: "Fixing Other Workflow Task Errors" below +│ │ +│ └── ActivityTaskFailed (excessive retries) +│ └── Activity is retrying. Fix activity code, restart worker. +│ Workflow will auto-retry with new code. +│ +├── Status: COMPLETED but wrong result +│ └── Check result: ./scripts/get-workflow-result.sh --workflow-id +│ Is result an error message? → Fix workflow/activity logic +│ +├── Status: FAILED +│ └── Run: ./scripts/analyze-workflow-error.sh --workflow-id +│ Fix code → ./scripts/ensure-worker.sh → Start NEW workflow +│ +├── Status: TIMED_OUT +│ └── Increase timeouts → ./scripts/ensure-worker.sh → Start NEW workflow +│ +└── Workflow never starts + └── Check: Worker running? Task queue matches? Workflow registered? +``` + +--- + +## Fixing Workflow Task Errors + +**Workflow task errors STALL the workflow** - it stops making progress entirely until the issue is fixed. + +### Fixing Non-Determinism Errors + +Non-determinism occurs when workflow code produces different commands during replay than what's recorded in history. + +**Symptoms**: +- `WorkflowTaskFailed` events in history +- "Non-deterministic error" or "history mismatch" in logs/error message + +**CRITICAL: First understand the error**: +```bash +# 1. ALWAYS analyze the error first - understand what mismatched +./scripts/analyze-workflow-error.sh --workflow-id + +# Look for details like: +# - "expected ActivityTaskScheduled but got TimerStarted" +# - "activity type mismatch: expected X got Y" +# - "timer ID mismatch" +``` + +**Report the error to user** - They need to know what changed and why. + +**Recovery options** (choose based on intent): + +**Option A: Fix code to match history (accidental change / bug)** +```bash +# Use when: You accidentally broke compatibility and want to recover the workflow +# 1. Understand what commands the history expects +# 2. Fix workflow code to produce those same commands during replay +# 3. Restart worker +./scripts/ensure-worker.sh +# 4. Workflow task retries automatically and continues +``` + +**Option B: Terminate and restart fresh (intentional v2 change)** +```bash +# Use when: You intentionally deployed breaking changes (v1→v2) and want new behavior +# The old workflow was started on v1; you want v2 going forward +temporal workflow terminate --workflow-id +./scripts/ensure-worker.sh +uv run starter # Start fresh workflow with v2 code +``` + +**Common non-determinism causes**: +- Changed activity order or added/removed activities mid-execution +- Changed activity names or signatures +- Added/removed timers or signals +- Conditional logic that depends on external state (time, random, etc.) + +**Key insight**: Non-determinism means "replay doesn't match history." +- **Accidental?** → Fix code to match history, workflow recovers +- **Intentional v2 change?** → Terminate old workflow, start fresh with new code + +### Fixing Other Workflow Task Errors + +For workflow task errors that are NOT non-determinism (code bugs, missing registration, etc.): + +**Symptoms**: +- `WorkflowTaskFailed` events +- Error is NOT "history mismatch" or "non-deterministic" + +**Fix procedure**: +```bash +# 1. Identify the error +./scripts/analyze-workflow-error.sh --workflow-id + +# 2. Fix the root cause (code bug, worker config, etc.) + +# 3. Kill and restart worker with fixed code +./scripts/ensure-worker.sh + +# 4. NO NEED TO TERMINATE - the workflow will automatically resume +# The new worker picks up where it left off and continues execution +``` + +**Key point**: Unlike non-determinism, the workflow can recover once you fix the code. + +--- + +## Fixing Activity Task Errors + +**Activity task errors cause retries**, not immediate workflow failure. + +### Workflow Stalling Due to Retries + +Workflows can appear stalled because an activity keeps failing and retrying. + +**Diagnosis**: +```bash +# Check for excessive activity retries +./scripts/find-stalled-workflows.sh + +# Look for ActivityTaskFailed count +# Check worker logs for retry messages +tail -100 $CLAUDE_TEMPORAL_LOG_DIR/worker-$(basename "$(pwd)").log +``` + +**Fix procedure**: +```bash +# 1. Fix the activity code + +# 2. Restart worker with fixed code +./scripts/ensure-worker.sh + +# 3. Worker auto-retries with new code +# No need to terminate or restart workflow +``` + +### Activity Failure (Retries Exhausted) + +When all retries are exhausted, the activity fails permanently. + +**Fix procedure**: +```bash +# 1. Analyze the error +./scripts/analyze-workflow-error.sh --workflow-id + +# 2. Fix activity code + +# 3. Restart worker +./scripts/ensure-worker.sh + +# 4. Start NEW workflow (old one has failed) +uv run starter +``` diff --git a/scripts/analyze-workflow-error.sh b/scripts/analyze-workflow-error.sh new file mode 100755 index 0000000..c348227 --- /dev/null +++ b/scripts/analyze-workflow-error.sh @@ -0,0 +1,225 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Get the directory where this script is located +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Environment variables with defaults +TEMPORAL_CLI="${TEMPORAL_CLI:-temporal}" +TEMPORAL_ADDRESS="${TEMPORAL_ADDRESS:-localhost:7233}" +CLAUDE_TEMPORAL_NAMESPACE="${CLAUDE_TEMPORAL_NAMESPACE:-default}" + +usage() { + cat <<'USAGE' +Usage: analyze-workflow-error.sh --workflow-id id [options] + +Parse workflow history to extract error details and provide recommendations. + +Options: + --workflow-id workflow ID to analyze, required + --run-id specific workflow run ID (optional) + -h, --help show this help +USAGE +} + +workflow_id="" +run_id="" + +while [[ $# -gt 0 ]]; do + case "$1" in + --workflow-id) workflow_id="${2-}"; shift 2 ;; + --run-id) run_id="${2-}"; shift 2 ;; + -h|--help) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 1 ;; + esac +done + +if [[ -z "$workflow_id" ]]; then + echo "workflow-id is required" >&2 + usage + exit 1 +fi + +if ! command -v "$TEMPORAL_CLI" >/dev/null 2>&1; then + echo "Temporal CLI not found: $TEMPORAL_CLI" >&2 + exit 1 +fi + +# Build temporal command +DESCRIBE_CMD=("$TEMPORAL_CLI" "workflow" "describe" "--workflow-id" "$workflow_id" "--address" "$TEMPORAL_ADDRESS" "--namespace" "$CLAUDE_TEMPORAL_NAMESPACE") +SHOW_CMD=("$TEMPORAL_CLI" "workflow" "show" "--workflow-id" "$workflow_id" "--address" "$TEMPORAL_ADDRESS" "--namespace" "$CLAUDE_TEMPORAL_NAMESPACE") + +if [[ -n "$run_id" ]]; then + DESCRIBE_CMD+=("--run-id" "$run_id") + SHOW_CMD+=("--run-id" "$run_id") +fi + +echo "=== Workflow Error Analysis ===" +echo "Workflow ID: $workflow_id" +if [[ -n "$run_id" ]]; then + echo "Run ID: $run_id" +fi +echo "" + +# Get workflow description +if ! describe_output=$("${DESCRIBE_CMD[@]}" 2>&1); then + echo "❌ Failed to describe workflow" >&2 + echo "$describe_output" >&2 + exit 1 +fi + +# Get workflow history +if ! show_output=$("${SHOW_CMD[@]}" 2>&1); then + echo "❌ Failed to get workflow history" >&2 + echo "$show_output" >&2 + exit 1 +fi + +# Extract status +status=$(echo "$describe_output" | grep -E "^\s*Status:" | awk '{print $2}' | tr -d ' ' || echo "UNKNOWN") +echo "Current Status: $status" +echo "" + +# Analyze different error types +workflow_task_failures=$(echo "$show_output" | grep -c "WorkflowTaskFailed" || echo "0") +activity_task_failures=$(echo "$show_output" | grep -c "ActivityTaskFailed" || echo "0") +workflow_exec_failed=$(echo "$show_output" | grep -c "WorkflowExecutionFailed" || echo "0") + +# Report findings +if [[ "$workflow_task_failures" -gt 0 ]]; then + echo "=== WorkflowTaskFailed Detected ===" + echo "Attempts: $workflow_task_failures" + echo "" + + # Extract error details + echo "Error Details:" + echo "$show_output" | grep -A 10 "WorkflowTaskFailed" | head -n 15 + echo "" + + echo "=== Diagnosis ===" + echo "Error Type: WorkflowTaskFailed" + echo "" + echo "Common Causes:" + echo " 1. Workflow type not registered with worker" + echo " 2. Worker missing workflow definition" + echo " 3. Workflow code has syntax errors" + echo " 4. Worker not running or not polling correct task queue" + echo "" + echo "=== Recommended Actions ===" + echo "1. Check if worker is running:" + echo " $SCRIPT_DIR/list-workers.sh" + echo "" + echo "2. Verify workflow is registered in worker.py:" + echo " - Check workflows=[YourWorkflow] in Worker() constructor" + echo "" + echo "3. Restart worker with updated code:" + echo " $SCRIPT_DIR/ensure-worker.sh" + echo "" + echo "4. Check worker logs for errors:" + echo " tail -f \$CLAUDE_TEMPORAL_LOG_DIR/worker-\$(basename \"\$(pwd)\").log" + +elif [[ "$activity_task_failures" -gt 0 ]]; then + echo "=== ActivityTaskFailed Detected ===" + echo "Attempts: $activity_task_failures" + echo "" + + # Extract error details + echo "Error Details:" + echo "$show_output" | grep -A 10 "ActivityTaskFailed" | head -n 20 + echo "" + + echo "=== Diagnosis ===" + echo "Error Type: ActivityTaskFailed" + echo "" + echo "Common Causes:" + echo " 1. Activity code threw an exception" + echo " 2. Activity type not registered with worker" + echo " 3. Activity code has bugs" + echo " 4. External dependency failure (API, database, etc.)" + echo "" + echo "=== Recommended Actions ===" + echo "1. Check activity logs for stack traces:" + echo " tail -f \$CLAUDE_TEMPORAL_LOG_DIR/worker-\$(basename \"\$(pwd)\").log" + echo "" + echo "2. Verify activity is registered in worker.py:" + echo " - Check activities=[your_activity] in Worker() constructor" + echo "" + echo "3. Review activity code for errors" + echo "" + echo "4. If activity code is fixed, restart worker:" + echo " $SCRIPT_DIR/ensure-worker.sh" + echo "" + echo "5. Consider adjusting retry policy if transient failure" + +elif [[ "$workflow_exec_failed" -gt 0 ]]; then + echo "=== WorkflowExecutionFailed Detected ===" + echo "" + + # Extract error details + echo "Error Details:" + echo "$show_output" | grep -A 20 "WorkflowExecutionFailed" | head -n 25 + echo "" + + echo "=== Diagnosis ===" + echo "Error Type: WorkflowExecutionFailed" + echo "" + echo "Common Causes:" + echo " 1. Workflow business logic error" + echo " 2. Unhandled exception in workflow code" + echo " 3. Workflow determinism violation" + echo "" + echo "=== Recommended Actions ===" + echo "1. Review workflow code for logic errors" + echo "" + echo "2. Check for non-deterministic code:" + echo " - Random number generation" + echo " - System time calls" + echo " - Threading/concurrency" + echo "" + echo "3. Review full workflow history:" + echo " temporal workflow show --workflow-id $workflow_id" + echo "" + echo "4. After fixing code, restart worker:" + echo " $SCRIPT_DIR/ensure-worker.sh" + +elif [[ "$status" == "TIMED_OUT" ]]; then + echo "=== Workflow Timeout ===" + echo "" + echo "The workflow exceeded its timeout limit." + echo "" + echo "=== Recommended Actions ===" + echo "1. Review workflow timeout settings in starter code" + echo "" + echo "2. Check if activities are taking too long:" + echo " - Review activity timeout settings" + echo " - Check activity logs for performance issues" + echo "" + echo "3. Consider increasing timeouts if operations legitimately take longer" + +elif [[ "$status" == "RUNNING" ]]; then + echo "=== Workflow Still Running ===" + echo "" + echo "The workflow appears to be running normally." + echo "" + echo "To monitor progress:" + echo " temporal workflow show --workflow-id $workflow_id" + echo "" + echo "To wait for completion:" + echo " $SCRIPT_DIR/wait-for-workflow-status.sh --workflow-id $workflow_id --status COMPLETED" + +elif [[ "$status" == "COMPLETED" ]]; then + echo "=== Workflow Completed Successfully ===" + echo "" + echo "No errors detected. Workflow completed normally." + +else + echo "=== Status: $status ===" + echo "" + echo "Review full workflow details:" + echo " temporal workflow describe --workflow-id $workflow_id" + echo " temporal workflow show --workflow-id $workflow_id" +fi + +echo "" +echo "=== Additional Resources ===" +echo "Web UI: http://localhost:8233/namespaces/$CLAUDE_TEMPORAL_NAMESPACE/workflows/$workflow_id" diff --git a/scripts/bulk-cancel-workflows.sh b/scripts/bulk-cancel-workflows.sh new file mode 100755 index 0000000..c6fa27e --- /dev/null +++ b/scripts/bulk-cancel-workflows.sh @@ -0,0 +1,145 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Environment variables with defaults +TEMPORAL_CLI="${TEMPORAL_CLI:-temporal}" +TEMPORAL_ADDRESS="${TEMPORAL_ADDRESS:-localhost:7233}" +CLAUDE_TEMPORAL_NAMESPACE="${CLAUDE_TEMPORAL_NAMESPACE:-default}" + +usage() { + cat <<'USAGE' +Usage: bulk-cancel-workflows.sh [options] + +Cancel multiple workflows. + +Options: + --workflow-ids file containing workflow IDs (one per line), required unless --pattern + --pattern cancel workflows matching pattern (regex) + --reason cancellation reason (default: "Bulk cancellation") + -h, --help show this help + +Examples: + # Cancel workflows from file + ./bulk-cancel-workflows.sh --workflow-ids stalled.txt + + # Cancel workflows matching pattern + ./bulk-cancel-workflows.sh --pattern "test-.*" + + # Cancel with custom reason + ./bulk-cancel-workflows.sh --workflow-ids stalled.txt --reason "Cleaning up test workflows" +USAGE +} + +workflow_ids_file="" +pattern="" +reason="Bulk cancellation" + +while [[ $# -gt 0 ]]; do + case "$1" in + --workflow-ids) workflow_ids_file="${2-}"; shift 2 ;; + --pattern) pattern="${2-}"; shift 2 ;; + --reason) reason="${2-}"; shift 2 ;; + -h|--help) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 1 ;; + esac +done + +if [[ -z "$workflow_ids_file" && -z "$pattern" ]]; then + echo "Either --workflow-ids or --pattern is required" >&2 + usage + exit 1 +fi + +if ! command -v "$TEMPORAL_CLI" >/dev/null 2>&1; then + echo "Temporal CLI not found: $TEMPORAL_CLI" >&2 + exit 1 +fi + +# Collect workflow IDs +workflow_ids=() + +if [[ -n "$workflow_ids_file" ]]; then + if [[ ! -f "$workflow_ids_file" ]]; then + echo "File not found: $workflow_ids_file" >&2 + exit 1 + fi + + # Read workflow IDs from file + while IFS= read -r line; do + # Skip empty lines and comments + [[ -z "$line" || "$line" =~ ^[[:space:]]*# ]] && continue + # Trim whitespace + line=$(echo "$line" | xargs) + workflow_ids+=("$line") + done < "$workflow_ids_file" +fi + +if [[ -n "$pattern" ]]; then + echo "Finding workflows matching pattern: $pattern" + + # List workflows and filter by pattern + LIST_CMD=("$TEMPORAL_CLI" "workflow" "list" "--address" "$TEMPORAL_ADDRESS" "--namespace" "$CLAUDE_TEMPORAL_NAMESPACE") + + if workflow_list=$("${LIST_CMD[@]}" 2>&1); then + # Parse workflow IDs from list and filter by pattern + while IFS= read -r wf_id; do + [[ -z "$wf_id" ]] && continue + if echo "$wf_id" | grep -E "$pattern" >/dev/null 2>&1; then + workflow_ids+=("$wf_id") + fi + done < <(echo "$workflow_list" | awk 'NR>1 && $1 != "" {print $1}' | grep -v "^-") + else + echo "Failed to list workflows" >&2 + echo "$workflow_list" >&2 + exit 1 + fi +fi + +# Check if we have any workflow IDs +if [[ "${#workflow_ids[@]}" -eq 0 ]]; then + echo "No workflows to cancel" + exit 0 +fi + +echo "Found ${#workflow_ids[@]} workflow(s) to cancel" +echo "Reason: $reason" +echo "" + +# Confirm with user +read -p "Continue with cancellation? (y/N) " -n 1 -r +echo +if [[ ! $REPLY =~ ^[Yy]$ ]]; then + echo "Cancellation aborted" + exit 0 +fi + +echo "" +echo "Canceling workflows..." +echo "" + +success_count=0 +failed_count=0 + +# Cancel each workflow +for workflow_id in "${workflow_ids[@]}"; do + echo -n "Canceling: $workflow_id ... " + + if "$TEMPORAL_CLI" workflow cancel \ + --workflow-id "$workflow_id" \ + --address "$TEMPORAL_ADDRESS" \ + --namespace "$CLAUDE_TEMPORAL_NAMESPACE" \ + --reason "$reason" \ + >/dev/null 2>&1; then + echo "✓" + success_count=$((success_count + 1)) + else + echo "❌ (may already be canceled or not exist)" + failed_count=$((failed_count + 1)) + fi +done + +echo "" +echo "=== Summary ===" +echo "Successfully canceled: $success_count" +echo "Failed: $failed_count" +echo "Total: ${#workflow_ids[@]}" diff --git a/scripts/ensure-server.sh b/scripts/ensure-server.sh new file mode 100755 index 0000000..1627909 --- /dev/null +++ b/scripts/ensure-server.sh @@ -0,0 +1,98 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Environment variables with defaults +CLAUDE_TEMPORAL_PID_DIR="${CLAUDE_TEMPORAL_PID_DIR:-${TMPDIR:-/tmp}/claude-temporal-pids}" +CLAUDE_TEMPORAL_LOG_DIR="${CLAUDE_TEMPORAL_LOG_DIR:-${TMPDIR:-/tmp}/claude-temporal-logs}" +TEMPORAL_CLI="${TEMPORAL_CLI:-temporal}" +TEMPORAL_ADDRESS="${TEMPORAL_ADDRESS:-localhost:7233}" + +# Create directories if they don't exist +mkdir -p "$CLAUDE_TEMPORAL_PID_DIR" +mkdir -p "$CLAUDE_TEMPORAL_LOG_DIR" + +PID_FILE="$CLAUDE_TEMPORAL_PID_DIR/server.pid" +LOG_FILE="$CLAUDE_TEMPORAL_LOG_DIR/server.log" + +# Check if temporal CLI is installed +if ! command -v "$TEMPORAL_CLI" >/dev/null 2>&1; then + echo "❌ Temporal CLI not found: $TEMPORAL_CLI" >&2 + echo "Install temporal CLI:" >&2 + echo " macOS: brew install temporal" >&2 + echo " Linux: https://github.com/temporalio/cli/releases" >&2 + exit 1 +fi + +# Function to check if server is responding +check_server_connectivity() { + # Try to list namespaces as a connectivity test + if "$TEMPORAL_CLI" operator namespace list --address "$TEMPORAL_ADDRESS" >/dev/null 2>&1; then + return 0 + fi + return 1 +} + +# Check if server is already running +if check_server_connectivity; then + echo "✓ Temporal server already running at $TEMPORAL_ADDRESS" + exit 0 +fi + +# Check if we have a PID file from previous run +if [[ -f "$PID_FILE" ]]; then + OLD_PID=$(cat "$PID_FILE") + if kill -0 "$OLD_PID" 2>/dev/null; then + # Process exists but not responding - might be starting up + echo "⏳ Server process exists (PID: $OLD_PID), checking connectivity..." + sleep 2 + if check_server_connectivity; then + echo "✓ Temporal server ready at $TEMPORAL_ADDRESS" + exit 0 + fi + echo "⚠️ Server process exists but not responding, killing and restarting..." + kill -9 "$OLD_PID" 2>/dev/null || true + fi + rm -f "$PID_FILE" +fi + +# Start server in background +echo "🚀 Starting Temporal dev server..." +"$TEMPORAL_CLI" server start-dev > "$LOG_FILE" 2>&1 & +SERVER_PID=$! +echo "$SERVER_PID" > "$PID_FILE" + +echo "⏳ Waiting for server to be ready..." + +# Wait up to 30 seconds for server to become ready +TIMEOUT=30 +ELAPSED=0 +INTERVAL=1 + +while (( ELAPSED < TIMEOUT )); do + if check_server_connectivity; then + echo "✓ Temporal server ready at $TEMPORAL_ADDRESS (PID: $SERVER_PID)" + echo "" + echo "Web UI: http://localhost:8233" + echo "gRPC: $TEMPORAL_ADDRESS" + echo "" + echo "Server logs: $LOG_FILE" + echo "Server PID file: $PID_FILE" + exit 0 + fi + + # Check if process died + if ! kill -0 "$SERVER_PID" 2>/dev/null; then + echo "❌ Server process died during startup" >&2 + echo "Check logs: $LOG_FILE" >&2 + rm -f "$PID_FILE" + exit 2 + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) +done + +echo "❌ Server startup timeout after ${TIMEOUT}s" >&2 +echo "Server might still be starting. Check logs: $LOG_FILE" >&2 +echo "Server PID: $SERVER_PID" >&2 +exit 2 diff --git a/scripts/ensure-worker.sh b/scripts/ensure-worker.sh new file mode 100755 index 0000000..5b9265b --- /dev/null +++ b/scripts/ensure-worker.sh @@ -0,0 +1,91 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Get the directory where this script is located +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Environment variables with defaults +CLAUDE_TEMPORAL_PID_DIR="${CLAUDE_TEMPORAL_PID_DIR:-${TMPDIR:-/tmp}/claude-temporal-pids}" +CLAUDE_TEMPORAL_LOG_DIR="${CLAUDE_TEMPORAL_LOG_DIR:-${TMPDIR:-/tmp}/claude-temporal-logs}" +CLAUDE_TEMPORAL_PROJECT_DIR="${CLAUDE_TEMPORAL_PROJECT_DIR:-$(pwd)}" +CLAUDE_TEMPORAL_PROJECT_NAME="${CLAUDE_TEMPORAL_PROJECT_NAME:-$(basename "$CLAUDE_TEMPORAL_PROJECT_DIR")}" +TEMPORAL_WORKER_CMD="${TEMPORAL_WORKER_CMD:-uv run worker}" + +# Create directories if they don't exist +mkdir -p "$CLAUDE_TEMPORAL_PID_DIR" +mkdir -p "$CLAUDE_TEMPORAL_LOG_DIR" + +PID_FILE="$CLAUDE_TEMPORAL_PID_DIR/worker-$CLAUDE_TEMPORAL_PROJECT_NAME.pid" +LOG_FILE="$CLAUDE_TEMPORAL_LOG_DIR/worker-$CLAUDE_TEMPORAL_PROJECT_NAME.log" + +# Always kill any existing workers (both tracked and orphaned) +# This ensures we don't accumulate orphaned processes +echo "🔍 Checking for existing workers..." + +# Use the helper function to find all workers +source "$SCRIPT_DIR/find-project-workers.sh" +existing_workers=$(find_project_workers "$CLAUDE_TEMPORAL_PROJECT_DIR" 2>/dev/null || true) + +if [[ -n "$existing_workers" ]]; then + worker_count=$(echo "$existing_workers" | wc -l | tr -d ' ') + echo "Found $worker_count existing worker(s), stopping them..." + + if "$SCRIPT_DIR/kill-worker.sh" 2>&1; then + echo "✓ Existing workers stopped" + else + # kill-worker.sh will have printed error messages + echo "⚠️ Some workers may not have been stopped, continuing anyway..." + fi +elif [[ -f "$PID_FILE" ]]; then + # PID file exists but no workers found - clean up stale PID file + echo "Removing stale PID file..." + rm -f "$PID_FILE" +fi + +# Clear old log file +> "$LOG_FILE" + +# Start worker in background +echo "🚀 Starting worker for project: $CLAUDE_TEMPORAL_PROJECT_NAME" +echo "Command: $TEMPORAL_WORKER_CMD" + +# Start worker, redirect output to log file +eval "$TEMPORAL_WORKER_CMD" > "$LOG_FILE" 2>&1 & +WORKER_PID=$! + +# Save PID +echo "$WORKER_PID" > "$PID_FILE" + +echo "Worker PID: $WORKER_PID" +echo "Log file: $LOG_FILE" + +# Wait for worker to be ready (simple approach: wait and check if still running) +echo "⏳ Waiting for worker to be ready..." + +# Wait 10 seconds for worker to initialize +sleep 10 + +# Check if process is still running +if ! kill -0 "$WORKER_PID" 2>/dev/null; then + echo "❌ Worker process died during startup" >&2 + echo "Last 20 lines of log:" >&2 + tail -n 20 "$LOG_FILE" >&2 || true + rm -f "$PID_FILE" + exit 1 +fi + +# Check if log file has content (worker is producing output) +if [[ -f "$LOG_FILE" ]] && [[ -s "$LOG_FILE" ]]; then + echo "✓ Worker ready (PID: $WORKER_PID)" + echo "" + echo "To monitor worker logs:" + echo " tail -f $LOG_FILE" + echo "" + echo "To check worker health:" + echo " $SCRIPT_DIR/monitor-worker-health.sh" + exit 0 +else + echo "⚠️ Worker is running but no logs detected" >&2 + echo "Check logs: $LOG_FILE" >&2 + exit 2 +fi diff --git a/scripts/find-project-workers.sh b/scripts/find-project-workers.sh new file mode 100755 index 0000000..8b8bf7b --- /dev/null +++ b/scripts/find-project-workers.sh @@ -0,0 +1,73 @@ +#!/usr/bin/env bash +# Helper function to find all worker processes for a specific project +# This can be sourced by other scripts or run directly + +# Usage: find_project_workers PROJECT_DIR +# Returns: PIDs of all worker processes for the project (one per line) +find_project_workers() { + local project_dir="$1" + + # Normalize the project directory path (resolve symlinks, remove trailing slash) + project_dir="$(cd "$project_dir" 2>/dev/null && pwd)" || { + echo "Error: Invalid project directory: $project_dir" >&2 + return 1 + } + + # Find all processes where: + # 1. Command contains the project directory path + # 2. Command contains "worker" (either .venv/bin/worker or "uv run worker") + # We need to be specific to avoid killing unrelated processes + + # Strategy: Find both parent "uv run worker" processes and child Python worker processes + # We'll use the project directory in the path as the key identifier + + local pids=() + + # Use ps to get all processes with their commands + if [[ "$(uname)" == "Darwin" ]]; then + # macOS - find Python workers + while IFS= read -r pid; do + [[ -n "$pid" ]] && pids+=("$pid") + done < <(ps ax -o pid,command | grep -E "\.venv/bin/(python[0-9.]*|worker)" | grep -E "${project_dir}" | grep -v grep | awk '{print $1}') + + # Also find "uv run worker" processes in this directory + while IFS= read -r pid; do + [[ -n "$pid" ]] && pids+=("$pid") + done < <(ps ax -o pid,command | grep "uv run worker" | grep -v grep | awk -v dir="$project_dir" '{ + # Check if process is running from the project directory by checking cwd + cmd = "lsof -a -p " $1 " -d cwd -Fn 2>/dev/null | grep ^n | cut -c2-" + cmd | getline cwd + close(cmd) + if (index(cwd, dir) > 0) print $1 + }') + else + # Linux - find Python workers + while IFS= read -r pid; do + [[ -n "$pid" ]] && pids+=("$pid") + done < <(ps ax -o pid,cmd | grep -E "\.venv/bin/(python[0-9.]*|worker)" | grep -E "${project_dir}" | grep -v grep | awk '{print $1}') + + # Also find "uv run worker" processes in this directory + while IFS= read -r pid; do + [[ -n "$pid" ]] && pids+=("$pid") + done < <(ps ax -o pid,cmd | grep "uv run worker" | grep -v grep | awk -v dir="$project_dir" '{ + # Check if process is running from the project directory + cmd = "readlink -f /proc/" $1 "/cwd 2>/dev/null" + cmd | getline cwd + close(cmd) + if (index(cwd, dir) > 0) print $1 + }') + fi + + # Print unique PIDs + printf "%s\n" "${pids[@]}" | sort -u +} + +# If script is executed directly (not sourced), run the function +if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then + if [[ $# -eq 0 ]]; then + # Default to current directory + find_project_workers "$(pwd)" + else + find_project_workers "$1" + fi +fi diff --git a/scripts/find-stalled-workflows.sh b/scripts/find-stalled-workflows.sh new file mode 100755 index 0000000..be57cd5 --- /dev/null +++ b/scripts/find-stalled-workflows.sh @@ -0,0 +1,117 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Environment variables with defaults +TEMPORAL_CLI="${TEMPORAL_CLI:-temporal}" +TEMPORAL_ADDRESS="${TEMPORAL_ADDRESS:-localhost:7233}" +CLAUDE_TEMPORAL_NAMESPACE="${CLAUDE_TEMPORAL_NAMESPACE:-default}" + +usage() { + cat <<'USAGE' +Usage: find-stalled-workflows.sh [options] + +Detect workflows with systematic issues (e.g., workflow task failures). + +Options: + --query filter workflows by query (optional) + -h, --help show this help +USAGE +} + +query="" + +while [[ $# -gt 0 ]]; do + case "$1" in + --query) query="${2-}"; shift 2 ;; + -h|--help) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 1 ;; + esac +done + +if ! command -v "$TEMPORAL_CLI" >/dev/null 2>&1; then + echo "Temporal CLI not found: $TEMPORAL_CLI" >&2 + exit 1 +fi + +if ! command -v jq >/dev/null 2>&1; then + echo "⚠️ jq not found. Install jq for better output formatting." >&2 + echo "This script will continue with basic text parsing..." >&2 +fi + +# Build list command - only look for RUNNING workflows since stalled workflows must be running +LIST_CMD=("$TEMPORAL_CLI" "workflow" "list" "--address" "$TEMPORAL_ADDRESS" "--namespace" "$CLAUDE_TEMPORAL_NAMESPACE") + +if [[ -n "$query" ]]; then + # Append user query to running filter + LIST_CMD+=("--query" "ExecutionStatus='Running' AND ($query)") +else + # Default: only find running workflows + LIST_CMD+=("--query" "ExecutionStatus='Running'") +fi + +echo "Scanning for stalled workflows..." +echo "" + +# Get list of running workflows +if ! workflow_list=$("${LIST_CMD[@]}" 2>&1); then + echo "Failed to list workflows" >&2 + echo "$workflow_list" >&2 + exit 1 +fi + +# Parse workflow IDs from list +# The output format is: Status WorkflowId Type StartTime +# WorkflowId is in column 2 +workflow_ids=$(echo "$workflow_list" | awk 'NR>1 {print $2}' | grep -v "^-" | grep -v "^$" || true) + +if [[ -z "$workflow_ids" ]]; then + echo "No workflows found" + exit 0 +fi + +# Print header +printf "%-40s %-35s %-10s\n" "WORKFLOW_ID" "ERROR_TYPE" "ATTEMPTS" +printf "%-40s %-35s %-10s\n" "----------------------------------------" "-----------------------------------" "----------" + +found_stalled=false + +# Check each workflow for errors +while IFS= read -r workflow_id; do + [[ -z "$workflow_id" ]] && continue + + # Get workflow event history using 'show' to see failure events + if show_output=$("$TEMPORAL_CLI" workflow show --workflow-id "$workflow_id" --address "$TEMPORAL_ADDRESS" --namespace "$CLAUDE_TEMPORAL_NAMESPACE" 2>/dev/null); then + + # Check for workflow task failures + workflow_task_failures=$(echo "$show_output" | grep -c "WorkflowTaskFailed" 2>/dev/null || echo "0") + workflow_task_failures=$(echo "$workflow_task_failures" | tr -d '\n' | tr -d ' ') + activity_task_failures=$(echo "$show_output" | grep -c "ActivityTaskFailed" 2>/dev/null || echo "0") + activity_task_failures=$(echo "$activity_task_failures" | tr -d '\n' | tr -d ' ') + + # Report if significant failures found + if [[ "$workflow_task_failures" -gt 0 ]]; then + found_stalled=true + # Truncate long workflow IDs for display + display_id=$(echo "$workflow_id" | cut -c1-40) + printf "%-40s %-35s %-10s\n" "$display_id" "WorkflowTaskFailed" "$workflow_task_failures" + elif [[ "$activity_task_failures" -gt 2 ]]; then + # Only report activity failures if they're excessive (>2) + found_stalled=true + display_id=$(echo "$workflow_id" | cut -c1-40) + printf "%-40s %-35s %-10s\n" "$display_id" "ActivityTaskFailed" "$activity_task_failures" + fi + fi +done <<< "$workflow_ids" + +echo "" + +if [[ "$found_stalled" == false ]]; then + echo "No stalled workflows detected" +else + echo "Found stalled workflows. To investigate:" + echo " ./tools/analyze-workflow-error.sh --workflow-id " + echo "" + echo "To cancel all stalled workflows:" + echo " ./tools/find-stalled-workflows.sh | awk 'NR>2 {print \$1}' > stalled.txt" + echo " ./tools/bulk-cancel-workflows.sh --workflow-ids stalled.txt" +fi diff --git a/scripts/get-workflow-result.sh b/scripts/get-workflow-result.sh new file mode 100755 index 0000000..05783a4 --- /dev/null +++ b/scripts/get-workflow-result.sh @@ -0,0 +1,202 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Environment variables with defaults +TEMPORAL_CLI="${TEMPORAL_CLI:-temporal}" +TEMPORAL_ADDRESS="${TEMPORAL_ADDRESS:-localhost:7233}" +CLAUDE_TEMPORAL_NAMESPACE="${CLAUDE_TEMPORAL_NAMESPACE:-default}" + +usage() { + cat <<'USAGE' +Usage: get-workflow-result.sh --workflow-id [options] + +Get the result/output from a completed workflow execution. + +Options: + --workflow-id Workflow ID to query (required) + --run-id Specific run ID (optional) + --raw Output raw JSON result only + -h, --help Show this help + +Examples: + # Get workflow result with formatted output + ./tools/get-workflow-result.sh --workflow-id my-workflow-123 + + # Get raw JSON result only + ./tools/get-workflow-result.sh --workflow-id my-workflow-123 --raw + + # Get result for specific run + ./tools/get-workflow-result.sh --workflow-id my-workflow-123 --run-id abc-def-ghi + +Output: + - Workflow status (COMPLETED, FAILED, etc.) + - Workflow result/output (if completed successfully) + - Failure message (if failed) + - Termination reason (if terminated) +USAGE +} + +workflow_id="" +run_id="" +raw_mode=false + +while [[ $# -gt 0 ]]; do + case "$1" in + --workflow-id) workflow_id="${2-}"; shift 2 ;; + --run-id) run_id="${2-}"; shift 2 ;; + --raw) raw_mode=true; shift ;; + -h|--help) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 1 ;; + esac +done + +if [[ -z "$workflow_id" ]]; then + echo "Error: --workflow-id is required" >&2 + usage + exit 1 +fi + +if ! command -v "$TEMPORAL_CLI" >/dev/null 2>&1; then + echo "Temporal CLI not found: $TEMPORAL_CLI" >&2 + exit 1 +fi + +# Build describe command +DESCRIBE_CMD=("$TEMPORAL_CLI" "workflow" "describe" "--workflow-id" "$workflow_id" "--address" "$TEMPORAL_ADDRESS" "--namespace" "$CLAUDE_TEMPORAL_NAMESPACE") + +if [[ -n "$run_id" ]]; then + DESCRIBE_CMD+=("--run-id" "$run_id") +fi + +# Get workflow details +if ! describe_output=$("${DESCRIBE_CMD[@]}" 2>&1); then + echo "Failed to describe workflow: $workflow_id" >&2 + echo "$describe_output" >&2 + exit 1 +fi + +# Extract workflow status +status=$(echo "$describe_output" | grep -i "Status:" | head -n1 | awk '{print $2}' || echo "UNKNOWN") + +if [[ "$raw_mode" == true ]]; then + # Raw mode: just output the result payload + # Use 'temporal workflow show' to get execution history with result + if ! show_output=$("$TEMPORAL_CLI" workflow show --workflow-id "$workflow_id" --address "$TEMPORAL_ADDRESS" --namespace "$CLAUDE_TEMPORAL_NAMESPACE" 2>&1); then + echo "Failed to get workflow result" >&2 + exit 1 + fi + + # Extract result from WorkflowExecutionCompleted event + echo "$show_output" | grep -A 10 "WorkflowExecutionCompleted" | grep -E "result|Result" || echo "{}" + exit 0 +fi + +# Formatted output +echo "════════════════════════════════════════════════════════════" +echo "Workflow: $workflow_id" +echo "Status: $status" +echo "════════════════════════════════════════════════════════════" +echo "" + +case "$status" in + COMPLETED) + echo "✅ Workflow completed successfully" + echo "" + echo "Result:" + echo "────────────────────────────────────────────────────────────" + + # Get workflow result using 'show' command + if show_output=$("$TEMPORAL_CLI" workflow show --workflow-id "$workflow_id" --address "$TEMPORAL_ADDRESS" --namespace "$CLAUDE_TEMPORAL_NAMESPACE" 2>/dev/null); then + # Extract result from WorkflowExecutionCompleted event + result=$(echo "$show_output" | grep -A 20 "WorkflowExecutionCompleted" | grep -E "result|Result" || echo "") + + if [[ -n "$result" ]]; then + echo "$result" + else + echo "(No result payload - workflow may return None/void)" + fi + else + echo "(Unable to extract result)" + fi + ;; + + FAILED) + echo "❌ Workflow failed" + echo "" + echo "Failure details:" + echo "────────────────────────────────────────────────────────────" + + # Extract failure message + failure=$(echo "$describe_output" | grep -A 5 "Failure:" || echo "") + if [[ -n "$failure" ]]; then + echo "$failure" + else + echo "(No failure details available)" + fi + + echo "" + echo "To analyze error:" + echo " ./tools/analyze-workflow-error.sh --workflow-id $workflow_id" + ;; + + CANCELED) + echo "🚫 Workflow was canceled" + echo "" + + # Try to extract cancellation reason + cancel_info=$(echo "$describe_output" | grep -i "cancel" || echo "") + if [[ -n "$cancel_info" ]]; then + echo "Cancellation info:" + echo "$cancel_info" + fi + ;; + + TERMINATED) + echo "⛔ Workflow was terminated" + echo "" + + # Extract termination reason + term_reason=$(echo "$describe_output" | grep -i "reason:" | head -n1 || echo "") + if [[ -n "$term_reason" ]]; then + echo "Termination reason:" + echo "$term_reason" + fi + ;; + + TIMED_OUT) + echo "⏱️ Workflow timed out" + echo "" + + timeout_info=$(echo "$describe_output" | grep -i "timeout" || echo "") + if [[ -n "$timeout_info" ]]; then + echo "Timeout info:" + echo "$timeout_info" + fi + ;; + + RUNNING) + echo "🏃 Workflow is still running" + echo "" + echo "Cannot get result for running workflow." + echo "" + echo "To wait for completion:" + echo " ./tools/wait-for-workflow-status.sh --workflow-id $workflow_id --status COMPLETED" + exit 1 + ;; + + *) + echo "Status: $status" + echo "" + echo "Full workflow details:" + echo "$describe_output" + ;; +esac + +echo "" +echo "════════════════════════════════════════════════════════════" +echo "" +echo "To view full workflow history:" +echo " temporal workflow show --workflow-id $workflow_id" +echo "" +echo "To view in Web UI:" +echo " http://localhost:8233/namespaces/$CLAUDE_TEMPORAL_NAMESPACE/workflows/$workflow_id" diff --git a/scripts/kill-all-workers.sh b/scripts/kill-all-workers.sh new file mode 100755 index 0000000..c3a619f --- /dev/null +++ b/scripts/kill-all-workers.sh @@ -0,0 +1,134 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Environment variables with defaults +CLAUDE_TEMPORAL_PID_DIR="${CLAUDE_TEMPORAL_PID_DIR:-${TMPDIR:-/tmp}/claude-temporal-pids}" + +# Graceful shutdown timeout (seconds) +GRACEFUL_TIMEOUT=5 + +usage() { + cat <<'USAGE' +Usage: kill-all-workers.sh [options] + +Kill all tracked workers across all projects. + +Options: + -p, --project kill only specific project worker + --include-server also kill temporal dev server + -h, --help show this help +USAGE +} + +specific_project="" +include_server=false + +while [[ $# -gt 0 ]]; do + case "$1" in + -p|--project) specific_project="${2-}"; shift 2 ;; + --include-server) include_server=true; shift ;; + -h|--help) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 1 ;; + esac +done + +# Check if PID directory exists +if [[ ! -d "$CLAUDE_TEMPORAL_PID_DIR" ]]; then + echo "No PID directory found: $CLAUDE_TEMPORAL_PID_DIR" + exit 0 +fi + +# Function to kill a process gracefully then forcefully +kill_process() { + local pid=$1 + local name=$2 + + if ! kill -0 "$pid" 2>/dev/null; then + echo "$name (PID $pid): already dead" + return 0 + fi + + # Attempt graceful shutdown + kill -TERM "$pid" 2>/dev/null || true + + # Wait for graceful shutdown + local elapsed=0 + while (( elapsed < GRACEFUL_TIMEOUT )); do + if ! kill -0 "$pid" 2>/dev/null; then + echo "$name (PID $pid): stopped gracefully ✓" + return 0 + fi + sleep 1 + elapsed=$((elapsed + 1)) + done + + # Force kill if still running + if kill -0 "$pid" 2>/dev/null; then + kill -9 "$pid" 2>/dev/null || true + sleep 1 + + if kill -0 "$pid" 2>/dev/null; then + echo "$name (PID $pid): failed to kill ❌" >&2 + return 1 + fi + echo "$name (PID $pid): force killed ✓" + fi + + return 0 +} + +killed_count=0 + +# Kill specific project worker if requested +if [[ -n "$specific_project" ]]; then + PID_FILE="$CLAUDE_TEMPORAL_PID_DIR/worker-$specific_project.pid" + if [[ -f "$PID_FILE" ]]; then + WORKER_PID=$(cat "$PID_FILE") + if kill_process "$WORKER_PID" "worker-$specific_project"; then + rm -f "$PID_FILE" + killed_count=$((killed_count + 1)) + fi + else + echo "No worker found for project: $specific_project" + exit 1 + fi +else + # Kill all workers + shopt -s nullglob + PID_FILES=("$CLAUDE_TEMPORAL_PID_DIR"/worker-*.pid) + shopt -u nullglob + + for pid_file in "${PID_FILES[@]}"; do + # Extract project name from filename + filename=$(basename "$pid_file") + project="${filename#worker-}" + project="${project%.pid}" + + # Read PID + worker_pid=$(cat "$pid_file") + + if kill_process "$worker_pid" "worker-$project"; then + rm -f "$pid_file" + killed_count=$((killed_count + 1)) + fi + done +fi + +# Kill server if requested +if [[ "$include_server" == true ]]; then + SERVER_PID_FILE="$CLAUDE_TEMPORAL_PID_DIR/server.pid" + if [[ -f "$SERVER_PID_FILE" ]]; then + SERVER_PID=$(cat "$SERVER_PID_FILE") + if kill_process "$SERVER_PID" "server"; then + rm -f "$SERVER_PID_FILE" + killed_count=$((killed_count + 1)) + fi + fi +fi + +if [[ "$killed_count" -eq 0 ]]; then + echo "No processes to kill" +else + echo "" + echo "Total: $killed_count process(es) killed" +fi diff --git a/scripts/kill-worker.sh b/scripts/kill-worker.sh new file mode 100755 index 0000000..596acdb --- /dev/null +++ b/scripts/kill-worker.sh @@ -0,0 +1,108 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Get the directory where this script is located +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Source the helper function to find project workers +source "$SCRIPT_DIR/find-project-workers.sh" + +# Environment variables with defaults +CLAUDE_TEMPORAL_PID_DIR="${CLAUDE_TEMPORAL_PID_DIR:-${TMPDIR:-/tmp}/claude-temporal-pids}" +CLAUDE_TEMPORAL_PROJECT_NAME="${CLAUDE_TEMPORAL_PROJECT_NAME:-$(basename "$(pwd)")}" +CLAUDE_TEMPORAL_PROJECT_DIR="${CLAUDE_TEMPORAL_PROJECT_DIR:-$(pwd)}" + +PID_FILE="$CLAUDE_TEMPORAL_PID_DIR/worker-$CLAUDE_TEMPORAL_PROJECT_NAME.pid" + +# Graceful shutdown timeout (seconds) +GRACEFUL_TIMEOUT=5 + +# Find ALL workers for this project (both tracked and orphaned) +echo "🔍 Finding all workers for project: $CLAUDE_TEMPORAL_PROJECT_NAME" + +# Collect all PIDs +worker_pids=() + +# Add PID from file if it exists +if [[ -f "$PID_FILE" ]]; then + TRACKED_PID=$(cat "$PID_FILE") + if kill -0 "$TRACKED_PID" 2>/dev/null; then + worker_pids+=("$TRACKED_PID") + fi +fi + +# Find all workers for this project using the helper function +while IFS= read -r pid; do + [[ -n "$pid" ]] && worker_pids+=("$pid") +done < <(find_project_workers "$CLAUDE_TEMPORAL_PROJECT_DIR" 2>/dev/null || true) + +# Remove duplicates +worker_pids=($(printf "%s\n" "${worker_pids[@]}" | sort -u)) + +if [[ ${#worker_pids[@]} -eq 0 ]]; then + echo "No workers running for project: $CLAUDE_TEMPORAL_PROJECT_NAME" + rm -f "$PID_FILE" + exit 1 +fi + +echo "Found ${#worker_pids[@]} worker process(es): ${worker_pids[*]}" + +# Attempt graceful shutdown of all workers +echo "⏳ Attempting graceful shutdown..." +for pid in "${worker_pids[@]}"; do + kill -TERM "$pid" 2>/dev/null || true +done + +# Wait for graceful shutdown +ELAPSED=0 +while (( ELAPSED < GRACEFUL_TIMEOUT )); do + all_dead=true + for pid in "${worker_pids[@]}"; do + if kill -0 "$pid" 2>/dev/null; then + all_dead=false + break + fi + done + + if [[ "$all_dead" == true ]]; then + echo "✓ All workers stopped gracefully" + rm -f "$PID_FILE" + exit 0 + fi + + sleep 1 + ELAPSED=$((ELAPSED + 1)) +done + +# Force kill any still running +still_running=() +for pid in "${worker_pids[@]}"; do + if kill -0 "$pid" 2>/dev/null; then + still_running+=("$pid") + fi +done + +if [[ ${#still_running[@]} -gt 0 ]]; then + echo "⚠️ ${#still_running[@]} process(es) still running after ${GRACEFUL_TIMEOUT}s, forcing kill..." + for pid in "${still_running[@]}"; do + kill -9 "$pid" 2>/dev/null || true + done + sleep 1 + + # Verify all are dead + failed_pids=() + for pid in "${still_running[@]}"; do + if kill -0 "$pid" 2>/dev/null; then + failed_pids+=("$pid") + fi + done + + if [[ ${#failed_pids[@]} -gt 0 ]]; then + echo "❌ Failed to kill worker process(es): ${failed_pids[*]}" >&2 + exit 1 + fi +fi + +echo "✓ All workers killed (${#worker_pids[@]} process(es))" +rm -f "$PID_FILE" +exit 0 diff --git a/scripts/list-recent-workflows.sh b/scripts/list-recent-workflows.sh new file mode 100755 index 0000000..b20801f --- /dev/null +++ b/scripts/list-recent-workflows.sh @@ -0,0 +1,113 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Environment variables with defaults +TEMPORAL_CLI="${TEMPORAL_CLI:-temporal}" +TEMPORAL_ADDRESS="${TEMPORAL_ADDRESS:-localhost:7233}" +CLAUDE_TEMPORAL_NAMESPACE="${CLAUDE_TEMPORAL_NAMESPACE:-default}" + +usage() { + cat <<'USAGE' +Usage: list-recent-workflows.sh [options] + +List recently completed/terminated workflows within a time window. + +Options: + --minutes Look back N minutes (default: 5) + --status Filter by status: COMPLETED, FAILED, CANCELED, TERMINATED, TIMED_OUT (optional) + --workflow-type Filter by workflow type (optional) + -h, --help Show this help + +Examples: + # List all workflows from last 5 minutes + ./tools/list-recent-workflows.sh + + # List failed workflows from last 10 minutes + ./tools/list-recent-workflows.sh --minutes 10 --status FAILED + + # List completed workflows of specific type from last 2 minutes + ./tools/list-recent-workflows.sh --minutes 2 --status COMPLETED --workflow-type MyWorkflow + +Output format: + WORKFLOW_ID STATUS WORKFLOW_TYPE CLOSE_TIME +USAGE +} + +minutes=5 +status="" +workflow_type="" + +while [[ $# -gt 0 ]]; do + case "$1" in + --minutes) minutes="${2-}"; shift 2 ;; + --status) status="${2-}"; shift 2 ;; + --workflow-type) workflow_type="${2-}"; shift 2 ;; + -h|--help) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 1 ;; + esac +done + +if ! command -v "$TEMPORAL_CLI" >/dev/null 2>&1; then + echo "Temporal CLI not found: $TEMPORAL_CLI" >&2 + exit 1 +fi + +# Validate status if provided +if [[ -n "$status" ]]; then + case "$status" in + COMPLETED|FAILED|CANCELED|TERMINATED|TIMED_OUT) ;; + *) echo "Invalid status: $status" >&2; usage; exit 1 ;; + esac +fi + +# Calculate time threshold (minutes ago) +if [[ "$OSTYPE" == "darwin"* ]]; then + # macOS + time_threshold=$(date -u -v-"${minutes}M" +"%Y-%m-%dT%H:%M:%SZ") +else + # Linux + time_threshold=$(date -u -d "$minutes minutes ago" +"%Y-%m-%dT%H:%M:%SZ") +fi + +# Build query +query="CloseTime > \"$time_threshold\"" + +if [[ -n "$status" ]]; then + query="$query AND ExecutionStatus = \"$status\"" +fi + +if [[ -n "$workflow_type" ]]; then + query="$query AND WorkflowType = \"$workflow_type\"" +fi + +echo "Searching workflows from last $minutes minute(s)..." +echo "Query: $query" +echo "" + +# Execute list command +if ! workflow_list=$("$TEMPORAL_CLI" workflow list \ + --address "$TEMPORAL_ADDRESS" \ + --namespace "$CLAUDE_TEMPORAL_NAMESPACE" \ + --query "$query" 2>&1); then + echo "Failed to list workflows" >&2 + echo "$workflow_list" >&2 + exit 1 +fi + +# Check if any workflows found +if echo "$workflow_list" | grep -q "No workflows found"; then + echo "No workflows found in the last $minutes minute(s)" + exit 0 +fi + +# Parse and display results +echo "$workflow_list" | head -n 50 + +# Count results +workflow_count=$(echo "$workflow_list" | awk 'NR>1 && $1 != "" && $1 !~ /^-+$/ {print $1}' | wc -l | tr -d ' ') + +echo "" +echo "Found $workflow_count workflow(s)" +echo "" +echo "To get workflow result:" +echo " ./tools/get-workflow-result.sh --workflow-id " diff --git a/scripts/list-workers.sh b/scripts/list-workers.sh new file mode 100755 index 0000000..ea75f4c --- /dev/null +++ b/scripts/list-workers.sh @@ -0,0 +1,221 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Get the directory where this script is located +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Source the helper function to find project workers +source "$SCRIPT_DIR/find-project-workers.sh" + +# Environment variables with defaults +CLAUDE_TEMPORAL_PID_DIR="${CLAUDE_TEMPORAL_PID_DIR:-${TMPDIR:-/tmp}/claude-temporal-pids}" + +# Check if PID directory exists +if [[ ! -d "$CLAUDE_TEMPORAL_PID_DIR" ]]; then + echo "No PID directory found: $CLAUDE_TEMPORAL_PID_DIR" + exit 0 +fi + +# Function to get process uptime +get_uptime() { + local pid=$1 + if [[ "$(uname)" == "Darwin" ]]; then + # macOS + local start_time=$(ps -o lstart= -p "$pid" 2>/dev/null | xargs -I{} date -j -f "%c" "{}" "+%s" 2>/dev/null || echo "0") + else + # Linux + local start_time=$(ps -o etimes= -p "$pid" 2>/dev/null | tr -d ' ' || echo "0") + fi + + if [[ "$start_time" == "0" ]]; then + echo "-" + return + fi + + local now=$(date +%s) + local elapsed=$((now - start_time)) + + # For Linux, etimes already gives elapsed seconds + if [[ "$(uname)" != "Darwin" ]]; then + elapsed=$start_time + fi + + local hours=$((elapsed / 3600)) + local minutes=$(((elapsed % 3600) / 60)) + local seconds=$((elapsed % 60)) + + if (( hours > 0 )); then + printf "%dh %dm" "$hours" "$minutes" + elif (( minutes > 0 )); then + printf "%dm %ds" "$minutes" "$seconds" + else + printf "%ds" "$seconds" + fi +} + +# Function to get process command +get_command() { + local pid=$1 + ps -o command= -p "$pid" 2>/dev/null | cut -c1-50 || echo "-" +} + +# Print header +printf "%-20s %-8s %-10s %-10s %-10s %s\n" "PROJECT" "PID" "STATUS" "TRACKED" "UPTIME" "COMMAND" +printf "%-20s %-8s %-10s %-10s %-10s %s\n" "--------------------" "--------" "----------" "----------" "----------" "-----" + +# Find all PID files +found_any=false + +# Track all PIDs we've seen to detect orphans later +declare -A tracked_pids + +# List server if exists +SERVER_PID_FILE="$CLAUDE_TEMPORAL_PID_DIR/server.pid" +if [[ -f "$SERVER_PID_FILE" ]]; then + found_any=true + SERVER_PID=$(cat "$SERVER_PID_FILE") + tracked_pids[$SERVER_PID]=1 + if kill -0 "$SERVER_PID" 2>/dev/null; then + uptime=$(get_uptime "$SERVER_PID") + command=$(get_command "$SERVER_PID") + printf "%-20s %-8s %-10s %-10s %-10s %s\n" "server" "$SERVER_PID" "running" "yes" "$uptime" "$command" + else + printf "%-20s %-8s %-10s %-10s %-10s %s\n" "server" "$SERVER_PID" "dead" "yes" "-" "-" + fi +fi + +# List all worker PID files +shopt -s nullglob +PID_FILES=("$CLAUDE_TEMPORAL_PID_DIR"/worker-*.pid) +shopt -u nullglob + +# Store project directories for orphan detection +declare -A project_dirs + +for pid_file in "${PID_FILES[@]}"; do + found_any=true + # Extract project name from filename + filename=$(basename "$pid_file") + project="${filename#worker-}" + project="${project%.pid}" + + # Read PID + worker_pid=$(cat "$pid_file") + tracked_pids[$worker_pid]=1 + + # Check if process is running + if kill -0 "$worker_pid" 2>/dev/null; then + uptime=$(get_uptime "$worker_pid") + command=$(get_command "$worker_pid") + printf "%-20s %-8s %-10s %-10s %-10s %s\n" "$project" "$worker_pid" "running" "yes" "$uptime" "$command" + + # Try to determine project directory from the command + # Look for project directory in the command path + if [[ "$command" =~ ([^[:space:]]+/${project})[/[:space:]] ]]; then + project_dir="${BASH_REMATCH[1]}" + project_dirs[$project]="$project_dir" + fi + else + printf "%-20s %-8s %-10s %-10s %-10s %s\n" "$project" "$worker_pid" "dead" "yes" "-" "-" + fi +done + +# Now check for orphaned workers for each project we know about +for project in "${!project_dirs[@]}"; do + project_dir="${project_dirs[$project]}" + + # Find all workers for this project + while IFS= read -r pid; do + [[ -z "$pid" ]] && continue + + # Skip if we already tracked this PID + if [[ -n "${tracked_pids[$pid]:-}" ]]; then + continue + fi + + # This is an orphaned worker + found_any=true + if kill -0 "$pid" 2>/dev/null; then + uptime=$(get_uptime "$pid") + command=$(get_command "$pid") + printf "%-20s %-8s %-10s %-10s %-10s %s\n" "$project" "$pid" "running" "ORPHAN" "$uptime" "$command" + tracked_pids[$pid]=1 + fi + done < <(find_project_workers "$project_dir" 2>/dev/null || true) +done + +# Also scan for workers that have no PID file at all (completely orphaned) +# Find all Python worker processes and group by project +if [[ "$(uname)" == "Darwin" ]]; then + # macOS + while IFS= read -r line; do + [[ -z "$line" ]] && continue + + pid=$(echo "$line" | awk '{print $1}') + command=$(echo "$line" | cut -d' ' -f2-) + + # Skip if already tracked + if [[ -n "${tracked_pids[$pid]:-}" ]]; then + continue + fi + + # Extract project name from path + if [[ "$command" =~ /([^/]+)/\.venv/bin/ ]]; then + project="${BASH_REMATCH[1]}" + found_any=true + uptime=$(get_uptime "$pid") + cmd_display=$(echo "$command" | cut -c1-50) + printf "%-20s %-8s %-10s %-10s %-10s %s\n" "$project" "$pid" "running" "ORPHAN" "$uptime" "$cmd_display" + tracked_pids[$pid]=1 + fi + done < <(ps ax -o pid,command | grep -E "\.venv/bin/(python[0-9.]*|worker)" | grep -v grep) + + # Also check for "uv run worker" processes + while IFS= read -r line; do + [[ -z "$line" ]] && continue + + pid=$(echo "$line" | awk '{print $1}') + + # Skip if already tracked + if [[ -n "${tracked_pids[$pid]:-}" ]]; then + continue + fi + + # Get the working directory for this process + cwd=$(lsof -a -p "$pid" -d cwd -Fn 2>/dev/null | grep ^n | cut -c2- || echo "") + if [[ -n "$cwd" ]]; then + project=$(basename "$cwd") + found_any=true + uptime=$(get_uptime "$pid") + printf "%-20s %-8s %-10s %-10s %-10s %s\n" "$project" "$pid" "running" "ORPHAN" "$uptime" "uv run worker" + tracked_pids[$pid]=1 + fi + done < <(ps ax -o pid,command | grep "uv run worker" | grep -v grep | awk '{print $1}') +else + # Linux - similar logic using /proc + while IFS= read -r line; do + [[ -z "$line" ]] && continue + + pid=$(echo "$line" | awk '{print $1}') + command=$(echo "$line" | cut -d' ' -f2-) + + # Skip if already tracked + if [[ -n "${tracked_pids[$pid]:-}" ]]; then + continue + fi + + # Extract project name from path + if [[ "$command" =~ /([^/]+)/\.venv/bin/ ]]; then + project="${BASH_REMATCH[1]}" + found_any=true + uptime=$(get_uptime "$pid") + cmd_display=$(echo "$command" | cut -c1-50) + printf "%-20s %-8s %-10s %-10s %-10s %s\n" "$project" "$pid" "running" "ORPHAN" "$uptime" "$cmd_display" + tracked_pids[$pid]=1 + fi + done < <(ps ax -o pid,cmd | grep -E "\.venv/bin/(python[0-9.]*|worker)" | grep -v grep) +fi + +if [[ "$found_any" == false ]]; then + echo "No workers or server found" +fi diff --git a/scripts/monitor-worker-health.sh b/scripts/monitor-worker-health.sh new file mode 100755 index 0000000..6c15a82 --- /dev/null +++ b/scripts/monitor-worker-health.sh @@ -0,0 +1,111 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Environment variables with defaults +CLAUDE_TEMPORAL_PID_DIR="${CLAUDE_TEMPORAL_PID_DIR:-${TMPDIR:-/tmp}/claude-temporal-pids}" +CLAUDE_TEMPORAL_LOG_DIR="${CLAUDE_TEMPORAL_LOG_DIR:-${TMPDIR:-/tmp}/claude-temporal-logs}" +CLAUDE_TEMPORAL_PROJECT_NAME="${CLAUDE_TEMPORAL_PROJECT_NAME:-$(basename "$(pwd)")}" + +PID_FILE="$CLAUDE_TEMPORAL_PID_DIR/worker-$CLAUDE_TEMPORAL_PROJECT_NAME.pid" +LOG_FILE="$CLAUDE_TEMPORAL_LOG_DIR/worker-$CLAUDE_TEMPORAL_PROJECT_NAME.log" + +# Function to get process uptime +get_uptime() { + local pid=$1 + if [[ "$(uname)" == "Darwin" ]]; then + # macOS + local start_time=$(ps -o lstart= -p "$pid" 2>/dev/null | xargs -I{} date -j -f "%c" "{}" "+%s" 2>/dev/null || echo "0") + else + # Linux + local start_time=$(ps -o etimes= -p "$pid" 2>/dev/null | tr -d ' ' || echo "0") + fi + + if [[ "$start_time" == "0" ]]; then + echo "unknown" + return + fi + + local now=$(date +%s) + local elapsed=$((now - start_time)) + + # For Linux, etimes already gives elapsed seconds + if [[ "$(uname)" != "Darwin" ]]; then + elapsed=$start_time + fi + + local hours=$((elapsed / 3600)) + local minutes=$(((elapsed % 3600) / 60)) + local seconds=$((elapsed % 60)) + + if (( hours > 0 )); then + printf "%dh %dm %ds" "$hours" "$minutes" "$seconds" + elif (( minutes > 0 )); then + printf "%dm %ds" "$minutes" "$seconds" + else + printf "%ds" "$seconds" + fi +} + +echo "=== Worker Health Check ===" +echo "Project: $CLAUDE_TEMPORAL_PROJECT_NAME" +echo "" + +# Check if PID file exists +if [[ ! -f "$PID_FILE" ]]; then + echo "Worker Status: NOT RUNNING" + echo "No PID file found: $PID_FILE" + exit 1 +fi + +# Read PID +WORKER_PID=$(cat "$PID_FILE") + +# Check if process is alive +if ! kill -0 "$WORKER_PID" 2>/dev/null; then + echo "Worker Status: DEAD" + echo "PID file exists but process is not running" + echo "PID: $WORKER_PID (stale)" + echo "" + echo "To clean up and restart:" + echo " rm -f $PID_FILE" + echo " ./tools/ensure-worker.sh" + exit 1 +fi + +# Process is alive +echo "Worker Status: RUNNING" +echo "PID: $WORKER_PID" +echo "Uptime: $(get_uptime "$WORKER_PID")" +echo "" + +# Check log file +if [[ -f "$LOG_FILE" ]]; then + echo "Log file: $LOG_FILE" + echo "Log size: $(wc -c < "$LOG_FILE" | tr -d ' ') bytes" + echo "" + + # Check for recent errors in logs (last 50 lines) + if tail -n 50 "$LOG_FILE" | grep -iE "(error|exception|fatal|traceback)" >/dev/null 2>&1; then + echo "⚠️ Recent errors found in logs (last 50 lines):" + echo "" + tail -n 50 "$LOG_FILE" | grep -iE "(error|exception|fatal)" | tail -n 10 + echo "" + echo "Full logs: $LOG_FILE" + exit 1 + fi + + # Show last log entry + echo "Last log entry:" + tail -n 1 "$LOG_FILE" 2>/dev/null || echo "(empty log)" + echo "" + + echo "✓ Worker appears healthy" + echo "" + echo "To view logs:" + echo " tail -f $LOG_FILE" +else + echo "⚠️ Log file not found: $LOG_FILE" + echo "" + echo "Worker is running but no logs found" + exit 1 +fi diff --git a/scripts/wait-for-worker-ready.sh b/scripts/wait-for-worker-ready.sh new file mode 100755 index 0000000..17fbadf --- /dev/null +++ b/scripts/wait-for-worker-ready.sh @@ -0,0 +1,76 @@ +#!/usr/bin/env bash +set -euo pipefail + +usage() { + cat <<'USAGE' +Usage: wait-for-worker-ready.sh --log-file file [options] + +Poll worker log file for startup confirmation. + +Options: + --log-file path to worker log file, required + -p, --pattern regex pattern to look for (default: "Worker started") + -F, --fixed treat pattern as a fixed string (grep -F) + -T, --timeout seconds to wait (integer, default: 30) + -i, --interval poll interval in seconds (default: 0.5) + -h, --help show this help +USAGE +} + +log_file="" +pattern="Worker started" +grep_flag="-E" +timeout=30 +interval=0.5 + +while [[ $# -gt 0 ]]; do + case "$1" in + --log-file) log_file="${2-}"; shift 2 ;; + -p|--pattern) pattern="${2-}"; shift 2 ;; + -F|--fixed) grep_flag="-F"; shift ;; + -T|--timeout) timeout="${2-}"; shift 2 ;; + -i|--interval) interval="${2-}"; shift 2 ;; + -h|--help) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 1 ;; + esac +done + +if [[ -z "$log_file" ]]; then + echo "log-file is required" >&2 + usage + exit 1 +fi + +if ! [[ "$timeout" =~ ^[0-9]+$ ]]; then + echo "timeout must be an integer number of seconds" >&2 + exit 1 +fi + +# End time in epoch seconds +start_epoch=$(date +%s) +deadline=$((start_epoch + timeout)) + +while true; do + # Check if log file exists and has content + if [[ -f "$log_file" ]]; then + log_content="$(cat "$log_file" 2>/dev/null || true)" + + if [[ -n "$log_content" ]] && printf '%s\n' "$log_content" | grep $grep_flag -- "$pattern" >/dev/null 2>&1; then + exit 0 + fi + fi + + now=$(date +%s) + if (( now >= deadline )); then + echo "Timed out after ${timeout}s waiting for pattern: $pattern" >&2 + if [[ -f "$log_file" ]]; then + echo "Last content from $log_file:" >&2 + tail -n 50 "$log_file" >&2 || true + else + echo "Log file not found: $log_file" >&2 + fi + exit 1 + fi + + sleep "$interval" +done diff --git a/scripts/wait-for-workflow-status.sh b/scripts/wait-for-workflow-status.sh new file mode 100755 index 0000000..5a3bd54 --- /dev/null +++ b/scripts/wait-for-workflow-status.sh @@ -0,0 +1,118 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Environment variables with defaults +TEMPORAL_CLI="${TEMPORAL_CLI:-temporal}" +TEMPORAL_ADDRESS="${TEMPORAL_ADDRESS:-localhost:7233}" +CLAUDE_TEMPORAL_NAMESPACE="${CLAUDE_TEMPORAL_NAMESPACE:-default}" + +usage() { + cat <<'USAGE' +Usage: wait-for-workflow-status.sh --workflow-id id --status status [options] + +Poll workflow for specific status. + +Options: + --workflow-id workflow ID to monitor, required + --status status to wait for, required + (RUNNING, COMPLETED, FAILED, CANCELED, TERMINATED, TIMED_OUT) + --run-id specific workflow run ID (optional) + -T, --timeout seconds to wait (integer, default: 300) + -i, --interval poll interval in seconds (default: 2) + -h, --help show this help +USAGE +} + +workflow_id="" +run_id="" +target_status="" +timeout=300 +interval=2 + +while [[ $# -gt 0 ]]; do + case "$1" in + --workflow-id) workflow_id="${2-}"; shift 2 ;; + --run-id) run_id="${2-}"; shift 2 ;; + --status) target_status="${2-}"; shift 2 ;; + -T|--timeout) timeout="${2-}"; shift 2 ;; + -i|--interval) interval="${2-}"; shift 2 ;; + -h|--help) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 1 ;; + esac +done + +if [[ -z "$workflow_id" || -z "$target_status" ]]; then + echo "workflow-id and status are required" >&2 + usage + exit 1 +fi + +if ! [[ "$timeout" =~ ^[0-9]+$ ]]; then + echo "timeout must be an integer number of seconds" >&2 + exit 1 +fi + +if ! command -v "$TEMPORAL_CLI" >/dev/null 2>&1; then + echo "Temporal CLI not found: $TEMPORAL_CLI" >&2 + exit 1 +fi + +# Build temporal command +TEMPORAL_CMD=("$TEMPORAL_CLI" "workflow" "describe" "--workflow-id" "$workflow_id" "--address" "$TEMPORAL_ADDRESS" "--namespace" "$CLAUDE_TEMPORAL_NAMESPACE") + +if [[ -n "$run_id" ]]; then + TEMPORAL_CMD+=("--run-id" "$run_id") +fi + +# Normalize target status to uppercase +target_status=$(echo "$target_status" | tr '[:lower:]' '[:upper:]') + +# End time in epoch seconds +start_epoch=$(date +%s) +deadline=$((start_epoch + timeout)) + +echo "Polling workflow: $workflow_id" +echo "Target status: $target_status" +echo "Timeout: ${timeout}s" +echo "" + +while true; do + # Query workflow status + if output=$("${TEMPORAL_CMD[@]}" 2>&1); then + # Extract status from output + # The output includes a line like: " Status COMPLETED" + if current_status=$(echo "$output" | grep -E "^\s*Status\s" | awk '{print $2}' | tr -d ' '); then + echo "Current status: $current_status ($(date '+%H:%M:%S'))" + + if [[ "$current_status" == "$target_status" ]]; then + echo "" + echo "✓ Workflow reached status: $target_status" + exit 0 + fi + + # Check if workflow reached a terminal state different from target + case "$current_status" in + COMPLETED|FAILED|CANCELED|TERMINATED|TIMED_OUT) + if [[ "$current_status" != "$target_status" ]]; then + echo "" + echo "⚠️ Workflow reached terminal status: $current_status (expected: $target_status)" + exit 1 + fi + ;; + esac + else + echo "⚠️ Could not parse workflow status from output" >&2 + fi + else + echo "⚠️ Failed to query workflow (it may not exist yet)" >&2 + fi + + now=$(date +%s) + if (( now >= deadline )); then + echo "" + echo "❌ Timeout after ${timeout}s waiting for status: $target_status" >&2 + exit 1 + fi + + sleep "$interval" +done From 5998c6406d97d4d0d5b695c722ab99342865c901 Mon Sep 17 00:00:00 2001 From: Donald Pinckney Date: Wed, 4 Feb 2026 14:35:08 -0500 Subject: [PATCH 2/4] Use claude to merge Steve's, Max's, and Mason's skills. (#2) * Use claude to merge Steve's, Max's, and Mason's skills. Did a review pass using claude's skill devlopment skills * Add missing things from Steve * trigger tweaks * Add in common gotchas from Johann --- README.md | 113 ++++- SKILL.md | 245 ++++++---- references/concepts.md | 38 -- references/core/ai-integration.md | 237 +++++++++ references/core/common-gotchas.md | 190 ++++++++ references/core/determinism.md | 144 ++++++ references/{ => core}/error-reference.md | 5 + .../{ => core}/interactive-workflows.md | 0 references/{ => core}/logs.md | 0 references/core/patterns.md | 239 +++++++++ references/{ => core}/tool-reference.md | 0 references/core/troubleshooting.md | 308 ++++++++++++ references/core/versioning.md | 172 +++++++ references/python/advanced-features.md | 378 +++++++++++++++ references/python/ai-patterns.md | 362 ++++++++++++++ references/python/data-handling.md | 230 +++++++++ references/python/determinism.md | 143 ++++++ references/python/error-handling.md | 145 ++++++ references/python/gotchas.md | 390 +++++++++++++++ references/python/observability.md | 191 ++++++++ references/python/patterns.md | 453 ++++++++++++++++++ references/python/python.md | 176 +++++++ references/python/sandbox.md | 180 +++++++ references/python/sync-vs-async.md | 243 ++++++++++ references/python/testing.md | 120 +++++ references/python/versioning.md | 355 ++++++++++++++ references/troubleshooting.md | 184 ------- references/typescript/advanced-features.md | 438 +++++++++++++++++ references/typescript/data-handling.md | 256 ++++++++++ references/typescript/determinism.md | 133 +++++ references/typescript/error-handling.md | 180 +++++++ references/typescript/gotchas.md | 403 ++++++++++++++++ references/typescript/observability.md | 231 +++++++++ references/typescript/patterns.md | 422 ++++++++++++++++ references/typescript/testing.md | 128 +++++ references/typescript/typescript.md | 121 +++++ references/typescript/versioning.md | 307 ++++++++++++ 37 files changed, 7550 insertions(+), 310 deletions(-) delete mode 100644 references/concepts.md create mode 100644 references/core/ai-integration.md create mode 100644 references/core/common-gotchas.md create mode 100644 references/core/determinism.md rename references/{ => core}/error-reference.md (91%) rename references/{ => core}/interactive-workflows.md (100%) rename references/{ => core}/logs.md (100%) create mode 100644 references/core/patterns.md rename references/{ => core}/tool-reference.md (100%) create mode 100644 references/core/troubleshooting.md create mode 100644 references/core/versioning.md create mode 100644 references/python/advanced-features.md create mode 100644 references/python/ai-patterns.md create mode 100644 references/python/data-handling.md create mode 100644 references/python/determinism.md create mode 100644 references/python/error-handling.md create mode 100644 references/python/gotchas.md create mode 100644 references/python/observability.md create mode 100644 references/python/patterns.md create mode 100644 references/python/python.md create mode 100644 references/python/sandbox.md create mode 100644 references/python/sync-vs-async.md create mode 100644 references/python/testing.md create mode 100644 references/python/versioning.md delete mode 100644 references/troubleshooting.md create mode 100644 references/typescript/advanced-features.md create mode 100644 references/typescript/data-handling.md create mode 100644 references/typescript/determinism.md create mode 100644 references/typescript/error-handling.md create mode 100644 references/typescript/gotchas.md create mode 100644 references/typescript/observability.md create mode 100644 references/typescript/patterns.md create mode 100644 references/typescript/testing.md create mode 100644 references/typescript/typescript.md create mode 100644 references/typescript/versioning.md diff --git a/README.md b/README.md index 922884a..3340372 100644 --- a/README.md +++ b/README.md @@ -1 +1,112 @@ -# skill-temporal-dev +# Temporal Development Skill + +A comprehensive skill for building Temporal applications in Python and TypeScript. + +## Overview + +This skill provides multi-language guidance for Temporal development, combining: +- **Core concepts** shared across languages (determinism, patterns, versioning) +- **Language-specific references** for Python and TypeScript +- **Operational scripts** for worker and workflow management +- **AI/LLM integration patterns** for building durable AI applications + +## Structure + +``` +temporal-dev/ +├── SKILL.md # Core architecture, quick references (always loaded) +├── references/ +│ ├── core/ # Language-agnostic concepts +│ │ ├── determinism.md # Why determinism matters, replay mechanics +│ │ ├── patterns.md # Signals, queries, saga, child workflows +│ │ ├── versioning.md # Patching, workflow types, worker versioning +│ │ ├── troubleshooting.md # Decision trees, recovery procedures +│ │ ├── error-reference.md # Common error types, workflow status +│ │ ├── interactive-workflows.md # Testing signals, updates, queries +│ │ ├── tool-reference.md # Script options, worker management +│ │ ├── logs.md # Log file locations, search patterns +│ │ └── ai-integration.md # AI/LLM integration concepts +│ ├── python/ # Python SDK references +│ │ ├── python.md # SDK overview, quick start +│ │ ├── sandbox.md # Python sandbox mechanics +│ │ ├── sync-vs-async.md # Activity type selection +│ │ ├── patterns.md # Python implementations +│ │ ├── testing.md # WorkflowEnvironment, mocking +│ │ ├── error-handling.md # ApplicationError, retries +│ │ ├── data-handling.md # Pydantic, encryption +│ │ ├── observability.md # Logging, metrics +│ │ ├── versioning.md # Python patching API +│ │ ├── advanced-features.md # Continue-as-new, interceptors +│ │ └── ai-patterns.md # Python AI Cookbook patterns +│ └── typescript/ # TypeScript SDK references +│ ├── typescript.md # SDK overview, quick start +│ ├── patterns.md # TypeScript implementations +│ ├── testing.md # TestWorkflowEnvironment +│ ├── error-handling.md # ApplicationFailure +│ ├── data-handling.md # Data converters +│ ├── observability.md # Sinks, logging +│ ├── versioning.md # TypeScript patching API +│ └── advanced-features.md # Cancellation scopes +├── scripts/ # Operational utilities +│ ├── ensure-server.sh # Start Temporal dev server +│ ├── ensure-worker.sh # Start worker for project +│ ├── list-workers.sh # List running workers +│ ├── kill-worker.sh # Stop specific worker +│ ├── kill-all-workers.sh # Stop ALL workers +│ ├── monitor-worker-health.sh # Check worker health +│ ├── list-recent-workflows.sh # Show recent executions +│ ├── get-workflow-result.sh # Get workflow output +│ ├── find-stalled-workflows.sh # Find stuck workflows +│ ├── analyze-workflow-error.sh # Diagnose failures +│ ├── bulk-cancel-workflows.sh # Cancel multiple workflows +│ ├── wait-for-workflow-status.sh # Poll workflow status +│ ├── wait-for-worker-ready.sh # Poll worker startup +│ └── find-project-workers.sh # Helper: find worker PIDs +``` + +## Progressive Disclosure + +The skill uses progressive loading to manage context efficiently: + +1. **SKILL.md** - Always loaded when skill triggers + - Core architecture diagram + - Determinism quick reference + - Pattern index with links + - Troubleshooting quick reference + +2. **Core references** - Loaded when discussing concepts + - Language-agnostic theory and patterns + - Versioning strategies + - Troubleshooting decision trees + +3. **Language references** - Loaded when working in that language + - SDK-specific implementations + - Language-specific gotchas + - Testing patterns + +## Content Sources + +This skill merges content from multiple sources: +- **Steve's temporal-dev skill** - Operational scripts, troubleshooting +- **Max's temporal-claude-skill** - Multi-SDK structure, AI integration +- **Mason's python-sdk skill** - Python deep-dive, sandbox, sync/async +- **Mason's typescript-sdk skill** - TypeScript patterns, V8 isolation + +## Trigger Phrases + +The skill activates on phrases like: +- "create a Temporal workflow" +- "write a Temporal activity" +- "debug workflow stuck" +- "fix non-determinism error" +- "Temporal Python" / "Temporal TypeScript" +- "workflow replay" +- "activity timeout" +- "signal workflow" / "query workflow" +- "worker not starting" +- "activity keeps retrying" +- "Temporal heartbeat" +- "continue-as-new" +- "child workflow" +- "saga pattern" +- "workflow versioning" diff --git a/SKILL.md b/SKILL.md index 70af934..390a12e 100644 --- a/SKILL.md +++ b/SKILL.md @@ -1,117 +1,188 @@ --- -name: temporal-dev -description: "Start, stop, debug, and troubleshoot Temporal workflows for Python projects. Use when: starting workers, executing workflows, workflow is stalled/failed, non-determinism errors, checking workflow status, or managing temporal server start-dev lifecycle." -version: 0.1.0 -allowed-tools: "Bash(.claude/skills/temporal/scripts/*:*), Read" +name: Temporal Development +description: This skill should be used when the user asks to "create a Temporal workflow", "write a Temporal activity", "debug stuck workflow", "fix non-determinism error", "Temporal Python", "Temporal TypeScript", "workflow replay", "activity timeout", "signal workflow", "query workflow", "worker not starting", "activity keeps retrying", "Temporal heartbeat", "continue-as-new", "child workflow", "saga pattern", "workflow versioning", "durable execution", "reliable distributed systems", or mentions Temporal SDK development. Provides multi-language guidance for Python and TypeScript with operational scripts. +version: 1.0.0 --- -# Temporal Skill +# Temporal Development -Manage Temporal workflows using local development server (Python SDK, `temporal server start-dev`). +## Overview -## Environment Variables +Temporal is a durable execution platform that makes workflows survive failures automatically. This skill provides guidance for building Temporal applications in Python and TypeScript. -| Variable | Default | Description | -|----------|---------|-------------| -| `CLAUDE_TEMPORAL_LOG_DIR` | `/tmp/claude-temporal-logs` | Worker log directory | -| `CLAUDE_TEMPORAL_PID_DIR` | `/tmp/claude-temporal-pids` | Worker PID directory | -| `TEMPORAL_ADDRESS` | `localhost:7233` | Temporal server gRPC address | -| `TEMPORAL_WORKER_CMD` | `uv run worker` | Command to start worker | +## Core Architecture ---- +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Temporal Cluster │ +│ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │ +│ │ Event History │ │ Task Queues │ │ Visibility │ │ +│ │ (Durable Log) │ │ (Work Router) │ │ (Search) │ │ +│ └─────────────────┘ └─────────────────┘ └────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ + ▲ + │ Poll / Complete + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Worker │ +│ ┌─────────────────────────┐ ┌──────────────────────────────┐ │ +│ │ Workflow Definitions │ │ Activity Implementations │ │ +│ │ (Deterministic) │ │ (Non-deterministic OK) │ │ +│ └─────────────────────────┘ └──────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` -## Quick Start +**Components:** +- **Workflows** - Durable, deterministic functions that orchestrate activities +- **Activities** - Non-deterministic operations (API calls, I/O) that can fail and retry +- **Workers** - Long-running processes that poll task queues and execute code +- **Task Queues** - Named queues connecting clients to workers -```bash -# 1. Start server -./scripts/ensure-server.sh +## History Replay: Why Determinism Matters -# 2. Start worker (kills old workers, starts fresh) -./scripts/ensure-worker.sh +Temporal achieves durability through **history replay**: -# 3. Execute workflow -uv run starter # Capture workflow_id from output +1. **Initial Execution** - Worker runs workflow, generates Commands, stored as Events in history +2. **Recovery** - On restart/failure, Worker re-executes workflow from beginning +3. **Matching** - SDK compares generated Commands against stored Events +4. **Restoration** - Uses stored Activity results instead of re-executing -# 4. Wait for completion -./scripts/wait-for-workflow-status.sh --workflow-id --status COMPLETED +**If Commands don't match Events = Non-determinism Error = Workflow blocked** -# 5. Get result (verify it's correct, not an error message) -./scripts/get-workflow-result.sh --workflow-id +| Workflow Code | Command | Event | +|--------------|---------|-------| +| Execute activity | `ScheduleActivityTask` | `ActivityTaskScheduled` | +| Sleep/timer | `StartTimer` | `TimerStarted` | +| Child workflow | `StartChildWorkflowExecution` | `ChildWorkflowExecutionStarted` | -# 6. CLEANUP: Kill workers when done -./scripts/kill-worker.sh -``` +See `references/core/determinism.md` for detailed explanation. ---- +## Determinism Quick Reference -## Common Recipes +| Forbidden | Python | TypeScript | +|-----------|--------|------------| +| Current time | `workflow.now()` | `Date.now()` (auto-replaced) | +| Random | `workflow.random()` | `Math.random()` (auto-replaced) | +| UUID | `workflow.uuid4()` | `uuid4()` from workflow | +| Sleep | `asyncio.sleep()` | `sleep()` from workflow | -### Clean Start -```bash -./scripts/kill-all-workers.sh -./scripts/ensure-server.sh -./scripts/ensure-worker.sh -uv run starter -``` +**Python sandbox**: Explicit protection, use `workflow.unsafe.imports_passed_through()` for libraries +**TypeScript sandbox**: V8 isolation, automatic replacements, use type-only imports for activities -### Debug Stalled Workflow -```bash -./scripts/find-stalled-workflows.sh -./scripts/analyze-workflow-error.sh --workflow-id -tail -100 $CLAUDE_TEMPORAL_LOG_DIR/worker-$(basename "$(pwd)").log -# See references/troubleshooting.md for decision tree -``` +## Language Selection -### Clear Stalled Environment -```bash -./scripts/find-stalled-workflows.sh -./scripts/bulk-cancel-workflows.sh -./scripts/kill-worker.sh -./scripts/ensure-worker.sh -``` +### Python +- Decorators: `@workflow.defn`, `@workflow.run`, `@activity.defn` +- Async/await throughout +- Explicit sandbox with pass-through pattern +- **Critical**: Separate workflow and activity files for performance +- See `references/python/python.md` -### Check Recent Results -```bash -./scripts/list-recent-workflows.sh --minutes 30 -./scripts/get-workflow-result.sh --workflow-id -``` +### TypeScript +- Functions exported from workflow file +- `proxyActivities()` with type-only imports +- V8 sandbox with automatic replacements +- Webpack bundling for workflows +- See `references/typescript/typescript.md` ---- +## Pattern Index -## Key Scripts +| Pattern | Use Case | Python | TypeScript | +|---------|----------|--------|------------| +| **Signals** | Fire-and-forget events to running workflow | `references/python/patterns.md` | `references/typescript/patterns.md` | +| **Queries** | Read-only state inspection | `references/python/patterns.md` | `references/typescript/patterns.md` | +| **Updates** | Synchronous state modification with response | `references/python/patterns.md` | `references/typescript/patterns.md` | +| **Child Workflows** | Break down large workflows, isolate failures | `references/python/patterns.md` | `references/typescript/patterns.md` | +| **Continue-as-New** | Prevent unbounded history growth | `references/python/advanced-features.md` | `references/typescript/advanced-features.md` | +| **Saga** | Distributed transactions with compensation | `references/python/patterns.md` | `references/typescript/patterns.md` | -| Script | Purpose | -|--------|---------| -| `ensure-server.sh` | Start dev server if not running | -| `ensure-worker.sh` | Kill old workers, start fresh one | -| `kill-worker.sh` | Kill current project's worker | -| `kill-all-workers.sh` | Kill all workers (`--include-server` option) | -| `find-stalled-workflows.sh` | Detect stalled workflows | -| `analyze-workflow-error.sh` | Extract errors from history | -| `wait-for-workflow-status.sh` | Block until status reached | -| `get-workflow-result.sh` | Get workflow output | +## Troubleshooting Quick Reference -See `references/tool-reference.md` for full details. +| Symptom | Likely Cause | Action | +|---------|--------------|--------| +| Workflow stuck (RUNNING but no progress) | Worker not running or wrong task queue | Check worker, verify task queue name | +| `NondeterminismError` | Code changed mid-execution | Use patching API or reset workflow | +| Activity keeps retrying | Activity throwing errors | Check activity logs, fix root cause | +| Workflow FAILED | Unhandled exception in workflow | Check workflow error, fix code | +| Timeout errors | Timeout too short or activity stuck | Increase timeout or add heartbeats | ---- +See `references/core/troubleshooting.md` for decision trees and detailed recovery steps. -## References (Load When Needed) +## Versioning -| Reference | When to Read | -|-----------|--------------| -| `references/concepts.md` | Understanding workflow vs activity tasks, component architecture | -| `references/troubleshooting.md` | Workflow stalled, failed, or misbehaving - decision tree and fixes | -| `references/error-reference.md` | Looking up specific error types and recovery steps | -| `references/tool-reference.md` | Script options and worker management details | -| `references/interactive-workflows.md` | Signals, updates, queries for human-in-the-loop workflows | -| `references/logs.md` | Log file locations and search commands | +To safely change workflow code while workflows are running: ---- +1. **Patching API** - Code-level branching for old vs new paths +2. **Workflow Type Versioning** - New workflow type for incompatible changes +3. **Worker Versioning** - Deployment-level control with Build IDs + +See `references/core/versioning.md` for concepts, language-specific files for implementation. -## Critical Rules +## Scripts (Operational) -1. **Always kill workers when done** - Don't leave stale workers running -2. **One worker instance only** - Multiple workers cause non-determinism -3. **Capture workflow_id** - You need it for all monitoring/troubleshooting -4. **Verify results** - COMPLETED status doesn't mean correct result; check payload -5. **Non-determinism: analyze first** - Use `analyze-workflow-error.sh` to understand the mismatch. If accidental: fix code to match history. If intentional v2 change: terminate and start fresh. See `references/troubleshooting.md` +Available scripts in `scripts/` for worker and workflow management: + +### Server & Worker Lifecycle +| Script | Purpose | +|--------|---------| +| `ensure-server.sh` | Start Temporal dev server if not running | +| `ensure-worker.sh` | Start worker for project (kills existing first) | +| `list-workers.sh` | List running workers | +| `kill-worker.sh` | Stop a specific worker | +| `kill-all-workers.sh` | Stop ALL workers (cleanup) | +| `monitor-worker-health.sh` | Check worker health, uptime, recent errors | + +### Workflow Operations +| Script | Purpose | +|--------|---------| +| `list-recent-workflows.sh` | Show recent workflow executions | +| `get-workflow-result.sh` | Get output/result from completed workflow | +| `find-stalled-workflows.sh` | Find workflows not making progress | +| `analyze-workflow-error.sh` | Diagnose workflow failures | +| `bulk-cancel-workflows.sh` | Cancel multiple workflows by ID or pattern | + +### Utilities (used by other scripts) +| Script | Purpose | +|--------|---------| +| `wait-for-workflow-status.sh` | Poll until workflow reaches target status | +| `wait-for-worker-ready.sh` | Poll log file for worker startup | +| `find-project-workers.sh` | Helper to find worker PIDs for a project | + +## Additional Resources + +### Core References (Language-Agnostic) +- **`references/core/determinism.md`** - Why determinism matters, replay mechanics +- **`references/core/patterns.md`** - Conceptual patterns (signals, queries, saga) +- **`references/core/versioning.md`** - Versioning strategies and concepts +- **`references/core/troubleshooting.md`** - Decision trees, recovery procedures +- **`references/core/error-reference.md`** - Common error types, workflow status reference +- **`references/core/common-gotchas.md`** - Anti-patterns and common mistakes +- **`references/core/interactive-workflows.md`** - Testing signals, updates, queries +- **`references/core/tool-reference.md`** - Script options and worker management details +- **`references/core/logs.md`** - Log file locations and search patterns +- **`references/core/ai-integration.md`** - AI/LLM integration patterns + +### Python References +- **`references/python/python.md`** - Python SDK overview, quick start +- **`references/python/sandbox.md`** - Python sandbox mechanics +- **`references/python/sync-vs-async.md`** - Activity type selection, event loop +- **`references/python/patterns.md`** - Python pattern implementations +- **`references/python/testing.md`** - WorkflowEnvironment, mocking +- **`references/python/error-handling.md`** - ApplicationError, retries +- **`references/python/data-handling.md`** - Pydantic, encryption +- **`references/python/observability.md`** - Logging, metrics, tracing +- **`references/python/versioning.md`** - Python patching API +- **`references/python/advanced-features.md`** - Continue-as-new, interceptors +- **`references/python/ai-patterns.md`** - Python AI Cookbook patterns +- **`references/python/gotchas.md`** - Python-specific anti-patterns + +### TypeScript References +- **`references/typescript/typescript.md`** - TypeScript SDK overview, quick start +- **`references/typescript/patterns.md`** - TypeScript pattern implementations +- **`references/typescript/testing.md`** - TestWorkflowEnvironment +- **`references/typescript/error-handling.md`** - ApplicationFailure, retries +- **`references/typescript/data-handling.md`** - Data converters +- **`references/typescript/observability.md`** - Sinks, logging +- **`references/typescript/versioning.md`** - TypeScript patching API +- **`references/typescript/advanced-features.md`** - Cancellation scopes, interceptors +- **`references/typescript/gotchas.md`** - TypeScript-specific anti-patterns diff --git a/references/concepts.md b/references/concepts.md deleted file mode 100644 index 144709e..0000000 --- a/references/concepts.md +++ /dev/null @@ -1,38 +0,0 @@ -# Temporal Concepts - -Understanding how Temporal components interact is essential for troubleshooting. - -## How Workers, Workflows, and Tasks Relate - -``` -┌─────────────────────────────────────────────────────────────────┐ -│ TEMPORAL SERVER │ -│ Stores workflow history, manages task queues, coordinates work │ -└─────────────────────────────────────────────────────────────────┘ - │ - Task Queue (named queue) - │ - ▼ -┌─────────────────────────────────────────────────────────────────┐ -│ WORKER │ -│ Long-running process that polls task queue for work │ -│ Contains: Workflow definitions + Activity implementations │ -│ │ -│ When work arrives: │ -│ - Workflow Task → Execute workflow code decisions │ -│ - Activity Task → Execute activity code (business logic) │ -└─────────────────────────────────────────────────────────────────┘ -``` - -**Key Insight**: The workflow code runs inside the worker. If worker code is outdated or buggy, workflow execution fails. - -## Workflow Task vs Activity Task - -| Task Type | What It Does | Where It Runs | On Failure | -|-----------|--------------|---------------|------------| -| **Workflow Task** | Makes workflow decisions (what to do next) | Worker | **Stalls the workflow** until fixed | -| **Activity Task** | Executes business logic | Worker | Retries per retry policy | - -**CRITICAL**: Workflow task errors are fundamentally different from activity task errors: -- **Workflow Task Failure** → Workflow **stops making progress entirely** -- **Activity Task Failure** → Workflow **retries the activity** (workflow still progressing) diff --git a/references/core/ai-integration.md b/references/core/ai-integration.md new file mode 100644 index 0000000..acf79f3 --- /dev/null +++ b/references/core/ai-integration.md @@ -0,0 +1,237 @@ +# AI/LLM Integration with Temporal + +## Overview + +Temporal provides durable execution for AI/LLM applications, handling retries, rate limits, and long-running operations automatically. These patterns apply across languages, with Python being the most mature for AI integration. + +For Python-specific implementation details and code examples, see `references/python/ai-patterns.md`. + +## Why Temporal for AI? + +| Challenge | Temporal Solution | +|-----------|-------------------| +| LLM API timeouts | Automatic retries with backoff | +| Rate limiting | Activity retry policies handle 429s | +| Long-running agents | Durable state survives crashes | +| Multi-step pipelines | Workflow orchestration | +| Cost tracking | Activity-level visibility | +| Debugging | Full execution history | + +## Core Patterns + +### Pattern 1: Generic LLM Activity + +Create flexible, reusable activities for LLM calls: + +``` +Activity: call_llm_generic( + model: string, + system_instructions: string, + user_input: string, + tools?: list, + response_format?: schema +) -> response +``` + +**Benefits**: +- Single activity handles multiple use cases +- Consistent retry handling +- Centralized configuration + +### Pattern 2: Activity-Based Separation + +Isolate each operation in its own activity: + +``` +Workflow: + ├── Activity: call_llm (get tool selection) + ├── Activity: execute_tool (run selected tool) + └── Activity: call_llm (interpret results) +``` + +**Benefits**: +- Independent retry for each step +- Clear audit trail in history +- Easier testing and mocking +- Failure isolation + +### Pattern 3: Centralized Retry Management + +**Critical**: Disable retries in LLM client libraries, let Temporal handle retries. + +``` +LLM Client Config: + max_retries = 0 ← Disable client retries + +Activity Retry Policy: + initial_interval = 1s + backoff_coefficient = 2.0 + maximum_attempts = 5 + maximum_interval = 60s +``` + +**Why**: +- Temporal retries are durable (survive crashes) +- Single retry configuration point +- Better visibility into retry attempts +- Consistent backoff behavior + +### Pattern 4: Tool-Calling Agent + +Three-phase workflow for LLM agents with tools: + +``` +┌─────────────────────────────────────────────┐ +│ Phase 1: Tool Selection │ +│ Activity: Present tools to LLM │ +│ LLM returns: tool_name, arguments │ +└─────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────┐ +│ Phase 2: Tool Execution │ +│ Activity: Execute selected tool │ +│ (Separate activity per tool type) │ +└─────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────┐ +│ Phase 3: Result Interpretation │ +│ Activity: Send results back to LLM │ +│ LLM returns: final response or next tool │ +└─────────────────────────────────────────────┘ + │ + ▼ + Loop until LLM returns final answer +``` + +### Pattern 5: Multi-Agent Orchestration + +Complex pipelines with multiple specialized agents: + +``` +Deep Research Example: + │ + ├── Planning Agent (Activity) + │ └── Output: subtopics to research + │ + ├── Query Generation Agent (Activity) + │ └── Output: search queries per subtopic + │ + ├── Parallel Web Search (Multiple Activities) + │ └── Output: search results (resilient to partial failures) + │ + └── Synthesis Agent (Activity) + └── Output: final report +``` + +**Key Pattern**: Use parallel execution with `return_exceptions=True` to continue with partial results when some searches fail. + +### Pattern 6: Structured Outputs + +Define schemas for LLM responses: + +``` +Input: Raw LLM prompt +Schema: { action: string, confidence: float, reasoning: string } +Output: Validated, typed response +``` + +**Benefits**: +- Type safety +- Automatic validation +- Easier downstream processing + +## Timeout Recommendations + +| Operation Type | Recommended Timeout | +|----------------|---------------------| +| Simple LLM calls (GPT-4, Claude-3) | 30 seconds | +| Reasoning models (o1, o3, extended thinking) | 300 seconds (5 min) | +| Web searches | 300 seconds (5 min) | +| Simple tool execution | 30-60 seconds | +| Image generation | 120 seconds | +| Document processing | 60-120 seconds | + +**Rationale**: +- Reasoning models need time for complex computation +- Web searches may hit rate limits requiring backoff +- Fast timeouts catch stuck operations +- Longer timeouts prevent premature failures for expensive operations + +## Rate Limit Handling + +### From HTTP Headers + +Parse rate limit info from API responses: + +``` +Response Headers: + Retry-After: 30 + X-RateLimit-Remaining: 0 + +Activity: + If rate limited: + Raise retryable error with retry_after hint + Temporal handles the delay +``` + +### Retry Policy Configuration + +``` +Retry Policy: + initial_interval: 1s (or from Retry-After header) + backoff_coefficient: 2.0 + maximum_interval: 60s + maximum_attempts: 10 + non_retryable_errors: [InvalidAPIKey, InvalidInput] +``` + +## Error Handling + +### Retryable Errors +- Rate limits (429) +- Timeouts +- Temporary server errors (500, 502, 503) +- Network errors + +### Non-Retryable Errors +- Invalid API key (401) +- Invalid input/prompt +- Content policy violations +- Model not found + +## Best Practices + +1. **Disable client retries** - Let Temporal handle all retries +2. **Set appropriate timeouts** - Based on operation type +3. **Separate activities** - One per logical operation +4. **Use structured outputs** - For type safety and validation +5. **Handle partial failures** - Continue with available results +6. **Monitor costs** - Track LLM calls at activity level +7. **Version prompts** - Track prompt changes in code +8. **Test with mocks** - Mock LLM responses in tests + +## Observability + +- **Activity duration**: Track LLM latency +- **Retry counts**: Monitor rate limiting +- **Token usage**: Log in activity output +- **Cost attribution**: Tag workflows with cost centers + +## Language-Specific Resources + +### Python +See `references/python/ai-patterns.md` for: +- Pydantic data converter setup +- OpenAI client configuration +- LiteLLM multi-model support +- OpenAI Agents SDK integration +- Complete code examples +- Testing patterns + +### TypeScript +AI integration patterns in TypeScript follow the same concepts: +- Use `proxyActivities` for LLM activities +- Configure timeouts per activity type +- Handle errors with `ApplicationFailure` diff --git a/references/core/common-gotchas.md b/references/core/common-gotchas.md new file mode 100644 index 0000000..7d1ee23 --- /dev/null +++ b/references/core/common-gotchas.md @@ -0,0 +1,190 @@ +# Common Temporal Gotchas + +Common mistakes and anti-patterns in Temporal development. Learning from these saves significant debugging time. + +## Idempotency Issues + +### Non-Idempotent Activities + +**The Problem**: Activities may execute more than once due to retries or Worker failures. If an activity calls an external service without an idempotency key, you may charge a customer twice, send duplicate emails, or create duplicate records. + +**Symptoms**: +- Duplicate side effects (double charges, duplicate notifications) +- Data inconsistencies after retries + +**The Fix**: Always use idempotency keys when calling external services. Use the workflow ID, activity ID, or a domain-specific identifier (like order ID) as the key. + +### Local Activities + +Local Activities skip the task queue for lower latency, but they're still subject to retries. The same idempotency rules apply. + +## Replay Safety Violations + +### Side Effects in Workflow Code + +**The Problem**: Code in workflow functions runs on first execution AND on every replay. Any side effect (logging, notifications, metrics) will happen multiple times. + +**Symptoms**: +- Duplicate log entries +- Multiple notifications for the same event +- Inflated metrics + +**The Fix**: +- Use the SDK's replay-aware logger (only logs on first execution) +- Put all side effects in Activities + +### Non-Deterministic Time + +**The Problem**: Using system time (`datetime.now()`, `Date.now()`) in workflow code returns different values on replay, causing non-determinism errors. + +**Symptoms**: +- Non-determinism errors mentioning time-based decisions +- Workflows that worked once but fail on replay + +**The Fix**: Use the SDK's deterministic time function (`workflow.now()` in Python, `Date.now()` is auto-replaced in TypeScript). + +## Worker Management Issues + +### Multiple Workers with Different Code + +**The Problem**: If Worker A runs part of a workflow with code v1, then Worker B (with code v2) picks it up, replay may produce different Commands. + +**Symptoms**: +- Non-determinism errors after deploying new code +- Errors mentioning "command mismatch" or "unexpected command" + +**The Fix**: +- Use Worker Versioning for production deployments +- During development: kill old workers before starting new ones +- Ensure all workers run identical code + +### Stale Workflows During Development + +**The Problem**: Workflows started with old code continue running after you change the code. + +**Symptoms**: +- Workflows behave unexpectedly after code changes +- Non-determinism errors on previously-working workflows + +**The Fix**: +- Terminate stale workflows: `temporal workflow terminate --workflow-id ` +- Use `find-stalled-workflows.sh` to detect stuck workflows +- In production, use versioning for backward compatibility + +## Workflow Design Anti-Patterns + +### The Mega Workflow + +**The Problem**: Putting too much logic in a single workflow. + +**Issues**: +- Hard to test and maintain +- Event history grows unbounded +- Single point of failure +- Difficult to reason about + +**The Fix**: +- Keep workflows focused on a single responsibility +- Use Child Workflows for sub-processes +- Use Continue-as-New for long-running workflows + +### Failing Too Quickly + +**The Problem**: Using aggressive retry policies that give up too easily. + +**Symptoms**: +- Workflows failing on transient errors +- Unnecessary workflow failures during brief outages + +**The Fix**: Use appropriate retry policies. Let Temporal handle transient failures with exponential backoff. Reserve `maximum_attempts=1` for truly non-retryable operations. + +## Query Handler Mistakes + +### Modifying State in Queries + +**The Problem**: Queries are read-only. Modifying state in a query handler causes non-determinism on replay because queries don't generate history events. + +**Symptoms**: +- State inconsistencies after workflow replay +- Non-determinism errors + +**The Fix**: Queries must only read state. Use Updates for operations that need to modify state AND return a result. + +### Blocking in Queries + +**The Problem**: Queries must return immediately. They cannot await activities, child workflows, timers, or conditions. + +**Symptoms**: +- Query timeouts +- Deadlocks + +**The Fix**: Queries return current state only. Use Signals or Updates to trigger async operations. + +### Query vs Signal vs Update + +| Operation | Modifies State? | Returns Result? | Can Block? | Use For | +|-----------|-----------------|-----------------|------------|---------| +| **Query** | No | Yes | No | Read current state | +| **Signal** | Yes | No | Yes | Fire-and-forget mutations | +| **Update** | Yes | Yes | Yes | Mutations needing results | + +**Key rule**: Query to peek, Signal to push, Update to pop. + +## File Organization Issues + +Each SDK has specific requirements for how workflow and activity code should be organized. Mixing them incorrectly causes sandbox issues, bundling problems, or performance degradation. + +See language-specific gotchas for details. + +## Testing Mistakes + +### Only Testing Happy Paths + +**The Problem**: Not testing what happens when things go wrong. + +**Questions to answer**: +- What happens when an Activity exhausts all retries? +- What happens when a workflow is cancelled mid-execution? +- What happens during a Worker restart? + +**The Fix**: Test failure scenarios explicitly. Mock activities to fail, test cancellation handling, use replay testing. + +### Not Testing Replay Compatibility + +**The Problem**: Changing workflow code without verifying existing workflows can still replay. + +**Symptoms**: +- Non-determinism errors after deployment +- Stuck workflows that can't make progress + +**The Fix**: Use replay testing against saved histories from production or staging. + +## Error Handling Mistakes + +### Swallowing Errors + +**The Problem**: Catching errors without proper handling hides failures. + +**Symptoms**: +- Silent failures +- Workflows completing "successfully" despite errors +- Difficult debugging + +**The Fix**: Log errors and make deliberate decisions. Either re-raise, use a fallback, or explicitly document why ignoring is safe. + +### Wrong Retry Classification + +**The Problem**: Marking transient errors as non-retryable, or permanent errors as retryable. + +**Symptoms**: +- Workflows failing on temporary network issues (if marked non-retryable) +- Infinite retries on invalid input (if marked retryable) + +**The Fix**: +- **Retryable**: Network errors, timeouts, rate limits, temporary unavailability +- **Non-retryable**: Invalid input, authentication failures, business rule violations, resource not found + +## Language-Specific Gotchas + +- [Python Gotchas](../python/gotchas.md) +- [TypeScript Gotchas](../typescript/gotchas.md) diff --git a/references/core/determinism.md b/references/core/determinism.md new file mode 100644 index 0000000..9a606d1 --- /dev/null +++ b/references/core/determinism.md @@ -0,0 +1,144 @@ +# Determinism in Temporal Workflows + +## Overview + +Temporal workflows must be deterministic because of **history replay** - the mechanism that enables durable execution. + +## Why Determinism Matters + +### The Replay Mechanism + +When a Worker needs to restore workflow state (after crash, cache eviction, or continuing after a long timer), it **re-executes the workflow code from the beginning**. But instead of re-running activities, it uses results stored in the Event History. + +``` +Initial Execution: + Code runs → Generates Commands → Server stores as Events + +Replay (Recovery): + Code runs again → Generates Commands → SDK compares to Events + If match: Use stored results, continue + If mismatch: NondeterminismError! +``` + +### Commands and Events + +Every workflow operation generates a Command that becomes an Event: + +| Workflow Code | Command Generated | Event Stored | +|--------------|-------------------|--------------| +| Execute activity | `ScheduleActivityTask` | `ActivityTaskScheduled` | +| Sleep/timer | `StartTimer` | `TimerStarted` | +| Child workflow | `StartChildWorkflowExecution` | `ChildWorkflowExecutionStarted` | +| Complete workflow | `CompleteWorkflowExecution` | `WorkflowExecutionCompleted` | + +### Non-Determinism Example + +``` +First Run (11:59 AM): + if datetime.now().hour < 12: → True + execute_activity(morning_task) → Command: ScheduleActivityTask("morning_task") + +Replay (12:01 PM): + if datetime.now().hour < 12: → False + execute_activity(afternoon_task) → Command: ScheduleActivityTask("afternoon_task") + +Result: Commands don't match history → NondeterminismError +``` + +## Sources of Non-Determinism + +### Time-Based Operations +- `datetime.now()`, `time.time()`, `Date.now()` +- Different value on each execution + +### Random Values +- `random.random()`, `Math.random()`, `uuid.uuid4()` +- Different value on each execution + +### External State +- Reading files, environment variables, databases +- State may change between executions + +### Non-Deterministic Iteration +- Map/dict iteration order (in some languages) +- Set iteration order + +### Threading/Concurrency +- Race conditions produce different outcomes +- Non-deterministic ordering + +## SDK Protection Mechanisms + +### Python Sandbox +The Python SDK runs workflows in a sandbox that: +- Intercepts non-deterministic calls +- Raises errors for forbidden operations +- Requires explicit pass-through for libraries + +```python +# Python: Use SDK alternatives +workflow.now() # Instead of datetime.now() +workflow.random() # Instead of random +workflow.uuid4() # Instead of uuid.uuid4() +``` + +### TypeScript V8 Isolation +The TypeScript SDK runs workflows in an isolated V8 context that: +- Automatically replaces `Date.now()`, `Math.random()` with deterministic versions +- Prevents access to Node.js APIs +- Bundles workflow code separately from activities + +```typescript +// TypeScript: Auto-replaced to be deterministic +Date.now() // Returns workflow task start time +Math.random() // Returns seeded PRNG value +new Date() // Deterministic +``` + +## Detecting Non-Determinism + +### During Execution +- `NondeterminismError` raised when Commands don't match Events +- Workflow becomes blocked until code is fixed + +### Testing with Replay +Export workflow history and replay against new code: + +```python +# Python +from temporalio.worker import Replayer +replayer = Replayer(workflows=[MyWorkflow]) +await replayer.replay_workflow(history) # Raises if incompatible +``` + +```typescript +// TypeScript +import { Worker } from '@temporalio/worker'; +await Worker.runReplayHistory({ + workflowsPath: require.resolve('./workflows'), + history, +}); +``` + +## Recovery from Non-Determinism + +### Accidental Change +If you accidentally introduced non-determinism: +1. Revert code to match what's in history +2. Restart worker +3. Workflow auto-recovers + +### Intentional Change +If you need to change workflow logic: +1. Use the **Patching API** to support both old and new code paths +2. Or terminate old workflows and start new ones with updated code + +See `versioning.md` for patching details. + +## Best Practices + +1. **Use SDK-provided alternatives** for time, random, UUID +2. **Move I/O to activities** - workflows should only orchestrate +3. **Test with replay** before deploying workflow changes +4. **Use patching** for intentional changes to running workflows +5. **Keep workflows focused** - complex logic increases non-determinism risk diff --git a/references/error-reference.md b/references/core/error-reference.md similarity index 91% rename from references/error-reference.md rename to references/core/error-reference.md index 0926e4f..b39be4a 100644 --- a/references/error-reference.md +++ b/references/core/error-reference.md @@ -22,3 +22,8 @@ | `CANCELED` | Explicitly canceled | Review reason | | `TERMINATED` | Force-stopped | Review reason | | `TIMED_OUT` | Exceeded timeout | Increase timeout | + +## See Also + +- [Common Gotchas](common-gotchas.md) - Anti-patterns that cause these errors +- [Troubleshooting](troubleshooting.md) - Decision trees for diagnosing issues diff --git a/references/interactive-workflows.md b/references/core/interactive-workflows.md similarity index 100% rename from references/interactive-workflows.md rename to references/core/interactive-workflows.md diff --git a/references/logs.md b/references/core/logs.md similarity index 100% rename from references/logs.md rename to references/core/logs.md diff --git a/references/core/patterns.md b/references/core/patterns.md new file mode 100644 index 0000000..2ecc726 --- /dev/null +++ b/references/core/patterns.md @@ -0,0 +1,239 @@ +# Temporal Workflow Patterns + +## Overview + +Common patterns for building robust Temporal workflows. For language-specific implementations, see the Python or TypeScript references. + +## Signals + +**Purpose**: Send data to a running workflow asynchronously (fire-and-forget). + +**When to Use**: +- Human approval workflows +- Adding items to a workflow's queue +- Notifying workflow of external events +- Live configuration updates + +**Characteristics**: +- Asynchronous - sender doesn't wait for response +- Can mutate workflow state +- Durable - signals are persisted in history +- Can be sent before workflow starts (signal-with-start) + +**Example Flow**: +``` +Client Workflow + │ │ + │──── signal(approve) ────▶│ + │ │ (updates state) + │ │ + │◀──── (no response) ──────│ +``` + +## Queries + +**Purpose**: Read workflow state synchronously without modifying it. + +**When to Use**: +- Building dashboards showing workflow progress +- Health checks and monitoring +- Debugging workflow state +- Exposing current status to external systems + +**Characteristics**: +- Synchronous - caller waits for response +- Read-only - must not modify state +- Not recorded in history +- Executes on the worker, not persisted + +**Example Flow**: +``` +Client Workflow + │ │ + │──── query(status) ──────▶│ + │ │ (reads state) + │◀──── "processing" ───────│ +``` + +## Updates + +**Purpose**: Modify workflow state and receive a response synchronously. + +**When to Use**: +- Operations that need confirmation (add item, return count) +- Validation before accepting changes +- Replace signal+query combinations +- Request-response patterns within workflow + +**Characteristics**: +- Synchronous - caller waits for completion +- Can mutate state AND return values +- Supports validators to reject invalid updates +- Recorded in history + +**Example Flow**: +``` +Client Workflow + │ │ + │──── update(addItem) ────▶│ + │ │ (validates, modifies state) + │◀──── {count: 5} ─────────│ +``` + +## Child Workflows + +**Purpose**: Break complex workflows into smaller, reusable pieces. + +**When to Use**: +- Prevent history from growing too large +- Isolate failure domains (child can fail without failing parent) +- Reuse workflow logic across multiple parents +- Different retry policies for different parts + +**Characteristics**: +- Own history (doesn't bloat parent) +- Independent lifecycle options (ParentClosePolicy) +- Can be cancelled independently +- Results returned to parent + +**Parent Close Policies**: +- `TERMINATE` - Child terminated when parent closes (default) +- `ABANDON` - Child continues running independently +- `REQUEST_CANCEL` - Cancellation requested but not forced + +## Continue-as-New + +**Purpose**: Prevent unbounded history growth by "restarting" with fresh history. + +**When to Use**: +- Long-running workflows (entity workflows, subscriptions) +- Workflows with many iterations +- When history approaches 10,000+ events +- Periodic cleanup of accumulated state + +**How It Works**: +``` +Workflow (history: 10,000 events) + │ + │ continueAsNew(currentState) + ▼ +New Workflow Execution (history: 0 events) + │ (same workflow ID, fresh history) + │ (receives currentState as input) +``` + +**Best Practice**: Check `historyLength` or `continueAsNewSuggested` periodically. + +## Saga Pattern + +**Purpose**: Distributed transactions with compensation for failures. + +**When to Use**: +- Multi-step operations that span services +- Operations requiring rollback on failure +- Financial transactions, order processing +- Booking systems with multiple reservations + +**How It Works**: +``` +Step 1: Reserve inventory + └─ Compensation: Release inventory + +Step 2: Charge payment + └─ Compensation: Refund payment + +Step 3: Ship order + └─ Compensation: Cancel shipment + +On failure at step 3: + Execute: Refund payment (step 2 compensation) + Execute: Release inventory (step 1 compensation) +``` + +**Implementation Pattern**: +1. Track compensation actions as you complete each step +2. On failure, execute compensations in reverse order +3. Handle compensation failures gracefully (log, alert, manual intervention) + +## Parallel Execution + +**Purpose**: Run multiple independent operations concurrently. + +**When to Use**: +- Processing multiple items that don't depend on each other +- Calling multiple APIs simultaneously +- Fan-out/fan-in patterns +- Reducing total workflow duration + +**Patterns**: +- `Promise.all()` / `asyncio.gather()` - Wait for all +- Partial failure handling - Continue with successful results + +## Entity Workflow Pattern + +**Purpose**: Model long-lived entities as workflows that handle events. + +**When to Use**: +- Subscription management +- User sessions +- Shopping carts +- Any stateful entity receiving events over time + +**How It Works**: +``` +Entity Workflow (user-123) + │ + ├── Receives signal: AddItem + │ └── Updates state + │ + ├── Receives signal: UpdateQuantity + │ └── Updates state + │ + ├── Receives query: GetCart + │ └── Returns current state + │ + └── continueAsNew when history grows +``` + +## Timer Patterns + +**Purpose**: Durable delays that survive worker restarts. + +**Use Cases**: +- Scheduled reminders +- Timeout handling +- Delayed actions +- Polling with intervals + +**Characteristics**: +- Timers are durable (persisted in history) +- Can be cancelled +- Combine with cancellation scopes for timeouts + +## Polling Pattern + +**Purpose**: Repeatedly check external state until condition met. + +**Implementation**: +``` +while not condition_met: + result = await check_activity() + if result.done: + break + await sleep(poll_interval) +``` + +**Best Practice**: Use exponential backoff for polling intervals. + +## Choosing Between Patterns + +| Need | Pattern | +|------|---------| +| Send data, don't need response | Signal | +| Read state, no modification | Query | +| Modify state, need response | Update | +| Break down large workflow | Child Workflow | +| Prevent history growth | Continue-as-New | +| Rollback on failure | Saga | +| Process items concurrently | Parallel Execution | +| Long-lived stateful entity | Entity Workflow | diff --git a/references/tool-reference.md b/references/core/tool-reference.md similarity index 100% rename from references/tool-reference.md rename to references/core/tool-reference.md diff --git a/references/core/troubleshooting.md b/references/core/troubleshooting.md new file mode 100644 index 0000000..c931c9f --- /dev/null +++ b/references/core/troubleshooting.md @@ -0,0 +1,308 @@ +# Temporal Troubleshooting Guide + +## Workflow Diagnosis Decision Tree + +``` +Workflow not behaving as expected? +│ +├─▶ What is the workflow status? +│ │ +│ ├─▶ RUNNING (but no progress) +│ │ └─▶ Go to: "Workflow Stuck" section +│ │ +│ ├─▶ FAILED +│ │ └─▶ Go to: "Workflow Failed" section +│ │ +│ ├─▶ TIMED_OUT +│ │ └─▶ Go to: "Timeout Issues" section +│ │ +│ └─▶ COMPLETED (but wrong result) +│ └─▶ Go to: "Wrong Result" section +``` + +## Workflow Stuck (RUNNING but No Progress) + +### Decision Tree + +``` +Workflow stuck in RUNNING? +│ +├─▶ Is a worker running? +│ │ +│ ├─▶ NO: Start a worker +│ │ └─▶ scripts/ensure-worker.sh +│ │ +│ └─▶ YES: Is it on the correct task queue? +│ │ +│ ├─▶ NO: Start worker with correct task queue +│ │ +│ └─▶ YES: Check for non-determinism +│ │ +│ ├─▶ NondeterminismError in logs? +│ │ └─▶ Go to: "Non-Determinism" section +│ │ +│ └─▶ No errors? +│ └─▶ Check if workflow is waiting for signal/timer +``` + +### Common Causes + +1. **No worker running** + - Check: `scripts/list-workers.sh` + - Fix: `scripts/ensure-worker.sh ` + +2. **Worker on wrong task queue** + - Check: Worker logs for task queue name + - Fix: Start worker with matching task queue + +3. **Worker has stale code** + - Check: Worker startup time vs code changes + - Fix: Restart worker with updated code + +4. **Workflow waiting for signal** + - Check: Workflow history for pending signals + - Fix: Send expected signal or check signal sender + +5. **Activity stuck/timing out** + - Check: Activity retry attempts in history + - Fix: Investigate activity failure, increase timeout + +## Non-Determinism Errors + +### Decision Tree + +``` +NondeterminismError? +│ +├─▶ Was code intentionally changed? +│ │ +│ ├─▶ YES: Use patching API +│ │ └─▶ See: references/core/versioning.md +│ │ +│ └─▶ NO: Accidental change +│ │ +│ ├─▶ Can you identify the change? +│ │ │ +│ │ ├─▶ YES: Revert and restart worker +│ │ │ +│ │ └─▶ NO: Compare current code to expected history +│ │ └─▶ Check: Activity names, order, parameters +``` + +### Common Causes + +1. **Changed activity order** + ``` + # Before # After (BREAKS) + await activity_a await activity_b + await activity_b await activity_a + ``` + +2. **Changed activity name** + ``` + # Before # After (BREAKS) + await process_order(...) await handle_order(...) + ``` + +3. **Added/removed activity call** + - Adding new activity mid-workflow + - Removing activity that was previously called + +4. **Using non-deterministic values** + - `datetime.now()` in workflow (use `workflow.now()`) + - `random.random()` in workflow (use `workflow.random()`) + +### Recovery + +**Accidental Change:** +1. Identify the change +2. Revert code to match history +3. Restart worker +4. Workflow automatically recovers + +**Intentional Change:** +1. Use patching API for gradual migration +2. Or terminate old workflows, start new ones + +## Workflow Failed + +### Decision Tree + +``` +Workflow status = FAILED? +│ +├─▶ Check workflow error message +│ │ +│ ├─▶ Application error (your code) +│ │ └─▶ Fix the bug, start new workflow +│ │ +│ ├─▶ NondeterminismError +│ │ └─▶ Go to: "Non-Determinism" section +│ │ +│ └─▶ Timeout error +│ └─▶ Go to: "Timeout Issues" section +``` + +### Common Causes + +1. **Unhandled exception in workflow** + - Check error message and stack trace + - Fix bug in workflow code + +2. **Activity exhausted retries** + - All retry attempts failed + - Check activity logs for root cause + +3. **Non-retryable error thrown** + - Error marked as non-retryable + - Intentional failure, check business logic + +## Timeout Issues + +### Timeout Types + +| Timeout | Scope | What It Limits | +|---------|-------|----------------| +| `WorkflowExecutionTimeout` | Entire workflow | Total time including retries and continue-as-new | +| `WorkflowRunTimeout` | Single run | Time for one run (before continue-as-new) | +| `ScheduleToCloseTimeout` | Activity | Total time including retries | +| `StartToCloseTimeout` | Activity | Single attempt time | +| `HeartbeatTimeout` | Activity | Time between heartbeats | + +### Diagnosis + +``` +Timeout error? +│ +├─▶ Which timeout? +│ │ +│ ├─▶ Workflow timeout +│ │ └─▶ Increase timeout or optimize workflow +│ │ +│ ├─▶ ScheduleToCloseTimeout +│ │ └─▶ Activity taking too long overall (including retries) +│ │ +│ ├─▶ StartToCloseTimeout +│ │ └─▶ Single activity attempt too slow +│ │ +│ └─▶ HeartbeatTimeout +│ └─▶ Activity not heartbeating frequently enough +│ └─▶ Add heartbeat() calls in long activities +``` + +### Fixes + +1. **Increase timeout** if operation legitimately takes longer +2. **Add heartbeats** to long-running activities +3. **Optimize activity** to complete faster +4. **Break into smaller activities** for better granularity + +## Activity Keeps Retrying + +### Decision Tree + +``` +Activity retrying repeatedly? +│ +├─▶ Check activity error +│ │ +│ ├─▶ Transient error (network, timeout) +│ │ └─▶ Expected behavior, will eventually succeed +│ │ +│ ├─▶ Permanent error (bug, invalid input) +│ │ └─▶ Fix the bug or mark as non-retryable +│ │ +│ └─▶ Resource exhausted +│ └─▶ Add backoff, check rate limits +``` + +### Common Causes + +1. **Bug in activity code** + - Fix the bug + - Consider marking certain errors as non-retryable + +2. **External service down** + - Retries are working as intended + - Monitor service recovery + +3. **Invalid input** + - Validate inputs before activity + - Return non-retryable error for bad input + +## Wrong Result (Completed but Incorrect) + +### Diagnosis + +1. **Check workflow history** for unexpected activity results +2. **Verify activity implementations** produce correct output +3. **Check for race conditions** in parallel execution +4. **Verify signal handling** if signals are involved + +### Common Causes + +1. **Activity bug** - Wrong logic in activity +2. **Stale data** - Activity using outdated information +3. **Signal ordering** - Signals processed in unexpected order +4. **Parallel execution** - Race condition in concurrent operations + +## Worker Issues + +### Worker Not Starting + +``` +Worker won't start? +│ +├─▶ Connection error +│ └─▶ Check Temporal server is running +│ └─▶ scripts/ensure-server.sh +│ +├─▶ Registration error +│ └─▶ Check workflow/activity definitions are valid +│ +└─▶ Import error + └─▶ Check Python imports, TypeScript bundling +``` + +### Worker Crashing + +1. **Out of memory** - Reduce concurrent tasks, check for leaks +2. **Unhandled exception** - Add error handling +3. **Dependency issue** - Check package versions + +## Useful Commands + +```bash +# Check Temporal server +temporal server start-dev + +# List workflows +temporal workflow list + +# Describe specific workflow +temporal workflow describe --workflow-id + +# Show workflow history +temporal workflow show --workflow-id + +# Terminate stuck workflow +temporal workflow terminate --workflow-id + +# Reset workflow to specific point +temporal workflow reset --workflow-id --event-id +``` + +## Quick Reference: Status → Action + +| Status | First Check | Common Fix | +|--------|-------------|------------| +| RUNNING (stuck) | Worker running? | Start/restart worker | +| FAILED | Error message | Fix bug, handle error | +| TIMED_OUT | Which timeout? | Increase timeout or optimize | +| TERMINATED | Who terminated? | Check audit log | +| CANCELED | Cancellation source | Expected or investigate | + +## See Also + +- [Common Gotchas](common-gotchas.md) - Anti-patterns that cause these issues +- [Error Reference](error-reference.md) - Quick error type lookup diff --git a/references/core/versioning.md b/references/core/versioning.md new file mode 100644 index 0000000..4c1e92d --- /dev/null +++ b/references/core/versioning.md @@ -0,0 +1,172 @@ +# Workflow Versioning Concepts + +## Overview + +Workflow versioning allows safe deployment of code changes without breaking running workflows. Three approaches available: + +1. **Patching API** - Code-level version branching +2. **Workflow Type Versioning** - New workflow types for incompatible changes +3. **Worker Versioning** - Deployment-level control with Build IDs + +## Why Versioning is Needed + +When workers restart after deployment, they resume open workflows through history replay. If updated code produces different Commands than the original code, it causes non-determinism errors. + +``` +Original Code (recorded in history): + await activity_a() + await activity_b() + +Updated Code (during replay): + await activity_a() + await activity_c() ← Different! NondeterminismError +``` + +## Approach 1: Patching API + +### Concept + +The patching API lets you branch code based on whether a workflow was started before or after a code change. + +``` +if patched("my-change"): + // New code path (for new and replaying new workflows) +else: + // Old code path (for replaying old workflows) +``` + +### Three-Phase Lifecycle + +**Phase 1: Patch In** +- Add both old and new code paths +- New workflows take new path, old workflows take old path + +**Phase 2: Deprecate** +- After all old workflows complete, remove old code +- Keep deprecation marker for history compatibility + +**Phase 3: Remove** +- After all deprecated workflows complete +- Remove patch entirely, only new code remains + +### When to Use + +- Adding new activities or steps +- Changing activity parameters +- Reordering operations +- Any change that would cause non-determinism + +### When NOT to Use + +- Changes to activity implementations (activities aren't replayed) +- Adding new signal/query handlers (additive changes are safe) +- Bug fixes that don't change Command sequence + +## Approach 2: Workflow Type Versioning + +### Concept + +Create a new workflow type (e.g., `OrderWorkflowV2`) instead of patching. + +``` +// Old: OrderWorkflow +// New: OrderWorkflowV2 (completely new implementation) +``` + +### When to Use + +- Major incompatible changes +- Complete rewrites +- When patching would be too complex +- When you want clean separation + +### Process + +1. Create new workflow type with new name +2. Register both with worker +3. Start new workflows with new type +4. Wait for old workflows to complete +5. Remove old workflow type + +## Approach 3: Worker Versioning + +### Concept + +Manage versions at deployment level using Build IDs. Multiple worker versions can run simultaneously. + +``` +Worker v1.0 (Build ID: abc123) + └── Handles workflows started on this version + +Worker v2.0 (Build ID: def456) + └── Handles new workflows + └── Can also handle upgraded old workflows +``` + +### Key Concepts + +**Worker Deployment**: Logical service grouping (e.g., "order-service") + +**Build ID**: Specific code version (e.g., git commit hash) + +**Versioning Behaviors**: +- `PINNED` - Workflows stay on original worker version +- `AUTO_UPGRADE` - Workflows can move to newer versions + +### When to Use PINNED + +- Short-running workflows (minutes to hours) +- Consistency is critical +- Want simplest development experience +- Building new applications + +### When to Use AUTO_UPGRADE + +- Long-running workflows (weeks or months) +- Workflows need bug fixes during execution +- Still requires patching for version transitions + +## Choosing an Approach + +| Scenario | Recommended Approach | +|----------|---------------------| +| Small change, few running workflows | Patching API | +| Major rewrite | Workflow Type Versioning | +| Many short workflows, frequent deploys | Worker Versioning (PINNED) | +| Long-running workflows needing updates | Worker Versioning (AUTO_UPGRADE) + Patching | +| Quick fix, can wait for completion | Wait for workflows to complete | + +## Best Practices + +1. **Check for open executions** before removing old code +2. **Use descriptive patch IDs** (e.g., "add-fraud-check" not "patch-1") +3. **Deploy incrementally**: patch → deprecate → remove +4. **Test replay compatibility** before deploying changes +5. **Monitor old workflow counts** during migration + +## Finding Workflows by Version + +```bash +# Find workflows with specific patch +temporal workflow list --query \ + 'WorkflowType = "OrderWorkflow" AND TemporalChangeVersion = "add-fraud-check"' + +# Find pre-patch workflows +temporal workflow list --query \ + 'WorkflowType = "OrderWorkflow" AND TemporalChangeVersion IS NULL' + +# Find workflows on specific worker version +temporal workflow list --query \ + 'TemporalWorkerDeploymentVersion = "my-service:v1.0.0"' +``` + +## Common Mistakes + +1. **Removing old code too early** - Breaks replaying workflows +2. **Not testing with replay** - Catches issues before production +3. **Patching non-Command changes** - Unnecessary complexity +4. **Forgetting to deprecate** - Accumulates dead code + +For language-specific implementation details, see: +- `references/python/versioning.md` +- `references/typescript/versioning.md` diff --git a/references/python/advanced-features.md b/references/python/advanced-features.md new file mode 100644 index 0000000..3fffa77 --- /dev/null +++ b/references/python/advanced-features.md @@ -0,0 +1,378 @@ +# Python SDK Advanced Features + +## Continue-as-New + +Use continue-as-new to prevent unbounded history growth in long-running workflows. + +```python +@workflow.defn +class BatchProcessingWorkflow: + @workflow.run + async def run(self, state: ProcessingState) -> str: + while not state.is_complete: + # Process next batch + state = await workflow.execute_activity( + process_batch, state, + schedule_to_close_timeout=timedelta(minutes=5), + ) + + # Check history size and continue-as-new if needed + if workflow.info().get_current_history_length() > 10000: + workflow.continue_as_new(args=[state]) + + return "completed" +``` + +### Continue-as-New with Different Arguments + +```python +# Continue with modified state +workflow.continue_as_new(args=[new_state]) + +# Continue with memo and search attributes +workflow.continue_as_new( + args=[new_state], + memo={"last_processed": item_id}, + search_attributes=SearchAttributes.from_pairs([ + (BATCH_NUMBER, state.batch + 1), + ]), +) +``` + +## Workflow Updates + +Updates allow synchronous interaction with running workflows. + +### Defining Update Handlers + +```python +@workflow.defn +class OrderWorkflow: + def __init__(self): + self._items: list[str] = [] + + @workflow.update + async def add_item(self, item: str) -> int: + """Add item and return new count.""" + self._items.append(item) + return len(self._items) + + @workflow.update + async def add_item_with_validation(self, item: str) -> int: + """Update with validation.""" + # This runs before the update is accepted + if not item: + raise ValueError("Item cannot be empty") + self._items.append(item) + return len(self._items) + + # Validator runs in the handler but before main logic + @add_item_with_validation.validator + def validate_add_item(self, item: str) -> None: + if len(self._items) >= 100: + raise ValueError("Order is full") +``` + +### Calling Updates from Client + +```python +handle = client.get_workflow_handle("order-123") + +# Execute update and wait for result +count = await handle.execute_update( + OrderWorkflow.add_item, + "new-item", +) +print(f"Order now has {count} items") +``` + +## Schedules + +Create recurring workflow executions. + +```python +from temporalio.client import ( + Schedule, + ScheduleActionStartWorkflow, + ScheduleSpec, + ScheduleIntervalSpec, +) + +# Create a schedule +schedule_id = "daily-report" +await client.create_schedule( + schedule_id, + Schedule( + action=ScheduleActionStartWorkflow( + DailyReportWorkflow.run, + id="daily-report", + task_queue="reports", + ), + spec=ScheduleSpec( + intervals=[ScheduleIntervalSpec(every=timedelta(days=1))], + ), + ), +) + +# Manage schedules +schedule = client.get_schedule_handle(schedule_id) +await schedule.pause("Maintenance window") +await schedule.unpause() +await schedule.trigger() # Run immediately +await schedule.delete() +``` + +## Interceptors + +Interceptors allow cross-cutting concerns like logging, metrics, and auth. + +### Creating a Custom Activity Interceptor + +The interceptor pattern uses a chain of interceptors. You create an `Interceptor` class that returns specialized inbound interceptors for activities and workflows. + +```python +from temporalio.worker import ( + Interceptor, + ActivityInboundInterceptor, + ExecuteActivityInput, +) +from typing import Any + +class LoggingActivityInboundInterceptor(ActivityInboundInterceptor): + async def execute_activity(self, input: ExecuteActivityInput) -> Any: + activity.logger.info(f"Activity starting: {input.fn.__name__}") + try: + # Delegate to next interceptor in chain + result = await self.next.execute_activity(input) + activity.logger.info(f"Activity completed: {input.fn.__name__}") + return result + except Exception as e: + activity.logger.error(f"Activity failed: {e}") + raise + +class LoggingInterceptor(Interceptor): + def intercept_activity( + self, + next: ActivityInboundInterceptor, + ) -> ActivityInboundInterceptor: + # Return our interceptor wrapping the next one + return LoggingActivityInboundInterceptor(next) + +# Apply to worker +worker = Worker( + client, + task_queue="my-queue", + workflows=[MyWorkflow], + activities=[my_activity], + interceptors=[LoggingInterceptor()], +) +``` + +### Creating a Custom Workflow Interceptor + +```python +from temporalio.worker import ( + Interceptor, + WorkflowInboundInterceptor, + WorkflowInterceptorClassInput, + ExecuteWorkflowInput, +) + +class LoggingWorkflowInboundInterceptor(WorkflowInboundInterceptor): + async def execute_workflow(self, input: ExecuteWorkflowInput) -> Any: + workflow.logger.info(f"Workflow starting: {input.type}") + try: + result = await self.next.execute_workflow(input) + workflow.logger.info(f"Workflow completed: {input.type}") + return result + except Exception as e: + workflow.logger.error(f"Workflow failed: {e}") + raise + +class LoggingInterceptor(Interceptor): + def workflow_interceptor_class( + self, + input: WorkflowInterceptorClassInput, + ) -> type[WorkflowInboundInterceptor] | None: + return LoggingWorkflowInboundInterceptor +``` + +## Dynamic Workflows and Activities + +Handle workflows/activities not known at compile time. + +### Dynamic Workflow Handler + +```python +@workflow.defn(dynamic=True) +class DynamicWorkflow: + @workflow.run + async def run(self, args: Sequence[RawValue]) -> Any: + workflow_type = workflow.info().workflow_type + # Route based on type + if workflow_type == "order-workflow": + return await self._handle_order(args) + elif workflow_type == "refund-workflow": + return await self._handle_refund(args) +``` + +### Dynamic Activity Handler + +```python +@activity.defn(dynamic=True) +async def dynamic_activity(args: Sequence[RawValue]) -> Any: + activity_type = activity.info().activity_type + # Handle based on type + ... +``` + +## Async Activity Completion + +For activities that complete asynchronously (e.g., human tasks, external callbacks). + +```python +from temporalio import activity +from temporalio.client import Client + +@activity.defn +async def request_approval(request_id: str) -> None: + # Get task token for async completion + task_token = activity.info().task_token + + # Store task token for later completion (e.g., in database) + await store_task_token(request_id, task_token) + + # Raise to indicate async completion + activity.raise_complete_async() + +# Later, complete the activity from another process +async def complete_approval(request_id: str, approved: bool): + client = await Client.connect("localhost:7233") + task_token = await get_task_token(request_id) + + if approved: + await client.get_async_activity_handle(task_token).complete("approved") + else: + await client.get_async_activity_handle(task_token).fail( + ApplicationError("Rejected") + ) +``` + +## Sandbox Customization + +The Python SDK runs workflows in a sandbox to ensure determinism. You can customize sandbox restrictions when needed. + +### Passing Through Modules + +If you need to use modules that are blocked by the sandbox: + +```python +from temporalio.worker import Worker +from temporalio.worker.workflow_sandbox import SandboxRestrictions + +# Allow specific modules through the sandbox +restrictions = SandboxRestrictions.default.with_passthrough_modules("my_module") + +worker = Worker( + client, + task_queue="my-queue", + workflows=[MyWorkflow], + workflow_runner=SandboxedWorkflowRunner( + restrictions=restrictions, + ), +) +``` + +### Passing Through All Modules (Use with Caution) + +```python +# Disable module restrictions entirely - use only if you trust all code +restrictions = SandboxRestrictions.default.with_passthrough_all_modules() +``` + +### Temporary Passthrough in Workflow Code + +```python +@workflow.run +async def run(self) -> str: + # Temporarily disable sandbox restrictions for imports + with workflow.unsafe.imports_passed_through(): + import some_restricted_module + # Use the module... +``` + +### Customizing Invalid Module Members + +```python +# Allow specific members that are normally blocked +restrictions = SandboxRestrictions.default +restrictions = restrictions.with_invalid_module_member_children( + "datetime", {"datetime": {"now"}} # Block datetime.datetime.now +) +``` + +## Gevent Compatibility Warning + +**The Python SDK is NOT compatible with gevent.** Gevent's monkey patching modifies Python's asyncio event loop in ways that break the SDK's deterministic execution model. + +If your application uses gevent: +- You cannot run Temporal workers in the same process +- Consider running workers in a separate process without gevent +- Use a message queue or HTTP API to communicate between gevent and Temporal processes + +## Worker Tuning + +Configure worker performance settings. + +```python +from concurrent.futures import ThreadPoolExecutor + +worker = Worker( + client, + task_queue="my-queue", + workflows=[MyWorkflow], + activities=[my_activity], + # Workflow task concurrency + max_concurrent_workflow_tasks=100, + # Activity task concurrency + max_concurrent_activities=100, + # Executor for sync activities + activity_executor=ThreadPoolExecutor(max_workers=50), + # Graceful shutdown timeout + graceful_shutdown_timeout=timedelta(seconds=30), +) +``` + +## Workflow Init Decorator + +Use `@workflow.init` to run initialization code when a workflow is first created (not on replay). + +```python +@workflow.defn +class MyWorkflow: + @workflow.init + def __init__(self, initial_value: str) -> None: + # This runs only on first execution, not replay + self._value = initial_value + self._items: list[str] = [] + + @workflow.run + async def run(self) -> str: + # self._value and self._items are already initialized + return self._value +``` + +## Workflow Failure Exception Types + +Control which exceptions cause workflow task failures vs workflow failures: + +```python +@workflow.defn( + # These exception types will fail the workflow execution (not just the task) + failure_exception_types=[ValueError, CustomBusinessError] +) +class MyWorkflow: + @workflow.run + async def run(self) -> str: + raise ValueError("This fails the workflow, not just the task") +``` diff --git a/references/python/ai-patterns.md b/references/python/ai-patterns.md new file mode 100644 index 0000000..9a3ef62 --- /dev/null +++ b/references/python/ai-patterns.md @@ -0,0 +1,362 @@ +# Python AI/LLM Integration Patterns + +## Overview + +This document provides Python-specific implementation details for integrating LLMs with Temporal. For conceptual patterns, see `references/core/ai-integration.md`. + +## Pydantic Data Converter Setup + +**Required** for handling complex types like OpenAI response objects: + +```python +from temporalio.client import Client +from temporalio.contrib.pydantic import pydantic_data_converter + +client = await Client.connect( + "localhost:7233", + data_converter=pydantic_data_converter, +) +``` + +## OpenAI Client Configuration + +**Critical**: Disable client retries, let Temporal handle them: + +```python +from openai import AsyncOpenAI + +openai_client = AsyncOpenAI( + api_key=os.getenv("OPENAI_API_KEY"), + max_retries=0, # CRITICAL: Disable client retries + timeout=30.0, +) +``` + +## LiteLLM Configuration + +For multi-model support: + +```python +import litellm + +litellm.num_retries = 0 # Disable LiteLLM retries +``` + +## Generic LLM Activity + +Flexible, reusable activity for LLM calls: + +```python +from temporalio import activity +from pydantic import BaseModel +from typing import Optional, Any + +class LLMRequest(BaseModel): + model: str + system_prompt: str + user_input: str + tools: Optional[list] = None + response_format: Optional[type] = None + temperature: float = 0.7 + +class LLMResponse(BaseModel): + content: str + tool_calls: Optional[list] = None + usage: dict + +@activity.defn +async def call_llm(request: LLMRequest) -> LLMResponse: + """Generic LLM activity supporting multiple use cases.""" + response = await openai_client.chat.completions.create( + model=request.model, + messages=[ + {"role": "system", "content": request.system_prompt}, + {"role": "user", "content": request.user_input}, + ], + tools=request.tools, + temperature=request.temperature, + ) + + return LLMResponse( + content=response.choices[0].message.content or "", + tool_calls=response.choices[0].message.tool_calls, + usage=response.usage.model_dump(), + ) +``` + +## Activity Retry Policy + +Configure retries at the workflow level: + +```python +from datetime import timedelta +from temporalio import workflow +from temporalio.common import RetryPolicy + +with workflow.unsafe.imports_passed_through(): + from activities.llm import call_llm, LLMRequest + +@workflow.defn +class LLMWorkflow: + @workflow.run + async def run(self, prompt: str) -> str: + response = await workflow.execute_activity( + call_llm, + LLMRequest( + model="gpt-4", + system_prompt="You are a helpful assistant.", + user_input=prompt, + ), + start_to_close_timeout=timedelta(seconds=30), + retry_policy=RetryPolicy( + initial_interval=timedelta(seconds=1), + backoff_coefficient=2.0, + maximum_interval=timedelta(seconds=60), + maximum_attempts=5, + non_retryable_error_types=["InvalidAPIKeyError"], + ), + ) + return response.content +``` + +## Tool-Calling Agent Workflow + +```python +from temporalio import workflow +from datetime import timedelta + +with workflow.unsafe.imports_passed_through(): + from activities.llm import call_llm, LLMRequest, LLMResponse + from activities.tools import execute_tool + from models.tools import ToolDefinition + +@workflow.defn +class AgentWorkflow: + @workflow.run + async def run(self, user_request: str, tools: list[ToolDefinition]) -> str: + messages = [] + + while True: + # Phase 1: Get LLM response with tools + response = await workflow.execute_activity( + call_llm, + LLMRequest( + model="gpt-4", + system_prompt="You are a helpful agent with tools.", + user_input=user_request, + tools=[t.to_openai_format() for t in tools], + ), + start_to_close_timeout=timedelta(seconds=30), + ) + + # Check if LLM wants to use a tool + if not response.tool_calls: + return response.content + + # Phase 2: Execute tools + for tool_call in response.tool_calls: + tool_result = await workflow.execute_activity( + execute_tool, + tool_call, + start_to_close_timeout=timedelta(seconds=60), + ) + messages.append({ + "role": "tool", + "tool_call_id": tool_call.id, + "content": tool_result, + }) + + # Phase 3: Continue conversation with tool results + user_request = f"Tool results: {messages}" +``` + +## Structured Outputs + +Using Pydantic for validated responses: + +```python +from pydantic import BaseModel +from temporalio import activity + +class AnalysisResult(BaseModel): + sentiment: str + confidence: float + key_topics: list[str] + summary: str + +@activity.defn +async def analyze_text(text: str) -> AnalysisResult: + response = await openai_client.beta.chat.completions.parse( + model="gpt-4o", + messages=[ + {"role": "system", "content": "Analyze the following text."}, + {"role": "user", "content": text}, + ], + response_format=AnalysisResult, + ) + return response.choices[0].message.parsed +``` + +## Rate Limit Handling + +Parse rate limit headers and raise retryable errors: + +```python +from temporalio import activity +from temporalio.exceptions import ApplicationError + +@activity.defn +async def call_llm_with_rate_limit(request: LLMRequest) -> LLMResponse: + try: + response = await openai_client.chat.completions.create(...) + return LLMResponse(...) + except openai.RateLimitError as e: + # Extract retry-after if available + retry_after = e.response.headers.get("retry-after", 30) + raise ApplicationError( + f"Rate limited, retry after {retry_after}s", + non_retryable=False, # Allow Temporal to retry + ) +``` + +## Multi-Agent Pipeline (Deep Research) + +```python +from temporalio import workflow +from datetime import timedelta +import asyncio + +with workflow.unsafe.imports_passed_through(): + from activities.research import ( + generate_subtopics, + generate_search_queries, + search_web, + synthesize_report, + ) + +@workflow.defn +class DeepResearchWorkflow: + @workflow.run + async def run(self, topic: str) -> str: + # Phase 1: Planning + subtopics = await workflow.execute_activity( + generate_subtopics, + topic, + start_to_close_timeout=timedelta(seconds=60), + ) + + # Phase 2: Query Generation + queries = await workflow.execute_activity( + generate_search_queries, + subtopics, + start_to_close_timeout=timedelta(seconds=60), + ) + + # Phase 3: Parallel Web Search (resilient to partial failures) + search_tasks = [ + workflow.execute_activity( + search_web, + query, + start_to_close_timeout=timedelta(seconds=300), + ) + for query in queries + ] + + # Continue with partial results on failure + results = await asyncio.gather(*search_tasks, return_exceptions=True) + successful_results = [r for r in results if not isinstance(r, Exception)] + + # Phase 4: Synthesis + report = await workflow.execute_activity( + synthesize_report, + {"topic": topic, "research": successful_results}, + start_to_close_timeout=timedelta(seconds=300), + ) + + return report +``` + +## OpenAI Agents SDK Integration + +Using Temporal's OpenAI contrib module: + +```python +from temporalio.contrib.openai import create_workflow_agent +from agents import Agent, Runner + +# Create a Temporal-aware agent +agent = create_workflow_agent( + model="gpt-4", + tools=[search_tool, calculator_tool], +) + +@workflow.defn +class DurableAgentWorkflow: + @workflow.run + async def run(self, task: str) -> str: + result = await agent.run(task) + return result.output +``` + +## Testing with Mocks + +```python +import pytest +from temporalio.testing import WorkflowEnvironment +from temporalio.worker import Worker + +@pytest.fixture +async def workflow_environment(): + async with await WorkflowEnvironment.start_time_skipping() as env: + yield env + +async def test_llm_workflow(workflow_environment): + # Mock LLM activity + async def mock_call_llm(request): + return LLMResponse( + content="Mocked response", + tool_calls=None, + usage={"total_tokens": 100}, + ) + + async with Worker( + workflow_environment.client, + task_queue="test-queue", + workflows=[LLMWorkflow], + activities=[mock_call_llm], # Use mock + ): + result = await workflow_environment.client.execute_workflow( + LLMWorkflow.run, + "test prompt", + id="test-workflow", + task_queue="test-queue", + ) + assert result == "Mocked response" +``` + +## Timeout Recommendations + +```python +# Simple LLM calls (GPT-4, Claude-3) +start_to_close_timeout=timedelta(seconds=30) + +# Reasoning models (o1, o3) +start_to_close_timeout=timedelta(seconds=300) + +# Web searches +start_to_close_timeout=timedelta(seconds=300) + +# Tool execution +start_to_close_timeout=timedelta(seconds=60) +``` + +## Best Practices + +1. **Always use Pydantic data converter** for complex types +2. **Disable retries in LLM clients** (max_retries=0) +3. **Set appropriate timeouts** per operation type +4. **Use structured outputs** for type safety +5. **Handle partial failures** in parallel operations +6. **Mock activities in tests** for fast, deterministic testing +7. **Log token usage** for cost tracking +8. **Version prompts** in code for reproducibility diff --git a/references/python/data-handling.md b/references/python/data-handling.md new file mode 100644 index 0000000..c4e16a5 --- /dev/null +++ b/references/python/data-handling.md @@ -0,0 +1,230 @@ +# Python SDK Data Handling + +## Overview + +The Python SDK uses data converters to serialize/deserialize workflow inputs, outputs, and activity parameters. + +## Default Data Converter + +The default converter handles: +- `None` +- `bytes` (as binary) +- Protobuf messages +- JSON-serializable types (dict, list, str, int, float, bool) + +## Pydantic Integration + +Use Pydantic models for validated, typed data. + +```python +from pydantic import BaseModel +from temporalio.contrib.pydantic import pydantic_data_converter + +class OrderInput(BaseModel): + order_id: str + items: list[str] + total: float + customer_email: str + +class OrderResult(BaseModel): + order_id: str + status: str + tracking_number: str | None = None + +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, input: OrderInput) -> OrderResult: + # Pydantic validation happens automatically + return OrderResult( + order_id=input.order_id, + status="completed", + tracking_number="TRK123", + ) + +# Configure client with Pydantic support +client = await Client.connect( + "localhost:7233", + data_converter=pydantic_data_converter, +) +``` + +## Custom Data Converter + +Create custom converters for special serialization needs. + +```python +from temporalio.converter import ( + DataConverter, + PayloadConverter, + DefaultPayloadConverter, +) + +class CustomPayloadConverter(PayloadConverter): + # Implement encoding_payload_converters and decoding_payload_converters + pass + +custom_converter = DataConverter( + payload_converter_class=CustomPayloadConverter, +) + +client = await Client.connect( + "localhost:7233", + data_converter=custom_converter, +) +``` + +## Payload Encryption + +Encrypt sensitive workflow data. + +```python +from temporalio.converter import PayloadCodec +from temporalio.api.common.v1 import Payload +from cryptography.fernet import Fernet +from typing import Sequence + +class EncryptionCodec(PayloadCodec): + def __init__(self, key: bytes): + self._fernet = Fernet(key) + + async def encode(self, payloads: Sequence[Payload]) -> list[Payload]: + return [ + Payload( + metadata={"encoding": b"binary/encrypted"}, + data=self._fernet.encrypt(p.SerializeToString()), + ) + for p in payloads + ] + + async def decode(self, payloads: Sequence[Payload]) -> list[Payload]: + result = [] + for p in payloads: + if p.metadata.get("encoding") == b"binary/encrypted": + decrypted = self._fernet.decrypt(p.data) + decoded = Payload() + decoded.ParseFromString(decrypted) + result.append(decoded) + else: + result.append(p) + return result + +# Apply encryption codec +client = await Client.connect( + "localhost:7233", + data_converter=DataConverter( + payload_codec=EncryptionCodec(encryption_key), + ), +) +``` + +## Search Attributes + +Custom searchable fields for workflow visibility. + +```python +from temporalio.common import SearchAttributes, SearchAttributeKey + +# Define typed keys +ORDER_ID = SearchAttributeKey.for_keyword("OrderId") +ORDER_STATUS = SearchAttributeKey.for_keyword("OrderStatus") +ORDER_TOTAL = SearchAttributeKey.for_float("OrderTotal") +CREATED_AT = SearchAttributeKey.for_datetime("CreatedAt") + +# Set at workflow start +await client.execute_workflow( + OrderWorkflow.run, + order, + id=f"order-{order.id}", + task_queue="orders", + search_attributes=SearchAttributes.from_pairs([ + (ORDER_ID, order.id), + (ORDER_STATUS, "pending"), + (ORDER_TOTAL, order.total), + (CREATED_AT, datetime.now(timezone.utc)), + ]), +) + +# Upsert from within workflow +workflow.upsert_search_attributes([ + (ORDER_STATUS, "completed"), +]) +``` + +## Workflow Memo + +Store arbitrary metadata with workflows (not searchable). + +```python +# Set memo at workflow start +await client.execute_workflow( + OrderWorkflow.run, + order, + id=f"order-{order.id}", + task_queue="orders", + memo={ + "customer_name": order.customer_name, + "notes": "Priority customer", + }, +) + +# Read memo from workflow +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, order: Order) -> str: + memo = workflow.memo() + notes = memo.get("notes", "") + ... +``` + +## Large Payloads + +For large data, consider: + +1. **Store externally**: Put large data in S3/GCS, pass references in workflows +2. **Use Payload Codec**: Compress payloads automatically +3. **Chunk data**: Split large lists across multiple activities + +```python +# Example: Reference pattern for large data +@activity.defn +async def upload_to_storage(data: bytes) -> str: + """Upload data and return reference.""" + key = f"data/{uuid.uuid4()}" + await storage_client.upload(key, data) + return key + +@activity.defn +async def download_from_storage(key: str) -> bytes: + """Download data by reference.""" + return await storage_client.download(key) +``` + +## Deterministic APIs for Values + +Use these APIs within workflows for deterministic random values and UUIDs: + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + # Deterministic UUID (same on replay) + unique_id = workflow.uuid4() + + # Deterministic random (same on replay) + rng = workflow.random() + value = rng.randint(1, 100) + + return str(unique_id) +``` + +## Best Practices + +1. Use Pydantic for input/output validation +2. Keep payloads small (< 2MB recommended) +3. Encrypt sensitive data with PayloadCodec +4. Store large data externally with references +5. Use dataclasses for simple data structures +6. Use `workflow.uuid4()` and `workflow.random()` for deterministic values diff --git a/references/python/determinism.md b/references/python/determinism.md new file mode 100644 index 0000000..90e5856 --- /dev/null +++ b/references/python/determinism.md @@ -0,0 +1,143 @@ +# Python SDK Determinism + +## Overview + +The Python SDK runs workflows in a sandbox that provides automatic protection against many non-deterministic operations. + +## Why Determinism Matters: History Replay + +Temporal achieves durability through **history replay**. Understanding this mechanism is key to writing correct Workflow code. + +### How Replay Works + +1. **Initial Execution**: When your Workflow runs for the first time, the SDK records Commands (like "schedule activity") to the Event History stored by Temporal Server. + +2. **Recovery/Continuation**: When a Worker restarts, loses connectivity, or picks up a Workflow Task, it must restore the Workflow's state by replaying the code from the beginning. + +3. **Command Matching**: During replay, the SDK re-executes your Workflow code but doesn't actually run Activities again. Instead, it compares the Commands your code generates against the Events in history. If there's a match, it uses the stored result. + +4. **Non-determinism Detection**: If your code generates different Commands than what's in history (e.g., different Activity name, different order), the SDK raises a `NondeterminismError`. + +### Example: Why datetime.now() Breaks Replay + +```python +# BAD - Non-deterministic +@workflow.defn +class BadWorkflow: + @workflow.run + async def run(self) -> str: + import datetime + if datetime.datetime.now().hour < 12: # Different value on replay! + await workflow.execute_activity(morning_activity, ...) + else: + await workflow.execute_activity(afternoon_activity, ...) +``` + +If this runs at 11:59 AM initially and replays at 12:01 PM, it will try to schedule a different Activity, causing `NondeterminismError`. + +```python +# GOOD - Deterministic +@workflow.defn +class GoodWorkflow: + @workflow.run + async def run(self) -> str: + if workflow.now().hour < 12: # Consistent during replay + await workflow.execute_activity(morning_activity, ...) + else: + await workflow.execute_activity(afternoon_activity, ...) +``` + +### Testing Replay Compatibility + +Use the `Replayer` class to verify your code changes are compatible with existing histories: + +```python +from temporalio.worker import Replayer +from temporalio.client import WorkflowHistory + +async def test_replay_compatibility(): + replayer = Replayer(workflows=[MyWorkflow]) + + # Test against a saved history + with open("workflow_history.json") as f: + history = WorkflowHistory.from_json("my-workflow-id", f.read()) + + # This will raise NondeterminismError if incompatible + await replayer.replay_workflow(history) +``` + +## Sandbox Behavior + +The sandbox: +- Isolates global state via `exec` compilation +- Restricts non-deterministic library calls via proxy objects +- Passes through standard library with restrictions + +## Safe Alternatives + +| Forbidden | Safe Alternative | +|-----------|------------------| +| `datetime.now()` | `workflow.now()` | +| `datetime.utcnow()` | `workflow.now()` | +| `random.random()` | `workflow.random().random()` | +| `random.randint()` | `workflow.random().randint()` | +| `uuid.uuid4()` | `workflow.uuid4()` | +| `time.time()` | `workflow.now().timestamp()` | + +## Pass-Through Pattern + +For third-party libraries that need to bypass sandbox restrictions: + +```python +with workflow.unsafe.imports_passed_through(): + import pydantic + from my_module import my_activity +``` + +## Disabling Sandbox + +```python +# Per-workflow +@workflow.defn(sandboxed=False) +class UnsandboxedWorkflow: + pass + +# Per-block +with workflow.unsafe.sandbox_unrestricted(): + # Unrestricted code + pass + +# Globally (worker level) +from temporalio.worker import UnsandboxedWorkflowRunner +Worker(..., workflow_runner=UnsandboxedWorkflowRunner()) +``` + +## Forbidden Operations + +- Direct I/O (network, filesystem) +- Threading operations +- `subprocess` calls +- Global mutable state modification +- `time.sleep()` (use `asyncio.sleep()`) + +## Commands and Events + +Understanding the relationship between your code and the Event History: + +| Workflow Code | Command Generated | Event Created | +|--------------|-------------------|---------------| +| `workflow.execute_activity()` | ScheduleActivityTask | ActivityTaskScheduled | +| `asyncio.sleep()` / `workflow.sleep()` | StartTimer | TimerStarted | +| `workflow.execute_child_workflow()` | StartChildWorkflowExecution | ChildWorkflowExecutionStarted | +| `workflow.continue_as_new()` | ContinueAsNewWorkflowExecution | WorkflowExecutionContinuedAsNew | +| Return from `@workflow.run` | CompleteWorkflowExecution | WorkflowExecutionCompleted | + +## Best Practices + +1. Use `workflow.now()` for all time operations +2. Use `workflow.random()` for random values +3. Use `workflow.uuid4()` for unique identifiers +4. Pass through third-party libraries explicitly +5. Test with replay to catch non-determinism +6. Keep workflows focused on orchestration, delegate I/O to activities +7. Use `workflow.logger` instead of print() for replay-safe logging diff --git a/references/python/error-handling.md b/references/python/error-handling.md new file mode 100644 index 0000000..627e58c --- /dev/null +++ b/references/python/error-handling.md @@ -0,0 +1,145 @@ +# Python SDK Error Handling + +## Overview + +The Python SDK uses `ApplicationError` for application-specific errors and provides comprehensive retry policy configuration. + +## Application Errors + +```python +from temporalio.exceptions import ApplicationError + +@activity.defn +async def validate_order(order: Order) -> None: + if not order.is_valid(): + raise ApplicationError( + "Invalid order", + type="ValidationError", + ) +``` + +## Non-Retryable Errors + +```python +raise ApplicationError( + "Permanent failure - invalid credit card", + type="PaymentError", + non_retryable=True, # Will not retry +) +``` + +## Handling Activity Errors + +```python +from temporalio.exceptions import ActivityError + +@workflow.run +async def run(self) -> str: + try: + return await workflow.execute_activity( + risky_activity, + schedule_to_close_timeout=timedelta(minutes=5), + ) + except ActivityError as e: + workflow.logger.error(f"Activity failed: {e}") + # Handle or re-raise + raise ApplicationError("Workflow failed due to activity error") +``` + +## Retry Policy Configuration + +```python +from temporalio.common import RetryPolicy + +result = await workflow.execute_activity( + my_activity, + schedule_to_close_timeout=timedelta(minutes=10), + retry_policy=RetryPolicy( + initial_interval=timedelta(seconds=1), + backoff_coefficient=2.0, + maximum_interval=timedelta(minutes=1), + maximum_attempts=5, + non_retryable_error_types=["ValidationError", "PaymentError"], + ), +) +``` + +## Timeout Configuration + +```python +await workflow.execute_activity( + my_activity, + start_to_close_timeout=timedelta(minutes=5), # Single attempt + schedule_to_close_timeout=timedelta(minutes=30), # Including retries + heartbeat_timeout=timedelta(seconds=30), # Between heartbeats +) +``` + +## Workflow Failure + +```python +@workflow.run +async def run(self) -> str: + if some_condition: + raise ApplicationError( + "Cannot process order", + type="BusinessError", + non_retryable=True, + ) + return "success" +``` + +## Idempotency Patterns + +When Activities interact with external systems, making them idempotent ensures correctness during retries and replay. + +### Using Workflow IDs as Idempotency Keys + +```python +@activity.defn +async def charge_payment(order_id: str, amount: float) -> str: + # Use order_id as idempotency key with payment provider + result = await payment_api.charge( + amount=amount, + idempotency_key=f"order-{order_id}", # Prevents duplicate charges + ) + return result.transaction_id +``` + +### Tracking Operation Status in Workflow State + +```python +@workflow.defn +class OrderWorkflow: + def __init__(self): + self._payment_completed = False + self._transaction_id: str | None = None + + @workflow.run + async def run(self, order: Order) -> str: + if not self._payment_completed: + self._transaction_id = await workflow.execute_activity( + charge_payment, order.id, order.total, + schedule_to_close_timeout=timedelta(minutes=5), + ) + self._payment_completed = True + + # Continue with order processing... + return self._transaction_id +``` + +### Designing Idempotent Activities + +1. **Use unique identifiers** as idempotency keys (workflow ID, activity ID, or business ID) +2. **Check before acting**: Query external system state before making changes +3. **Make operations repeatable**: Ensure calling twice produces the same result +4. **Record outcomes**: Store transaction IDs or results for verification + +## Best Practices + +1. Use specific error types for different failure modes +2. Mark permanent failures as non-retryable +3. Configure appropriate retry policies per activity +4. Log errors before re-raising +5. Use `ActivityError` to catch activity failures in workflows +6. Design activities to be idempotent for safe retries diff --git a/references/python/gotchas.md b/references/python/gotchas.md new file mode 100644 index 0000000..ceec619 --- /dev/null +++ b/references/python/gotchas.md @@ -0,0 +1,390 @@ +# Python Gotchas + +Python-specific mistakes and anti-patterns. See also [Common Gotchas](../core/common-gotchas.md) for language-agnostic concepts. + +## Idempotency + +```python +# BAD - May charge customer multiple times on retry +@activity.defn +async def charge_payment(order_id: str, amount: float) -> str: + return await payment_api.charge(customer_id, amount) + +# GOOD - Safe for retries +@activity.defn +async def charge_payment(order_id: str, amount: float) -> str: + return await payment_api.charge( + customer_id, + amount, + idempotency_key=f"order-{order_id}" + ) +``` + +## Replay Safety + +### Side Effects in Workflows + +```python +# BAD - Prints on every replay, notification runs in workflow +@workflow.defn +class NotificationWorkflow: + @workflow.run + async def run(self): + print("Starting workflow") # Runs on replay too + send_slack_notification("Started") # Side effect in workflow! + await workflow.execute_activity(...) + +# GOOD - Replay-safe +@workflow.defn +class NotificationWorkflow: + @workflow.run + async def run(self): + workflow.logger.info("Starting workflow") # Only logs on first execution + await workflow.execute_activity(send_notification, "Started") +``` + +### Time-Based Logic + +```python +# BAD - Different time on replay +if datetime.now() > deadline: + await cancel_order() + +# GOOD - Consistent across replays +if workflow.now() > deadline: + await cancel_order() +``` + +### Other Non-Deterministic Operations + +```python +# BAD - Different values on replay +random_id = str(uuid.uuid4()) +random_value = random.random() + +# GOOD - Deterministic alternatives +random_id = workflow.uuid4() +random_value = workflow.random().random() +``` + +## Query Handlers + +### Modifying State + +```python +# BAD - Query modifies state +@workflow.defn +class QueueWorkflow: + def __init__(self): + self._queue = [] + + @workflow.query + def get_next_item(self) -> str | None: + if self._queue: + return self._queue.pop(0) # Mutates state! + return None + +# GOOD - Query reads, Update modifies +@workflow.defn +class QueueWorkflow: + def __init__(self): + self._queue = [] + + @workflow.query + def peek(self) -> str | None: + return self._queue[0] if self._queue else None + + @workflow.update + def dequeue(self) -> str | None: + if self._queue: + return self._queue.pop(0) + return None +``` + +### Blocking in Queries + +```python +# BAD - Queries cannot await +@workflow.query +async def get_data_with_refresh(self) -> dict: + if self._data is None: + self._data = await workflow.execute_activity(fetch_data, ...) + return self._data + +# GOOD - Query returns state, signal triggers refresh +@workflow.signal +async def refresh_data(self): + self._data = await workflow.execute_activity(fetch_data, ...) + +@workflow.query +def get_data(self) -> dict | None: + return self._data +``` + +## File Organization + +### Importing Activities into Workflow Files + +**The Problem**: The Python sandbox reloads workflow files on every task. Importing heavy activity modules slows down workers. + +```python +# BAD - activities.py gets reloaded constantly +# workflows.py +from activities import my_activity + +@workflow.defn +class MyWorkflow: + pass + +# GOOD - Pass-through import +# workflows.py +from temporalio import workflow + +with workflow.unsafe.imports_passed_through(): + from activities import my_activity + +@workflow.defn +class MyWorkflow: + pass +``` + +### Mixing Workflows and Activities + +```python +# BAD - Everything in one file +# app.py +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self): + await workflow.execute_activity(my_activity, ...) + +@activity.defn +async def my_activity(): + # Heavy imports, I/O, etc. + pass + +# GOOD - Separate files +# workflows.py +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self): + await workflow.execute_activity(my_activity, ...) + +# activities.py +@activity.defn +async def my_activity(): + pass +``` + +## Async vs Sync Activities + +### Blocking in Async Activities + +```python +# BAD - Blocks the event loop +@activity.defn +async def process_file(path: str) -> str: + with open(path) as f: # Blocking I/O in async! + return f.read() + +# GOOD Option 1 - Use sync activity with executor +@activity.defn +def process_file(path: str) -> str: + with open(path) as f: + return f.read() + +# Register with executor in worker +Worker( + client, + task_queue="my-queue", + activities=[process_file], + activity_executor=ThreadPoolExecutor(max_workers=10), +) + +# GOOD Option 2 - Use async I/O +@activity.defn +async def process_file(path: str) -> str: + async with aiofiles.open(path) as f: + return await f.read() +``` + +### Missing Executor for Sync Activities + +```python +# BAD - Sync activity without executor blocks worker +@activity.defn +def slow_computation(data: str) -> str: + return heavy_cpu_work(data) + +Worker( + client, + task_queue="my-queue", + activities=[slow_computation], + # Missing activity_executor! +) + +# GOOD - Provide executor +Worker( + client, + task_queue="my-queue", + activities=[slow_computation], + activity_executor=ThreadPoolExecutor(max_workers=10), +) +``` + +## Error Handling + +### Swallowing Errors + +```python +# BAD - Error is hidden +@workflow.defn +class SilentFailureWorkflow: + @workflow.run + async def run(self): + try: + await workflow.execute_activity(...) + except Exception: + pass # Error is lost! + +# GOOD - Handle appropriately +@workflow.defn +class ProperErrorHandlingWorkflow: + @workflow.run + async def run(self): + try: + await workflow.execute_activity(...) + except ActivityError as e: + workflow.logger.error(f"Activity failed: {e}") + raise # Or use fallback, compensate, etc. +``` + +### Wrong Retry Classification + +```python +# BAD - Network errors should be retried +@activity.defn +async def call_api(): + try: + return await http_client.get(url) + except ConnectionError: + raise ApplicationError("Connection failed", non_retryable=True) + +# GOOD - Only permanent failures are non-retryable +@activity.defn +async def call_api(): + try: + return await http_client.get(url) + except ConnectionError: + raise # Let Temporal retry + except InvalidCredentialsError: + raise ApplicationError("Invalid API key", non_retryable=True) +``` + +## Retry Policies + +### Too Aggressive + +```python +# BAD - Gives up too easily +result = await workflow.execute_activity( + flaky_api_call, + schedule_to_close_timeout=timedelta(seconds=30), + retry_policy=RetryPolicy(maximum_attempts=1), +) + +# GOOD - Resilient to transient failures +result = await workflow.execute_activity( + flaky_api_call, + schedule_to_close_timeout=timedelta(minutes=10), + retry_policy=RetryPolicy( + initial_interval=timedelta(seconds=1), + maximum_interval=timedelta(minutes=1), + backoff_coefficient=2.0, + maximum_attempts=10, + ), +) +``` + +## Heartbeating + +### Forgetting to Heartbeat Long Activities + +```python +# BAD - No heartbeat, can't detect stuck activities +@activity.defn +async def process_large_file(path: str): + for chunk in read_chunks(path): + process(chunk) # Takes hours, no heartbeat + +# GOOD - Regular heartbeats with progress +@activity.defn +async def process_large_file(path: str): + for i, chunk in enumerate(read_chunks(path)): + activity.heartbeat(f"Processing chunk {i}") + process(chunk) +``` + +### Heartbeat Timeout Too Short + +```python +# BAD - Heartbeat timeout shorter than processing time +await workflow.execute_activity( + process_chunk, + start_to_close_timeout=timedelta(minutes=30), + heartbeat_timeout=timedelta(seconds=10), # Too short! +) + +# GOOD - Heartbeat timeout allows for processing variance +await workflow.execute_activity( + process_chunk, + start_to_close_timeout=timedelta(minutes=30), + heartbeat_timeout=timedelta(minutes=2), +) +``` + +## Testing + +### Not Testing Failures + +```python +# Test failure scenarios +@pytest.mark.asyncio +async def test_activity_failure_handling(): + async with await WorkflowEnvironment.start_time_skipping() as env: + # Create activity that always fails + @activity.defn + async def failing_activity() -> str: + raise ApplicationError("Simulated failure", non_retryable=True) + + async with Worker( + env.client, + task_queue="test", + workflows=[MyWorkflow], + activities=[failing_activity], + ): + with pytest.raises(WorkflowFailureError): + await env.client.execute_workflow( + MyWorkflow.run, + id="test-failure", + task_queue="test", + ) +``` + +### Not Testing Replay + +```python +from temporalio.worker import Replayer + +async def test_replay_compatibility(): + replayer = Replayer(workflows=[MyWorkflow]) + + # Load history from file (captured from production/staging) + with open("workflow_history.json") as f: + history = WorkflowHistory.from_json("workflow-id", f.read()) + + # Fails if current code is incompatible with history + await replayer.replay_workflow(history) +``` diff --git a/references/python/observability.md b/references/python/observability.md new file mode 100644 index 0000000..fcba076 --- /dev/null +++ b/references/python/observability.md @@ -0,0 +1,191 @@ +# Python SDK Observability + +## Overview + +The Python SDK provides comprehensive observability through logging, metrics, tracing, and visibility (Search Attributes). + +## Logging + +### Workflow Logging (Replay-Safe) + +Use `workflow.logger` for replay-safe logging that avoids duplicate messages: + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self, name: str) -> str: + workflow.logger.info("Workflow started", extra={"name": name}) + + result = await workflow.execute_activity( + my_activity, + schedule_to_close_timeout=timedelta(minutes=5), + ) + + workflow.logger.info("Activity completed", extra={"result": result}) + return result +``` + +The workflow logger automatically: +- Suppresses duplicate logs during replay +- Includes workflow context (workflow ID, run ID, etc.) + +### Activity Logging + +Use `activity.logger` for context-aware activity logging: + +```python +@activity.defn +async def process_order(order_id: str) -> str: + activity.logger.info(f"Processing order {order_id}") + + # Perform work... + + activity.logger.info("Order processed successfully") + return "completed" +``` + +Activity logger includes: +- Activity ID, type, and task queue +- Workflow ID and run ID +- Attempt number (for retries) + +### Custom Logger Configuration + +```python +import logging + +# Configure a custom handler +handler = logging.StreamHandler() +handler.setFormatter(logging.Formatter( + "%(asctime)s - %(name)s - %(levelname)s - %(message)s" +)) + +# Apply to Temporal's logger +temporal_logger = logging.getLogger("temporalio") +temporal_logger.addHandler(handler) +temporal_logger.setLevel(logging.INFO) +``` + +## Metrics + +### Enabling SDK Metrics + +```python +from temporalio.client import Client +from temporalio.runtime import Runtime, TelemetryConfig, PrometheusConfig + +# Configure Prometheus metrics endpoint +runtime = Runtime( + telemetry=TelemetryConfig( + metrics=PrometheusConfig(bind_address="0.0.0.0:9090") + ) +) + +client = await Client.connect( + "localhost:7233", + runtime=runtime, +) +``` + +### Key SDK Metrics + +- `temporal_request` - Client requests to server +- `temporal_workflow_task_execution_latency` - Workflow task processing time +- `temporal_activity_execution_latency` - Activity execution time +- `temporal_workflow_task_replay_latency` - Replay duration + +## Tracing + +### OpenTelemetry Integration + +```python +from opentelemetry import trace +from opentelemetry.sdk.trace import TracerProvider +from temporalio.contrib.opentelemetry import TracingInterceptor + +# Set up OpenTelemetry +provider = TracerProvider() +trace.set_tracer_provider(provider) + +# Create tracing interceptor +tracing_interceptor = TracingInterceptor() + +# Apply to client and worker +client = await Client.connect( + "localhost:7233", + interceptors=[tracing_interceptor], +) + +worker = Worker( + client, + task_queue="my-queue", + workflows=[MyWorkflow], + activities=[my_activity], + interceptors=[tracing_interceptor], +) +``` + +## Search Attributes (Visibility) + +### Setting Search Attributes at Start + +```python +from temporalio.common import SearchAttributes, SearchAttributeKey + +# Define typed search attribute keys +ORDER_ID = SearchAttributeKey.for_keyword("OrderId") +CUSTOMER_TYPE = SearchAttributeKey.for_keyword("CustomerType") +ORDER_TOTAL = SearchAttributeKey.for_float("OrderTotal") + +# Start workflow with search attributes +await client.execute_workflow( + OrderWorkflow.run, + order, + id=f"order-{order.id}", + task_queue="orders", + search_attributes=SearchAttributes.from_pairs([ + (ORDER_ID, order.id), + (CUSTOMER_TYPE, order.customer_type), + (ORDER_TOTAL, order.total), + ]), +) +``` + +### Upserting Search Attributes from Workflow + +```python +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, order: Order) -> str: + # Update status as workflow progresses + workflow.upsert_search_attributes([ + (ORDER_STATUS, "processing"), + ]) + + await workflow.execute_activity(process_order, order, ...) + + workflow.upsert_search_attributes([ + (ORDER_STATUS, "completed"), + ]) + return "done" +``` + +### Querying Workflows by Search Attributes + +```python +# List workflows using search attributes +async for workflow in client.list_workflows( + 'OrderStatus = "processing" AND CustomerType = "premium"' +): + print(f"Workflow {workflow.id} is still processing") +``` + +## Best Practices + +1. Use `workflow.logger` in workflows, `activity.logger` in activities +2. Don't use print() in workflows - it will produce duplicate output on replay +3. Configure metrics for production monitoring +4. Use Search Attributes for business-level visibility +5. Add tracing for distributed debugging diff --git a/references/python/patterns.md b/references/python/patterns.md new file mode 100644 index 0000000..bc4f3fc --- /dev/null +++ b/references/python/patterns.md @@ -0,0 +1,453 @@ +# Python SDK Patterns + +## Signals + +### WHY: Use signals to send data or commands to a running workflow from external sources +### WHEN: +- **Order approval workflows** - Wait for human approval before proceeding +- **Live configuration updates** - Change workflow behavior without restarting +- **Fire-and-forget communication** - Notify workflow of external events +- **Workflow coordination** - Allow workflows to communicate with each other + +**Signals vs Queries vs Updates:** +- Signals: Fire-and-forget, no response, can modify state +- Queries: Read-only, returns data, cannot modify state +- Updates: Synchronous, returns response, can modify state + +```python +@workflow.defn +class OrderWorkflow: + def __init__(self): + self._approved = False + self._items = [] + + @workflow.signal + async def approve(self) -> None: + self._approved = True + + @workflow.signal + async def add_item(self, item: str) -> None: + self._items.append(item) + + @workflow.run + async def run(self) -> str: + # Wait for approval + await workflow.wait_condition(lambda: self._approved) + return f"Processed {len(self._items)} items" +``` + +### Dynamic Signal Handlers + +For handling signals with names not known at compile time: + +```python +@workflow.defn +class DynamicSignalWorkflow: + def __init__(self): + self._signals: dict[str, list[Any]] = {} + + @workflow.signal(dynamic=True) + async def handle_signal(self, name: str, args: Sequence[RawValue]) -> None: + if name not in self._signals: + self._signals[name] = [] + self._signals[name].append(workflow.payload_converter().from_payload(args[0])) +``` + +## Queries + +### WHY: Read workflow state without affecting execution - queries are read-only +### WHEN: +- **Progress tracking dashboards** - Display workflow progress to users +- **Status endpoints** - Check workflow state for API responses +- **Debugging** - Inspect internal workflow state +- **Health checks** - Verify workflow is functioning correctly + +**Important:** Queries must NOT modify workflow state or have side effects. + +```python +@workflow.defn +class StatusWorkflow: + def __init__(self): + self._status = "pending" + self._progress = 0 + + @workflow.query + def get_status(self) -> str: + return self._status + + @workflow.query + def get_progress(self) -> int: + return self._progress + + @workflow.run + async def run(self) -> str: + self._status = "running" + for i in range(100): + self._progress = i + await workflow.execute_activity( + process_item, i, + schedule_to_close_timeout=timedelta(minutes=1) + ) + self._status = "completed" + return "done" +``` + +### Dynamic Query Handlers + +```python +@workflow.query(dynamic=True) +def handle_query(self, name: str, args: Sequence[RawValue]) -> Any: + if name == "get_field": + field_name = workflow.payload_converter().from_payload(args[0]) + return getattr(self, f"_{field_name}", None) +``` + +## Child Workflows + +### WHY: Break complex workflows into smaller, manageable units with independent failure domains +### WHEN: +- **Failure domain isolation** - Child failures don't automatically fail parent +- **Different retry policies** - Each child can have its own retry configuration +- **Reusability** - Share workflow logic across multiple parent workflows +- **Independent scaling** - Child workflows can run on different task queues +- **History size management** - Each child has its own event history + +**Use activities instead when:** Operation is short-lived, doesn't need its own failure domain, or doesn't need independent retry policies. + +```python +@workflow.run +async def run(self, orders: list[Order]) -> list[str]: + results = [] + for order in orders: + result = await workflow.execute_child_workflow( + ProcessOrderWorkflow.run, + order, + id=f"order-{order.id}", + # Control what happens to child when parent completes + parent_close_policy=workflow.ParentClosePolicy.ABANDON, + ) + results.append(result) + return results +``` + +## External Workflows + +### WHY: Interact with workflows that are not children of the current workflow +### WHEN: +- **Cross-workflow coordination** - Coordinate between independent workflows +- **Signaling existing workflows** - Send signals to workflows started elsewhere +- **Cancellation of other workflows** - Cancel workflows from a coordinating workflow + +```python +@workflow.run +async def run(self, target_workflow_id: str) -> None: + # Get handle to external workflow + handle = workflow.get_external_workflow_handle(target_workflow_id) + + # Signal the external workflow + await handle.signal(TargetWorkflow.data_ready, data_payload) + + # Or cancel it + await handle.cancel() +``` + +## Parallel Execution + +### WHY: Execute multiple independent operations concurrently for better throughput +### WHEN: +- **Batch processing** - Process multiple items simultaneously +- **Fan-out patterns** - Distribute work across multiple activities +- **Independent operations** - Operations that don't depend on each other's results + +```python +@workflow.run +async def run(self, items: list[str]) -> list[str]: + # Execute activities in parallel + tasks = [ + workflow.execute_activity( + process_item, item, + schedule_to_close_timeout=timedelta(minutes=5) + ) + for item in items + ] + return await asyncio.gather(*tasks) +``` + +### Deterministic Alternatives to asyncio + +Use Temporal's deterministic alternatives for safer concurrent operations: + +```python +# workflow.wait() - like asyncio.wait() +done, pending = await workflow.wait( + futures, + return_when=workflow.WaitConditionResult.FIRST_COMPLETED +) + +# workflow.as_completed() - like asyncio.as_completed() +for future in workflow.as_completed(futures): + result = await future + # Process each result as it completes +``` + +## Continue-as-New + +### WHY: Prevent unbounded event history growth in long-running or infinite workflows +### WHEN: +- **Event history approaching 10,000+ events** - Temporal recommends continue-as-new before hitting limits +- **Infinite/long-running workflows** - Polling, subscription, or daemon-style workflows +- **Memory optimization** - Reset workflow state to reduce memory footprint + +**Recommendation:** Check history length periodically and continue-as-new around 10,000 events. + +```python +@workflow.run +async def run(self, state: WorkflowState) -> str: + while True: + state = await process_batch(state) + + if state.is_complete: + return "done" + + # Continue with fresh history before hitting limits + if workflow.info().get_current_history_length() > 10000: + workflow.continue_as_new(args=[state]) +``` + +## Saga Pattern (Compensations) + +### WHY: Implement distributed transactions with compensating actions for rollback +### WHEN: +- **Multi-step transactions** - Operations that span multiple services +- **Eventual consistency** - When you can't use traditional ACID transactions +- **Rollback requirements** - When partial failures require undoing previous steps + +**Important:** Compensation activities should be idempotent - they may be retried. + +```python +@workflow.run +async def run(self, order: Order) -> str: + compensations: list[Callable[[], Awaitable[None]]] = [] + + try: + await workflow.execute_activity( + reserve_inventory, order, + schedule_to_close_timeout=timedelta(minutes=5) + ) + compensations.append(lambda: workflow.execute_activity( + release_inventory, order, + schedule_to_close_timeout=timedelta(minutes=5) + )) + + await workflow.execute_activity( + charge_payment, order, + schedule_to_close_timeout=timedelta(minutes=5) + ) + compensations.append(lambda: workflow.execute_activity( + refund_payment, order, + schedule_to_close_timeout=timedelta(minutes=5) + )) + + await workflow.execute_activity( + ship_order, order, + schedule_to_close_timeout=timedelta(minutes=5) + ) + + return "Order completed" + + except Exception as e: + workflow.logger.error(f"Order failed: {e}, running compensations") + for compensate in reversed(compensations): + try: + await compensate() + except Exception as comp_err: + workflow.logger.error(f"Compensation failed: {comp_err}") + raise +``` + +## Cancellation Handling + +### WHY: Gracefully handle workflow cancellation requests and perform cleanup +### WHEN: +- **Graceful shutdown** - Clean up resources when workflow is cancelled +- **External cancellation** - Respond to cancellation requests from clients +- **Cleanup activities** - Run cleanup logic even after cancellation + +```python +@workflow.run +async def run(self) -> str: + try: + await workflow.execute_activity( + long_running_activity, + schedule_to_close_timeout=timedelta(hours=1), + ) + return "completed" + except asyncio.CancelledError: + # Workflow was cancelled - perform cleanup + workflow.logger.info("Workflow cancelled, running cleanup") + # Cleanup activities still run even after cancellation + await workflow.execute_activity( + cleanup_activity, + schedule_to_close_timeout=timedelta(minutes=5), + ) + raise # Re-raise to mark workflow as cancelled +``` + +## Wait Condition with Timeout + +### WHY: Wait for a condition with a deadline +### WHEN: +- **Approval workflows with deadlines** - Auto-reject if not approved in time +- **Conditional waits with timeouts** - Proceed with default after timeout + +```python +@workflow.run +async def run(self) -> str: + self._approved = False + + # Wait for approval with 24-hour timeout + try: + await workflow.wait_condition( + lambda: self._approved, + timeout=timedelta(hours=24) + ) + return "approved" + except asyncio.TimeoutError: + return "auto-rejected due to timeout" +``` + +## Waiting for All Handlers to Finish + +### WHY: Ensure all signal/update handlers complete before workflow exits +### WHEN: +- **Workflows with async handlers** - Prevent data loss from in-flight handlers +- **Before continue-as-new** - Ensure handlers complete before resetting + +```python +@workflow.run +async def run(self) -> str: + # ... main workflow logic ... + + # Before exiting, wait for all handlers to finish + await workflow.wait_condition(workflow.all_handlers_finished) + return "done" +``` + +## Activity Heartbeat Details + +### WHY: Resume activity progress after worker failure +### WHEN: +- **Long-running activities** - Track progress for resumability +- **Checkpointing** - Save progress periodically + +```python +@activity.defn +async def process_large_file(file_path: str) -> str: + # Get heartbeat details from previous attempt (if any) + heartbeat_details = activity.info().heartbeat_details + start_line = heartbeat_details[0] if heartbeat_details else 0 + + with open(file_path) as f: + for i, line in enumerate(f): + if i < start_line: + continue # Skip already processed lines + + process_line(line) + + # Heartbeat with progress + activity.heartbeat(i + 1) + + return "completed" +``` + +## Versioning with Patching + +### WHY: Safely deploy workflow code changes without breaking running workflows +### WHEN: +- **Adding new steps** - New code path for new executions, old path for replays +- **Changing activity calls** - Modify activity parameters or logic +- **Deprecating features** - Gradually remove old code paths + +```python +@workflow.run +async def run(self) -> str: + if workflow.patched("new-greeting"): + # New implementation + greeting = await workflow.execute_activity( + new_greet_activity, + schedule_to_close_timeout=timedelta(minutes=1) + ) + else: + # Old implementation (for replay) + greeting = await workflow.execute_activity( + old_greet_activity, + schedule_to_close_timeout=timedelta(minutes=1) + ) + + return greeting +``` + +## Timers + +### WHY: Schedule delays or deadlines within workflows in a durable way +### WHEN: +- **Scheduled delays** - Wait for a specific duration before continuing +- **Deadlines** - Set timeouts for operations +- **Reminder patterns** - Schedule future notifications + +```python +@workflow.run +async def run(self) -> str: + # Wait for 1 hour + await asyncio.sleep(3600) + + # Or with workflow-specific API + await workflow.sleep(timedelta(hours=1)) + + return "Timer fired" +``` + +## Local Activities + +### WHY: Reduce latency for short, lightweight operations by skipping the task queue +### WHEN: +- **Short operations** - Activities completing in milliseconds/seconds +- **High-frequency calls** - When task queue overhead is significant +- **Low-latency requirements** - When you can't afford task queue round-trip + +**Note:** Local activities are experimental in Python SDK. + +```python +@workflow.run +async def run(self) -> str: + result = await workflow.execute_local_activity( + quick_lookup, + "key", + schedule_to_close_timeout=timedelta(seconds=5), + ) + return result +``` + +## Using Pydantic Models + +```python +from pydantic import BaseModel +from temporalio.contrib.pydantic import pydantic_data_converter + +class OrderInput(BaseModel): + order_id: str + items: list[str] + total: float + +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, input: OrderInput) -> str: + return f"Processed order {input.order_id}" + +# Client setup with Pydantic support +client = await Client.connect( + "localhost:7233", + data_converter=pydantic_data_converter, +) +``` diff --git a/references/python/python.md b/references/python/python.md new file mode 100644 index 0000000..b6c296f --- /dev/null +++ b/references/python/python.md @@ -0,0 +1,176 @@ +# Temporal Python SDK Reference + +## Overview + +The Temporal Python SDK (`temporalio`) provides a fully async, type-safe approach to building durable workflows. Python 3.9+ required. Workflows run in a sandbox by default for determinism protection. + +## Quick Start + +**activities/greet.py** - Activity definitions (separate file for performance): +```python +from temporalio import activity + +@activity.defn +async def greet(name: str) -> str: + return f"Hello, {name}!" +``` + +**workflows/greeting.py** - Workflow definition (import activities through sandbox): +```python +from datetime import timedelta +from temporalio import workflow + +with workflow.unsafe.imports_passed_through(): + from activities.greet import greet + +@workflow.defn +class GreetingWorkflow: + @workflow.run + async def run(self, name: str) -> str: + return await workflow.execute_activity( + greet, name, schedule_to_close_timeout=timedelta(seconds=30) + ) +``` + +**worker.py** - Worker setup (imports both): +```python +import asyncio +from temporalio.client import Client +from temporalio.worker import Worker + +from activities.greet import greet +from workflows.greeting import GreetingWorkflow + +async def main(): + client = await Client.connect("localhost:7233") + async with Worker(client, task_queue="greeting-queue", + workflows=[GreetingWorkflow], activities=[greet]): + result = await client.execute_workflow( + GreetingWorkflow.run, "World", + id="greeting-workflow", task_queue="greeting-queue" + ) + print(result) + +asyncio.run(main()) +``` + +## Key Concepts + +### Workflow Definition +- Use `@workflow.defn` decorator on class +- Use `@workflow.run` on the entry point method +- Must be async (`async def`) +- Use `@workflow.signal`, `@workflow.query`, `@workflow.update` for handlers + +### Activity Definition +- Use `@activity.defn` decorator +- Can be sync or async functions +- **Default to sync activities** - safer and easier to debug +- Sync activities need `activity_executor` (ThreadPoolExecutor) +- Async activities require async-safe libraries throughout (e.g., `aiohttp` not `requests`) + +See `sync-vs-async.md` for detailed guidance on choosing between sync and async. + +### Worker Setup +- Connect client, create Worker with workflows and activities +- Use `async with Worker(...)` context manager +- Activities can specify custom executor + +## Why Determinism Matters: History Replay + +Temporal achieves durability through **history replay**. When a Worker restarts or recovers from a failure, it re-executes the Workflow code from the beginning, but instead of re-running Activities, it uses the results stored in the Event History. + +**This is why Workflow code must be deterministic:** +- During replay, the SDK compares Commands generated by your code against the Events in history +- If the sequence differs (non-determinism), the Worker cannot restore state +- Non-determinism causes `NondeterminismError` and blocks Workflow progress + +The Python SDK's sandbox provides automatic protection against many common non-deterministic operations, but understanding replay helps you write correct code. + +See `determinism.md` for detailed rules and safe alternatives. + +## Determinism Rules Summary + +**Safe alternatives:** +- `workflow.now()` instead of `datetime.now()` +- `workflow.random()` instead of `random` +- `asyncio.sleep()` works (sandbox handles it) +- `workflow.uuid4()` for UUIDs +- `workflow.wait()` instead of `asyncio.wait()` (deterministic ordering) +- `workflow.as_completed()` instead of `asyncio.as_completed()` (deterministic ordering) + +**Pass-through for libraries:** +```python +with workflow.unsafe.imports_passed_through(): + import pydantic +``` + +## Replay-Aware Logging + +Use `workflow.logger` inside Workflows for replay-safe logging that avoids duplicate log messages during replay: + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + workflow.logger.info("Starting workflow") # Only logs on initial execution + result = await workflow.execute_activity(...) + workflow.logger.info(f"Activity result: {result}") + return result +``` + +Use `activity.logger` in activities for context-aware logging with automatic correlation: + +```python +@activity.defn +async def my_activity() -> str: + activity.logger.info("Processing activity") # Includes activity context + return "done" +``` + +## File Organization Best Practice + +**Keep Workflow definitions in separate files from Activity definitions.** The Python SDK sandbox reloads Workflow definition files on every execution for determinism protection. Minimizing file contents improves Worker performance. + +``` +my_temporal_app/ +├── workflows/ +│ └── greeting.py # Only Workflow classes +├── activities/ +│ └── translate.py # Only Activity functions/classes +├── worker.py # Worker setup, imports both +└── starter.py # Client code to start workflows +``` + +**In the Workflow file, import Activities through the sandbox:** +```python +# workflows/greeting.py +from temporalio import workflow + +with workflow.unsafe.imports_passed_through(): + from activities.translate import TranslateActivities +``` + +## Common Pitfalls + +1. **Using `datetime.now()` in workflows** - Use `workflow.now()` instead +2. **Blocking in async activities** - Use sync activities or async-safe libraries only +3. **Missing executor for sync activities** - Add `activity_executor=ThreadPoolExecutor()` +4. **Forgetting to heartbeat** - Long activities need `activity.heartbeat()` +5. **Using gevent** - Incompatible with SDK +6. **Using `print()` in workflows** - Use `workflow.logger` instead for replay-safe logging +7. **Mixing Workflows and Activities in same file** - Causes unnecessary reloads, hurts performance + +## Additional Resources + +### Reference Files +- **`determinism.md`** - Sandbox behavior, safe alternatives, pass-through pattern, history replay +- **`sync-vs-async.md`** - Sync vs async activities, event loop blocking, executor configuration +- **`error-handling.md`** - ApplicationError, retry policies, non-retryable errors, idempotency +- **`testing.md`** - WorkflowEnvironment, time-skipping, activity mocking +- **`patterns.md`** - Signals, queries, child workflows, saga pattern +- **`observability.md`** - Logging, metrics, tracing, Search Attributes +- **`advanced-features.md`** - Continue-as-new, schedules, updates, interceptors +- **`data-handling.md`** - Data converters, Pydantic, payload encryption +- **`versioning.md`** - Patching API, workflow type versioning, Worker Versioning diff --git a/references/python/sandbox.md b/references/python/sandbox.md new file mode 100644 index 0000000..5ebd79b --- /dev/null +++ b/references/python/sandbox.md @@ -0,0 +1,180 @@ +# Python Workflow Sandbox + +## Overview + +The Python SDK runs workflows in a sandbox that provides automatic protection against non-deterministic operations. This is unique to Python - TypeScript uses V8 isolation with automatic replacements instead. + +## How the Sandbox Works + +The sandbox: +- Isolates global state via `exec` compilation +- Restricts non-deterministic library calls via proxy objects +- Passes through standard library with restrictions +- Reloads workflow files on each execution + +## Safe Alternatives + +| Forbidden | Safe Alternative | +|-----------|------------------| +| `datetime.now()` | `workflow.now()` | +| `datetime.utcnow()` | `workflow.now()` | +| `random.random()` | `workflow.random().random()` | +| `random.randint()` | `workflow.random().randint()` | +| `uuid.uuid4()` | `workflow.uuid4()` | +| `time.time()` | `workflow.now().timestamp()` | +| `asyncio.wait()` | `workflow.wait()` (deterministic ordering) | +| `asyncio.as_completed()` | `workflow.as_completed()` | + +## Pass-Through Pattern + +Third-party libraries that aren't sandbox-aware need explicit pass-through: + +```python +from temporalio import workflow + +with workflow.unsafe.imports_passed_through(): + import pydantic + from my_module import my_dataclass +``` + +**When to use pass-through:** +- Data classes and models (Pydantic, dataclasses) +- Serialization libraries +- Type definitions +- Any library that doesn't do I/O or non-deterministic operations + +## Importing Activities + +Activities should be imported through pass-through since they're defined outside the sandbox: + +```python +# workflows/order.py +from temporalio import workflow + +with workflow.unsafe.imports_passed_through(): + from activities.payment import process_payment + from activities.shipping import ship_order + +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, order_id: str) -> str: + await workflow.execute_activity( + process_payment, + order_id, + schedule_to_close_timeout=timedelta(minutes=5), + ) + return await workflow.execute_activity( + ship_order, + order_id, + schedule_to_close_timeout=timedelta(minutes=10), + ) +``` + +## Disabling the Sandbox + +### Per-Workflow + +```python +@workflow.defn(sandboxed=False) +class UnsandboxedWorkflow: + @workflow.run + async def run(self) -> str: + # No sandbox protection - use with caution + return "result" +``` + +### Per-Block + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + with workflow.unsafe.sandbox_unrestricted(): + # Unrestricted code block + pass + return "result" +``` + +### Globally (Worker Level) + +```python +from temporalio.worker import Worker, UnsandboxedWorkflowRunner + +worker = Worker( + client, + task_queue="my-queue", + workflows=[MyWorkflow], + activities=[my_activity], + workflow_runner=UnsandboxedWorkflowRunner(), +) +``` + +## Forbidden Operations + +These operations will fail or cause non-determinism in the sandbox: + +- **Direct I/O**: Network calls, file reads/writes +- **Threading**: `threading` module operations +- **Subprocess**: `subprocess` calls +- **Global state**: Modifying mutable global variables +- **Blocking sleep**: `time.sleep()` (use `asyncio.sleep()`) + +## File Organization + +**Critical**: Keep workflow definitions in separate files from activity definitions. + +The sandbox reloads workflow definition files on every execution. Minimizing file contents improves Worker performance. + +``` +my_temporal_app/ +├── workflows/ +│ └── order.py # Only workflow classes +├── activities/ +│ └── payment.py # Only activity functions +├── models/ +│ └── order.py # Shared data models +├── worker.py # Worker setup, imports both +└── starter.py # Client code +``` + +## Common Issues + +### Import Errors + +``` +Error: Cannot import 'pydantic' in sandbox +``` + +**Fix**: Use pass-through: +```python +with workflow.unsafe.imports_passed_through(): + import pydantic +``` + +### Non-Determinism from Libraries + +Some libraries do internal caching or use current time: + +```python +# May cause non-determinism +import some_library +result = some_library.cached_operation() # Cache changes between replays +``` + +**Fix**: Move to activity or use pass-through with caution. + +### Slow Worker Startup + +Large workflow files slow down worker initialization because they're reloaded frequently. + +**Fix**: Keep workflow files minimal, move logic to activities. + +## Best Practices + +1. **Separate workflow and activity files** for performance +2. **Use pass-through explicitly** for third-party libraries +3. **Keep workflow files small** to minimize reload time +4. **Move I/O to activities** always +5. **Test with replay** to catch sandbox issues early diff --git a/references/python/sync-vs-async.md b/references/python/sync-vs-async.md new file mode 100644 index 0000000..ba6b831 --- /dev/null +++ b/references/python/sync-vs-async.md @@ -0,0 +1,243 @@ +# Python SDK: Sync vs Async Activities + +## Overview + +The Temporal Python SDK supports multiple ways of implementing Activities: + +- **Asynchronous** using `asyncio` +- **Synchronous multithreaded** using `concurrent.futures.ThreadPoolExecutor` +- **Synchronous multiprocess** using `concurrent.futures.ProcessPoolExecutor` + +Choosing the correct approach is critical—incorrect usage can cause sporadic failures and difficult-to-diagnose bugs. + +## Recommendation: Default to Synchronous + +Activities should be synchronous by default. Use async only when certain the code doesn't block the event loop. + +## The Event Loop Problem + +The Python async event loop runs in a single thread. When any task runs, no other tasks can execute until an `await` is reached. If code makes a blocking call (file I/O, synchronous HTTP, etc.), the entire event loop freezes. + +**Consequences of blocking the event loop:** +- Worker cannot communicate with Temporal Server +- Workflow progress blocks across the worker +- Potential deadlocks and unpredictable behavior +- Difficult-to-diagnose bugs + +## How the SDK Handles Each Type + +### Synchronous Activities + +- Run in the `activity_executor` (thread pool by default) +- Protected from accidentally blocking the global event loop +- Multiple activities run in parallel via OS thread scheduling +- Thread pool provides preemptive switching between tasks + +```python +from concurrent.futures import ThreadPoolExecutor +from temporalio.worker import Worker + +with ThreadPoolExecutor(max_workers=100) as executor: + worker = Worker( + client, + task_queue="my-queue", + workflows=[MyWorkflow], + activities=[my_sync_activity], + activity_executor=executor, + ) + await worker.run() +``` + +### Asynchronous Activities + +- Share the default asyncio event loop with the Temporal worker +- Any blocking call freezes the entire loop +- Require async-safe libraries throughout + +```python +@activity.defn +async def my_async_activity(name: str) -> str: + # Must use async-safe libraries only + async with aiohttp.ClientSession() as session: + async with session.get(f"http://api.example.com/{name}") as response: + return await response.text() +``` + +## HTTP Libraries: A Critical Choice + +| Library | Type | Safe in Async Activity? | +|---------|------|------------------------| +| `requests` | Blocking | No - blocks event loop | +| `urllib3` | Blocking | No - blocks event loop | +| `aiohttp` | Async | Yes | +| `httpx` | Both | Yes (use async mode) | + +**Example: Wrong way (blocks event loop)** +```python +@activity.defn +async def bad_activity(url: str) -> str: + import requests + response = requests.get(url) # BLOCKS the event loop! + return response.text +``` + +**Example: Correct way (async-safe)** +```python +@activity.defn +async def good_activity(url: str) -> str: + async with aiohttp.ClientSession() as session: + async with session.get(url) as response: + return await response.text() +``` + +## Running Blocking Code in Async Activities + +If blocking code must run in an async activity, offload it to a thread: + +```python +import asyncio + +@activity.defn +async def activity_with_blocking_call() -> str: + # Run blocking code in a thread pool + loop = asyncio.get_event_loop() + result = await loop.run_in_executor(None, blocking_function) + return result + +# Or use asyncio.to_thread (Python 3.9+) +@activity.defn +async def activity_with_blocking_call_v2() -> str: + result = await asyncio.to_thread(blocking_function) + return result +``` + +## When to Use Async Activities + +Use async activities only when: + +1. All code paths are async-safe (no blocking calls) +2. Using async-native libraries (aiohttp, asyncpg, motor, etc.) +3. Performance benefits are needed for I/O-bound operations +4. The team understands async constraints + +## When to Use Sync Activities + +Use sync activities when: + +1. Making HTTP calls with `requests` or similar blocking libraries +2. Performing file I/O operations +3. Using database drivers that aren't async-native +4. Uncertain whether code is async-safe +5. Integrating with legacy or third-party synchronous code + +## Debugging Tip + +If experiencing sporadic bugs, hangs, or timeouts: + +1. Convert async activities to sync +2. Test thoroughly +3. If bugs disappear, the original async activity had blocking calls + +## Threading Considerations + +### Multi-Core Usage + +For CPU-bound work or true parallelism: + +- Use multiple worker processes +- Or use `ProcessPoolExecutor` for synchronous activities + +```python +from concurrent.futures import ProcessPoolExecutor + +with ProcessPoolExecutor(max_workers=4) as executor: + worker = Worker( + client, + task_queue="cpu-intensive-queue", + activities=[cpu_bound_activity], + activity_executor=executor, + ) +``` + +### Separate Workers for Workflows vs Activities + +Some teams deploy: +- Workflow-only workers (CPU-bound, need deadlock detection) +- Activity-only workers (I/O-bound, may need more parallelism) + +This prevents resource contention and allows independent scaling. + +## Complete Example: Sync Activity with ThreadPoolExecutor + +```python +import urllib.parse +import requests +from concurrent.futures import ThreadPoolExecutor +from temporalio import activity +from temporalio.client import Client +from temporalio.worker import Worker + +@activity.defn +def greet_in_spanish(name: str) -> str: + """Synchronous activity using requests library.""" + url = f"http://localhost:9999/get-spanish-greeting?name={urllib.parse.quote(name)}" + response = requests.get(url) + return response.text + +async def main(): + client = await Client.connect("localhost:7233") + + with ThreadPoolExecutor(max_workers=100) as executor: + worker = Worker( + client, + task_queue="greeting-tasks", + workflows=[GreetingWorkflow], + activities=[greet_in_spanish], + activity_executor=executor, + ) + await worker.run() +``` + +## Complete Example: Async Activity with aiohttp + +```python +import aiohttp +import urllib.parse +from temporalio import activity +from temporalio.client import Client +from temporalio.worker import Worker + +class TranslateActivities: + def __init__(self, session: aiohttp.ClientSession): + self.session = session + + @activity.defn + async def greet_in_spanish(self, name: str) -> str: + """Async activity using aiohttp - safe for event loop.""" + url = f"http://localhost:9999/get-spanish-greeting?name={urllib.parse.quote(name)}" + async with self.session.get(url) as response: + return await response.text() + +async def main(): + client = await Client.connect("localhost:7233") + + async with aiohttp.ClientSession() as session: + activities = TranslateActivities(session) + worker = Worker( + client, + task_queue="greeting-tasks", + workflows=[GreetingWorkflow], + activities=[activities.greet_in_spanish], + ) + await worker.run() +``` + +## Summary + +| Aspect | Sync Activities | Async Activities | +|--------|-----------------|------------------| +| Default choice | Yes | Only when certain | +| Blocking calls | Safe (runs in thread pool) | Dangerous (blocks event loop) | +| HTTP library | `requests`, `httpx` | `aiohttp`, `httpx` (async) | +| Executor needed | Yes (`ThreadPoolExecutor`) | No | +| Debugging | Easier | Harder (timing issues) | diff --git a/references/python/testing.md b/references/python/testing.md new file mode 100644 index 0000000..e5de0b1 --- /dev/null +++ b/references/python/testing.md @@ -0,0 +1,120 @@ +# Python SDK Testing + +## Overview + +The Python SDK provides `WorkflowEnvironment` for testing workflows with time-skipping support and `ActivityEnvironment` for isolated activity testing. + +## Time-Skipping Test Environment + +```python +import pytest +from temporalio.testing import WorkflowEnvironment +from temporalio.worker import Worker + +@pytest.mark.asyncio +async def test_workflow(): + async with await WorkflowEnvironment.start_time_skipping() as env: + async with Worker( + env.client, + task_queue="test-queue", + workflows=[MyWorkflow], + activities=[my_activity], + ): + result = await env.client.execute_workflow( + MyWorkflow.run, + "input", + id="test-workflow-id", + task_queue="test-queue", + ) + assert result == "expected" +``` + +## Local Test Environment + +For tests that don't need time-skipping: + +```python +async with await WorkflowEnvironment.start_local() as env: + # Real-time execution + pass +``` + +## Activity Testing + +```python +from temporalio.testing import ActivityEnvironment + +@pytest.mark.asyncio +async def test_activity(): + env = ActivityEnvironment() + + # Optionally customize activity info + # env.info = ActivityInfo(...) + + result = await env.run(my_activity, "arg1", "arg2") + assert result == "expected" +``` + +## Mocking Activities + +```python +async def mock_activity(input: str) -> str: + return "mocked result" + +@pytest.mark.asyncio +async def test_with_mock(): + async with await WorkflowEnvironment.start_time_skipping() as env: + async with Worker( + env.client, + task_queue="test-queue", + workflows=[MyWorkflow], + activities=[mock_activity], # Use mock + ): + result = await env.client.execute_workflow(...) +``` + +## Workflow Replay Testing + +```python +from temporalio.worker import Replayer + +async def test_replay(): + replayer = Replayer(workflows=[MyWorkflow]) + + # From JSON file + await replayer.replay_workflow( + WorkflowHistory.from_json("workflow-id", history_json) + ) +``` + +## Testing Signals and Queries + +```python +@pytest.mark.asyncio +async def test_signals(): + async with await WorkflowEnvironment.start_time_skipping() as env: + async with Worker(...): + handle = await env.client.start_workflow( + MyWorkflow.run, + id="test-wf", + task_queue="test-queue", + ) + + # Send signal + await handle.signal(MyWorkflow.my_signal, "data") + + # Query state + status = await handle.query(MyWorkflow.get_status) + assert status == "expected" + + # Wait for completion + result = await handle.result() +``` + +## Best Practices + +1. Use time-skipping for workflows with timers +2. Mock external dependencies in activities +3. Test replay compatibility when changing workflow code +4. Test signal/query handlers explicitly +5. Use unique workflow IDs per test to avoid conflicts diff --git a/references/python/versioning.md b/references/python/versioning.md new file mode 100644 index 0000000..69600a2 --- /dev/null +++ b/references/python/versioning.md @@ -0,0 +1,355 @@ +# Python SDK Versioning + +## Overview + +Workflow versioning allows you to safely deploy changes to Workflow code without causing non-deterministic errors in running Workflow Executions. The Python SDK provides multiple approaches: the Patching API for code-level version management, Workflow Type versioning for incompatible changes, and Worker Versioning for deployment-level control. + +## Why Versioning is Needed + +When Workers restart after a deployment, they resume open Workflow Executions through History Replay. If the updated Workflow Definition produces a different sequence of Commands than the original code, it causes a non-deterministic error. Versioning ensures backward compatibility by preserving the original execution path for existing workflows while allowing new workflows to use updated code. + +## Workflow Versioning with Patching API + +### The patched() Function + +The `patched()` function checks whether a Workflow should run new or old code: + +```python +from temporalio import workflow + +@workflow.defn +class ShippingWorkflow: + @workflow.run + async def run(self) -> None: + if workflow.patched("send-email-instead-of-fax"): + # New code path + await workflow.execute_activity( + send_email, + schedule_to_close_timeout=timedelta(minutes=5), + ) + else: + # Old code path (for replay of existing workflows) + await workflow.execute_activity( + send_fax, + schedule_to_close_timeout=timedelta(minutes=5), + ) +``` + +**How it works:** +- For new executions: `patched()` returns `True` and records a marker in the Workflow history +- For replay with the marker: `patched()` returns `True` (history includes this patch) +- For replay without the marker: `patched()` returns `False` (history predates this patch) + +### Three-Step Patching Process + +**Step 1: Patch in New Code** + +Add the patch with both old and new code paths: + +```python +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, order: Order) -> str: + if workflow.patched("add-fraud-check"): + # New: Run fraud check before payment + await workflow.execute_activity( + check_fraud, + order, + schedule_to_close_timeout=timedelta(minutes=2), + ) + + # Original payment logic runs for both paths + return await workflow.execute_activity( + process_payment, + order, + schedule_to_close_timeout=timedelta(minutes=5), + ) +``` + +**Step 2: Deprecate the Patch** + +Once all pre-patch Workflow Executions have completed, remove the old code and use `deprecate_patch()`: + +```python +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, order: Order) -> str: + workflow.deprecate_patch("add-fraud-check") + + # Only new code remains + await workflow.execute_activity( + check_fraud, + order, + schedule_to_close_timeout=timedelta(minutes=2), + ) + + return await workflow.execute_activity( + process_payment, + order, + schedule_to_close_timeout=timedelta(minutes=5), + ) +``` + +**Step 3: Remove the Patch** + +After all workflows with the deprecated patch marker have completed, remove the `deprecate_patch()` call entirely: + +```python +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, order: Order) -> str: + await workflow.execute_activity( + check_fraud, + order, + schedule_to_close_timeout=timedelta(minutes=2), + ) + + return await workflow.execute_activity( + process_payment, + order, + schedule_to_close_timeout=timedelta(minutes=5), + ) +``` + +### Branching with Multiple Patches + +A Workflow can have multiple patches, each representing a modification deployed at a specific time: + +```python +@workflow.defn +class NotificationWorkflow: + @workflow.run + async def run(self) -> None: + if workflow.patched("use-sms"): + # Latest: SMS notifications + await workflow.execute_activity( + send_sms, + schedule_to_close_timeout=timedelta(minutes=5), + ) + elif workflow.patched("use-email"): + # Intermediate: Email notifications + await workflow.execute_activity( + send_email, + schedule_to_close_timeout=timedelta(minutes=5), + ) + else: + # Original: Fax notifications + await workflow.execute_activity( + send_fax, + schedule_to_close_timeout=timedelta(minutes=5), + ) +``` + +You can use a single patch ID for multiple changes deployed together: + +```python +if workflow.patched("v2-updates"): + # All v2 changes together + await workflow.execute_activity(validate_v2, ...) + await workflow.execute_activity(process_v2, ...) +else: + await workflow.execute_activity(validate_v1, ...) + await workflow.execute_activity(process_v1, ...) +``` + +### Query Filters for Finding Workflows by Version + +Use List Filters to find workflows with specific patch versions: + +```bash +# Find running workflows with a specific patch +temporal workflow list --query \ + 'WorkflowType = "OrderWorkflow" AND ExecutionStatus = "Running" AND TemporalChangeVersion = "add-fraud-check"' + +# Find running workflows without any patch (pre-patch versions) +temporal workflow list --query \ + 'WorkflowType = "OrderWorkflow" AND ExecutionStatus = "Running" AND TemporalChangeVersion IS NULL' +``` + +## Workflow Type Versioning + +For incompatible changes, create a new Workflow Type instead of using patches: + +```python +@workflow.defn(name="PizzaWorkflow") +class PizzaWorkflow: + @workflow.run + async def run(self, order: PizzaOrder) -> str: + # Original implementation + return await self._process_order_v1(order) + +@workflow.defn(name="PizzaWorkflowV2") +class PizzaWorkflowV2: + @workflow.run + async def run(self, order: PizzaOrder) -> str: + # New implementation with incompatible changes + return await self._process_order_v2(order) +``` + +Register both with the Worker: + +```python +worker = Worker( + client, + task_queue="pizza-task-queue", + workflows=[PizzaWorkflow, PizzaWorkflowV2], + activities=[make_pizza, deliver_pizza], +) +``` + +Update client code to start new workflows with the new type: + +```python +# Old workflows continue on PizzaWorkflow +# New workflows use PizzaWorkflowV2 +handle = await client.start_workflow( + PizzaWorkflowV2.run, + order, + id=f"pizza-{order.id}", + task_queue="pizza-task-queue", +) +``` + +Check for open executions before removing the old type: + +```bash +temporal workflow list --query 'WorkflowType = "PizzaWorkflow" AND ExecutionStatus = "Running"' +``` + +## Worker Versioning + +Worker Versioning manages versions at the deployment level, allowing multiple Worker versions to run simultaneously. + +### Key Concepts + +**Worker Deployment**: A logical service grouping similar Workers together (e.g., "loan-processor"). All versions of your code live under this umbrella. + +**Worker Deployment Version**: A specific snapshot of your code identified by a deployment name and Build ID (e.g., "loan-processor:v1.0" or "loan-processor:abc123"). + +### Configuring Workers for Versioning + +```python +from temporalio.worker import Worker +from temporalio.worker.deployment_config import ( + WorkerDeploymentConfig, + WorkerDeploymentVersion, +) + +worker = Worker( + client, + task_queue="my-task-queue", + workflows=[MyWorkflow], + activities=[my_activity], + deployment_config=WorkerDeploymentConfig( + version=WorkerDeploymentVersion( + deployment_name="my-service", + build_id="v1.0.0", # or git commit hash + ), + use_worker_versioning=True, + ), +) +``` + +**Configuration parameters:** +- `use_worker_versioning`: Enables Worker Versioning +- `version`: Identifies the Worker Deployment Version (deployment name + build ID) +- Build ID: Typically a git commit hash, version number, or timestamp + +### PINNED vs AUTO_UPGRADE Behaviors + +**PINNED Behavior** + +Workflows stay locked to their original Worker version: + +```python +from temporalio.workflow import VersioningBehavior + +@workflow.defn +class StableWorkflow: + @workflow.run + async def run(self) -> str: + # This workflow will always run on its assigned version + return await workflow.execute_activity( + process_order, + schedule_to_close_timeout=timedelta(minutes=5), + ) +``` + +**When to use PINNED:** +- Short-running workflows (minutes to hours) +- Consistency is critical (e.g., financial transactions) +- You want to eliminate version compatibility complexity +- Building new applications and want simplest development experience + +**AUTO_UPGRADE Behavior** + +Workflows can move to newer versions: + +**When to use AUTO_UPGRADE:** +- Long-running workflows (weeks or months) +- Workflows need to benefit from bug fixes during execution +- Migrating from traditional rolling deployments +- You are already using patching APIs for version transitions + +**Important:** AUTO_UPGRADE workflows still need patching to handle version transitions safely since they can move between Worker versions. + +### Worker Configuration with Default Behavior + +```python +# For short-running workflows, prefer PINNED +worker = Worker( + client, + task_queue="orders-task-queue", + workflows=[OrderWorkflow], + activities=[process_order], + deployment_config=WorkerDeploymentConfig( + version=WorkerDeploymentVersion( + deployment_name="order-service", + build_id=os.environ["BUILD_ID"], + ), + use_worker_versioning=True, + # default_versioning_behavior=VersioningBehavior.PINNED, + ), +) +``` + +### Deployment Strategies + +**Blue-Green Deployments** + +Maintain two environments and switch traffic between them: +1. Deploy new code to idle environment +2. Run tests and validation +3. Switch traffic to new environment +4. Keep old environment for instant rollback + +**Rainbow Deployments** + +Multiple versions run simultaneously: +- New workflows use latest version +- Existing workflows complete on their original version +- Add new versions alongside existing ones +- Gradually sunset old versions as workflows complete + +This works well with Kubernetes where you manage multiple ReplicaSets running different Worker versions. + +### Querying Workflows by Worker Version + +```bash +# Find workflows on a specific Worker version +temporal workflow list --query \ + 'TemporalWorkerDeploymentVersion = "my-service:v1.0.0" AND ExecutionStatus = "Running"' +``` + +## Best Practices + +1. **Check for open executions** before removing old code paths +2. **Use descriptive patch IDs** that explain the change (e.g., "add-fraud-check" not "patch-1") +3. **Deploy patches incrementally**: patch, deprecate, remove +4. **Use PINNED for short workflows** to simplify version management +5. **Use AUTO_UPGRADE with patching** for long-running workflows that need updates +6. **Generate Build IDs from code** (git hash) to ensure changes produce new versions +7. **Avoid rolling deployments** for high-availability services with long-running workflows diff --git a/references/troubleshooting.md b/references/troubleshooting.md deleted file mode 100644 index bb58c85..0000000 --- a/references/troubleshooting.md +++ /dev/null @@ -1,184 +0,0 @@ -# Troubleshooting Temporal Workflows - -## Step 1: Identify the Problem - -```bash -# Check workflow status -temporal workflow describe --workflow-id - -# Check for stalled workflows (workflows stuck in RUNNING) -./scripts/find-stalled-workflows.sh - -# Analyze specific workflow errors -./scripts/analyze-workflow-error.sh --workflow-id -``` - -## Step 2: Diagnose Using This Decision Tree - -``` -Workflow not behaving as expected? -│ -├── Status: RUNNING but no progress (STALLED) -│ │ -│ ├── Is it an interactive workflow waiting for signal/update? -│ │ └── YES → Send the required interaction -│ │ -│ └── NO → Run: ./scripts/find-stalled-workflows.sh -│ │ -│ ├── WorkflowTaskFailed detected -│ │ │ -│ │ ├── Non-determinism error (history mismatch)? -│ │ │ └── See: "Fixing Non-Determinism Errors" below -│ │ │ -│ │ └── Other workflow task error (code bug, missing registration)? -│ │ └── See: "Fixing Other Workflow Task Errors" below -│ │ -│ └── ActivityTaskFailed (excessive retries) -│ └── Activity is retrying. Fix activity code, restart worker. -│ Workflow will auto-retry with new code. -│ -├── Status: COMPLETED but wrong result -│ └── Check result: ./scripts/get-workflow-result.sh --workflow-id -│ Is result an error message? → Fix workflow/activity logic -│ -├── Status: FAILED -│ └── Run: ./scripts/analyze-workflow-error.sh --workflow-id -│ Fix code → ./scripts/ensure-worker.sh → Start NEW workflow -│ -├── Status: TIMED_OUT -│ └── Increase timeouts → ./scripts/ensure-worker.sh → Start NEW workflow -│ -└── Workflow never starts - └── Check: Worker running? Task queue matches? Workflow registered? -``` - ---- - -## Fixing Workflow Task Errors - -**Workflow task errors STALL the workflow** - it stops making progress entirely until the issue is fixed. - -### Fixing Non-Determinism Errors - -Non-determinism occurs when workflow code produces different commands during replay than what's recorded in history. - -**Symptoms**: -- `WorkflowTaskFailed` events in history -- "Non-deterministic error" or "history mismatch" in logs/error message - -**CRITICAL: First understand the error**: -```bash -# 1. ALWAYS analyze the error first - understand what mismatched -./scripts/analyze-workflow-error.sh --workflow-id - -# Look for details like: -# - "expected ActivityTaskScheduled but got TimerStarted" -# - "activity type mismatch: expected X got Y" -# - "timer ID mismatch" -``` - -**Report the error to user** - They need to know what changed and why. - -**Recovery options** (choose based on intent): - -**Option A: Fix code to match history (accidental change / bug)** -```bash -# Use when: You accidentally broke compatibility and want to recover the workflow -# 1. Understand what commands the history expects -# 2. Fix workflow code to produce those same commands during replay -# 3. Restart worker -./scripts/ensure-worker.sh -# 4. Workflow task retries automatically and continues -``` - -**Option B: Terminate and restart fresh (intentional v2 change)** -```bash -# Use when: You intentionally deployed breaking changes (v1→v2) and want new behavior -# The old workflow was started on v1; you want v2 going forward -temporal workflow terminate --workflow-id -./scripts/ensure-worker.sh -uv run starter # Start fresh workflow with v2 code -``` - -**Common non-determinism causes**: -- Changed activity order or added/removed activities mid-execution -- Changed activity names or signatures -- Added/removed timers or signals -- Conditional logic that depends on external state (time, random, etc.) - -**Key insight**: Non-determinism means "replay doesn't match history." -- **Accidental?** → Fix code to match history, workflow recovers -- **Intentional v2 change?** → Terminate old workflow, start fresh with new code - -### Fixing Other Workflow Task Errors - -For workflow task errors that are NOT non-determinism (code bugs, missing registration, etc.): - -**Symptoms**: -- `WorkflowTaskFailed` events -- Error is NOT "history mismatch" or "non-deterministic" - -**Fix procedure**: -```bash -# 1. Identify the error -./scripts/analyze-workflow-error.sh --workflow-id - -# 2. Fix the root cause (code bug, worker config, etc.) - -# 3. Kill and restart worker with fixed code -./scripts/ensure-worker.sh - -# 4. NO NEED TO TERMINATE - the workflow will automatically resume -# The new worker picks up where it left off and continues execution -``` - -**Key point**: Unlike non-determinism, the workflow can recover once you fix the code. - ---- - -## Fixing Activity Task Errors - -**Activity task errors cause retries**, not immediate workflow failure. - -### Workflow Stalling Due to Retries - -Workflows can appear stalled because an activity keeps failing and retrying. - -**Diagnosis**: -```bash -# Check for excessive activity retries -./scripts/find-stalled-workflows.sh - -# Look for ActivityTaskFailed count -# Check worker logs for retry messages -tail -100 $CLAUDE_TEMPORAL_LOG_DIR/worker-$(basename "$(pwd)").log -``` - -**Fix procedure**: -```bash -# 1. Fix the activity code - -# 2. Restart worker with fixed code -./scripts/ensure-worker.sh - -# 3. Worker auto-retries with new code -# No need to terminate or restart workflow -``` - -### Activity Failure (Retries Exhausted) - -When all retries are exhausted, the activity fails permanently. - -**Fix procedure**: -```bash -# 1. Analyze the error -./scripts/analyze-workflow-error.sh --workflow-id - -# 2. Fix activity code - -# 3. Restart worker -./scripts/ensure-worker.sh - -# 4. Start NEW workflow (old one has failed) -uv run starter -``` diff --git a/references/typescript/advanced-features.md b/references/typescript/advanced-features.md new file mode 100644 index 0000000..9739a94 --- /dev/null +++ b/references/typescript/advanced-features.md @@ -0,0 +1,438 @@ +# TypeScript SDK Advanced Features + +## Continue-as-New + +Use continue-as-new to prevent unbounded history growth in long-running workflows. + +```typescript +import { continueAsNew, workflowInfo } from '@temporalio/workflow'; + +export async function batchProcessingWorkflow(state: ProcessingState): Promise { + while (!state.isComplete) { + // Process next batch + state = await processNextBatch(state); + + // Check history size and continue-as-new if needed + const info = workflowInfo(); + if (info.historyLength > 10000) { + await continueAsNew(state); + } + } + + return 'completed'; +} +``` + +### Continue-as-New with Options + +```typescript +import { continueAsNew } from '@temporalio/workflow'; + +// Continue with modified options +await continueAsNew(newState, { + memo: { lastProcessed: itemId }, + searchAttributes: { BatchNumber: [state.batch + 1] }, +}); +``` + +## Workflow Updates + +Updates allow synchronous interaction with running workflows. + +### Defining Update Handlers + +```typescript +import { defineUpdate, setHandler, condition } from '@temporalio/workflow'; + +// Define the update +export const addItemUpdate = defineUpdate('addItem'); +export const addItemValidatedUpdate = defineUpdate('addItemValidated'); + +export async function orderWorkflow(): Promise { + const items: string[] = []; + let completed = false; + + // Simple update handler + setHandler(addItemUpdate, (item: string) => { + items.push(item); + return items.length; + }); + + // Update handler with validator + setHandler( + addItemValidatedUpdate, + (item: string) => { + items.push(item); + return items.length; + }, + { + validator: (item: string) => { + if (!item) throw new Error('Item cannot be empty'); + if (items.length >= 100) throw new Error('Order is full'); + }, + } + ); + + // Wait for completion signal + await condition(() => completed); + return `Order with ${items.length} items completed`; +} +``` + +### Calling Updates from Client + +```typescript +import { Client } from '@temporalio/client'; +import { addItemUpdate } from './workflows'; + +const client = new Client(); +const handle = client.workflow.getHandle('order-123'); + +// Execute update and wait for result +const count = await handle.executeUpdate(addItemUpdate, { args: ['new-item'] }); +console.log(`Order now has ${count} items`); +``` + +## Nexus Operations + +### WHY: Cross-namespace and cross-cluster service communication +### WHEN: +- **Multi-namespace architectures** - Call operations across Temporal namespaces +- **Service-oriented design** - Expose workflow capabilities as reusable services +- **Cross-cluster communication** - Interact with workflows in different Temporal clusters + +### Defining a Nexus Service + +Define the service interface shared between caller and handler: + +```typescript +// api.ts - shared service definition +import * as nexus from 'nexus-rpc'; + +export const helloService = nexus.service('hello', { + // Synchronous operation + echo: nexus.operation(), + // Workflow-backed operation + hello: nexus.operation(), +}); + +export interface EchoInput { message: string; } +export interface EchoOutput { message: string; } +export interface HelloInput { name: string; language: string; } +export interface HelloOutput { message: string; } +``` + +### Implementing Nexus Service Handlers + +```typescript +// service/handler.ts +import * as nexus from 'nexus-rpc'; +import * as temporalNexus from '@temporalio/nexus'; +import { helloService, EchoInput, EchoOutput, HelloInput, HelloOutput } from '../api'; +import { helloWorkflow } from './workflows'; + +export const helloServiceHandler = nexus.serviceHandler(helloService, { + // Synchronous operation - simple async function + echo: async (ctx, input: EchoInput): Promise => { + // Can access Temporal client via temporalNexus.getClient() + return input; + }, + + // Workflow-backed operation + hello: new temporalNexus.WorkflowRunOperationHandler( + async (ctx, input: HelloInput) => { + return await temporalNexus.startWorkflow(ctx, helloWorkflow, { + args: [input], + workflowId: ctx.requestId ?? crypto.randomUUID(), + }); + }, + ), +}); +``` + +### Calling Nexus Operations from Workflows + +```typescript +// caller/workflows.ts +import * as wf from '@temporalio/workflow'; +import { helloService } from '../api'; + +const HELLO_SERVICE_ENDPOINT = 'my-nexus-endpoint-name'; + +export async function callerWorkflow(name: string): Promise { + const nexusClient = wf.createNexusClient({ + service: helloService, + endpoint: HELLO_SERVICE_ENDPOINT, + }); + + const result = await nexusClient.executeOperation( + 'hello', + { name, language: 'en' }, + { scheduleToCloseTimeout: '10s' }, + ); + + return result.message; +} +``` + +## Activity Cancellation and Heartbeating + +### ActivityCancellationType + +Control how activities respond to workflow cancellation: + +```typescript +import { proxyActivities, ActivityCancellationType, isCancellation, log } from '@temporalio/workflow'; +import type * as activities from './activities'; + +const { longRunningActivity } = proxyActivities({ + startToCloseTimeout: '60s', + heartbeatTimeout: '3s', + // TRY_CANCEL (default): Request cancellation, resolve/reject immediately + // WAIT_CANCELLATION_COMPLETED: Wait for activity to acknowledge cancellation + // WAIT_CANCELLATION_REQUESTED: Wait for cancellation request to be delivered + // ABANDON: Don't request cancellation + cancellationType: ActivityCancellationType.WAIT_CANCELLATION_COMPLETED, +}); + +export async function workflowWithCancellation(): Promise { + try { + await longRunningActivity(); + } catch (err) { + if (isCancellation(err)) { + log.info('Workflow cancelled along with its activity'); + // Use CancellationScope.nonCancellable for cleanup + } + throw err; + } +} +``` + +### Activity Heartbeat Details for Resumption + +Use heartbeat details to resume long-running activities from where they left off: + +```typescript +// activities.ts +import { activityInfo, log, sleep, CancelledFailure, heartbeat } from '@temporalio/activity'; + +export async function processWithProgress(sleepIntervalMs = 1000): Promise { + try { + // Resume from last heartbeat on retry + const startingPoint = activityInfo().heartbeatDetails || 1; + log.info('Starting activity at progress', { startingPoint }); + + for (let progress = startingPoint; progress <= 100; ++progress) { + log.info('Progress', { progress }); + await sleep(sleepIntervalMs); + // Heartbeat with progress - allows resuming on retry + heartbeat(progress); + } + } catch (err) { + if (err instanceof CancelledFailure) { + log.warn('Activity cancelled', { message: err.message }); + // Cleanup code here + } + throw err; + } +} +``` + +## Schedules + +Create recurring workflow executions. + +```typescript +import { Client, ScheduleOverlapPolicy } from '@temporalio/client'; + +const client = new Client(); + +// Create a schedule +const schedule = await client.schedule.create({ + scheduleId: 'daily-report', + spec: { + intervals: [{ every: '1 day' }], + }, + action: { + type: 'startWorkflow', + workflowType: 'dailyReportWorkflow', + taskQueue: 'reports', + args: [], + }, + policies: { + overlap: ScheduleOverlapPolicy.SKIP, + }, +}); + +// Manage schedules +const handle = client.schedule.getHandle('daily-report'); +await handle.pause('Maintenance window'); +await handle.unpause(); +await handle.trigger(); // Run immediately +await handle.delete(); +``` + +## Interceptors + +Interceptors allow cross-cutting concerns like logging, metrics, and auth. + +### Creating a Custom Interceptor + +```typescript +import { + ActivityInboundCallsInterceptor, + ActivityExecuteInput, + Next, +} from '@temporalio/worker'; + +class LoggingActivityInterceptor implements ActivityInboundCallsInterceptor { + async execute( + input: ActivityExecuteInput, + next: Next + ): Promise { + console.log(`Activity starting: ${input.activity.name}`); + try { + const result = await next(input); + console.log(`Activity completed: ${input.activity.name}`); + return result; + } catch (err) { + console.error(`Activity failed: ${input.activity.name}`, err); + throw err; + } + } +} + +// Apply to worker +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), + activities, + taskQueue: 'my-queue', + interceptors: { + activity: [() => new LoggingActivityInterceptor()], + }, +}); +``` + +## Sinks + +Sinks allow workflows to emit events for side effects (logging, metrics). + +```typescript +import { proxySinks, Sinks } from '@temporalio/workflow'; + +// Define sink interface +export interface LoggerSinks extends Sinks { + logger: { + info(message: string, attrs: Record): void; + error(message: string, attrs: Record): void; + }; +} + +// Use in workflow +const { logger } = proxySinks(); + +export async function myWorkflow(input: string): Promise { + logger.info('Workflow started', { input }); + + const result = await someActivity(input); + + logger.info('Workflow completed', { result }); + return result; +} + +// Implement sink in worker +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), + activities, + taskQueue: 'my-queue', + sinks: { + logger: { + info: { + fn(workflowInfo, message, attrs) { + console.log(`[${workflowInfo.workflowId}] ${message}`, attrs); + }, + callDuringReplay: false, // Don't log during replay + }, + error: { + fn(workflowInfo, message, attrs) { + console.error(`[${workflowInfo.workflowId}] ${message}`, attrs); + }, + callDuringReplay: false, + }, + }, + }, +}); +``` + +## CancellationScope Patterns + +Advanced cancellation control within workflows. + +```typescript +import { + CancellationScope, + CancelledFailure, + sleep, +} from '@temporalio/workflow'; + +export async function workflowWithCancellation(): Promise { + // Non-cancellable scope - runs to completion even if workflow cancelled + const criticalResult = await CancellationScope.nonCancellable(async () => { + return await criticalActivity(); + }); + + // Cancellable scope with timeout + try { + await CancellationScope.cancellable(async () => { + await Promise.race([ + longRunningActivity(), + sleep('5 minutes').then(() => { + CancellationScope.current().cancel(); + }), + ]); + }); + } catch (err) { + if (err instanceof CancelledFailure) { + // Handle cancellation + await cleanupActivity(); + } + throw err; + } + + return criticalResult; +} +``` + +## Dynamic Workflows and Activities + +Handle workflows/activities not known at compile time. + +```typescript +// Dynamic workflow registration +import { proxyActivities } from '@temporalio/workflow'; + +export async function dynamicWorkflow( + workflowType: string, + args: unknown[] +): Promise { + switch (workflowType) { + case 'order': + return handleOrderWorkflow(args); + case 'refund': + return handleRefundWorkflow(args); + default: + throw new Error(`Unknown workflow type: ${workflowType}`); + } +} +``` + +## Best Practices + +1. Use continue-as-new for long-running workflows to prevent history growth +2. Prefer updates over signals when you need a response +3. Use sinks with `callDuringReplay: false` for logging +4. Use CancellationScope.nonCancellable for critical cleanup operations +5. Configure interceptors for cross-cutting concerns like tracing +6. Use `ActivityCancellationType.WAIT_CANCELLATION_COMPLETED` when cleanup is important +7. Store progress in heartbeat details for resumable long-running activities +8. Use Nexus for cross-namespace service communication diff --git a/references/typescript/data-handling.md b/references/typescript/data-handling.md new file mode 100644 index 0000000..88aad84 --- /dev/null +++ b/references/typescript/data-handling.md @@ -0,0 +1,256 @@ +# TypeScript SDK Data Handling + +## Overview + +The TypeScript SDK uses data converters to serialize/deserialize workflow inputs, outputs, and activity parameters. + +## Default Data Converter + +The default converter handles: +- `undefined` and `null` +- `Uint8Array` (as binary) +- Protobuf messages (if configured) +- JSON-serializable types + +## Search Attributes + +Custom searchable fields for workflow visibility. + +### Setting Search Attributes at Start + +```typescript +import { Client } from '@temporalio/client'; + +const client = new Client(); + +await client.workflow.start('orderWorkflow', { + taskQueue: 'orders', + workflowId: `order-${orderId}`, + args: [order], + searchAttributes: { + OrderId: [orderId], + CustomerType: ['premium'], + OrderTotal: [99.99], + CreatedAt: [new Date()], + }, +}); +``` + +### Upserting Search Attributes from Workflow + +```typescript +import { upsertSearchAttributes, workflowInfo } from '@temporalio/workflow'; + +export async function orderWorkflow(order: Order): Promise { + // Update status as workflow progresses + upsertSearchAttributes({ + OrderStatus: ['processing'], + }); + + await processOrder(order); + + upsertSearchAttributes({ + OrderStatus: ['completed'], + }); + + return 'done'; +} +``` + +### Reading Search Attributes + +```typescript +import { workflowInfo } from '@temporalio/workflow'; + +export async function orderWorkflow(): Promise { + const info = workflowInfo(); + const searchAttrs = info.searchAttributes; + const orderId = searchAttrs?.OrderId?.[0]; + // ... +} +``` + +### Querying Workflows by Search Attributes + +```typescript +const client = new Client(); + +// List workflows using search attributes +for await (const workflow of client.workflow.list({ + query: 'OrderStatus = "processing" AND CustomerType = "premium"', +})) { + console.log(`Workflow ${workflow.workflowId} is still processing`); +} +``` + +## Workflow Memo + +Store arbitrary metadata with workflows (not searchable). + +```typescript +// Set memo at workflow start +await client.workflow.start('orderWorkflow', { + taskQueue: 'orders', + workflowId: `order-${orderId}`, + args: [order], + memo: { + customerName: order.customerName, + notes: 'Priority customer', + }, +}); + +// Read memo from workflow +import { workflowInfo } from '@temporalio/workflow'; + +export async function orderWorkflow(): Promise { + const info = workflowInfo(); + const customerName = info.memo?.customerName; + // ... +} +``` + +## Custom Data Converter + +Create custom converters for special serialization needs. + +```typescript +import { + DataConverter, + PayloadConverter, + defaultPayloadConverter, +} from '@temporalio/common'; + +class CustomPayloadConverter implements PayloadConverter { + toPayload(value: unknown): Payload | undefined { + // Custom serialization logic + return defaultPayloadConverter.toPayload(value); + } + + fromPayload(payload: Payload): T { + // Custom deserialization logic + return defaultPayloadConverter.fromPayload(payload); + } +} + +const dataConverter: DataConverter = { + payloadConverter: new CustomPayloadConverter(), +}; + +// Apply to client +const client = new Client({ + dataConverter, +}); + +// Apply to worker +const worker = await Worker.create({ + dataConverter, + // ... +}); +``` + +## Payload Codec (Encryption) + +Encrypt sensitive workflow data. + +```typescript +import { PayloadCodec, Payload } from '@temporalio/common'; + +class EncryptionCodec implements PayloadCodec { + private readonly encryptionKey: Uint8Array; + + constructor(key: Uint8Array) { + this.encryptionKey = key; + } + + async encode(payloads: Payload[]): Promise { + return Promise.all( + payloads.map(async (payload) => ({ + metadata: { + encoding: 'binary/encrypted', + }, + data: await this.encrypt(payload.data ?? new Uint8Array()), + })) + ); + } + + async decode(payloads: Payload[]): Promise { + return Promise.all( + payloads.map(async (payload) => { + if (payload.metadata?.encoding === 'binary/encrypted') { + return { + ...payload, + data: await this.decrypt(payload.data ?? new Uint8Array()), + }; + } + return payload; + }) + ); + } + + private async encrypt(data: Uint8Array): Promise { + // Implement encryption (e.g., using Web Crypto API) + return data; + } + + private async decrypt(data: Uint8Array): Promise { + // Implement decryption + return data; + } +} + +// Apply codec +const dataConverter: DataConverter = { + payloadCodec: new EncryptionCodec(encryptionKey), +}; +``` + +## Protobuf Support + +Using Protocol Buffers for type-safe serialization. + +```typescript +import { DefaultPayloadConverterWithProtobufs } from '@temporalio/common/lib/protobufs'; + +const dataConverter: DataConverter = { + payloadConverter: new DefaultPayloadConverterWithProtobufs({ + protobufRoot: myProtobufRoot, + }), +}; +``` + +## Large Payloads + +For large data, consider: + +1. **Store externally**: Put large data in S3/GCS, pass references in workflows +2. **Use compression codec**: Compress payloads automatically +3. **Chunk data**: Split large arrays across multiple activities + +```typescript +// Example: Reference pattern for large data +import { proxyActivities } from '@temporalio/workflow'; + +const { uploadToStorage, downloadFromStorage } = proxyActivities({ + startToCloseTimeout: '5 minutes', +}); + +export async function processLargeDataWorkflow(dataRef: string): Promise { + // Download data from storage using reference + const data = await downloadFromStorage(dataRef); + + // Process data... + const result = await processData(data); + + // Upload result and return reference + const resultRef = await uploadToStorage(result); +} +``` + +## Best Practices + +1. Keep payloads small (< 2MB recommended) +2. Use search attributes for business-level visibility and filtering +3. Encrypt sensitive data with PayloadCodec +4. Store large data externally with references +5. Use memo for non-searchable metadata +6. Configure the same data converter on both client and worker diff --git a/references/typescript/determinism.md b/references/typescript/determinism.md new file mode 100644 index 0000000..e7db427 --- /dev/null +++ b/references/typescript/determinism.md @@ -0,0 +1,133 @@ +# TypeScript SDK Determinism + +## Overview + +The TypeScript SDK runs workflows in an isolated V8 sandbox that automatically provides determinism. + +## Why Determinism Matters + +Temporal provides durable execution through **History Replay**. When a Worker needs to restore workflow state (after a crash, cache eviction, or to continue after a long timer), it re-executes the workflow code from the beginning. + +**The Critical Rule**: A Workflow is deterministic if every execution of its code produces the same Commands, in the same sequence, given the same input. + +During replay, the Worker: +1. Re-executes your workflow code +2. Compares generated Commands to Events in the history +3. Uses stored results from history instead of re-executing Activities + +If the Commands don't match the history, the Worker cannot accurately restore state, causing a **non-deterministic error**. + +### Example of Non-Determinism + +```typescript +// WRONG - Non-deterministic! +export async function badWorkflow(): Promise { + await importData(); + + // Random value changes on each execution + if (Math.random() > 0.5) { // Would be a problem without sandbox + await sleep('30 minutes'); + } + + return await sendReport(); +} +``` + +Without the sandbox, if the random number was 0.8 on first run (timer started) but 0.3 on replay (no timer), the Worker would see a `StartTimer` command that doesn't match history, causing a non-deterministic error. + +**Good news**: The TypeScript sandbox automatically makes `Math.random()` deterministic, so this specific code actually works. But the concept is important for understanding WHY the sandbox exists. + +## Automatic Replacements + +The sandbox replaces non-deterministic APIs with deterministic versions: + +| Original | Replacement | +|----------|-------------| +| `Math.random()` | Seeded PRNG per workflow | +| `Date.now()` | Workflow task start time | +| `Date` constructor | Deterministic time | +| `setTimeout` | Workflow timer | + +## Safe Operations + +```typescript +import { sleep } from '@temporalio/workflow'; + +// These are all safe in workflows: +Math.random(); // Deterministic +Date.now(); // Deterministic +new Date(); // Deterministic +await sleep('1 hour'); // Durable timer + +// Object iteration is deterministic in JavaScript +for (const key in obj) { } +Object.keys(obj).forEach(k => { }); +``` + +## Forbidden Operations + +```typescript +// DO NOT do these in workflows: +import fs from 'fs'; // Node.js modules +fetch('https://...'); // Network I/O +console.log(); // Side effects (use workflow.log) +``` + +## Type-Only Activity Imports + +```typescript +// CORRECT - type-only import +import type * as activities from './activities'; + +const { myActivity } = proxyActivities({ + startToCloseTimeout: '5 minutes', +}); + +// WRONG - actual import brings in implementation +import * as activities from './activities'; +``` + +## Workflow Bundling + +Workflows are bundled by the worker using Webpack. The bundled code runs in isolation. + +```typescript +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), // Gets bundled + activities, // Not bundled, runs in Node.js + taskQueue: 'my-queue', +}); +``` + +## Patching for Versioning + +Use `patched()` to safely change workflow code while maintaining compatibility with running workflows: + +```typescript +import { patched, deprecatePatch } from '@temporalio/workflow'; + +export async function myWorkflow(): Promise { + if (patched('my-change')) { + // New code path + return await newImplementation(); + } else { + // Old code path (for replay) + return await oldImplementation(); + } +} + +// Later, after all old workflows complete: +export async function myWorkflow(): Promise { + deprecatePatch('my-change'); + return await newImplementation(); +} +``` + +## Best Practices + +1. Use type-only imports for activities in workflow files +2. Match all @temporalio package versions +3. Use `sleep()` from workflow package, never `setTimeout` directly +4. Keep workflows focused on orchestration +5. Test with replay to verify determinism +6. Use `patched()` when changing workflow logic for running workflows diff --git a/references/typescript/error-handling.md b/references/typescript/error-handling.md new file mode 100644 index 0000000..4057043 --- /dev/null +++ b/references/typescript/error-handling.md @@ -0,0 +1,180 @@ +# TypeScript SDK Error Handling + +## Overview + +The TypeScript SDK uses `ApplicationFailure` for application errors with support for non-retryable marking. + +## Application Failures + +```typescript +import { ApplicationFailure } from '@temporalio/workflow'; + +export async function myWorkflow(): Promise { + throw ApplicationFailure.create({ + message: 'Invalid input', + type: 'ValidationError', + nonRetryable: true, + }); +} +``` + +## Activity Errors + +```typescript +import { ApplicationFailure } from '@temporalio/activity'; + +export async function validateActivity(input: string): Promise { + if (!isValid(input)) { + throw ApplicationFailure.create({ + message: `Invalid input: ${input}`, + type: 'ValidationError', + nonRetryable: true, + }); + } +} +``` + +## Handling Errors in Workflows + +```typescript +import { proxyActivities, ApplicationFailure } from '@temporalio/workflow'; +import type * as activities from './activities'; + +const { riskyActivity } = proxyActivities({ + startToCloseTimeout: '5 minutes', +}); + +export async function workflowWithErrorHandling(): Promise { + try { + return await riskyActivity(); + } catch (err) { + if (err instanceof ApplicationFailure) { + console.log(`Activity failed: ${err.type} - ${err.message}`); + } + throw err; + } +} +``` + +## Retry Configuration + +```typescript +const { myActivity } = proxyActivities({ + startToCloseTimeout: '10 minutes', + retry: { + initialInterval: '1s', + backoffCoefficient: 2, + maximumInterval: '1m', + maximumAttempts: 5, + nonRetryableErrorTypes: ['ValidationError', 'PaymentError'], + }, +}); +``` + +## Timeout Configuration + +```typescript +const { myActivity } = proxyActivities({ + startToCloseTimeout: '5 minutes', // Single attempt + scheduleToCloseTimeout: '30 minutes', // Including retries + heartbeatTimeout: '30 seconds', // Between heartbeats +}); +``` + +## Cancellation Handling in Activities + +```typescript +import { CancelledFailure, heartbeat } from '@temporalio/activity'; + +export async function cancellableActivity(): Promise { + try { + while (true) { + heartbeat(); + await doWork(); + } + } catch (err) { + if (err instanceof CancelledFailure) { + await cleanup(); + } + throw err; + } +} +``` + +## Idempotency Patterns + +Activities may be executed more than once due to retries. Design activities to be idempotent to prevent duplicate side effects. + +### Why Activities Need Idempotency + +Consider this scenario: +1. Worker polls and accepts an Activity Task +2. Activity function completes successfully +3. Worker crashes before notifying the Cluster +4. Cluster retries the Activity (doesn't know it completed) + +If the Activity charged a credit card, the customer would be charged twice. + +### Using Idempotency Keys + +Use the Workflow Run ID + Activity ID as an idempotency key - this is constant across retries but unique across workflow executions: + +```typescript +import { info } from '@temporalio/activity'; + +export async function chargePayment( + customerId: string, + amount: number +): Promise { + // Create idempotency key from workflow context + const idempotencyKey = `${info().workflowRunId}-${info().activityId}`; + + // Pass to external service (e.g., Stripe, payment processor) + const result = await paymentService.charge({ + customerId, + amount, + idempotencyKey, // Service ignores duplicate requests with same key + }); + + return result.transactionId; +} +``` + +**Important**: Use `workflowRunId` (not `workflowId`) because workflow IDs can be reused. + +### Granular Activities + +Make activities more granular to reduce the scope of potential retries: + +```typescript +// BETTER - Three small activities +export async function lookupCustomer(customerId: string): Promise { + return await db.findCustomer(customerId); +} + +export async function processPayment(paymentInfo: PaymentInfo): Promise { + const idempotencyKey = `${info().workflowRunId}-${info().activityId}`; + return await paymentService.process(paymentInfo, idempotencyKey); +} + +export async function sendReceipt(transactionId: string): Promise { + await emailService.sendReceipt(transactionId); +} + +// WORSE - One large activity doing multiple things +export async function processOrder(order: Order): Promise { + const customer = await db.findCustomer(order.customerId); + await paymentService.process(order.payment); // If this fails here... + await emailService.sendReceipt(order.id); // ...all three retry +} +``` + +## Best Practices + +1. Use specific error types for different failure modes +2. Set `nonRetryable: true` for permanent failures +3. Configure `nonRetryableErrorTypes` in retry policy +4. Handle `CancelledFailure` in activities that need cleanup +5. Always re-throw errors after handling +6. Use idempotency keys for activities with external side effects +7. Make activities granular to minimize retry scope diff --git a/references/typescript/gotchas.md b/references/typescript/gotchas.md new file mode 100644 index 0000000..b535c93 --- /dev/null +++ b/references/typescript/gotchas.md @@ -0,0 +1,403 @@ +# TypeScript Gotchas + +TypeScript-specific mistakes and anti-patterns. See also [Common Gotchas](../core/common-gotchas.md) for language-agnostic concepts. + +## Idempotency + +```typescript +// BAD - May charge customer multiple times on retry +export async function chargePayment(orderId: string, amount: number): Promise { + return await paymentApi.charge(customerId, amount); +} + +// GOOD - Safe for retries +export async function chargePayment(orderId: string, amount: number): Promise { + return await paymentApi.charge(customerId, amount, { + idempotencyKey: `order-${orderId}`, + }); +} +``` + +## Replay Safety + +### Side Effects in Workflows + +```typescript +// BAD - console.log runs on every replay +export async function notificationWorkflow(): Promise { + console.log('Starting workflow'); // Runs on replay too + await sendSlackNotification('Started'); // Side effect in workflow! + await activities.doWork(); +} + +// GOOD - Use workflow logger and activities for side effects +import { log } from '@temporalio/workflow'; + +export async function notificationWorkflow(): Promise { + log.info('Starting workflow'); // Only logs on first execution + await activities.sendNotification('Started'); +} +``` + +### Non-Deterministic Operations + +The TypeScript SDK automatically replaces some non-deterministic operations: + +```typescript +// These are SAFE - automatically replaced by SDK +const now = Date.now(); // Deterministic +const random = Math.random(); // Deterministic +const id = crypto.randomUUID(); // Deterministic (if using workflow's crypto) + +// For explicit deterministic UUID, use: +import { uuid4 } from '@temporalio/workflow'; +const id = uuid4(); +``` + +## Query Handlers + +### Modifying State + +```typescript +// BAD - Query modifies state +const queues = new Map(); + +export const getNextItemQuery = defineQuery('getNextItem'); + +export async function queueWorkflow(queueId: string): Promise { + const queue: string[] = []; + + setHandler(getNextItemQuery, () => { + return queue.shift(); // Mutates state! + }); + + await condition(() => false); +} + +// GOOD - Query reads, Update modifies +export const peekQuery = defineQuery('peek'); +export const dequeueUpdate = defineUpdate('dequeue'); + +export async function queueWorkflow(): Promise { + const queue: string[] = []; + + setHandler(peekQuery, () => queue[0]); + + setHandler(dequeueUpdate, () => queue.shift()); + + await condition(() => false); +} +``` + +### Blocking in Queries + +```typescript +// BAD - Queries cannot await +setHandler(getDataQuery, async () => { + if (!data) { + data = await activities.fetchData(); // Cannot await in query! + } + return data; +}); + +// GOOD - Query returns state, signal triggers refresh +setHandler(refreshSignal, async () => { + data = await activities.fetchData(); +}); + +setHandler(getDataQuery, () => data); +``` + +## Activity Imports + +### Importing Implementations Instead of Types + +**The Problem**: Importing activity implementations brings Node.js code into the V8 workflow sandbox, causing bundling errors or runtime failures. + +```typescript +// BAD - Brings actual code into workflow sandbox +import * as activities from './activities'; + +const { greet } = proxyActivities({ + startToCloseTimeout: '1 minute', +}); + +// GOOD - Type-only import +import type * as activities from './activities'; + +const { greet } = proxyActivities({ + startToCloseTimeout: '1 minute', +}); +``` + +### Importing Node.js Modules in Workflows + +```typescript +// BAD - fs is not available in workflow sandbox +import * as fs from 'fs'; + +export async function myWorkflow(): Promise { + const data = fs.readFileSync('file.txt'); // Will fail! +} + +// GOOD - File I/O belongs in activities +export async function myWorkflow(): Promise { + const data = await activities.readFile('file.txt'); +} +``` + +## Bundling Issues + +### Missing Dependencies in Workflow Bundle + +```typescript +// If using external packages in workflows, ensure they're bundled + +// worker.ts +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), + bundlerOptions: { + // Include specific packages if needed + ignoreModules: ['some-node-only-package'], + }, +}); +``` + +### Package Version Mismatches + +All `@temporalio/*` packages must have the same version: + +```json +// BAD - Version mismatch +{ + "dependencies": { + "@temporalio/client": "1.9.0", + "@temporalio/worker": "1.8.0", + "@temporalio/workflow": "1.9.1" + } +} + +// GOOD - All versions match +{ + "dependencies": { + "@temporalio/client": "1.9.0", + "@temporalio/worker": "1.9.0", + "@temporalio/workflow": "1.9.0" + } +} +``` + +## Error Handling + +### Swallowing Errors + +```typescript +// BAD - Error is hidden +export async function riskyWorkflow(): Promise { + try { + await activities.riskyOperation(); + } catch { + // Error is lost! + } +} + +// GOOD - Handle appropriately +import { log } from '@temporalio/workflow'; + +export async function riskyWorkflow(): Promise { + try { + await activities.riskyOperation(); + } catch (err) { + log.error('Activity failed', { error: err }); + throw err; // Or use fallback, compensate, etc. + } +} +``` + +### Wrong Retry Classification + +```typescript +// BAD - Network errors should be retried +export async function callApi(): Promise { + try { + return await fetch(url); + } catch (err) { + throw ApplicationFailure.nonRetryable('Connection failed'); + } +} + +// GOOD - Only permanent failures are non-retryable +export async function callApi(): Promise { + try { + return await fetch(url); + } catch (err) { + if (err instanceof InvalidCredentialsError) { + throw ApplicationFailure.nonRetryable('Invalid API key'); + } + throw err; // Let Temporal retry network errors + } +} +``` + +## Retry Policies + +### Too Aggressive + +```typescript +// BAD - Gives up too easily +const result = await activities.flakyApiCall({ + scheduleToCloseTimeout: '30 seconds', + retry: { maximumAttempts: 1 }, +}); + +// GOOD - Resilient to transient failures +const result = await activities.flakyApiCall({ + scheduleToCloseTimeout: '10 minutes', + retry: { + initialInterval: '1 second', + maximumInterval: '1 minute', + backoffCoefficient: 2, + maximumAttempts: 10, + }, +}); +``` + +## Cancellation + +### Not Handling Cancellation + +```typescript +// BAD - Cleanup doesn't run on cancellation +export async function workflowWithCleanup(): Promise { + await activities.acquireResource(); + await activities.doWork(); + await activities.releaseResource(); // Never runs if cancelled! +} + +// GOOD - Use CancellationScope for cleanup +import { CancellationScope } from '@temporalio/workflow'; + +export async function workflowWithCleanup(): Promise { + await activities.acquireResource(); + try { + await activities.doWork(); + } finally { + // Run cleanup even on cancellation + await CancellationScope.nonCancellable(async () => { + await activities.releaseResource(); + }); + } +} +``` + +## Heartbeating + +### Forgetting to Heartbeat Long Activities + +```typescript +// BAD - No heartbeat, can't detect stuck activities +export async function processLargeFile(path: string): Promise { + for await (const chunk of readChunks(path)) { + await processChunk(chunk); // Takes hours, no heartbeat + } +} + +// GOOD - Regular heartbeats with progress +import { heartbeat } from '@temporalio/activity'; + +export async function processLargeFile(path: string): Promise { + let i = 0; + for await (const chunk of readChunks(path)) { + heartbeat(`Processing chunk ${i++}`); + await processChunk(chunk); + } +} +``` + +### Heartbeat Timeout Too Short + +```typescript +// BAD - Heartbeat timeout shorter than processing time +const { processChunk } = proxyActivities({ + startToCloseTimeout: '30 minutes', + heartbeatTimeout: '10 seconds', // Too short! +}); + +// GOOD - Heartbeat timeout allows for processing variance +const { processChunk } = proxyActivities({ + startToCloseTimeout: '30 minutes', + heartbeatTimeout: '2 minutes', +}); +``` + +## Testing + +### Not Testing Failures + +```typescript +import { TestWorkflowEnvironment } from '@temporalio/testing'; +import { Worker } from '@temporalio/worker'; + +test('handles activity failure', async () => { + const env = await TestWorkflowEnvironment.createTimeSkipping(); + + const worker = await Worker.create({ + connection: env.nativeConnection, + taskQueue: 'test', + workflowsPath: require.resolve('./workflows'), + activities: { + // Activity that always fails + riskyOperation: async () => { + throw ApplicationFailure.nonRetryable('Simulated failure'); + }, + }, + }); + + await worker.runUntil(async () => { + await expect( + env.client.workflow.execute(riskyWorkflow, { + workflowId: 'test-failure', + taskQueue: 'test', + }) + ).rejects.toThrow('Simulated failure'); + }); + + await env.teardown(); +}); +``` + +### Not Testing Replay + +```typescript +import { Worker } from '@temporalio/worker'; + +test('replay compatibility', async () => { + const history = await import('./fixtures/workflow_history.json'); + + // Fails if current code is incompatible with history + await Worker.runReplayHistory({ + workflowsPath: require.resolve('./workflows'), + history, + }); +}); +``` + +## Timers and Sleep + +### Using JavaScript setTimeout + +```typescript +// BAD - setTimeout is not durable +export async function delayedWorkflow(): Promise { + await new Promise(resolve => setTimeout(resolve, 60000)); // Not durable! + await activities.doWork(); +} + +// GOOD - Use workflow sleep +import { sleep } from '@temporalio/workflow'; + +export async function delayedWorkflow(): Promise { + await sleep('1 minute'); // Durable, survives restarts + await activities.doWork(); +} +``` diff --git a/references/typescript/observability.md b/references/typescript/observability.md new file mode 100644 index 0000000..74e24bc --- /dev/null +++ b/references/typescript/observability.md @@ -0,0 +1,231 @@ +# TypeScript SDK Observability + +## Overview + +The TypeScript SDK provides replay-aware logging, metrics, and OpenTelemetry integration for production observability. + +## Replay-Aware Logging + +Temporal's logger automatically suppresses duplicate messages during replay, preventing log spam when workflows recover state. + +### Workflow Logging + +```typescript +import { log } from '@temporalio/workflow'; + +export async function orderWorkflow(orderId: string): Promise { + log.info('Processing order', { orderId }); + + const result = await processPayment(orderId); + log.debug('Payment processed', { orderId, result }); + + return result; +} +``` + +**Log levels**: `log.debug()`, `log.info()`, `log.warn()`, `log.error()` + +### Activity Logging + +```typescript +import * as activity from '@temporalio/activity'; + +export async function processPayment(orderId: string): Promise { + const context = activity.Context.current(); + context.log.info('Processing payment', { orderId }); + + // Activity logs don't need replay suppression + // since completed activities aren't re-executed + return 'payment-id-123'; +} +``` + +## Customizing the Logger + +### Basic Configuration + +```typescript +import { DefaultLogger, Runtime } from '@temporalio/worker'; + +const logger = new DefaultLogger('DEBUG', ({ level, message }) => { + console.log(`Custom logger: ${level} - ${message}`); +}); +Runtime.install({ logger }); +``` + +### Winston Integration + +```typescript +import winston from 'winston'; +import { DefaultLogger, Runtime } from '@temporalio/worker'; + +const winstonLogger = winston.createLogger({ + level: 'debug', + format: winston.format.json(), + transports: [ + new winston.transports.File({ filename: 'temporal.log' }) + ], +}); + +const logger = new DefaultLogger('DEBUG', (entry) => { + winstonLogger.log({ + label: entry.meta?.activityId ? 'activity' : entry.meta?.workflowId ? 'workflow' : 'worker', + level: entry.level.toLowerCase(), + message: entry.message, + timestamp: Number(entry.timestampNanos / 1_000_000n), + ...entry.meta, + }); +}); + +Runtime.install({ logger }); +``` + +## OpenTelemetry Integration + +The `@temporalio/interceptors-opentelemetry` package provides tracing for workflows and activities. + +### Setup + +```typescript +// instrumentation.ts - require before other imports +import { NodeSDK } from '@opentelemetry/sdk-node'; +import { ConsoleSpanExporter } from '@opentelemetry/sdk-trace-node'; +import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'; +import { Resource } from '@opentelemetry/resources'; +import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions'; + +export const resource = new Resource({ + [ATTR_SERVICE_NAME]: 'my-temporal-service', +}); + +// Use OTLP exporter for production +export const traceExporter = new OTLPTraceExporter({ + url: 'http://127.0.0.1:4317', + timeoutMillis: 1000, +}); + +export const otelSdk = new NodeSDK({ + resource, + traceExporter, +}); + +otelSdk.start(); +``` + +### Worker Configuration + +```typescript +import { Worker } from '@temporalio/worker'; +import { + OpenTelemetryActivityInboundInterceptor, + OpenTelemetryActivityOutboundInterceptor, + makeWorkflowExporter, +} from '@temporalio/interceptors-opentelemetry/lib/worker'; +import { resource, traceExporter } from './instrumentation'; +import * as activities from './activities'; + +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), + activities, + taskQueue: 'my-queue', + + // OpenTelemetry sinks and interceptors + sinks: { + exporter: makeWorkflowExporter(traceExporter, resource), + }, + interceptors: { + workflowModules: [require.resolve('./workflows')], + activity: [ + (ctx) => ({ + inbound: new OpenTelemetryActivityInboundInterceptor(ctx), + outbound: new OpenTelemetryActivityOutboundInterceptor(ctx), + }), + ], + }, +}); +``` + +## Metrics + +### Prometheus Metrics + +```typescript +import { Runtime } from '@temporalio/worker'; + +Runtime.install({ + telemetryOptions: { + metrics: { + prometheus: { + bindAddress: '127.0.0.1:9091', + }, + }, + }, +}); +``` + +### OTLP Metrics + +```typescript +Runtime.install({ + telemetryOptions: { + metrics: { + otel: { + url: 'http://127.0.0.1:4317', + metricsExportInterval: '1s', + }, + }, + }, +}); +``` + +## Debugging with Event History + +### Viewing Event History + +Use the Temporal CLI or Web UI to inspect workflow execution history: + +```bash +# CLI +temporal workflow show --workflow-id my-workflow + +# Get history as JSON +temporal workflow show --workflow-id my-workflow --output json +``` + +### Key Events to Look For + +| Event | Indicates | +|-------|-----------| +| `ActivityTaskScheduled` | Activity was requested | +| `ActivityTaskStarted` | Worker started executing activity | +| `ActivityTaskCompleted` | Activity completed successfully | +| `ActivityTaskFailed` | Activity threw an error | +| `ActivityTaskTimedOut` | Activity exceeded timeout | +| `TimerStarted` | `sleep()` called | +| `TimerFired` | Sleep completed | +| `WorkflowTaskFailed` | Non-deterministic error or workflow bug | + +### Debugging Non-Determinism + +If you see `WorkflowTaskFailed` with a non-determinism error: + +1. Export the history: `temporal workflow show -w -o json > history.json` +2. Run replay test to reproduce: + +```typescript +import { Worker } from '@temporalio/worker'; + +await Worker.runReplayHistory( + { workflowsPath: require.resolve('./workflows') }, + history +); +``` + +## Best Practices + +1. Use `log` from `@temporalio/workflow` - never `console.log` in workflows +2. Include correlation IDs (orderId, customerId) in log messages +3. Configure Winston or similar for production log aggregation +4. Enable OpenTelemetry for distributed tracing across services +5. Monitor Prometheus metrics for worker health +6. Use Event History for debugging workflow issues diff --git a/references/typescript/patterns.md b/references/typescript/patterns.md new file mode 100644 index 0000000..c6a606d --- /dev/null +++ b/references/typescript/patterns.md @@ -0,0 +1,422 @@ +# TypeScript SDK Patterns + +## Signals + +**WHY**: Signals allow external clients or other workflows to send data to a running workflow asynchronously. Unlike queries (read-only), signals can mutate workflow state. + +**WHEN to use**: +- Sending events to a running workflow (e.g., approval, cancellation request) +- Adding items to a workflow's queue or collection +- Notifying a workflow about external state changes +- Implementing human-in-the-loop workflows + +```typescript +import { defineSignal, setHandler, condition } from '@temporalio/workflow'; + +const approveSignal = defineSignal<[boolean]>('approve'); +const addItemSignal = defineSignal<[string]>('addItem'); + +export async function orderWorkflow(): Promise { + let approved = false; + const items: string[] = []; + + setHandler(approveSignal, (value) => { + approved = value; + }); + + setHandler(addItemSignal, (item) => { + items.push(item); + }); + + await condition(() => approved); + return `Processed ${items.length} items`; +} +``` + +## Queries + +**WHY**: Queries provide a synchronous, read-only way to inspect workflow state. They execute instantly without modifying workflow state or history. + +**WHEN to use**: +- Exposing workflow progress or status to external systems +- Building dashboards or monitoring UIs +- Debugging workflow state during development +- Implementing "get current state" endpoints + +```typescript +import { defineQuery, setHandler } from '@temporalio/workflow'; + +const statusQuery = defineQuery('status'); +const progressQuery = defineQuery('progress'); + +export async function progressWorkflow(): Promise { + let status = 'running'; + let progress = 0; + + setHandler(statusQuery, () => status); + setHandler(progressQuery, () => progress); + + for (let i = 0; i < 100; i++) { + progress = i; + await doWork(); + } + status = 'completed'; +} +``` + +## Updates + +**WHY**: Updates combine the state mutation capability of signals with the synchronous response of queries. The caller waits for the update handler to complete and receives a return value. + +**WHEN to use**: +- Operations that modify state AND need to return a result (e.g., "add item and return new count") +- Validation before accepting a change (use validators to reject invalid updates) +- Synchronous request-response patterns within a workflow +- Replacing signal+query combos where you signal then immediately query + +### Defining Update Handlers + +```typescript +import { defineUpdate, setHandler, condition } from '@temporalio/workflow'; + +// Define the update - specify return type and argument types +export const addItemUpdate = defineUpdate('addItem'); +export const addItemValidatedUpdate = defineUpdate('addItemValidated'); + +export async function orderWorkflow(): Promise { + const items: string[] = []; + let completed = false; + + // Simple update handler - returns new item count + setHandler(addItemUpdate, (item: string) => { + items.push(item); + return items.length; + }); + + // Update handler with validator - rejects invalid input before execution + setHandler( + addItemValidatedUpdate, + (item: string) => { + items.push(item); + return items.length; + }, + { + validator: (item: string) => { + if (!item) throw new Error('Item cannot be empty'); + if (items.length >= 100) throw new Error('Order is full'); + }, + } + ); + + await condition(() => completed); + return `Order with ${items.length} items completed`; +} +``` + +### Calling Updates from Client + +```typescript +import { Client } from '@temporalio/client'; +import { addItemUpdate } from './workflows'; + +const client = new Client(); +const handle = client.workflow.getHandle('order-123'); + +// Execute update and wait for result +const count = await handle.executeUpdate(addItemUpdate, { args: ['new-item'] }); +console.log(`Order now has ${count} items`); + +// Start update and get handle for later result retrieval +const updateHandle = await handle.startUpdate(addItemUpdate, { + args: ['another-item'], + waitForStage: 'ACCEPTED', +}); +const result = await updateHandle.result(); +``` + +## Signal-with-Start + +**WHY**: Atomically starts a workflow and sends it a signal in a single operation. Avoids race conditions where the workflow might complete before receiving the signal. + +**WHEN to use**: +- Starting a workflow and immediately sending it data +- Idempotent "create or update" patterns +- Ensuring a signal is delivered even if the workflow needs to be started first + +```typescript +import { Client } from '@temporalio/client'; +import { orderSignal } from './workflows'; + +const client = new Client(); + +const handle = await client.workflow.signalWithStart('orderWorkflow', { + workflowId: `order-${customerId}`, + taskQueue: 'orders', + args: [customerId], + signal: orderSignal, + signalArgs: [{ item: 'product-123', quantity: 2 }], +}); +``` + +## Child Workflows + +**WHY**: Child workflows decompose complex workflows into smaller, reusable units. Each child has its own history, preventing history bloat. + +**WHEN to use**: +- Breaking down large workflows to prevent history growth +- Reusing workflow logic across multiple parent workflows +- Isolating failures - a child can fail without failing the parent + +```typescript +import { executeChild } from '@temporalio/workflow'; + +export async function parentWorkflow(orders: Order[]): Promise { + const results: string[] = []; + + for (const order of orders) { + const result = await executeChild(processOrderWorkflow, { + args: [order], + workflowId: `order-${order.id}`, + }); + results.push(result); + } + + return results; +} +``` + +### Child Workflow Options + +```typescript +import { executeChild, ParentClosePolicy, ChildWorkflowCancellationType } from '@temporalio/workflow'; + +const result = await executeChild(childWorkflow, { + args: [input], + workflowId: `child-${workflowInfo().workflowId}`, + + // ParentClosePolicy - what happens to child when parent closes + // TERMINATE (default), ABANDON, REQUEST_CANCEL + parentClosePolicy: ParentClosePolicy.TERMINATE, + + // ChildWorkflowCancellationType - how cancellation is handled + // WAIT_CANCELLATION_COMPLETED (default), WAIT_CANCELLATION_REQUESTED, TRY_CANCEL, ABANDON + cancellationType: ChildWorkflowCancellationType.WAIT_CANCELLATION_COMPLETED, +}); +``` + +## Parallel Execution + +**WHY**: Running multiple operations concurrently improves workflow performance when operations are independent. + +**WHEN to use**: +- Processing multiple independent items +- Calling multiple APIs that don't depend on each other +- Fan-out/fan-in patterns + +```typescript +export async function parallelWorkflow(items: string[]): Promise { + return await Promise.all( + items.map((item) => processItem(item)) + ); +} +``` + +## Continue-as-New + +**WHY**: Prevents unbounded history growth by completing the current workflow and starting a new run with the same workflow ID. + +**WHEN to use**: +- Long-running workflows that would accumulate too much history +- Entity/subscription workflows that run indefinitely +- Batch processing with large numbers of iterations + +```typescript +import { continueAsNew, workflowInfo } from '@temporalio/workflow'; + +export async function longRunningWorkflow(state: State): Promise { + while (true) { + state = await processNextBatch(state); + + if (state.isComplete) { + return 'done'; + } + + const info = workflowInfo(); + if (info.continueAsNewSuggested || info.historyLength > 10000) { + await continueAsNew(state); + } + } +} +``` + +## Cancellation Scopes + +**WHY**: Control how cancellation propagates to activities and child workflows. Essential for cleanup logic and timeout behavior. + +**WHEN to use**: +- Ensuring cleanup activities run even when workflow is cancelled +- Implementing timeouts for activity groups +- Manual cancellation of specific operations + +```typescript +import { CancellationScope, sleep } from '@temporalio/workflow'; + +export async function scopedWorkflow(): Promise { + // Non-cancellable scope - runs even if workflow cancelled + await CancellationScope.nonCancellable(async () => { + await cleanupActivity(); + }); + + // Timeout scope + await CancellationScope.withTimeout('5 minutes', async () => { + await longRunningActivity(); + }); + + // Manual cancellation + const scope = new CancellationScope(); + const promise = scope.run(() => someActivity()); + scope.cancel(); +} +``` + +## Saga Pattern + +**WHY**: Implement distributed transactions by tracking compensation actions. If any step fails, previously completed steps are rolled back in reverse order. + +**WHEN to use**: +- Multi-step business transactions that span multiple services +- Operations where partial completion requires cleanup +- Financial transactions, order processing, booking systems + +```typescript +export async function sagaWorkflow(order: Order): Promise { + const compensations: Array<() => Promise> = []; + + try { + await reserveInventory(order); + compensations.push(() => releaseInventory(order)); + + await chargePayment(order); + compensations.push(() => refundPayment(order)); + + await shipOrder(order); + return 'Order completed'; + } catch (err) { + for (const compensate of compensations.reverse()) { + try { + await compensate(); + } catch (compErr) { + console.log('Compensation failed', compErr); + } + } + throw err; + } +} +``` + +## Entity Workflow Pattern + +**WHY**: Model a long-lived entity as a single workflow that handles events over its lifetime. + +**WHEN to use**: +- Modeling stateful entities that exist for extended periods +- Subscription management, user sessions +- Any entity that receives events and must maintain consistent state + +```typescript +import { defineSignal, defineQuery, setHandler, condition, continueAsNew, workflowInfo } from '@temporalio/workflow'; + +const eventSignal = defineSignal<[Event]>('event'); +const stateQuery = defineQuery('state'); + +export async function entityWorkflow(entityId: string, initialState: EntityState): Promise { + let state = initialState; + + setHandler(stateQuery, () => state); + setHandler(eventSignal, (event: Event) => { + state = applyEvent(state, event); + }); + + while (!state.deleted) { + await condition(() => state.deleted || workflowInfo().continueAsNewSuggested); + if (workflowInfo().continueAsNewSuggested && !state.deleted) { + await continueAsNew(entityId, state); + } + } +} +``` + +## Triggers (Promise-like Signals) + +**WHY**: Triggers provide a one-shot promise that resolves when a signal is received. Cleaner than condition() for single-value signals. + +**WHEN to use**: +- Waiting for a single response (approval, completion notification) +- Converting signal-based events into awaitable promises + +```typescript +import { Trigger } from '@temporalio/workflow'; + +export async function triggerWorkflow(): Promise { + const approvalTrigger = new Trigger(); + + setHandler(approveSignal, (approved) => { + approvalTrigger.resolve(approved); + }); + + const approved = await approvalTrigger; + return approved ? 'Approved' : 'Rejected'; +} +``` + +## Timers + +**WHY**: Durable timers that survive worker restarts. Use sleep() for delays instead of JavaScript setTimeout. + +**WHEN to use**: +- Implementing delays between steps +- Scheduling future actions +- Timeout patterns (combined with cancellation scopes) + +```typescript +import { sleep, CancellationScope } from '@temporalio/workflow'; + +export async function timerWorkflow(): Promise { + await sleep('1 hour'); + + const timerScope = new CancellationScope(); + const timerPromise = timerScope.run(() => sleep('1 hour')); + + setHandler(cancelSignal, () => { + timerScope.cancel(); + }); + + try { + await timerPromise; + return 'Timer completed'; + } catch { + return 'Timer cancelled'; + } +} +``` + +## uuid4() Utility + +**WHY**: Generate deterministic UUIDs safe to use in workflows. Uses the workflow seeded PRNG, so the same UUID is generated during replay. + +**WHEN to use**: +- Generating unique IDs for child workflows +- Creating idempotency keys +- Any situation requiring unique identifiers in workflow code + +```typescript +import { uuid4 } from '@temporalio/workflow'; + +export async function workflowWithIds(): Promise { + const childWorkflowId = uuid4(); + await executeChild(childWorkflow, { + workflowId: childWorkflowId, + args: [input], + }); +} +``` diff --git a/references/typescript/testing.md b/references/typescript/testing.md new file mode 100644 index 0000000..70e4a0b --- /dev/null +++ b/references/typescript/testing.md @@ -0,0 +1,128 @@ +# TypeScript SDK Testing + +## Overview + +The TypeScript SDK provides `TestWorkflowEnvironment` for testing workflows with time-skipping and activity mocking support. + +## Test Environment Setup + +```typescript +import { TestWorkflowEnvironment } from '@temporalio/testing'; +import { Worker } from '@temporalio/worker'; + +describe('Workflow', () => { + let testEnv: TestWorkflowEnvironment; + + before(async () => { + testEnv = await TestWorkflowEnvironment.createLocal(); + }); + + after(async () => { + await testEnv?.teardown(); + }); + + it('runs workflow', async () => { + const { client, nativeConnection } = testEnv; + + const worker = await Worker.create({ + connection: nativeConnection, + taskQueue: 'test', + workflowsPath: require.resolve('./workflows'), + activities: require('./activities'), + }); + + await worker.runUntil(async () => { + const result = await client.workflow.execute(greetingWorkflow, { + taskQueue: 'test', + workflowId: 'test-workflow', + args: ['World'], + }); + expect(result).toEqual('Hello, World!'); + }); + }); +}); +``` + +## Time Skipping + +```typescript +// Create time-skipping environment +const testEnv = await TestWorkflowEnvironment.createTimeSkipping(); + +// Time automatically advances when workflows wait +await worker.runUntil(async () => { + const result = await client.workflow.execute(longRunningWorkflow, { + taskQueue: 'test', + workflowId: 'test-workflow', + }); + // Even if workflow has 1-hour timer, test completes instantly +}); + +// Manual time advancement +await testEnv.sleep('1 day'); +``` + +## Activity Mocking + +```typescript +const worker = await Worker.create({ + connection: nativeConnection, + taskQueue: 'test', + workflowsPath: require.resolve('./workflows'), + activities: { + // Mock activity implementation + greet: async (name: string) => `Mocked: ${name}`, + }, +}); +``` + +## Replay Testing + +```typescript +import { Worker } from '@temporalio/worker'; + +describe('Replay', () => { + it('replays workflow history', async () => { + const history = await fetchWorkflowHistory('workflow-id'); + + await Worker.runReplayHistory( + { + workflowsPath: require.resolve('./workflows'), + }, + history + ); + }); +}); +``` + +## Testing Signals and Queries + +```typescript +it('handles signals and queries', async () => { + await worker.runUntil(async () => { + const handle = await client.workflow.start(approvalWorkflow, { + taskQueue: 'test', + workflowId: 'approval-test', + }); + + // Query current state + const status = await handle.query('getStatus'); + expect(status).toEqual('pending'); + + // Send signal + await handle.signal('approve'); + + // Wait for completion + const result = await handle.result(); + expect(result).toEqual('Approved!'); + }); +}); +``` + +## Best Practices + +1. Use time-skipping for workflows with timers +2. Mock external dependencies in activities +3. Test replay compatibility when changing workflow code +4. Use unique workflow IDs per test +5. Clean up test environment after tests diff --git a/references/typescript/typescript.md b/references/typescript/typescript.md new file mode 100644 index 0000000..663d01f --- /dev/null +++ b/references/typescript/typescript.md @@ -0,0 +1,121 @@ +# Temporal TypeScript SDK Reference + +## Overview + +The Temporal TypeScript SDK provides a modern async/await approach to building durable workflows. Workflows run in an isolated V8 sandbox for automatic determinism protection. + +**CRITICAL**: All `@temporalio/*` packages must have the same version number. + +## How Temporal Works: History Replay + +Understanding how Temporal achieves durable execution is essential for writing correct workflows. + +### The Replay Mechanism + +When a Worker executes workflow code, it creates **Commands** (requests for operations like starting an Activity or Timer) and sends them to the Temporal Cluster. The Cluster maintains an **Event History** - a durable log of everything that happened during the workflow execution. + +**Key insight**: During replay, the Worker re-executes your workflow code but uses the Event History to restore state instead of re-executing Activities. When it encounters an Activity call that has a corresponding `ActivityTaskCompleted` event in history, it returns the stored result instead of scheduling a new execution. + +This is why **determinism matters**: The Worker validates that Commands generated during replay match the Events in history. A mismatch causes a non-deterministic error because the Worker cannot reliably restore state. + +### Commands and Events + +| Workflow Code | Command Generated | Resulting Event | +|--------------|-------------------|-----------------| +| Activity call | `ScheduleActivityTask` | `ActivityTaskScheduled` | +| `sleep()` | `StartTimer` | `TimerStarted` | +| Child workflow | `StartChildWorkflowExecution` | `ChildWorkflowExecutionStarted` | +| Return/complete | `CompleteWorkflowExecution` | `WorkflowExecutionCompleted` | + +### When Replay Occurs + +- Worker crashes and recovers +- Worker's cache fills and evicts workflow state +- Workflow continues after long timer +- Testing with replay histories + +## Quick Start + +```typescript +// activities.ts +export async function greet(name: string): Promise { + return `Hello, ${name}!`; +} + +// workflows.ts +import { proxyActivities } from '@temporalio/workflow'; +import type * as activities from './activities'; + +const { greet } = proxyActivities({ + startToCloseTimeout: '1 minute', +}); + +export async function greetingWorkflow(name: string): Promise { + return await greet(name); +} + +// worker.ts +import { Worker } from '@temporalio/worker'; +import * as activities from './activities'; + +async function run() { + const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), + activities, + taskQueue: 'greeting-queue', + }); + await worker.run(); +} +``` + +## Key Concepts + +### Workflow Definition +- Async functions exported from workflow file +- Use `proxyActivities()` with type-only imports +- Use `defineSignal()`, `defineQuery()`, `setHandler()` for handlers + +### Activity Definition +- Regular async functions +- Can perform I/O, network calls, etc. +- Use `heartbeat()` for long operations + +### Worker Setup +- Use `Worker.create()` with workflowsPath +- Import activities directly (not via proxy) + +## Determinism Rules + +The TypeScript SDK runs workflows in an isolated V8 sandbox. + +**Automatic replacements:** +- `Math.random()` → deterministic seeded PRNG +- `Date.now()` → workflow start time +- `setTimeout` → deterministic timer + +**Safe to use:** +- `sleep()` from `@temporalio/workflow` +- `condition()` for waiting +- Standard JavaScript operations + +See `determinism.md` for detailed rules. + +## Common Pitfalls + +1. **Importing activities without `type`** - Use `import type * as activities` +2. **Version mismatch** - All @temporalio packages must match +3. **Direct I/O in workflows** - Use activities for external calls +4. **Missing `proxyActivities`** - Required to call activities from workflows +5. **Forgetting to bundle workflows** - Worker needs workflowsPath + +## Additional Resources + +### Reference Files +- **`determinism.md`** - V8 sandbox, bundling, safe operations, WHY determinism matters +- **`error-handling.md`** - ApplicationFailure, retry policies, idempotency keys +- **`testing.md`** - TestWorkflowEnvironment, time-skipping +- **`patterns.md`** - Signals, queries, cancellation scopes +- **`observability.md`** - Replay-aware logging, metrics, OpenTelemetry, debugging +- **`advanced-features.md`** - Interceptors, sinks, updates, schedules +- **`data-handling.md`** - Search attributes, workflow memo, data converters +- **`versioning.md`** - Patching API, workflow type versioning, Worker Versioning diff --git a/references/typescript/versioning.md b/references/typescript/versioning.md new file mode 100644 index 0000000..7ce2a62 --- /dev/null +++ b/references/typescript/versioning.md @@ -0,0 +1,307 @@ +# TypeScript SDK Versioning + +## Overview + +The TypeScript SDK provides multiple approaches to safely change Workflow code while maintaining compatibility with running Workflows: the Patching API, Workflow Type Versioning, and Worker Versioning. + +## Why Versioning Matters + +Temporal provides durable execution through **History Replay**. When a Worker needs to restore Workflow state, it re-executes the Workflow code from the beginning. If you change Workflow code while executions are still running, replay can fail because the new code produces different Commands than the original history. + +Versioning strategies allow you to safely deploy changes without breaking in-progress Workflow Executions. + +## Workflow Versioning with the Patching API + +The Patching API lets you change Workflow Definitions without causing non-deterministic behavior in running Workflows. + +### The patched() Function + +The `patched()` function takes a `patchId` string and returns a boolean: + +```typescript +import { patched } from '@temporalio/workflow'; + +export async function myWorkflow(): Promise { + if (patched('my-change-id')) { + // New code path + await newImplementation(); + } else { + // Old code path (for replay of existing executions) + await oldImplementation(); + } +} +``` + +**How it works:** +- If the Workflow is running for the first time, `patched()` returns `true` and inserts a marker into the Event History +- During replay, if the history contains a marker with the same `patchId`, `patched()` returns `true` +- During replay, if no matching marker exists, `patched()` returns `false` + +### Three-Step Patching Process + +Patching is a three-step process for safely deploying changes: + +#### Step 1: Patch in New Code + +Add the patch alongside the old code: + +```typescript +import { patched } from '@temporalio/workflow'; + +// Original code sent fax notifications +export async function shippingConfirmation(): Promise { + if (patched('changedNotificationType')) { + await sendEmail(); // New code + } else { + await sendFax(); // Old code for replay + } + await sleep('1 day'); +} +``` + +#### Step 2: Deprecate the Patch + +Once all Workflows using the old code have completed, deprecate the patch: + +```typescript +import { deprecatePatch } from '@temporalio/workflow'; + +export async function shippingConfirmation(): Promise { + deprecatePatch('changedNotificationType'); + await sendEmail(); + await sleep('1 day'); +} +``` + +The `deprecatePatch()` function records a marker that does not fail replay when Workflow code does not emit it, allowing a transition period. + +#### Step 3: Remove the Patch + +After all Workflows using `deprecatePatch` have completed, remove it entirely: + +```typescript +export async function shippingConfirmation(): Promise { + await sendEmail(); + await sleep('1 day'); +} +``` + +### Multiple Patches + +A Workflow can have multiple patches for different changes: + +```typescript +export async function shippingConfirmation(): Promise { + if (patched('sendEmail')) { + await sendEmail(); + } else if (patched('sendTextMessage')) { + await sendTextMessage(); + } else if (patched('sendTweet')) { + await sendTweet(); + } else { + await sendFax(); + } +} +``` + +You can use a single `patchId` for multiple changes deployed together. + +### Query Filters for Versioned Workflows + +Use List Filters to find Workflows by version: + +``` +# Find running Workflows with a specific patch +WorkflowType = "shippingConfirmation" AND ExecutionStatus = "Running" AND TemporalChangeVersion="changedNotificationType" + +# Find running Workflows without the patch (started before patching) +WorkflowType = "shippingConfirmation" AND ExecutionStatus = "Running" AND TemporalChangeVersion IS NULL +``` + +## Workflow Type Versioning + +An alternative to patching is creating new Workflow functions for incompatible changes: + +```typescript +// Original Workflow +export async function pizzaWorkflow(order: PizzaOrder): Promise { + // Original implementation +} + +// New version with incompatible changes +export async function pizzaWorkflowV2(order: PizzaOrder): Promise { + // Updated implementation +} +``` + +Register both Workflows with the Worker: + +```typescript +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), + taskQueue: 'pizza-queue', +}); +``` + +Update client code to start new Workflows with the new type: + +```typescript +// Start new executions with V2 +await client.workflow.start(pizzaWorkflowV2, { + workflowId: 'order-123', + taskQueue: 'pizza-queue', + args: [order], +}); +``` + +Use List Filters to check for remaining V1 executions: + +``` +WorkflowType = "pizzaWorkflow" AND ExecutionStatus = "Running" +``` + +After all V1 executions complete, remove the old Workflow function. + +## Worker Versioning + +Worker Versioning allows multiple Worker versions to run simultaneously, routing Workflows to specific versions without code-level patching. + +### Key Concepts + +- **Worker Deployment**: A logical name for your application (e.g., "order-service") +- **Worker Deployment Version**: A specific build of your code (deployment name + Build ID) + +### Configuring Workers for Versioning + +```typescript +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), + taskQueue: 'my-queue', + workerDeploymentOptions: { + useWorkerVersioning: true, + version: { + deploymentName: 'order-service', + buildId: '1.0.0', // Or git hash, build number, etc. + }, + defaultVersioningBehavior: 'PINNED', // Or 'AUTO_UPGRADE' + }, + connection: nativeConnection, +}); +``` + +**Configuration options:** +- `useWorkerVersioning`: Enables Worker Versioning +- `version.deploymentName`: Logical name for your service (consistent across versions) +- `version.buildId`: Unique identifier for this build (git hash, semver, build number) +- `defaultVersioningBehavior`: How Workflows behave when versions change + +### Versioning Behaviors + +#### PINNED Behavior + +Workflows are locked to the Worker version they started on: + +```typescript +workerDeploymentOptions: { + useWorkerVersioning: true, + version: { buildId: '1.0', deploymentName: 'order-service' }, + defaultVersioningBehavior: 'PINNED', +} +``` + +**Characteristics:** +- Workflows run only on their assigned version +- No patching required in Workflow code +- Cannot use other versioning APIs +- Ideal for short-running Workflows where consistency matters + +**Use PINNED when:** +- You want to eliminate version compatibility complexity +- Workflows are short-running +- Stability is more important than getting latest updates + +#### AUTO_UPGRADE Behavior + +Workflows can move to newer Worker versions: + +```typescript +workerDeploymentOptions: { + useWorkerVersioning: true, + version: { buildId: '1.0', deploymentName: 'order-service' }, + defaultVersioningBehavior: 'AUTO_UPGRADE', +} +``` + +**Characteristics:** +- Workflows can be rerouted to new versions +- Once moved to a newer version, cannot return to older ones +- May require patching to handle version transitions +- Ideal for long-running Workflows that need bug fixes + +**Use AUTO_UPGRADE when:** +- Workflows are long-running (weeks or months) +- You want Workflows to benefit from bug fixes +- Migrating from rolling deployments + +### Deployment Strategies + +#### Blue-Green Deployments + +Maintain two environments and switch traffic between them: + +1. Deploy new version to idle environment +2. Run validation tests +3. Switch traffic to new environment +4. Keep old environment for instant rollback + +#### Rainbow Deployments + +Multiple Worker versions run simultaneously: + +```typescript +// Version 1.0 Workers +const worker1 = await Worker.create({ + workerDeploymentOptions: { + useWorkerVersioning: true, + version: { buildId: '1.0', deploymentName: 'order-service' }, + defaultVersioningBehavior: 'PINNED', + }, + // ... +}); + +// Version 2.0 Workers (deployed alongside 1.0) +const worker2 = await Worker.create({ + workerDeploymentOptions: { + useWorkerVersioning: true, + version: { buildId: '2.0', deploymentName: 'order-service' }, + defaultVersioningBehavior: 'PINNED', + }, + // ... +}); +``` + +**Benefits:** +- Existing PINNED Workflows complete on their original version +- New Workflows use the latest version +- Add new versions without replacing existing ones +- Supports gradual traffic ramping + +## Choosing a Versioning Strategy + +| Strategy | Best For | Trade-offs | +|----------|----------|------------| +| Patching API | Incremental changes to long-running Workflows | Requires maintaining patch branches in code | +| Workflow Type Versioning | Major incompatible changes | Requires code duplication and client updates | +| Worker Versioning (PINNED) | Short-running Workflows, new applications | Requires infrastructure to run multiple versions | +| Worker Versioning (AUTO_UPGRADE) | Long-running Workflows, migrations | May require patching for safe transitions | + +## Best Practices + +1. Use descriptive `patchId` names that explain the change +2. Follow the three-step patching process completely before removing patches +3. Use List Filters to verify no running Workflows before removing version support +4. Keep Worker Deployment names consistent across all versions +5. Use unique, traceable Build IDs (git hashes, semver, timestamps) +6. Choose PINNED for new applications with short-running Workflows +7. Choose AUTO_UPGRADE when migrating from rolling deployments or for long-running Workflows +8. Test version transitions with replay tests before deploying From ff033bc7cf5a37c819d2b8f5b153f364df3454f0 Mon Sep 17 00:00:00 2001 From: Donald Pinckney Date: Wed, 4 Feb 2026 17:30:40 -0500 Subject: [PATCH 3/4] add simple feedback mechanism (#3) --- SKILL.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/SKILL.md b/SKILL.md index 390a12e..38ecca2 100644 --- a/SKILL.md +++ b/SKILL.md @@ -186,3 +186,7 @@ Available scripts in `scripts/` for worker and workflow management: - **`references/typescript/versioning.md`** - TypeScript patching API - **`references/typescript/advanced-features.md`** - Cancellation scopes, interceptors - **`references/typescript/gotchas.md`** - TypeScript-specific anti-patterns + +## Feedback + +If this skill's explanations are unclear, misleading, or missing important information—or if Temporal concepts are proving unexpectedly difficult to work with—draft a GitHub issue body describing the problem encountered and what would have helped, then ask the user to file it at https://github.com/temporalio/skill-temporal-developer/issues/new. Do not file the issue autonomously. From b8b74c0589d9187d8be06862b3ef2498295dda6f Mon Sep 17 00:00:00 2001 From: Donald Pinckney Date: Thu, 12 Feb 2026 22:20:15 -0500 Subject: [PATCH 4/4] Change skill name to kebab-case, for compatibility with Amp and Cline (#7) --- SKILL.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/SKILL.md b/SKILL.md index 38ecca2..1673557 100644 --- a/SKILL.md +++ b/SKILL.md @@ -1,10 +1,10 @@ --- -name: Temporal Development +name: temporal-developer description: This skill should be used when the user asks to "create a Temporal workflow", "write a Temporal activity", "debug stuck workflow", "fix non-determinism error", "Temporal Python", "Temporal TypeScript", "workflow replay", "activity timeout", "signal workflow", "query workflow", "worker not starting", "activity keeps retrying", "Temporal heartbeat", "continue-as-new", "child workflow", "saga pattern", "workflow versioning", "durable execution", "reliable distributed systems", or mentions Temporal SDK development. Provides multi-language guidance for Python and TypeScript with operational scripts. version: 1.0.0 --- -# Temporal Development +# Skill: temporal-developer ## Overview