Skip to content

feat(rivetkit): workflows#3734

Closed
NathanFlurry wants to merge 1 commit into01-21-feat_rivetkit_add_actor_run_handlerfrom
01-02-feat_rivetkit_workflows
Closed

feat(rivetkit): workflows#3734
NathanFlurry wants to merge 1 commit into01-21-feat_rivetkit_add_actor_run_handlerfrom
01-02-feat_rivetkit_workflows

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Jan 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

4 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Jan 2, 2026 8:07pm
rivet-inspector Ignored Ignored Preview Jan 2, 2026 8:07pm
rivet-site Ignored Ignored Preview Jan 2, 2026 8:07pm
rivetkit-serverless Skipped Skipped Jan 2, 2026 8:07pm

Copy link
Member Author

NathanFlurry commented Jan 2, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from a1796cf to 4ca7516 Compare January 2, 2026 20:07
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless January 2, 2026 20:07 Inactive
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from 4ca7516 to cff619f Compare January 15, 2026 18:21
@NathanFlurry NathanFlurry marked this pull request as ready for review January 15, 2026 18:21
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from cff619f to 3700d45 Compare January 15, 2026 18:54
@NathanFlurry NathanFlurry changed the base branch from main to graphite-base/3734 January 22, 2026 04:11
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from 3700d45 to fbdbcd3 Compare January 22, 2026 04:11
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3734 to actor-messages January 22, 2026 04:11
@NathanFlurry NathanFlurry changed the base branch from actor-messages to graphite-base/3734 January 22, 2026 05:36
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from fbdbcd3 to 246d677 Compare January 22, 2026 06:31
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3734 to 01-21-feat_rivetkit_add_actor_run_handler January 22, 2026 06:31
@NathanFlurry NathanFlurry changed the base branch from 01-21-feat_rivetkit_add_actor_run_handler to graphite-base/3734 January 22, 2026 07:14
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from 246d677 to a3fbdd7 Compare January 22, 2026 17:45
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3734 to 01-21-feat_rivetkit_add_actor_run_handler January 22, 2026 17:45
@claude
Copy link

claude bot commented Jan 22, 2026

PR Review: Workflow Engine for RivetKit

Overview

This PR introduces a comprehensive durable workflow engine for RivetKit TypeScript. This is a major addition (~9,257 lines) that implements a reentrant execution framework for building fault-tolerant, long-running workflows.

Summary Assessment

Overall: Strong implementation with production-ready patterns

This is high-quality code that demonstrates deep understanding of durable execution semantics. The implementation includes sophisticated features like:

  • Deterministic replay with history divergence detection
  • Automatic retry with exponential backoff
  • Rollback/compensation handlers
  • Race and join primitives for parallel execution
  • Message passing for external coordination
  • Loop state management with history trimming

Code Quality Analysis

Strengths ✅

  1. Excellent Error Handling Strategy

    • Custom error hierarchy is well-designed (CriticalError, RollbackError, EvictedError, etc.)
    • Clear distinction between retriable vs non-retriable errors
    • Proper error propagation through the execution stack
  2. Strong Type Safety

    • Comprehensive TypeScript types with good use of discriminated unions
    • Type-safe branch outputs in join/race operations
    • Well-defined interfaces for extensibility
  3. Deterministic Execution

    • History-based replay prevents non-deterministic behavior
    • Name registry pattern ensures consistent location keys
    • Input persistence for deterministic replays (lines 690-703 in index.ts)
  4. Memory Management

    • Loop history trimming to prevent unbounded growth (context.ts:718-763)
    • Lazy loading of entry metadata (storage.ts:165-189)
    • Dirty tracking for efficient flushes
  5. Testing Coverage

    • Tests cover both "yield" and "live" modes comprehensively
    • Edge cases like retry exhaustion, timeouts, and eviction are tested
    • Good use of test fixtures (CountingDriver)
  6. Documentation

    • Excellent inline comments explaining non-obvious behavior
    • Comprehensive QUICKSTART.md and architecture.md
    • Clear JSDoc annotations

Issues & Concerns 🔍

1. Critical: Race Condition in Message Consumption (High Priority)

Location: storage.ts:307-328 (consumeMessage)

export async function consumeMessage(
	storage: Storage,
	driver: EngineDriver,
	messageName: string,
): Promise<Message | null> {
	const index = storage.messages.findIndex(
		(message) => message.name === messageName,
	);
	if (index === -1) {
		return null;
	}

	const message = storage.messages[index];

	// Delete from driver first - if this fails, memory is unchanged
	await driver.delete(buildMessageKey(message.id));

	// Only remove from memory after successful driver deletion
	storage.messages.splice(index, 1);

	return message;
}

Problem: If the workflow crashes between deleting from KV and removing from memory, the message will be lost on next load (since it's deleted from KV but storage.messages still has it in memory).

Impact: Message loss in crash scenarios.

Recommendation: Consider a two-phase approach:

  • Mark message as "consumed" in KV first
  • Remove from memory
  • Delete from KV in background or during flush

Alternatively, accept this as "at-most-once" delivery and document it clearly.

2. Performance: Excessive Metadata Loading (Medium Priority)

Location: context.ts:355-359

Every retry attempt loads metadata from driver. If metadata is already loaded in storage.entryMetadata, this becomes redundant network calls. The loadMetadata function does cache, but consider loading all entry metadata upfront during loadStorage() for frequently-retried workflows.

3. Timeout Behavior May Be Surprising (Medium Priority)

Location: context.ts:464-480

The comment correctly notes that timeouts don't cancel underlying operations. This can lead to resource leaks if step operations don't respect ctx.abortSignal. Consider:

  • Documenting this more prominently in user-facing docs
  • Providing a helper to wrap operations with proper cancellation
  • Warning in logs when timeout occurs

4. Missing Input Validation (Low Priority)

Several functions lack input validation:

  • loop(): commitInterval must be > 0 (note in code but not enforced at context.ts:630)
  • listenN(): limit should be > 0
  • step(): timeout values could be validated

5. Inconsistent Error Messages (Low Priority)

Some errors are user-facing while others are internal. Consider standardizing error messages with actionable guidance.

Security Considerations 🔒

  1. No Sensitive Data Leakage: ✅ Good - error metadata filtering (index.ts:833-849) only includes serializable primitives

  2. Input Validation: ⚠️ Missing validation could allow malformed data into KV

  3. Resource Limits: ⚠️ No built-in limits on:

    • Number of loop iterations (could grow unbounded)
    • Message queue size
    • History entry count before trimming

    Consider adding configurable limits to prevent abuse.

Performance Considerations ⚡

  1. Batch Operations: ✅ Good use of batching in flush() (storage.ts:194-278)
  2. Ephemeral Steps: ✅ Excellent optimization (context.ts:417-424)
  3. Name Registry: ✅ Efficient deduplication
  4. Concerns:
    • list() operations load all entries into memory (storage.ts:122-128)
    • For workflows with thousands of entries, this could be expensive
    • Consider pagination or streaming for large histories

Test Coverage Analysis 🧪

Strengths:

  • Both "yield" and "live" modes tested
  • Good coverage of edge cases (timeouts, retries, eviction)
  • Parameterized tests reduce duplication

Gaps:

  • No tests for concurrent workflow execution
  • Limited testing of rollback scenarios
  • No chaos/fault injection tests (simulate KV failures)
  • Missing tests for history divergence recovery

Recommendation: Add tests for KV failures, partial batch writes, concurrent messages, and loop iteration limits.

Architecture Review 🏗️

Isolation Model

The driver design assumes external isolation (Durable Objects, actors). This is well-documented and appropriate.

Consideration: What happens if isolation is violated? (e.g., two processes run same workflow ID)

  • Could lead to history corruption
  • Consider adding optimistic locking or version checks

Serialization Strategy

Uses BARE schema for binary serialization. Good choice for compact storage, schema evolution, and type safety.

Note: Ensure BARE schema changes are backward compatible or migrations are handled.

Recommendations Summary

Must Fix (Before Merge)

  1. Document message delivery semantics - Is it at-most-once? exactly-once? Clarify in docs and potentially in error handling.

Should Fix (High Priority)

  1. Add resource limits (configurable max iterations, message queue size)
  2. Improve error messages with actionable guidance
  3. Add tests for concurrent scenarios and KV failures

Nice to Have (Medium Priority)

  1. Optimize metadata loading for retry-heavy workflows
  2. Add warnings when timeouts occur without cancellation support
  3. Add integration guide for implementing custom drivers
  4. Consider adding metrics/observability hooks

Documentation Improvements

  1. Add example of implementing a custom driver
  2. Document recovery scenarios (what happens if KV is corrupted?)
  3. Add migration guide for workflow code changes
  4. Document performance characteristics and scaling limits

Conclusion

This is excellent work and demonstrates a mature understanding of durable execution patterns. The code quality is high, error handling is sophisticated, and the API is well-designed.

Recommendation: APPROVE with minor fixes

The identified issues are mostly documentation gaps and potential edge cases. The core implementation is solid and ready for production use with proper driver implementations.

Key Action Items:

  1. Clarify message delivery semantics (documentation)
  2. Add resource limits to prevent unbounded growth
  3. Expand test coverage for failure scenarios
  4. Document driver implementation guide

Great job on this implementation! 🎉


Reviewed using Rivet CLAUDE.md guidelines and TypeScript best practices.

@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from a3fbdd7 to b4b061f Compare January 24, 2026 23:35
@NathanFlurry NathanFlurry force-pushed the 01-21-feat_rivetkit_add_actor_run_handler branch from 2fe5b7d to 329640c Compare January 24, 2026 23:35
@NathanFlurry NathanFlurry mentioned this pull request Jan 24, 2026
@NathanFlurry NathanFlurry force-pushed the 01-21-feat_rivetkit_add_actor_run_handler branch from 329640c to 4e8ca02 Compare January 28, 2026 01:12
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from b4b061f to 8a67460 Compare January 28, 2026 01:12
@jog1t jog1t force-pushed the 01-21-feat_rivetkit_add_actor_run_handler branch from 4e8ca02 to f1e4e30 Compare January 28, 2026 19:55
@jog1t jog1t force-pushed the 01-02-feat_rivetkit_workflows branch from 8a67460 to a5391dd Compare January 28, 2026 19:55
@claude
Copy link

claude bot commented Jan 28, 2026

Pull Request Review: Workflow Engine Implementation

Overview

This PR introduces a comprehensive workflow engine package (@rivetkit/workflow-engine) for RivetKit. This is a substantial addition (9,257 additions across 46 files) that implements durable execution for TypeScript workflows.

Architecture Assessment

Strengths

1. Solid Core Design

  • Well-architected workflow engine with proper separation of concerns
  • Clean abstraction via EngineDriver interface for storage/scheduling
  • Deterministic replay mechanism with history tracking
  • Proper handling of eviction, rollback, and retry logic

2. Comprehensive Feature Set

  • Steps with automatic retry and backoff
  • Durable loops with state management
  • Message-based communication (listen/send)
  • Parallel execution (join/race)
  • Rollback support with checkpoints
  • Migration handling via removed() entries

3. Excellent Test Coverage

  • 13 test files covering major functionality
  • Tests for both yield and live execution modes
  • Edge cases covered (eviction, rollback, race conditions)

4. Good Documentation

  • Comprehensive QUICKSTART.md
  • Detailed docs for advanced topics (rollback, retries, migrations, etc.)
  • Architecture documentation
  • Code comments are clear and helpful

Issues and Recommendations

Critical Issues

1. Race Condition in Message Consumption (storage.ts:307-328)

If multiple workflow executions run concurrently, they could both find the same message in memory, both attempt to delete it, and both think they successfully consumed it.

Recommendation:

  • Use atomic operations (CAS) in the driver if available
  • Add transaction IDs to messages
  • Document this limitation if concurrent execution is not supported
  • Consider using a timestamp-based claim mechanism

2. Non-Atomic Batch Operations (storage.ts:194-278)

The EngineDriver interface says batch "Should be atomic" but it is not required. If a batch partially fails, the workflow could end up in an inconsistent state.

Recommendation:

  • Make atomicity a MUST requirement in the driver interface
  • Add retry logic for batch operations
  • Implement compensation logic if partial failures are possible

High Priority Issues

3. Missing Input Validation

The code does not validate critical inputs like workflow IDs, message names, or entry names. For example, context.ts:946 uses colons as delimiters in name:count. If a user passes a name containing colons, this creates ambiguous keys.

Recommendation:

  • Add input validation at API boundaries
  • Sanitize or reject names with reserved characters
  • Document naming constraints

4. Unbounded Message Queue

Messages accumulate in memory without bounds. For long-running workflows receiving many messages, this could cause OOM errors.

Recommendation:

  • Add configurable message queue limits
  • Implement message expiration
  • Add backpressure mechanisms

5. Timeout Does Not Cancel Underlying Work (context.ts:464-480)

While documented, timeouts do not cancel the underlying async operation. If a step times out but continues running, it might hold database connections, continue making API calls, or modify shared state.

Recommendation:

  • Encourage passing ctx.abortSignal to all async operations
  • Add examples showing proper cancellation patterns
  • Consider tracking background operations and warning about orphaned work

6. Hardcoded Console Logging (context.ts:1746-1751)

Using console.warn directly makes logging not configurable.

Recommendation:

  • Add a logging interface to the driver or context
  • Allow users to inject their own logger

Minor Issues

7. Magic Numbers - Hardcoded retry delay of 100ms in index.ts:405 should be a named constant

8. Type Safety - Several type assertions that could be avoided with better type design

9. Incomplete Error Serialization - extractErrorInfo only captures primitive metadata, losing objects and arrays

10. Non-null Assertions - Multiple uses of ! operator that could fail if state is corrupted

Performance Considerations

11. Inefficient Loop Cleanup (context.ts:753-763)

Deletes each old iteration sequentially. For loops with thousands of iterations, this could be very slow.

Recommendation: Batch deletes or use range-based deletion if driver supports it.

12. Sequential Metadata Loading (index.ts:338)

Could be parallelized for better performance during rollback.

Security Considerations

13. No Resource Limits

Missing limits on workflow execution time, entry count, recursion depth, and message size.

Recommendation: Add configurable limits to prevent DoS attacks or runaway workflows.

14. No Input Validation on Deserialization (index.ts:695)

No validation of deserialized input. If storage is compromised, malicious input could be injected.

Recommendation: Add input schema validation.

Testing Gaps

Missing test cases for:

  • Concurrent workflow execution
  • Storage corruption scenarios
  • Very large workflows (scalability)
  • Driver failure scenarios (partial batch success)
  • Message ordering under concurrent sends
  • Memory limits

Recommendation: Add stress tests and chaos engineering scenarios.

Documentation Issues

15. Driver Interface Requirements

The list() requirement "MUST be sorted" is mentioned in comments but easy to miss.

Recommendation: Make this more prominent in interface documentation and add runtime validation.

16. Migration Path Unclear

The removed() API is documented but there is no clear guide on versioning strategies or safe migration patterns.

Positive Patterns Worth Highlighting

  • Proper use of TypeScript generics for type-safe workflow inputs/outputs
  • Clean separation between forward execution and rollback modes
  • Good error taxonomy with specific error classes
  • Ephemeral steps to optimize write patterns
  • Dual mode execution (yield vs live) for flexibility
  • Comprehensive location tracking for debugging
  • Name registry to avoid string duplication in storage

Summary

This is a well-engineered workflow engine with strong fundamentals. The main concerns are around concurrent execution safety, resource limits, and edge case handling.

Action Items Priority

Must Fix Before Merge:

  1. Document concurrency model (single-threaded vs concurrent)
  2. Add input validation for names and IDs
  3. Make batch atomicity requirements explicit
  4. Add resource limits (message queue, execution time)

Should Fix Soon:
5. Address message consumption race condition
6. Add configurable logging
7. Improve error serialization
8. Add stress tests

Nice to Have:
9. Optimize loop cleanup performance
10. Reduce type assertions
11. Parallelize metadata loading

Overall assessment: Approve with changes. This is production-quality code with some important refinements needed.


Review generated by Claude Sonnet 4.5

@NathanFlurry NathanFlurry force-pushed the 01-21-feat_rivetkit_add_actor_run_handler branch from f1e4e30 to 1c34e5c Compare January 30, 2026 08:19
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from a5391dd to 18bb55f Compare January 30, 2026 08:19
@claude
Copy link

claude bot commented Jan 30, 2026

PR Review: feat(rivetkit): workflows

Overview

This PR adds a comprehensive workflow engine to RivetKit - a durable, reentrant execution system for TypeScript workflows. The implementation is substantial (9,257 additions across 46 files) and introduces a sophisticated state management system with replay capabilities.

🎯 Architecture & Design

Strengths

  • Excellent deterministic replay: The location-based entry system with name registry ensures consistent replay across executions
  • Well-structured separation: Clean separation between driver interface, storage, and execution context
  • Comprehensive error handling: Custom error types for different failure modes
  • Thoughtful state management: Dirty tracking with batch writes minimizes persistence overhead
  • Good testing coverage: Multiple test files covering steps, loops, messages, rollback, race conditions

Concerns

1. Memory Management in Long-Running Workflows

The in-memory visitedKeys set and usedNamesInExecution set (context.ts:104, 109) grow unbounded during execution. For workflows with thousands of steps or long-running loops, this could cause memory issues.

Recommendation: Consider clearing these sets after validateComplete() or implementing periodic cleanup.

2. Race Condition in Message Handling

The notifyMessage function (index.ts:180-195) only notifies the first matching waiter when a message arrives. If multiple listen() calls are waiting for the same message name, only one will be woken.

Recommendation: Either document this as intentional behavior or support multiple waiters per message name.

🐛 Potential Bugs

1. Type Safety Issue in Step Configuration

The run! assertion in context.ts:309 assumes run is defined when nameOrConfig is a string, but TypeScript does not enforce this.

2. Silent Error in Rollback

Errors during rollback replay are silently ignored (index.ts:322), which could hide bugs. Consider logging these errors for observability.

⚡ Performance Considerations

1. Batch Write Performance

The flush() function does not limit batch size. For workflows with thousands of dirty entries, this could hit driver batch size limits.

2. Message List Sorting Overhead

Message sorting (storage.ts:134) may be redundant if the driver returns sorted results as documented.

3. History Entry Lookup

The O(n) iteration over all history entries in validateComplete() (context.ts:246-267) could be expensive for large workflows.

🔒 Security Concerns

1. Denial of Service via Large Inputs

No validation exists on workflow input/output sizes. Malicious code could store massive objects exhausting memory/storage.

2. Arbitrary Code Execution in Rollback

Rollback handlers execute with potentially stale data. Document security requirements for rollback handlers.

📚 Code Quality

Positives

  • ✅ Excellent inline documentation
  • ✅ Consistent naming conventions
  • ✅ Good use of TypeScript types
  • ✅ Clear separation of concerns
  • ✅ Comprehensive test coverage

Areas for Improvement

  • Missing JSDoc for public APIs
  • Magic numbers (e.g., 100ms retry delay at index.ts:405)
  • Long functions (executeRace is 227 lines)

✅ Recommendations Summary

Critical (Must Fix)

  1. Fix message waiter race condition
  2. Add explicit null check in step configuration
  3. Document driver sorting requirements

High Priority

  1. Add memory management for visitedKeys and usedNamesInExecution
  2. Add batch size limits in flush()
  3. Extract magic numbers to named constants
  4. Add error logging in rollback replay

Documentation

  1. Add migration guide for workflow code changes
  2. Document security requirements for rollback handlers
  3. Add performance tuning guide

🎉 Conclusion

This is a well-architected and thoughtfully implemented workflow engine. The deterministic replay mechanism is sophisticated and the separation of concerns is excellent.

Overall Assessment: ✅ Approve with recommendations

The core functionality is solid. Address the critical issues before merge, and consider the high-priority items for follow-up PRs.

Great work on this complex feature! 🚀

@jog1t jog1t force-pushed the 01-21-feat_rivetkit_add_actor_run_handler branch from 1c34e5c to 539fdca Compare January 30, 2026 21:29
@jog1t jog1t force-pushed the 01-02-feat_rivetkit_workflows branch from 18bb55f to 177df46 Compare January 30, 2026 21:29
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from 177df46 to 532ab13 Compare February 3, 2026 19:40
@NathanFlurry NathanFlurry force-pushed the 01-21-feat_rivetkit_add_actor_run_handler branch from 539fdca to 21606fd Compare February 3, 2026 19:40
@NathanFlurry NathanFlurry mentioned this pull request Feb 3, 2026
11 tasks
@graphite-app
Copy link
Contributor

graphite-app bot commented Feb 4, 2026

Merge activity

  • Feb 4, 8:36 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Feb 4, 8:37 PM UTC: CI is running for this pull request on a draft pull request (#4114) due to your merge queue CI optimization settings.
  • Feb 4, 8:38 PM UTC: Merged by the Graphite merge queue via draft PR: #4114.

graphite-app bot pushed a commit that referenced this pull request Feb 4, 2026
@graphite-app graphite-app bot closed this Feb 4, 2026
@graphite-app graphite-app bot deleted the 01-02-feat_rivetkit_workflows branch February 4, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant