Skip to content

Add standby compression start delay#184

Draft
sjmiller609 wants to merge 6 commits intomainfrom
codex/standby-compression-delay
Draft

Add standby compression start delay#184
sjmiller609 wants to merge 6 commits intomainfrom
codex/standby-compression-delay

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

Summary

  • add a standby-only compression delay override on POST /instances/{id}/standby and a per-instance default in snapshot_policy
  • keep delayed standby compression jobs cancelable before start and distinguish pending-delay skips from active compression cancellation
  • add metrics, logs, traces, OpenAPI updates, and tests for the new standby compression delay behavior

Testing

  • go test ./lib/instances
  • go test ./cmd/api/api -run 'Test(CreateInstance_MapsStandbyCompressionDelayInSnapshotPolicy|CreateInstance_InvalidStandbyCompressionDelayInSnapshotPolicy|InstanceToOAPI_EmitsStandbyCompressionDelayInSnapshotPolicy|StandbyInstance_MapsCompressionDelay|StandbyInstance_InvalidCompressionDelay|StandbyInstance_InvalidRequest)$'

Notes

  • go test ./cmd/api/api still hits unrelated environment-dependent volume tests on this machine because mkfs.ext4 is not available in $PATH.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: Add standby compression start delay

Edit this comment to update it. It will appear in the SDK's changelogs.

hypeman-typescript studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/9eac1a2ca37d6514c44185f9d188f2a79b59949c/dist.tar.gz
hypeman-openapi studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅

hypeman-go studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅build ✅lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@ae6c6f0e1d82dfdb013596beddeac78383a0adcb

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-04-06 15:55:38 UTC

Copy link
Copy Markdown
Collaborator Author

@sjmiller609 sjmiller609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Review: Add standby compression start delay

Overall this is a well-structured PR. The state machine extension (pending_delayrunning) is clean, the preemption metric suppression for pending jobs is correctly implemented, and test coverage is thorough. A few items worth discussing:

Design: No upper bound on compression delay

parseOptionalDuration and normalizeStandbyCompressionDelay reject negative values but allow arbitrarily large positive durations. A user could set compression_delay: "8760h" which would keep a job in pending_delay state for a year, holding a slot in the compressionJobs map and effectively meaning "never compress" without being explicit about it. Consider adding a reasonable ceiling (e.g. 1h or configurable max) or at least documenting the expected range in the OpenAPI description.

Correctness: canceledCompressionJob.State read safety

The read of job.state in cancelAndWaitCompressionJob after <-job.done is safe — Go's memory model guarantees that close(job.done) happens-before the receive returns, and all state mutations occur before the close. The returned state correctly reflects whether actual compression work was performed (running) vs the job was still waiting (pending_delay). Good.

Correctness: Preemption metric gating

All five call sites of cancelAndWaitCompressionJob now gate recordSnapshotCompressionPreemption on target.State == compressionJobStateRunning. This is the right call — preemption implies interrupting active work, not canceling a job that hasn't started yet. The new skipped result and separate wait duration metric provide the observability for the pending-delay cancellation path. Clean separation.

Correctness: Timer cleanup

The stopCompressionTimer helper correctly drains the channel after Stop() returns false. In the happy path (<-timer.Chan() fires), the timer has already fired so no cleanup is needed — this is correct.

Pre-existing bug fix: cloneStoredMetadataForFork

The addition of cloneSnapshotPolicy in cloneStoredMetadataForFork (fork.go) fixes what looks like a pre-existing bug — without it, forked instances share a pointer to the parent's SnapshotPolicy, risking mutation side effects. Worth calling out in the PR description since it's a correctness fix independent of the delay feature.

Minor: newCompressionTimer nil receiver guard

func (m *manager) newCompressionTimer(delay time.Duration) compressionTimer {
    if m != nil && m.compressionTimerFactory != nil {

The m != nil check is defensive against a nil receiver call which would indicate a bug elsewhere. Not a problem, just noting it's an unusual pattern for a method receiver.

Tests

Good coverage of the key paths:

  • Delay precedence resolution (override > stored > zero default)
  • Negative delay rejection
  • Pending-delay cancellation records skipped (not preemption)
  • ensureSnapshotMemoryReady skipping pending compression without preemption metric
  • Fake timer abstraction is clean and threadsafe

LGTM with the upper-bound consideration as the main open question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant