Skip to content

feat(taskbroker): Add Sending Status to Handle Push Failures#586

Open
george-sentry wants to merge 10 commits intomainfrom
george/push-taskbroker/add-sent-flag
Open

feat(taskbroker): Add Sending Status to Handle Push Failures#586
george-sentry wants to merge 10 commits intomainfrom
george/push-taskbroker/add-sent-flag

Conversation

@george-sentry
Copy link
Copy Markdown
Member

@george-sentry george-sentry commented Apr 2, 2026

Linear

Completes STREAM-860

Description

Currently, taskworkers pull tasks from taskbrokers via RPC. This approach works, but has some drawbacks. Therefore, we want taskbrokers to push tasks to taskworkers instead. Read this page on Notion for more information.

Right now, I rely on processing_deadline to revert processing tasks back to pending if pushing them failed. This isn't good because it eats through processing attempts, resulting in needlessly dropped tasks.

I want to add a Sending status that indicates a task is being sent. Now, upkeep increments processing attempts only for tasks that are still in "sending" when their processing deadlines expire. If the status is "processing," that means the task was already sent successfully and its processing attempts can be incremented.

This will help us avoid dropping tasks needlessly when workers are busy.

Note that my original plan was different. You can see it in the commit history. Here is a description of that plan.

I want to add a sent column to the activation table to track whether a task was successfully sent after being fetched from the table. Now, upkeep increments processing attempts only for tasks that are processing and have sent = true.

If the status is processing and sent = false, that means pushing failed or timed out (or didn't happen yet), and we can revert back to pending without incrementing processing attempts.

@george-sentry george-sentry requested a review from a team as a code owner April 2, 2026 22:03
@linear-code
Copy link
Copy Markdown

linear-code bot commented Apr 2, 2026

@george-sentry george-sentry changed the title feat(taskbroker): Add Sent Flag to Prevent Dropping Tasks on Push Failure feat(taskbroker): Add Sending Status to Handle Push Failures Apr 7, 2026

Ok(activations) => {
let inflight = &activations[0];
// If we return an error, the worker will place the result back in its internal queue and send the update again in the future, which is not desired
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still catch these cases and at least log an error.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logging here!

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 454e0c7. Configure here.

Comment on lines 555 to +558
.bind(InflightActivationStatus::Failure.to_string())
.bind(now)
.bind(InflightActivationStatus::Processing.to_string())
.bind(InflightActivationStatus::Sending.to_string())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: At-most-once (AMO) tasks in Sending status are moved to Failure when their processing deadline expires, even though they were never delivered to a worker.
Severity: MEDIUM

Suggested Fix

Modify the logic for handling expired AMO tasks in the Sending status. Instead of transitioning them to Failure, revert them to Pending status without incrementing the processing attempt counter. This aligns their behavior with non-AMO tasks in the same state and allows for a retry without violating at-most-once guarantees.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/store/postgres_activation_store.rs#L555-L558

Potential issue: When an at-most-once (AMO) task is claimed for processing, it enters a
`Sending` status. If the task's delivery to a worker is not confirmed before the
processing deadline expires, the current logic moves the task directly to a `Failure`
state. This wastes the single execution attempt for a task that was never actually run
by any worker. This contradicts the goal of preventing unnecessary task drops, as the
task could be safely retried without violating AMO semantics. The expected behavior for
an unsent AMO task would be to revert it to `Pending` status, allowing another delivery
attempt.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. It is possible to deliver a task to a worker successfully but observe a send failure due to downstream network issues. For example, the response may not reach the taskbroker before the request times out if the network is congested. If I used the AI's suggestion in this scenario, the task would be mistakenly retried, resulting in more than one execution (a clear violation of the at-most-once policy).

@george-sentry george-sentry requested a review from evanh April 9, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants