feat(taskbroker): Add Sending Status to Handle Push Failures#586
feat(taskbroker): Add Sending Status to Handle Push Failures#586george-sentry wants to merge 10 commits intomainfrom
Conversation
src/grpc/server.rs
Outdated
|
|
||
| Ok(activations) => { | ||
| let inflight = &activations[0]; | ||
| // If we return an error, the worker will place the result back in its internal queue and send the update again in the future, which is not desired |
There was a problem hiding this comment.
We should still catch these cases and at least log an error.
There was a problem hiding this comment.
Added logging here!
…tivations Marked Sent
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 454e0c7. Configure here.
| .bind(InflightActivationStatus::Failure.to_string()) | ||
| .bind(now) | ||
| .bind(InflightActivationStatus::Processing.to_string()) | ||
| .bind(InflightActivationStatus::Sending.to_string()) |
There was a problem hiding this comment.
Bug: At-most-once (AMO) tasks in Sending status are moved to Failure when their processing deadline expires, even though they were never delivered to a worker.
Severity: MEDIUM
Suggested Fix
Modify the logic for handling expired AMO tasks in the Sending status. Instead of transitioning them to Failure, revert them to Pending status without incrementing the processing attempt counter. This aligns their behavior with non-AMO tasks in the same state and allows for a retry without violating at-most-once guarantees.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/store/postgres_activation_store.rs#L555-L558
Potential issue: When an at-most-once (AMO) task is claimed for processing, it enters a
`Sending` status. If the task's delivery to a worker is not confirmed before the
processing deadline expires, the current logic moves the task directly to a `Failure`
state. This wastes the single execution attempt for a task that was never actually run
by any worker. This contradicts the goal of preventing unnecessary task drops, as the
task could be safely retried without violating AMO semantics. The expected behavior for
an unsent AMO task would be to revert it to `Pending` status, allowing another delivery
attempt.
There was a problem hiding this comment.
This is intentional. It is possible to deliver a task to a worker successfully but observe a send failure due to downstream network issues. For example, the response may not reach the taskbroker before the request times out if the network is congested. If I used the AI's suggestion in this scenario, the task would be mistakenly retried, resulting in more than one execution (a clear violation of the at-most-once policy).

Linear
Completes STREAM-860
Description
Currently, taskworkers pull tasks from taskbrokers via RPC. This approach works, but has some drawbacks. Therefore, we want taskbrokers to push tasks to taskworkers instead. Read this page on Notion for more information.
Right now, I rely on processing_deadline to revert processing tasks back to pending if pushing them failed. This isn't good because it eats through processing attempts, resulting in needlessly dropped tasks.
I want to add a
Sendingstatus that indicates a task is being sent. Now, upkeep increments processing attempts only for tasks that are still in "sending" when their processing deadlines expire. If the status is "processing," that means the task was already sent successfully and its processing attempts can be incremented.This will help us avoid dropping tasks needlessly when workers are busy.
Note that my original plan was different. You can see it in the commit history. Here is a description of that plan.