Skip to content

Comments

feat(NODE-7142): Exponential backoff and jitter in retry loops#4871

Open
baileympearson wants to merge 20 commits intomainfrom
NODE-7142-backpressure-backoff
Open

feat(NODE-7142): Exponential backoff and jitter in retry loops#4871
baileympearson wants to merge 20 commits intomainfrom
NODE-7142-backpressure-backoff

Conversation

@baileympearson
Copy link
Contributor

@baileympearson baileympearson commented Feb 10, 2026

Description

Summary of Changes

Implements the Client Backpressure specification, which adds automatic retry with exponential backoff for commands that fail with SystemOverloadedError + RetryableError labels.

Note:
this PR also implements NODE-7366: clarified error propagation when multiple NoWritesPerformed errors are encountered during retries. This is a prerequisite for backpressure since the multi-retry loop can now encounter more than two errors.

Notes for Reviewers

The retry loop was restructured to match the spec pseudocode. The canRetry() function centralizes all retry eligibility logic:

  • SystemOverloadedError + RetryableError → always retryable (including runCommand)
  • runCommand without overload labels → never retryable
  • Standard retryable read/write errors → existing behavior preserved

The maxAttempts/attemptsMade mechanism on AbstractOperation exists specifically for commitTransaction, which splits its retry budget across two executeOperation calls (first with original write concern, second with majority).

What is the motivation for this change?

NODE-7142
NODE-7366

Release Highlight

Client Backpressure

The driver now implements the Client Backpressure specification. When a MongoDB server responds with a SystemOverloadedError, the driver will automatically retry the operation with exponential backoff and jitter, up to 5 retries. A per-client token bucket rate-limits retry attempts during prolonged overload to protect server goodput.

Double check the following

  • Lint is passing (npm run check:lint)
  • Self-review completed using the steps outlined here
  • PR title follows the correct format: type(NODE-xxxx)[!]: description
    • Example: feat(NODE-1234)!: rewriting everything in coffeescript
  • Changes are covered by tests
  • New TODOs have a related JIRA ticket

@tadjik1 tadjik1 changed the title WIP - exponential backoff + jitter PR feat(NODE-7142): Exponential backoff and jitter in retry loops Feb 19, 2026
@tadjik1 tadjik1 force-pushed the NODE-7142-backpressure-backoff branch from bcf6a51 to 091e989 Compare February 19, 2026 10:00
Comment on lines +1253 to +1264
if (session.transaction.state === TxnState.STARTING_TRANSACTION) {
if (document.ok === 1) {
session.transaction.transition(TxnState.TRANSACTION_IN_PROGRESS);
} else {
const error = new MongoServerError(document.toObject());
const isRetryableError = error.hasErrorLabel(MongoErrorLabel.RetryableError);

if (!isRetryableError) {
session.transaction.transition(TxnState.TRANSACTION_IN_PROGRESS);
}
}
}
Copy link
Member

@tadjik1 tadjik1 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to defer transition of the transition into TRANSACTION_IN_PROGRESS:

  • success (ok === 1) means that the server accepted the command, transaction is in progress.
  • non-retryable error - the server processed something, the transaction state should advance so subsequent commands don't re-send startTransaction.
  • and NOT retryable errors - the session stays in STARTING_TRANSACTION so the retry can include startTransaction: true.

Previously, applySession transitioned STARTING_TRANSACTION -> TRANSACTION_IN_PROGRESS before the command was sent to the server (line 1205 removed).

}
} catch (error) {
if (options.session != null && !(error instanceof MongoServerError)) {
updateSessionFromResponse(options.session, MongoDBResponse.empty);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will consequently transition the transaction into TRANSACTION_IN_PROGRESS state (as .empty has no error and no success labels).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have coverage for this through the spec tests or is there a bespoke node test to write for this?

Copy link
Member

@tadjik1 tadjik1 Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a new change, rather this code preserves the same behaviour we have on main (where we transition transaction inside applySession), after refactoring this condition becomes necessary, so the state machine doesn't get stuck in STARTING_TRANSACTION after non-server error.

The question is valid though, I will dig deeper to see if we cover this somehow.

UPD:
I couldn't find anything about this specific transition in our tests (neither in integration/unit, nor in spec tests). At the same time I would prefer to keep it out of the scope of this PR because this is pre-backpressure behaviour:

// src/sessions.ts
if (session.transaction.state === TxnState.STARTING_TRANSACTION) {
    session.transaction.transition(TxnState.TRANSACTION_IN_PROGRESS); // this line is removed now
    command.startTransaction = true;

)
);

function canRetry(operation: AbstractOperation, error: MongoError) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function centralizes retry eligibility, as before it was split across the entire old loop. Key idea: SystemOverloadedError + RetryableError is always retryable regardless of retryReads/retryWrites settings or operation type (including runCommand).

readPreference: ReadPreference.primary,
bypassPinningCheck: true
});
operation.maxAttempts = MAX_RETRIES + 1;
Copy link
Member

@tadjik1 tadjik1 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that commitTransaction is split across 2 executeOperations: first with original write concern, second (on retryable error) with majority write concern.

The total retry budget is MAX_RETRIES + 1 (6 attempts). After the first executeOperation "spends" some attempts, remainingAttempts is computed from attemptsMade and passed as maxAttempts to the second operation.

@tadjik1 tadjik1 marked this pull request as ready for review February 19, 2026 20:08
@tadjik1 tadjik1 requested a review from a team as a code owner February 19, 2026 20:08
@tadjik1 tadjik1 requested a review from nbbeeken February 19, 2026 20:15
expect(error).to.be.instanceof(MongoServerError);
});

stub.returns(1);
Copy link
Contributor

@nbbeeken nbbeeken Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This surprises me because math.random can never return 1

I see that the spec prose just says "between 0 and 1" but then there's a comment in the pseudo code that says [0, 1) we should make sure the spec is very explicit about the range of the random number generator and that the tests use a real value here.

await client.close();
await clearFailPoint(this.configuration);
});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just picking a place to put this comment: Do we have/want any coverage for running many operations in parallel?

I think something to try and take a close look at the logic for is there any places where a condition is evaluated synchronously the impacts the rest of the operation execution such that all entrants to a function will respond to stale information (maybe expectedly or unexpectedly?). A naive example, is there something like deprioritizedServers that wouldn't work correctly if a bunch of operations were made synchronously? and synchronously while the driver is already under load?

Co-authored-by: Neal Beeken <neal.beeken@mongodb.com>
Copilot AI review requested due to automatic review settings February 20, 2026 08:09
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants