feat(NODE-7142): Exponential backoff and jitter in retry loops#4871
feat(NODE-7142): Exponential backoff and jitter in retry loops#4871baileympearson wants to merge 20 commits intomainfrom
Conversation
bcf6a51 to
091e989
Compare
| if (session.transaction.state === TxnState.STARTING_TRANSACTION) { | ||
| if (document.ok === 1) { | ||
| session.transaction.transition(TxnState.TRANSACTION_IN_PROGRESS); | ||
| } else { | ||
| const error = new MongoServerError(document.toObject()); | ||
| const isRetryableError = error.hasErrorLabel(MongoErrorLabel.RetryableError); | ||
|
|
||
| if (!isRetryableError) { | ||
| session.transaction.transition(TxnState.TRANSACTION_IN_PROGRESS); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The idea here is to defer transition of the transition into TRANSACTION_IN_PROGRESS:
- success (ok === 1) means that the server accepted the command, transaction is in progress.
- non-retryable error - the server processed something, the transaction state should advance so subsequent commands don't re-send
startTransaction. - and NOT retryable errors - the session stays in
STARTING_TRANSACTIONso the retry can includestartTransaction: true.
Previously, applySession transitioned STARTING_TRANSACTION -> TRANSACTION_IN_PROGRESS before the command was sent to the server (line 1205 removed).
| } | ||
| } catch (error) { | ||
| if (options.session != null && !(error instanceof MongoServerError)) { | ||
| updateSessionFromResponse(options.session, MongoDBResponse.empty); |
There was a problem hiding this comment.
This will consequently transition the transaction into TRANSACTION_IN_PROGRESS state (as .empty has no error and no success labels).
There was a problem hiding this comment.
Do we have coverage for this through the spec tests or is there a bespoke node test to write for this?
There was a problem hiding this comment.
This is not a new change, rather this code preserves the same behaviour we have on main (where we transition transaction inside applySession), after refactoring this condition becomes necessary, so the state machine doesn't get stuck in STARTING_TRANSACTION after non-server error.
The question is valid though, I will dig deeper to see if we cover this somehow.
UPD:
I couldn't find anything about this specific transition in our tests (neither in integration/unit, nor in spec tests). At the same time I would prefer to keep it out of the scope of this PR because this is pre-backpressure behaviour:
// src/sessions.ts
if (session.transaction.state === TxnState.STARTING_TRANSACTION) {
session.transaction.transition(TxnState.TRANSACTION_IN_PROGRESS); // this line is removed now
command.startTransaction = true;
| ) | ||
| ); | ||
|
|
||
| function canRetry(operation: AbstractOperation, error: MongoError) { |
There was a problem hiding this comment.
This function centralizes retry eligibility, as before it was split across the entire old loop. Key idea: SystemOverloadedError + RetryableError is always retryable regardless of retryReads/retryWrites settings or operation type (including runCommand).
| readPreference: ReadPreference.primary, | ||
| bypassPinningCheck: true | ||
| }); | ||
| operation.maxAttempts = MAX_RETRIES + 1; |
There was a problem hiding this comment.
The problem is that commitTransaction is split across 2 executeOperations: first with original write concern, second (on retryable error) with majority write concern.
The total retry budget is MAX_RETRIES + 1 (6 attempts). After the first executeOperation "spends" some attempts, remainingAttempts is computed from attemptsMade and passed as maxAttempts to the second operation.
| expect(error).to.be.instanceof(MongoServerError); | ||
| }); | ||
|
|
||
| stub.returns(1); |
There was a problem hiding this comment.
This surprises me because math.random can never return 1
I see that the spec prose just says "between 0 and 1" but then there's a comment in the pseudo code that says [0, 1) we should make sure the spec is very explicit about the range of the random number generator and that the tests use a real value here.
| await client.close(); | ||
| await clearFailPoint(this.configuration); | ||
| }); | ||
|
|
There was a problem hiding this comment.
Just picking a place to put this comment: Do we have/want any coverage for running many operations in parallel?
I think something to try and take a close look at the logic for is there any places where a condition is evaluated synchronously the impacts the rest of the operation execution such that all entrants to a function will respond to stale information (maybe expectedly or unexpectedly?). A naive example, is there something like deprioritizedServers that wouldn't work correctly if a bunch of operations were made synchronously? and synchronously while the driver is already under load?
Co-authored-by: Neal Beeken <neal.beeken@mongodb.com>
Description
Summary of Changes
Implements the Client Backpressure specification, which adds automatic retry with exponential backoff for commands that fail with
SystemOverloadedError+RetryableErrorlabels.Note:
this PR also implements NODE-7366: clarified error propagation when multiple
NoWritesPerformederrors are encountered during retries. This is a prerequisite for backpressure since the multi-retry loop can now encounter more than two errors.Notes for Reviewers
The retry loop was restructured to match the spec pseudocode. The
canRetry()function centralizes all retry eligibility logic:SystemOverloadedError+RetryableError→ always retryable (includingrunCommand)runCommandwithout overload labels → never retryableThe
maxAttempts/attemptsMademechanism onAbstractOperationexists specifically forcommitTransaction, which splits its retry budget across twoexecuteOperationcalls (first with original write concern, second with majority).What is the motivation for this change?
NODE-7142
NODE-7366
Release Highlight
Client Backpressure
The driver now implements the Client Backpressure specification. When a MongoDB server responds with a
SystemOverloadedError, the driver will automatically retry the operation with exponential backoff and jitter, up to 5 retries. A per-client token bucket rate-limits retry attempts during prolonged overload to protect server goodput.Double check the following
npm run check:lint)type(NODE-xxxx)[!]: descriptionfeat(NODE-1234)!: rewriting everything in coffeescript