Skip to content

REST: Add retry and timeout configuration for REST catalog#3418

Open
rahulsmahadev wants to merge 3 commits into
apache:mainfrom
rahulsmahadev:rahul/rest-retry-timeout
Open

REST: Add retry and timeout configuration for REST catalog#3418
rahulsmahadev wants to merge 3 commits into
apache:mainfrom
rahulsmahadev:rahul/rest-retry-timeout

Conversation

@rahulsmahadev

Copy link
Copy Markdown

Rationale for this change

The REST Catalog uses requests with no retries and no timeout configured, so transient 5xx / network failures bubble up immediately and slow servers can hang the client indefinitely. The reporting user on #2772 hit infinite-timeout hangs against a Polaris instance returning 504 from a proxy.

This change adds an optional connection: block on the catalog properties for opting in to a per-request timeout and a retry policy, as proposed in the issue body:

catalog:
  default:
    uri: http://rest-catalog/ws/
    connection:
      timeout: 60                     # seconds, applied to every HTTP call
      retry:
        total: 5
        backoff_factor: 1.0
        status_forcelist: [429, 500, 502, 503, 504]
        allowed_methods: [GET, HEAD, OPTIONS]

Changes:

  • Added connection, connection.timeout, connection.retry catalog properties.
  • Added a private _RetryTimeoutHTTPAdapter that subclasses requests.adapters.HTTPAdapter and applies a default timeout to every call (since requests does not expose a session-wide timeout).
  • The connection.retry sub-mapping is passed verbatim to urllib3.util.retry.Retry, so any kwarg the upstream class accepts can be configured.
  • Both keys are optional and opt-in. When neither is set, the default requests behavior is preserved — no behavior change for existing users.
  • Validation: negative or zero timeout raises ValueError; unknown Retry kwargs are caught and re-raised as ValueError with a clearer message.

The SigV4 adapter mounted at the catalog URI (see #3063) is a longer-prefix match and still wins for that host, so SigV4 users continue to get their existing retry behavior unchanged — this change deliberately stays out of that path.

Are these changes tested?

Yes. Added five tests in tests/catalog/test_rest.py:

  • test_session_without_connection_config_uses_default_adapter — no connection: block → no _RetryTimeoutHTTPAdapter is mounted (default requests behavior preserved).
  • test_session_with_connection_timeout_and_retry — full nested connection: block applies timeout and all Retry kwargs (total, backoff_factor, status_forcelist, allowed_methods).
  • test_session_with_connection_timeout_onlytimeout alone works without a retry block.
  • test_session_with_invalid_connection_timeout_raises — negative timeout raises ValueError.
  • test_session_with_invalid_connection_retry_kwarg_raises — unknown Retry kwarg raises ValueError.

Are there any user-facing changes?

Yes — two new optional REST catalog properties (connection.timeout, connection.retry) documented in mkdocs/docs/configuration.md under "Retry and timeout". No behavior change for users who do not set them.

Closes apache#2772.

The REST Catalog uses requests with no retries and no timeout by default,
so transient 5xx/network failures bubble up immediately and slow servers
can hang the client indefinitely (e.g. a Polaris instance returning 504
from a proxy).

Add an optional connection: block on the catalog properties:

  catalog:
    default:
      uri: http://rest-catalog/ws/
      connection:
        timeout: 60
        retry:
          total: 5
          backoff_factor: 1.0
          status_forcelist: [429, 500, 502, 503, 504]
          allowed_methods: [GET, HEAD, OPTIONS]

connection.retry is passed verbatim to urllib3.util.retry.Retry. Both
keys are optional and opt-in: when neither is set the default requests
behavior is preserved.

Signed-off-by: rahulsmahadev <rahul.mahadev@databricks.com>
@rambleraptor

Copy link
Copy Markdown
Collaborator

Can you take a look at the linter failures?

'make lint' will help you run the linters locally.

Comment thread mkdocs/docs/configuration.md Outdated
| Key | Example | Description |
| ---------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------------ |
| connection.timeout | 60 | Per-request timeout in seconds. Must be a positive number. |
| connection.retry | `{total: 5, backoff_factor: 1.0}` | Mapping passed verbatim as kwargs to [`urllib3.util.retry.Retry`](https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#urllib3.util.Retry). |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like having a dictionary here is a bit tricky; this couples the configuration tightly with urllib3. When urllib changes something, or we want to use another library, then we have to map this. How about adding the options explicitly.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this.

Allowing so many retry options can also lead to unintended behaviors. As an example, what happens if you set raise_on_status to False, but you receive a 403 that you should raise an error on?

I think we should focus on timeout, # of retries, and a backoff factor.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — dropped the dict pass-through. The connection block now exposes only three explicit options: timeout, retries, and backoff-factor. The status_forcelist and allowed_methods are hard-coded internally to transient codes (429/500/502/503/504) and idempotent methods (GET/HEAD/OPTIONS), so users can't accidentally swallow non-transient errors or retry non-idempotent writes. Fixed in 6fb87ff.

assert "Could not find the TLS certificate file, invalid path: path_to_client_cert" in str(e.value)


def test_session_without_connection_config_uses_default_adapter(rest_mock: Mocker) -> None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a test where we set the retry logic and then see the retries occur? We should be able to simulate this with mock HTTP calls and then see that X number of HTTP calls were made afterwards.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_session_retries_on_transient_5xx_then_succeeds in 6fb87ff. requests_mock actually replaces the HTTPAdapter on the session, which bypasses our retry logic, so the test instead stands up a real http.server on a loopback port. The handler returns three 503s followed by a 200, and the test asserts both that list_namespaces succeeds and that the handler saw 4 requests.

Per @Fokko + @rambleraptor: drop the urllib3 retry-dict pass-through and
expose only three explicit knobs on the connection block (timeout, retries,
backoff-factor). Hard-code the retry policy (status_forcelist of transient
codes; allowed_methods of idempotent verbs) so users cannot misconfigure
e.g. raise_on_status=False and silently swallow 4xx errors.

Per @rambleraptor: add a test that exercises the retry path end-to-end by
spinning up a loopback HTTP server that returns three 503s then a 200,
verifying the catalog makes all four attempts. requests_mock can't be used
here because it replaces the HTTPAdapter and bypasses retry logic.

Fix the three mypy errors flagged by CI:
- _RetryTimeoutHTTPAdapter.send now matches HTTPAdapter.send's full
  signature instead of (request, **kwargs).
- Test's set(adapter.max_retries.allowed_methods) now guards the
  Collection[str] | None type.

Signed-off-by: rahulsmahadev <rahul.mahadev@databricks.com>
@rahulsmahadev

Copy link
Copy Markdown
Author

Pushed 6fb87ff addressing the review feedback:

  • @Fokko / @rambleraptor: dropped the urllib3 retry-dict pass-through. The connection block now exposes only three explicit options — timeout, retries, backoff-factor — with status_forcelist and allowed_methods hard-coded internally (transient codes + idempotent verbs), so users can't accidentally swallow non-transient errors or retry non-idempotent writes.
  • @rambleraptor (retry test): added test_session_retries_on_transient_5xx_then_succeeds that uses a real loopback http.server (since requests_mock replaces the HTTPAdapter and bypasses retry logic). The handler returns three 503s then a 200; the test asserts both that list_namespaces succeeds and that the handler saw 4 requests.
  • @rambleraptor (lint): the three CI mypy errors are fixed — _RetryTimeoutHTTPAdapter.send now matches the parent signature, and the set(allowed_methods) call guards the Collection[str] | None type.

PTAL!

@rambleraptor

Copy link
Copy Markdown
Collaborator

Can you run make lint and get the linters passing? I'll take another look after that. Thanks!

requests' HTTPAdapter.send declares proxies as Mapping[str, str] | None
(via types-requests). Overriding with the more specific MutableMapping
is an invalid override under mypy 1.18 strict mode.

Signed-off-by: rahulsmahadev <rahul.mahadev@databricks.com>
@rahulsmahadev

Copy link
Copy Markdown
Author

Linters are now passing — afb1f51 fixed the last mypy holdouts: I'd used MutableMapping for the proxies parameter on the overridden HTTPAdapter.send, but the parent (per types-requests) declares it as Mapping, and overriding with a more specific type is an invalid variance under strict mypy. Switched to Mapping. All 17 checks green now. PTAL!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants