feat: Add opt-in per-domain request throttling for HTTP 429 backoff by MrAliHasan · Pull Request #1762 · apify/crawlee-python

MrAliHasan · 2026-02-20T21:31:44Z

Fixes #1437

Problem

When target websites return HTTP 429 (Too Many Requests), requests get retried without any per-domain delay — potentially making rate limiting worse.

Solution

Introduces the ThrottlingRequestManager, an opt-in request manager wrapper that enforces per-domain delays at the scheduling layer.

Note: Users must explicitly pass a ThrottlingRequestManager as the request_manager to enable throttling. There is no auto-wrapping or implicit behavior change.

Key features:

Per-domain sub-queues — requests for configured domains are routed to dedicated sub-queues at insertion time
HTTP 429 backoff — record_domain_delay() sets a per-domain throttled_until timestamp based on Retry-After headers
robots.txt crawl-delay — BasicCrawler automatically calls set_crawl_delay() when respect_robots_txt_file is enabled and the request manager is a ThrottlingRequestManager
Crawl-delay warning — logs a warning if respect_robots_txt_file is enabled but the request manager is not a ThrottlingRequestManager
Delay-aware scheduling — fetch_next_request() skips throttled domains, falls back to the inner queue, and sleeps only when all sub-queues are throttled and the inner queue is empty
Persistence support — sub-queues use the same storage backend as the inner queue; recreate_purged() handles queue reconstruction across crawler restarts

How it works

Requests are routed to domain-specific sub-queues at insertion time
If a domain is throttled (throttled_until), fetch_next_request() skips it and falls back to the inner queue
record_domain_delay() updates per-domain backoff on HTTP 429 responses, respecting Retry-After headers
set_crawl_delay() integrates robots.txt crawl-delay when enabled
On successful requests, backoff counters reset

Usage

from crawlee.request_loaders import ThrottlingRequestManager
from crawlee.storages import RequestQueue
from crawlee.crawlers import BasicCrawler

queue = await RequestQueue.open()
manager = ThrottlingRequestManager(
    queue,
    domains=["example.com", "api.example.com"]
)
crawler = BasicCrawler(request_manager=manager)

Files changed

File	Change
`_throttling_request_manager.py`	NEW — Per-domain throttling request manager
`http.py`	`parse_retry_after_header` utility
`_basic_crawler.py`	`recreate_purged()` integration, crawl-delay warning
`_playwright_crawler.py`	Pass `Retry-After` header
`test_throttling_request_manager.py`	NEW — 30 unit tests using real `RequestQueue` with `MemoryStorageClient`

Tests

30 new tests covering: domain routing, throttling, delay scheduling, sleep behavior, delegation, recreate_purged(), and edge cases
All 1648 existing tests pass with zero regressions

Future work

This is a focused first step toward a more complete RequestAnalyzer that may include:

robots.txt integration for multiple domains
URL group management
Enhanced per-domain scheduling and analytics

Add a new RequestThrottler component that handles HTTP 429 (Too Many Requests) responses on a per-domain basis, preventing the autoscaling death spiral where 429s cause concurrency to increase. Key features: - Per-domain tracking: rate limiting on domain A doesn't affect domain B - Exponential backoff: 2s -> 4s -> 8s -> ... capped at 60s - Retry-After header support (both seconds and HTTP-date formats) - Throttled requests are reclaimed to the queue, not dropped - Backoff resets on successful requests to that domain The AutoscaledPool is completely untouched - throttling happens transparently in BasicCrawler.__run_task_function before processing. Integration points: - BasicCrawler: throttle check, 429 recording, success reset - AbstractHttpCrawler: passes URL + Retry-After to detection - PlaywrightCrawler: passes URL + Retry-After to detection Closes apify#1437

vdusek · 2026-02-23T11:49:48Z

Hi @MrAliHasan, thanks for your contribution! We'll try to review this soon.

janbuchar

As mentioned in #1762 (comment), the approach of reclaiming throttled requests is not optimal.

On top of that, the solution to #1437 should probably also be extensible enough to also cover #1396 without much tweaking.

I believe that such solution could be implemented in crawlee-python quite easily. See similar issue for crawlee-js. The Python version already supports multiple "unnamed queues" via RequestQueue.open(alias="..."), so you'd only need to implement a ThrottlingRequestManager (implementation of the RequestManager interface) that would keep track of the per-domain queues and their delays.

Do you want to try it?

MrAliHasan · 2026-02-23T16:44:07Z

Thanks for the detailed review. That makes sense regarding the busy-wait behavior and queue writes.
I’ll refactor this into a ThrottlingRequestManager implementation so that the throttling logic lives in the request scheduling layer rather than in BasicCrawler.
I’ll push an updated version soon. Appreciate the guidance.

Move per-domain throttling from execution layer (BasicCrawler.__run_task_function) to scheduling layer (ThrottlingRequestManager.fetch_next_request). - ThrottlingRequestManager wraps RequestQueue, implements RequestManager interface - fetch_next_request() buffers throttled requests and asyncio.sleep()s when all domains are throttled — eliminates busy-wait and unnecessary queue writes - Unified delay mechanism supports both HTTP 429 backoff and robots.txt crawl-delay (apify#1396) - parse_retry_after_header moved to crawlee._utils.http - 23 new tests covering throttling, scheduling, delegation, and crawl-delay Addresses apify#1437, apify#1396

…queues and update its integration across crawlers.

MrAliHasan · 2026-02-25T01:21:06Z

Heads up @janbuchar @vdusek @Mantisus: I've pushed a significant refactor based on the latest feedback.

Sub-queues over memory buffer: ThrottlingRequestManager now delegates to persistent per-domain sub-queues via RequestQueue.open(alias=f"throttled-{domain}") instead of keeping throttled requests in memory.

Test Structure: Completely rewrote test_throttling_request_manager.py to drop the Test... classes and conform to Crawlee's standard test structure.

BasicCrawler fixes: Addressed all inline nits (used isinstance(), renamed url to request_url in _raise_for_session_blocked_status_code, updated docstrings/comments).

The tests track the routing origin and safely aggregate get_handled_count and is_empty metrics across the main queue and sub-queues. All 24 tests pass, and Ruff and Pytest issues have been resolved. Let me know if the updated delegation architecture feels right!

MrAliHasan · 2026-02-26T01:53:35Z

Update: I just pushed a small follow-up commit fixing the MyPy typing and Ruff linting errors in the test suite that were causing the CI to fail. All local checks for ThrottlingRequestManager are now passing 100%. Ready for review whenever you have time!

Pijukatel

Hello, thanks for the work on this PR! I have just some annoying edgecase to think about. I am not sure myself what the best way is to deal with them.

MrAliHasan · 2026-02-26T22:32:34Z

Hey @Pijukatel, thanks for looking deeply into this! Great catches all around on the edge cases. I spent some time analyzing these points, and here is how I propose we handle them. Let me know if you are aligned on this direction before I push the changes:

1. BasicCrawler initialization

Good catch. I'll simplify _basic_crawler.py to just directly assign self._request_manager = ThrottlingRequestManager(inner).

2. Sub-Queue storage strategy

Regarding the service_locator mismatch: you are absolutely right. If a crawler relies on persistent storage (e.g. Apify Platform) and we silently default the sub-queues to a fast MemoryStorageClient, an interrupted crawl will lose its throttled queues entirely on restart. We would be trading determinism and correctness for raw performance.

Therefore, I propose your "Option 2" as the default, but with the flexibility of "Option 3":

ThrottlingRequestManager will accept an optional service_locator.
By default, BasicCrawler will pass its own self._service_locator down to the throttler so that sub-queues consistently match the persistence mechanism of the main queue.
If a power user really wants pure in-memory sub-queues for cost/speed reasons despite a persistent main queue, they can explicitly pass a custom ServiceLocator into the throttler instance.

3. Fetch priority redesign

I completely agree. Relying on iteration order rather than the longest-overdue domain is a flaw in the scheduling logic.

I will remove _DomainState.last_request_at entirely. Less state means fewer edge cases.
On dispatch, we simply update _DomainState.throttled_until immediately.
fetch_next_request will collect available sub-queues and explicitly sort them by throttled_until, so the domain that has been waiting the longest gets fetched first.

4. Deleting Handled Requests

Because RequestQueue does not currently support .delete(), we can't physically purge them without API additions. However, since the sub-queues only act as temporary routing buffers, this shouldn't be a problem. The source of truth remains the inner queue. The handled tracking duplication in the sub-queue doesn't affect the core deduplication logic at all.

Does this direction look solid to you? If so, I'll get it coded and pushed up!

Pijukatel · 2026-02-27T14:56:32Z

...

Does this direction look solid to you? If so, I'll get it coded and pushed up!

We discussed it internally, and there are some open points. Please let me think about it over the weekend, so that I don't point you in the wrong direction.

Pijukatel · 2026-03-02T10:52:38Z

...

Does this direction look solid to you? If so, I'll get it coded and pushed up!

It is going in a good direction. Just a few more edgecases we discussed:

Having ad hoc request queues being created for domains can lead to two undesired scenarios:

If redirecting new requests to such queues, then deduplication is difficult (would have to check two queues)
If not redirecting new requests to such queues, then requests are duplicated, and there is a lot of work done by just duplicating specific domain requests from the main queue to the specialized queue.

To deal with this, we agreed it would be best to have the RequestThrottler be created with specific domains as init arguments. It should be redirecting new requests to such explicitly requested sub-queues. For domains that are not explicitly mentioned in init, it would not do any request throttling.

Preserve existing behavior and introduce `RequestThrottler` in a safe way

It would be best if we start with this feature being optional for now, until we get more usage feedback. So RequestThrottler would have to be explicitly passed to the BasicCrawler. Could you please add a short guide on how to do it and a small code example? (See: https://github.com/apify/crawlee-python/tree/master/docs/guides)

…tling manager.

MrAliHasan · 2026-03-02T19:46:28Z

Hey @Pijukatel, I've pushed the refactor based on your latest feedback. Here's what changed:

Explicit domain routing

ThrottlingRequestManager now requires domains at init — only listed domains are throttled.
add_request / add_requests route listed-domain requests directly to their sub-queue at insertion time. No duplication, no transfers.
Non-listed domains pass through to the main queue untouched.
Sub-queues use the same storage backend as the inner queue by default to preserve persistence across restarts.

Opt-in only

Removed auto-wrapping from BasicCrawler.get_request_manager(). The feature must be explicitly enabled by passing a ThrottlingRequestManager as request_manager.

Simplified state

Removed _DomainState.last_request_at. Only throttled_until is tracked, updated on dispatch.
Eliminated _dispatched_origins and _transferred_requests_count — no longer needed since requests start in the right queue.
fetch_next_request sorts sub-queues by throttled_until ascending (longest-overdue domain first).

Documentation

Added docs/guides/request_throttling.mdx with a short guide and code example showing explicit usage.

All local checks pass (1647 tests, 0 failures). Ready for review!

Pijukatel

Nice.
Now I have just a few more code-related comments.

…ctor domain state management and sub-queue handling.

MrAliHasan · 2026-03-04T22:42:26Z

Hey @Pijukatel, I've pushed all the changes from your latest review. Here's what was addressed:

Import cleanup

Moved parsedate_to_datetime import to the top of http.py.
Inlined retry_after_header variable in _playwright_crawler.py.

ThrottlingRequestManager simplifications

Removed redundant _domains set. All _domain_states are now fully populated at init with throttled_until set to now.
_service_locator now defaults to the global service_locator when not explicitly provided, eliminating all if self._service_locator branches in _get_or_create_sub_queue and recreate_purged.
Added if not domain: return guard in record_success for consistency with set_crawl_delay and _mark_domain_dispatched.
Removed if domain not in self._domain_states guards in record_domain_delay, set_crawl_delay, and _mark_domain_dispatched since all states are initialized at init.
Removed the # ────── separator comment block.

Dedicated purge method

Added recreate_purged() method on ThrottlingRequestManager that drops all queues and returns a fresh instance with the same domain configuration and service locator. BasicCrawler now calls this directly instead of accessing private attributes.

Docstrings

Clarified that fetch_next_request falls back to the inner queue when all sub-queues are throttled, and only sleeps when the inner queue is also empty.

Tests

Replaced all AsyncMock queue mocks with real RequestQueue backed by MemoryStorageClient, which also provides the service_locator for sub-queue creation.
Added test for recreate_purged.

All checks pass (1648 tests, 0 failures). Ready for review!

Pijukatel · 2026-03-05T10:36:58Z

@MrAliHasan good work! Please just fix the errors reported by Type check and I have no more comments.

MrAliHasan · 2026-03-05T12:40:57Z

@MrAliHasan good work! Please just fix the errors reported by Type check and I have no more comments.

The CI type check (Python 3.10) fails with invalid-return-type on the two return statements in add_request().

It seems the newer ty version used in CI resolves RequestQueue.add_request() as returning ProcessedRequest | None instead of ProcessedRequest, and RequestManager.add_request() as Unknown | ProcessedRequest | None.

One possible fix would be to narrow the type explicitly:

result = await sq.add_request(request, forefront=forefront)
if result is None:
    raise RuntimeError("Unexpected None from add_request()")
return result

However, this case should never occur in practice, so I wanted to ask first: Is there a preferred approach for handling this kind of ty false positive? For example, a suppression comment or a different typing pattern.

I also noticed that RequestManagerTandem.add_request() uses the same delegation pattern and does not seem to trigger this issue on CI.

Mantisus · 2026-03-05T15:25:19Z

Hi, @MrAliHasan

Thank you for your inspiring work on this PR.

However, this case should never occur in practice, so I wanted to ask first: Is there a preferred approach for handling this kind of ty false positive? For example, a suppression comment or a different typing pattern.

Make sure you are using ty 0.0.18. There are known issues with type narrowing in recent versions.

Regarding add_request, this is related to #1775. There is a possibility that the request will not be added to the queue with some backends.

….14 wheels for brotlicffi.

MrAliHasan · 2026-03-05T23:32:14Z

Hi, @MrAliHasan

Thank you for your inspiring work on this PR.

However, this case should never occur in practice, so I wanted to ask first: Is there a preferred approach for handling this kind of ty false positive? For example, a suppression comment or a different typing pattern.

Make sure you are using ty 0.0.18. There are known issues with type narrowing in recent versions.

Regarding add_request, this is related to #1775. There is a possibility that the request will not be added to the queue with some backends.

Thanks for the guidance!

I've pinned ty to version 0.0.18 in pyproject.toml and updated uv.lock. This resolves the CI type check error caused by the newer ty versions.

Regarding #1775, understood. I'll keep the current add_request return type as ProcessedRequest for now, and it can be adjusted later if the behavior changes as part of that issue.

vdusek · 2026-03-06T09:12:52Z

I've pinned ty to version 0.0.18 in pyproject.toml and updated uv.lock. This resolves the CI type check error caused by the newer ty versions.

This is wrong. You should have only run uv to upgrade the package to its latest version that is compliant with the pin in the pyproject.toml. Not hard-pin one specific version there.

vdusek · 2026-03-13T20:54:15Z

Change the return type to ProcessedRequest | None to match the base class behavior

👍

janbuchar · 2026-03-25T17:40:46Z

@MrAliHasan there are still some unresolved comments, mainly #1762 (comment) - can you please take care of those?

MrAliHasan · 2026-03-25T17:58:14Z

@MrAliHasan there are still some unresolved comments, mainly #1762 (comment) - can you please take care of those?

Yes, I'll work on it tomorrow.

…ard, crawl-delay caching, revert uv.lock

MrAliHasan · 2026-04-03T10:00:53Z

Thanks for the follow-up @vdusek @janbuchar! All comments have been addressed:

Remove type ignore, resolve proper typing: Updated the base class RequestManager.add_request to return ProcessedRequest | None (along with RequestManagerTandem and RequestQueue), so the override in ThrottlingRequestManager no longer needs # type: ignore. Also updated add_requests to handle None.
recreate_purged assumption: Added an explicit isinstance check that raises TypeError if the inner manager is not a RequestQueue, instead of silently assuming it.
Redundant set_crawl_delay calls: Added an early return in set_crawl_delay if the crawl-delay is already set for the domain, making repeated calls a no-op.
uv.lock: Reverted to match master. There are currently merge conflicts due to master moving forward

janbuchar · 2026-04-03T20:07:45Z

-            if purge_request_queue and isinstance(request_manager, RequestQueue):
-                await request_manager.drop()
-                self._request_manager = await RequestQueue.open(
-                    storage_client=self._service_locator.get_storage_client(),
-                    configuration=self._service_locator.get_configuration(),
-                )


Even in the state before the change, this was a code smell - shouldn't we add a "purge_on_start_hook"-like abstract method to RequestManager and implement it in RequestQueue? Or should we just call .drop on request manager?

This is aimed mostly at @vdusek and @Pijukatel. We definitely don't need to resolve it in this PR if you guys don't see an obvious way out.

Understood, happy to leave this for a follow-up.

…to mark_request_as_handled, fix docs

MrAliHasan · 2026-04-03T20:50:33Z

Added a request_manager_opener callback parameter to __init__ (defaults to RequestQueue.open), as you suggested. recreate_purged now uses this callback instead of hardcoding RequestQueue.open, and preserves it across recreations.

janbuchar

Two more issues. Also please resolve merge conflicts and revert changes to uv.lock.

MrAliHasan · 2026-04-09T16:05:54Z

@janbuchar
Thanks for the feedback. Could you please clarify what exactly is still not resolved? I want to make sure I address the issue correctly.

janbuchar · 2026-04-09T16:20:14Z

I meant these two comments:

feat: Add opt-in per-domain request throttling for HTTP 429 backoff #1762 (comment)
feat: Add opt-in per-domain request throttling for HTTP 429 backoff #1762 (comment)

Conflicts resolved: - src/crawlee/request_loaders/_request_manager.py: kept master's version (PR apify#1775 made the same fix on master with minor cosmetic differences). - uv.lock: regenerated from master (pyproject.toml ended up identical to master after auto-merge — no PR-specific dependency changes).

…type Type the wrapped manager and the `request_manager_opener` callback with a shared `TRequestManager` TypeVar so callers can ensure `recreate_purged` reconstructs the throttler with the same backing-store implementation that was passed in. The opener is now required, eliminating the silent `RequestQueue.open` default that would have type-lied for any inner type other than `RequestQueue`. Addresses review comment from janbuchar on PR apify#1762. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`_is_allowed_based_on_robots_txt_file` runs once per URL. Skip the ThrottlingRequestManager isinstance branch and the `set_crawl_delay` call entirely after the first URL from a given origin, instead of relying on a no-op early return inside the manager. Addresses review comment from vdusek/janbuchar on PR apify#1762. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vdusek requested review from Pijukatel, janbuchar and vdusek February 23, 2026 11:48

janbuchar requested changes Feb 23, 2026

View reviewed changes

Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated

Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py

Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated

janbuchar self-requested a review February 24, 2026 10:22

janbuchar requested changes Feb 24, 2026

View reviewed changes

refactor: reimplement ThrottlingRequestManager with per-domain sub-…

1065e9b

…queues and update its integration across crawlers.

test: fix typing and linting checks in ThrottlingRequestManager tests

138fd67

Pijukatel reviewed Feb 26, 2026

View reviewed changes

feat: Add explicit domain routing and management to the request throt…

abdf51c

…tling manager.

Pijukatel reviewed Mar 3, 2026

View reviewed changes

feat: Implement recreate_purged for ThrottlingRequestManager and refa…

dd99d9d

…ctor domain state management and sub-queue handling.

janbuchar reviewed Mar 5, 2026

View reviewed changes

Comment thread src/crawlee/request_loaders/_throttling_request_manager.py

deps: Pin ty to version 0.0.18 and update uv.lock to include Python 3…

497b782

….14 wheels for brotlicffi.

fix: Change add_request return type to ProcessedRequest | None

44b93bb

vdusek requested changes Mar 19, 2026

View reviewed changes

Comment thread uv.lock

Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated

Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated

Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py

vdusek changed the title ~~fix: add per-domain RequestThrottler for 429 backoff~~ feat: Add opt-in per-domain request throttling for HTTP 429 backoff Mar 19, 2026

vdusek requested review from Mantisus and janbuchar March 19, 2026 09:34

refactor: Address review feedback — proper typing, recreate_purged gu…

412df15

…ard, crawl-delay caching, revert uv.lock

janbuchar reviewed Apr 3, 2026

View reviewed changes

refactor: Add request_manager_opener callback, move record_success in…

a249a23

…to mark_request_as_handled, fix docs

janbuchar self-requested a review April 5, 2026 18:34

janbuchar reviewed Apr 9, 2026

View reviewed changes

Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py

Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated

vdusek requested review from janbuchar and vdusek May 5, 2026 14:51

vdusek added 4 commits May 6, 2026 09:10

Make ThrottlingRequestManager generic over inner manager type

a3f5c7c

Cache crawl-delay configuration per origin in BasicCrawler

bebf2db

Reflow docstrings to full 120-char width and fix backticks

e5fe554

Simplify

6dbe696

vdusek force-pushed the fix/request-throttler-429-backoff branch from f50d68c to 6dbe696 Compare May 6, 2026 07:11

vdusek added 3 commits May 6, 2026 09:49

Fix flaky test

8f10c10

warn when capping Retry-After or backoff at max_delay

ed0dcc0

make ThrottlingRequestManager sub-managers honor the generic inner type

f2f47a5

Conversation

MrAliHasan commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes #1437

Problem

Solution

How it works

Usage

Files changed

Tests

Future work

Uh oh!

vdusek commented Feb 23, 2026

Uh oh!

janbuchar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrAliHasan commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrAliHasan commented Feb 25, 2026

Uh oh!

MrAliHasan commented Feb 26, 2026

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrAliHasan commented Feb 26, 2026

Uh oh!

Pijukatel commented Feb 27, 2026

Uh oh!

Pijukatel commented Mar 2, 2026

Having ad hoc request queues being created for domains can lead to two undesired scenarios:

Preserve existing behavior and introduce RequestThrottler in a safe way

Uh oh!

MrAliHasan commented Mar 2, 2026

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrAliHasan commented Mar 4, 2026

Uh oh!

Uh oh!

Pijukatel commented Mar 5, 2026

Uh oh!

MrAliHasan commented Mar 5, 2026

Uh oh!

Mantisus commented Mar 5, 2026

Uh oh!

MrAliHasan commented Mar 5, 2026

Uh oh!

vdusek commented Mar 6, 2026

Uh oh!

vdusek commented Mar 13, 2026

MrAliHasan commented Feb 20, 2026 •

edited

Loading

janbuchar left a comment •

edited

Loading

Preserve existing behavior and introduce `RequestThrottler` in a safe way