Skip to content

feat: Add opt-in per-domain request throttling for HTTP 429 backoff#1762

Open
MrAliHasan wants to merge 22 commits intoapify:masterfrom
MrAliHasan:fix/request-throttler-429-backoff
Open

feat: Add opt-in per-domain request throttling for HTTP 429 backoff#1762
MrAliHasan wants to merge 22 commits intoapify:masterfrom
MrAliHasan:fix/request-throttler-429-backoff

Conversation

@MrAliHasan
Copy link
Copy Markdown

@MrAliHasan MrAliHasan commented Feb 20, 2026

Fixes #1437

Problem

When target websites return HTTP 429 (Too Many Requests), requests get retried without any per-domain delay — potentially making rate limiting worse.

Solution

Introduces the ThrottlingRequestManager, an opt-in request manager wrapper that enforces per-domain delays at the scheduling layer.

Note: Users must explicitly pass a ThrottlingRequestManager as the request_manager to enable throttling. There is no auto-wrapping or implicit behavior change.

Key features:

  • Per-domain sub-queues — requests for configured domains are routed to dedicated sub-queues at insertion time
  • HTTP 429 backoffrecord_domain_delay() sets a per-domain throttled_until timestamp based on Retry-After headers
  • robots.txt crawl-delayBasicCrawler automatically calls set_crawl_delay() when respect_robots_txt_file is enabled and the request manager is a ThrottlingRequestManager
  • Crawl-delay warning — logs a warning if respect_robots_txt_file is enabled but the request manager is not a ThrottlingRequestManager
  • Delay-aware schedulingfetch_next_request() skips throttled domains, falls back to the inner queue, and sleeps only when all sub-queues are throttled and the inner queue is empty
  • Persistence support — sub-queues use the same storage backend as the inner queue; recreate_purged() handles queue reconstruction across crawler restarts

How it works

  1. Requests are routed to domain-specific sub-queues at insertion time
  2. If a domain is throttled (throttled_until), fetch_next_request() skips it and falls back to the inner queue
  3. record_domain_delay() updates per-domain backoff on HTTP 429 responses, respecting Retry-After headers
  4. set_crawl_delay() integrates robots.txt crawl-delay when enabled
  5. On successful requests, backoff counters reset

Usage

from crawlee.request_loaders import ThrottlingRequestManager
from crawlee.storages import RequestQueue
from crawlee.crawlers import BasicCrawler

queue = await RequestQueue.open()
manager = ThrottlingRequestManager(
    queue,
    domains=["example.com", "api.example.com"]
)
crawler = BasicCrawler(request_manager=manager)

Files changed

File Change
_throttling_request_manager.py NEW — Per-domain throttling request manager
http.py parse_retry_after_header utility
_basic_crawler.py recreate_purged() integration, crawl-delay warning
_playwright_crawler.py Pass Retry-After header
test_throttling_request_manager.py NEW — 30 unit tests using real RequestQueue with MemoryStorageClient

Tests

  • 30 new tests covering: domain routing, throttling, delay scheduling, sleep behavior, delegation, recreate_purged(), and edge cases
  • All 1648 existing tests pass with zero regressions

Future work

This is a focused first step toward a more complete RequestAnalyzer that may include:

  • robots.txt integration for multiple domains
  • URL group management
  • Enhanced per-domain scheduling and analytics

Add a new RequestThrottler component that handles HTTP 429 (Too Many
Requests) responses on a per-domain basis, preventing the autoscaling
death spiral where 429s cause concurrency to increase.

Key features:
- Per-domain tracking: rate limiting on domain A doesn't affect domain B
- Exponential backoff: 2s -> 4s -> 8s -> ... capped at 60s
- Retry-After header support (both seconds and HTTP-date formats)
- Throttled requests are reclaimed to the queue, not dropped
- Backoff resets on successful requests to that domain

The AutoscaledPool is completely untouched - throttling happens
transparently in BasicCrawler.__run_task_function before processing.

Integration points:
- BasicCrawler: throttle check, 429 recording, success reset
- AbstractHttpCrawler: passes URL + Retry-After to detection
- PlaywrightCrawler: passes URL + Retry-After to detection

Closes apify#1437
@vdusek
Copy link
Copy Markdown
Collaborator

vdusek commented Feb 23, 2026

Hi @MrAliHasan, thanks for your contribution! We'll try to review this soon.

Copy link
Copy Markdown
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #1762 (comment), the approach of reclaiming throttled requests is not optimal.

On top of that, the solution to #1437 should probably also be extensible enough to also cover #1396 without much tweaking.

I believe that such solution could be implemented in crawlee-python quite easily. See similar issue for crawlee-js. The Python version already supports multiple "unnamed queues" via RequestQueue.open(alias="..."), so you'd only need to implement a ThrottlingRequestManager (implementation of the RequestManager interface) that would keep track of the per-domain queues and their delays.

Do you want to try it?

Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated
Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py
Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated
@MrAliHasan
Copy link
Copy Markdown
Author

Thanks for the detailed review. That makes sense regarding the busy-wait behavior and queue writes.
I’ll refactor this into a ThrottlingRequestManager implementation so that the throttling logic lives in the request scheduling layer rather than in BasicCrawler.
I’ll push an updated version soon. Appreciate the guidance.

Move per-domain throttling from execution layer (BasicCrawler.__run_task_function)
to scheduling layer (ThrottlingRequestManager.fetch_next_request).

- ThrottlingRequestManager wraps RequestQueue, implements RequestManager interface
- fetch_next_request() buffers throttled requests and asyncio.sleep()s when all
  domains are throttled — eliminates busy-wait and unnecessary queue writes
- Unified delay mechanism supports both HTTP 429 backoff and robots.txt
  crawl-delay (apify#1396)
- parse_retry_after_header moved to crawlee._utils.http
- 23 new tests covering throttling, scheduling, delegation, and crawl-delay

Addresses apify#1437, apify#1396
@janbuchar janbuchar self-requested a review February 24, 2026 10:22
Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated
Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated
Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated
Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated
Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated
Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated
Comment thread tests/unit/test_throttling_request_manager.py
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
…queues and update its integration across crawlers.
@MrAliHasan
Copy link
Copy Markdown
Author

Heads up @janbuchar @vdusek @Mantisus: I've pushed a significant refactor based on the latest feedback.

Sub-queues over memory buffer: ThrottlingRequestManager now delegates to persistent per-domain sub-queues via RequestQueue.open(alias=f"throttled-{domain}") instead of keeping throttled requests in memory.

Test Structure: Completely rewrote test_throttling_request_manager.py to drop the Test... classes and conform to Crawlee's standard test structure.

BasicCrawler fixes: Addressed all inline nits (used isinstance(), renamed url to request_url in _raise_for_session_blocked_status_code, updated docstrings/comments).

The tests track the routing origin and safely aggregate get_handled_count and is_empty metrics across the main queue and sub-queues. All 24 tests pass, and Ruff and Pytest issues have been resolved. Let me know if the updated delegation architecture feels right!

@MrAliHasan
Copy link
Copy Markdown
Author

Update: I just pushed a small follow-up commit fixing the MyPy typing and Ruff linting errors in the test suite that were causing the CI to fail. All local checks for ThrottlingRequestManager are now passing 100%. Ready for review whenever you have time!

Copy link
Copy Markdown
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thanks for the work on this PR! I have just some annoying edgecase to think about. I am not sure myself what the best way is to deal with them.

Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
@MrAliHasan
Copy link
Copy Markdown
Author

Hey @Pijukatel, thanks for looking deeply into this! Great catches all around on the edge cases. I spent some time analyzing these points, and here is how I propose we handle them. Let me know if you are aligned on this direction before I push the changes:

1. BasicCrawler initialization

Good catch. I'll simplify _basic_crawler.py to just directly assign self._request_manager = ThrottlingRequestManager(inner).

2. Sub-Queue storage strategy

Regarding the service_locator mismatch: you are absolutely right. If a crawler relies on persistent storage (e.g. Apify Platform) and we silently default the sub-queues to a fast MemoryStorageClient, an interrupted crawl will lose its throttled queues entirely on restart. We would be trading determinism and correctness for raw performance.

Therefore, I propose your "Option 2" as the default, but with the flexibility of "Option 3":

  • ThrottlingRequestManager will accept an optional service_locator.
  • By default, BasicCrawler will pass its own self._service_locator down to the throttler so that sub-queues consistently match the persistence mechanism of the main queue.
  • If a power user really wants pure in-memory sub-queues for cost/speed reasons despite a persistent main queue, they can explicitly pass a custom ServiceLocator into the throttler instance.

3. Fetch priority redesign

I completely agree. Relying on iteration order rather than the longest-overdue domain is a flaw in the scheduling logic.

  • I will remove _DomainState.last_request_at entirely. Less state means fewer edge cases.
  • On dispatch, we simply update _DomainState.throttled_until immediately.
  • fetch_next_request will collect available sub-queues and explicitly sort them by throttled_until, so the domain that has been waiting the longest gets fetched first.

4. Deleting Handled Requests

Because RequestQueue does not currently support .delete(), we can't physically purge them without API additions. However, since the sub-queues only act as temporary routing buffers, this shouldn't be a problem. The source of truth remains the inner queue. The handled tracking duplication in the sub-queue doesn't affect the core deduplication logic at all.

Does this direction look solid to you? If so, I'll get it coded and pushed up!

@Pijukatel
Copy link
Copy Markdown
Collaborator

...

Does this direction look solid to you? If so, I'll get it coded and pushed up!

We discussed it internally, and there are some open points. Please let me think about it over the weekend, so that I don't point you in the wrong direction.

@Pijukatel
Copy link
Copy Markdown
Collaborator

...

Does this direction look solid to you? If so, I'll get it coded and pushed up!

It is going in a good direction. Just a few more edgecases we discussed:

Having ad hoc request queues being created for domains can lead to two undesired scenarios:

  • If redirecting new requests to such queues, then deduplication is difficult (would have to check two queues)
  • If not redirecting new requests to such queues, then requests are duplicated, and there is a lot of work done by just duplicating specific domain requests from the main queue to the specialized queue.

To deal with this, we agreed it would be best to have the RequestThrottler be created with specific domains as init arguments. It should be redirecting new requests to such explicitly requested sub-queues. For domains that are not explicitly mentioned in init, it would not do any request throttling.

Preserve existing behavior and introduce RequestThrottler in a safe way

It would be best if we start with this feature being optional for now, until we get more usage feedback. So RequestThrottler would have to be explicitly passed to the BasicCrawler. Could you please add a short guide on how to do it and a small code example? (See: https://github.com/apify/crawlee-python/tree/master/docs/guides)

@MrAliHasan
Copy link
Copy Markdown
Author

Hey @Pijukatel, I've pushed the refactor based on your latest feedback. Here's what changed:

Explicit domain routing

  • ThrottlingRequestManager now requires domains at init — only listed domains are throttled.
  • add_request / add_requests route listed-domain requests directly to their sub-queue at insertion time. No duplication, no transfers.
  • Non-listed domains pass through to the main queue untouched.
  • Sub-queues use the same storage backend as the inner queue by default to preserve persistence across restarts.

Opt-in only

  • Removed auto-wrapping from BasicCrawler.get_request_manager(). The feature must be explicitly enabled by passing a ThrottlingRequestManager as request_manager.

Simplified state

  • Removed _DomainState.last_request_at. Only throttled_until is tracked, updated on dispatch.
  • Eliminated _dispatched_origins and _transferred_requests_count — no longer needed since requests start in the right queue.
  • fetch_next_request sorts sub-queues by throttled_until ascending (longest-overdue domain first).

Documentation

  • Added docs/guides/request_throttling.mdx with a short guide and code example showing explicit usage.

All local checks pass (1647 tests, 0 failures). Ready for review!

Copy link
Copy Markdown
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.
Now I have just a few more code-related comments.

Comment thread src/crawlee/_utils/http.py Outdated
Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated
Comment thread src/crawlee/crawlers/_playwright/_playwright_crawler.py Outdated
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
Comment thread tests/unit/test_throttling_request_manager.py Outdated
…ctor domain state management and sub-queue handling.
@MrAliHasan
Copy link
Copy Markdown
Author

Hey @Pijukatel, I've pushed all the changes from your latest review. Here's what was addressed:

Import cleanup

  • Moved parsedate_to_datetime import to the top of http.py.
  • Inlined retry_after_header variable in _playwright_crawler.py.

ThrottlingRequestManager simplifications

  • Removed redundant _domains set. All _domain_states are now fully populated at init with throttled_until set to now.
  • _service_locator now defaults to the global service_locator when not explicitly provided, eliminating all if self._service_locator branches in _get_or_create_sub_queue and recreate_purged.
  • Added if not domain: return guard in record_success for consistency with set_crawl_delay and _mark_domain_dispatched.
  • Removed if domain not in self._domain_states guards in record_domain_delay, set_crawl_delay, and _mark_domain_dispatched since all states are initialized at init.
  • Removed the # ────── separator comment block.

Dedicated purge method

  • Added recreate_purged() method on ThrottlingRequestManager that drops all queues and returns a fresh instance with the same domain configuration and service locator. BasicCrawler now calls this directly instead of accessing private attributes.

Docstrings

  • Clarified that fetch_next_request falls back to the inner queue when all sub-queues are throttled, and only sleeps when the inner queue is also empty.

Tests

  • Replaced all AsyncMock queue mocks with real RequestQueue backed by MemoryStorageClient, which also provides the service_locator for sub-queue creation.
  • Added test for recreate_purged.

All checks pass (1648 tests, 0 failures). Ready for review!

Comment thread src/crawlee/request_loaders/_throttling_request_manager.py
@Pijukatel
Copy link
Copy Markdown
Collaborator

@MrAliHasan good work! Please just fix the errors reported by Type check and I have no more comments.

@MrAliHasan
Copy link
Copy Markdown
Author

@MrAliHasan good work! Please just fix the errors reported by Type check and I have no more comments.

The CI type check (Python 3.10) fails with invalid-return-type on the two return statements in add_request().

It seems the newer ty version used in CI resolves RequestQueue.add_request() as returning ProcessedRequest | None instead of ProcessedRequest, and RequestManager.add_request() as Unknown | ProcessedRequest | None.

One possible fix would be to narrow the type explicitly:

result = await sq.add_request(request, forefront=forefront)
if result is None:
    raise RuntimeError("Unexpected None from add_request()")
return result

However, this case should never occur in practice, so I wanted to ask first: Is there a preferred approach for handling this kind of ty false positive? For example, a suppression comment or a different typing pattern.

I also noticed that RequestManagerTandem.add_request() uses the same delegation pattern and does not seem to trigger this issue on CI.

@Mantisus
Copy link
Copy Markdown
Collaborator

Mantisus commented Mar 5, 2026

Hi, @MrAliHasan

Thank you for your inspiring work on this PR.

However, this case should never occur in practice, so I wanted to ask first: Is there a preferred approach for handling this kind of ty false positive? For example, a suppression comment or a different typing pattern.

Make sure you are using ty 0.0.18. There are known issues with type narrowing in recent versions.

Regarding add_request, this is related to #1775. There is a possibility that the request will not be added to the queue with some backends.

@MrAliHasan
Copy link
Copy Markdown
Author

Hi, @MrAliHasan

Thank you for your inspiring work on this PR.

However, this case should never occur in practice, so I wanted to ask first: Is there a preferred approach for handling this kind of ty false positive? For example, a suppression comment or a different typing pattern.

Make sure you are using ty 0.0.18. There are known issues with type narrowing in recent versions.

Regarding add_request, this is related to #1775. There is a possibility that the request will not be added to the queue with some backends.

Thanks for the guidance!

I've pinned ty to version 0.0.18 in pyproject.toml and updated uv.lock. This resolves the CI type check error caused by the newer ty versions.

Regarding #1775, understood. I'll keep the current add_request return type as ProcessedRequest for now, and it can be adjusted later if the behavior changes as part of that issue.

@vdusek
Copy link
Copy Markdown
Collaborator

vdusek commented Mar 6, 2026

I've pinned ty to version 0.0.18 in pyproject.toml and updated uv.lock. This resolves the CI type check error caused by the newer ty versions.

This is wrong. You should have only run uv to upgrade the package to its latest version that is compliant with the pin in the pyproject.toml. Not hard-pin one specific version there.

@vdusek
Copy link
Copy Markdown
Collaborator

vdusek commented Mar 13, 2026

Change the return type to ProcessedRequest | None to match the base class behavior

👍

Comment thread uv.lock
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py
@vdusek vdusek changed the title fix: add per-domain RequestThrottler for 429 backoff feat: Add opt-in per-domain request throttling for HTTP 429 backoff Mar 19, 2026
@vdusek vdusek requested review from Mantisus and janbuchar March 19, 2026 09:34
@janbuchar
Copy link
Copy Markdown
Collaborator

@MrAliHasan there are still some unresolved comments, mainly #1762 (comment) - can you please take care of those?

@MrAliHasan
Copy link
Copy Markdown
Author

@MrAliHasan there are still some unresolved comments, mainly #1762 (comment) - can you please take care of those?

Yes, I'll work on it tomorrow.

@MrAliHasan
Copy link
Copy Markdown
Author

Thanks for the follow-up @vdusek @janbuchar! All comments have been addressed:

  • Remove type ignore, resolve proper typing: Updated the base class RequestManager.add_request to return ProcessedRequest | None (along with RequestManagerTandem and RequestQueue), so the override in ThrottlingRequestManager no longer needs # type: ignore. Also updated add_requests to handle None.
  • recreate_purged assumption: Added an explicit isinstance check that raises TypeError if the inner manager is not a RequestQueue, instead of silently assuming it.
  • Redundant set_crawl_delay calls: Added an early return in set_crawl_delay if the crawl-delay is already set for the domain, making repeated calls a no-op.
  • uv.lock: Reverted to match master. There are currently merge conflicts due to master moving forward

Comment thread docs/guides/request_throttling.mdx Outdated
Comment on lines -710 to -715
if purge_request_queue and isinstance(request_manager, RequestQueue):
await request_manager.drop()
self._request_manager = await RequestQueue.open(
storage_client=self._service_locator.get_storage_client(),
configuration=self._service_locator.get_configuration(),
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in the state before the change, this was a code smell - shouldn't we add a "purge_on_start_hook"-like abstract method to RequestManager and implement it in RequestQueue? Or should we just call .drop on request manager?

This is aimed mostly at @vdusek and @Pijukatel. We definitely don't need to resolve it in this PR if you guys don't see an obvious way out.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, happy to leave this for a follow-up.

Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
@MrAliHasan
Copy link
Copy Markdown
Author

Added a request_manager_opener callback parameter to __init__ (defaults to RequestQueue.open), as you suggested. recreate_purged now uses this callback instead of hardcoding RequestQueue.open, and preserves it across recreations.

@janbuchar janbuchar self-requested a review April 5, 2026 18:34
Copy link
Copy Markdown
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more issues. Also please resolve merge conflicts and revert changes to uv.lock.

Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py
Comment thread src/crawlee/request_loaders/_throttling_request_manager.py Outdated
@MrAliHasan
Copy link
Copy Markdown
Author

@janbuchar
Thanks for the feedback. Could you please clarify what exactly is still not resolved? I want to make sure I address the issue correctly.

@janbuchar
Copy link
Copy Markdown
Collaborator

Conflicts resolved:
- src/crawlee/request_loaders/_request_manager.py: kept master's version
  (PR apify#1775 made the same fix on master with minor cosmetic differences).
- uv.lock: regenerated from master (pyproject.toml ended up identical to
  master after auto-merge — no PR-specific dependency changes).
vdusek added a commit to MrAliHasan/crawlee-python that referenced this pull request May 5, 2026
…type

Type the wrapped manager and the `request_manager_opener` callback with a
shared `TRequestManager` TypeVar so callers can ensure `recreate_purged`
reconstructs the throttler with the same backing-store implementation
that was passed in. The opener is now required, eliminating the silent
`RequestQueue.open` default that would have type-lied for any inner
type other than `RequestQueue`.

Addresses review comment from janbuchar on PR apify#1762.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vdusek added a commit to MrAliHasan/crawlee-python that referenced this pull request May 5, 2026
`_is_allowed_based_on_robots_txt_file` runs once per URL. Skip the
ThrottlingRequestManager isinstance branch and the `set_crawl_delay`
call entirely after the first URL from a given origin, instead of
relying on a no-op early return inside the manager.

Addresses review comment from vdusek/janbuchar on PR apify#1762.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vdusek vdusek requested review from janbuchar and vdusek May 5, 2026 14:51
@vdusek vdusek force-pushed the fix/request-throttler-429-backoff branch from f50d68c to 6dbe696 Compare May 6, 2026 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix autoscaled pool scaling behavior on 429 Too Many Requests

7 participants