feat: Add opt-in per-domain request throttling for HTTP 429 backoff#1762
feat: Add opt-in per-domain request throttling for HTTP 429 backoff#1762MrAliHasan wants to merge 22 commits intoapify:masterfrom
Conversation
Add a new RequestThrottler component that handles HTTP 429 (Too Many Requests) responses on a per-domain basis, preventing the autoscaling death spiral where 429s cause concurrency to increase. Key features: - Per-domain tracking: rate limiting on domain A doesn't affect domain B - Exponential backoff: 2s -> 4s -> 8s -> ... capped at 60s - Retry-After header support (both seconds and HTTP-date formats) - Throttled requests are reclaimed to the queue, not dropped - Backoff resets on successful requests to that domain The AutoscaledPool is completely untouched - throttling happens transparently in BasicCrawler.__run_task_function before processing. Integration points: - BasicCrawler: throttle check, 429 recording, success reset - AbstractHttpCrawler: passes URL + Retry-After to detection - PlaywrightCrawler: passes URL + Retry-After to detection Closes apify#1437
|
Hi @MrAliHasan, thanks for your contribution! We'll try to review this soon. |
There was a problem hiding this comment.
As mentioned in #1762 (comment), the approach of reclaiming throttled requests is not optimal.
On top of that, the solution to #1437 should probably also be extensible enough to also cover #1396 without much tweaking.
I believe that such solution could be implemented in crawlee-python quite easily. See similar issue for crawlee-js. The Python version already supports multiple "unnamed queues" via RequestQueue.open(alias="..."), so you'd only need to implement a ThrottlingRequestManager (implementation of the RequestManager interface) that would keep track of the per-domain queues and their delays.
Do you want to try it?
|
Thanks for the detailed review. That makes sense regarding the busy-wait behavior and queue writes. |
Move per-domain throttling from execution layer (BasicCrawler.__run_task_function) to scheduling layer (ThrottlingRequestManager.fetch_next_request). - ThrottlingRequestManager wraps RequestQueue, implements RequestManager interface - fetch_next_request() buffers throttled requests and asyncio.sleep()s when all domains are throttled — eliminates busy-wait and unnecessary queue writes - Unified delay mechanism supports both HTTP 429 backoff and robots.txt crawl-delay (apify#1396) - parse_retry_after_header moved to crawlee._utils.http - 23 new tests covering throttling, scheduling, delegation, and crawl-delay Addresses apify#1437, apify#1396
…queues and update its integration across crawlers.
|
Heads up @janbuchar @vdusek @Mantisus: I've pushed a significant refactor based on the latest feedback. Sub-queues over memory buffer: ThrottlingRequestManager now delegates to persistent per-domain sub-queues via RequestQueue.open(alias=f"throttled-{domain}") instead of keeping throttled requests in memory. Test Structure: Completely rewrote test_throttling_request_manager.py to drop the Test... classes and conform to Crawlee's standard test structure. BasicCrawler fixes: Addressed all inline nits (used isinstance(), renamed url to request_url in _raise_for_session_blocked_status_code, updated docstrings/comments). The tests track the routing origin and safely aggregate get_handled_count and is_empty metrics across the main queue and sub-queues. All 24 tests pass, and Ruff and Pytest issues have been resolved. Let me know if the updated delegation architecture feels right! |
|
Update: I just pushed a small follow-up commit fixing the MyPy typing and Ruff linting errors in the test suite that were causing the CI to fail. All local checks for ThrottlingRequestManager are now passing 100%. Ready for review whenever you have time! |
Pijukatel
left a comment
There was a problem hiding this comment.
Hello, thanks for the work on this PR! I have just some annoying edgecase to think about. I am not sure myself what the best way is to deal with them.
|
Hey @Pijukatel, thanks for looking deeply into this! Great catches all around on the edge cases. I spent some time analyzing these points, and here is how I propose we handle them. Let me know if you are aligned on this direction before I push the changes: 1. Good catch. I'll simplify 2. Sub-Queue storage strategy Regarding the Therefore, I propose your "Option 2" as the default, but with the flexibility of "Option 3":
3. Fetch priority redesign I completely agree. Relying on iteration order rather than the longest-overdue domain is a flaw in the scheduling logic.
4. Deleting Handled Requests Because Does this direction look solid to you? If so, I'll get it coded and pushed up! |
|
...
We discussed it internally, and there are some open points. Please let me think about it over the weekend, so that I don't point you in the wrong direction. |
|
...
It is going in a good direction. Just a few more edgecases we discussed: Having ad hoc request queues being created for domains can lead to two undesired scenarios:
To deal with this, we agreed it would be best to have the Preserve existing behavior and introduce
|
|
Hey @Pijukatel, I've pushed the refactor based on your latest feedback. Here's what changed: Explicit domain routing
Opt-in only
Simplified state
Documentation
All local checks pass (1647 tests, 0 failures). Ready for review! |
Pijukatel
left a comment
There was a problem hiding this comment.
Nice.
Now I have just a few more code-related comments.
…ctor domain state management and sub-queue handling.
|
Hey @Pijukatel, I've pushed all the changes from your latest review. Here's what was addressed: Import cleanup
Dedicated purge method
Docstrings
Tests
All checks pass (1648 tests, 0 failures). Ready for review! |
|
@MrAliHasan good work! Please just fix the errors reported by |
The CI type check (Python 3.10) fails with It seems the newer One possible fix would be to narrow the type explicitly: result = await sq.add_request(request, forefront=forefront)
if result is None:
raise RuntimeError("Unexpected None from add_request()")
return resultHowever, this case should never occur in practice, so I wanted to ask first: Is there a preferred approach for handling this kind of I also noticed that |
|
Hi, @MrAliHasan Thank you for your inspiring work on this PR.
Make sure you are using Regarding |
….14 wheels for brotlicffi.
Thanks for the guidance! I've pinned Regarding #1775, understood. I'll keep the current |
This is wrong. You should have only run |
👍 |
|
@MrAliHasan there are still some unresolved comments, mainly #1762 (comment) - can you please take care of those? |
Yes, I'll work on it tomorrow. |
…ard, crawl-delay caching, revert uv.lock
|
Thanks for the follow-up @vdusek @janbuchar! All comments have been addressed:
|
| if purge_request_queue and isinstance(request_manager, RequestQueue): | ||
| await request_manager.drop() | ||
| self._request_manager = await RequestQueue.open( | ||
| storage_client=self._service_locator.get_storage_client(), | ||
| configuration=self._service_locator.get_configuration(), | ||
| ) |
There was a problem hiding this comment.
Even in the state before the change, this was a code smell - shouldn't we add a "purge_on_start_hook"-like abstract method to RequestManager and implement it in RequestQueue? Or should we just call .drop on request manager?
This is aimed mostly at @vdusek and @Pijukatel. We definitely don't need to resolve it in this PR if you guys don't see an obvious way out.
There was a problem hiding this comment.
Understood, happy to leave this for a follow-up.
…to mark_request_as_handled, fix docs
|
Added a |
janbuchar
left a comment
There was a problem hiding this comment.
Two more issues. Also please resolve merge conflicts and revert changes to uv.lock.
|
@janbuchar |
Conflicts resolved: - src/crawlee/request_loaders/_request_manager.py: kept master's version (PR apify#1775 made the same fix on master with minor cosmetic differences). - uv.lock: regenerated from master (pyproject.toml ended up identical to master after auto-merge — no PR-specific dependency changes).
…type Type the wrapped manager and the `request_manager_opener` callback with a shared `TRequestManager` TypeVar so callers can ensure `recreate_purged` reconstructs the throttler with the same backing-store implementation that was passed in. The opener is now required, eliminating the silent `RequestQueue.open` default that would have type-lied for any inner type other than `RequestQueue`. Addresses review comment from janbuchar on PR apify#1762. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_is_allowed_based_on_robots_txt_file` runs once per URL. Skip the ThrottlingRequestManager isinstance branch and the `set_crawl_delay` call entirely after the first URL from a given origin, instead of relying on a no-op early return inside the manager. Addresses review comment from vdusek/janbuchar on PR apify#1762. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f50d68c to
6dbe696
Compare
Fixes #1437
Problem
When target websites return HTTP 429 (Too Many Requests), requests get retried without any per-domain delay — potentially making rate limiting worse.
Solution
Introduces the
ThrottlingRequestManager, an opt-in request manager wrapper that enforces per-domain delays at the scheduling layer.Key features:
record_domain_delay()sets a per-domainthrottled_untiltimestamp based onRetry-AfterheadersBasicCrawlerautomatically callsset_crawl_delay()whenrespect_robots_txt_fileis enabled and the request manager is aThrottlingRequestManagerrespect_robots_txt_fileis enabled but the request manager is not aThrottlingRequestManagerfetch_next_request()skips throttled domains, falls back to the inner queue, and sleeps only when all sub-queues are throttled and the inner queue is emptyrecreate_purged()handles queue reconstruction across crawler restartsHow it works
throttled_until),fetch_next_request()skips it and falls back to the inner queuerecord_domain_delay()updates per-domain backoff on HTTP 429 responses, respectingRetry-Afterheadersset_crawl_delay()integrates robots.txt crawl-delay when enabledUsage
Files changed
_throttling_request_manager.pyhttp.pyparse_retry_after_headerutility_basic_crawler.pyrecreate_purged()integration, crawl-delay warning_playwright_crawler.pyRetry-Afterheadertest_throttling_request_manager.pyRequestQueuewithMemoryStorageClientTests
recreate_purged(), and edge casesFuture work
This is a focused first step toward a more complete
RequestAnalyzerthat may include: