fix: [#838] webfetch 404 cache — longer TTL + URL normalization#839
Conversation
…racking-param variations Telemetry-2026-05-21 showed 486 residual webfetch 404 errors (down 78% from March's 2,222 — the existing failure cache is doing its job). The remaining cases mostly escape because of two gaps: 1. TTL was 5 minutes. Real sessions often run 30+ minutes; the same dead URL re-hit the network after the first 5-minute window. 2. Cache key was the raw URL string. LLM-generated URLs frequently include tracking params (utm_source, fbclid, gclid, etc.) or fragments — each variation got its own cache entry, so a known-bad URL with different tracking params re-hit the network on every call. Changes - Extend FAILURE_CACHE_TTL from 5 minutes to 30 minutes. URLs that 404 rarely self-heal within a session; the longer window catches the slow-burn case without measurably increasing the risk of caching a transient failure. - Add `normalizeUrlForCache()` and use it for every cache read/write. Strips universally-tracking params (utm_*, ref*, fbclid, gclid, msclkid, yclid, twclid, li_fat_id, mc_cid, mc_eid, _ga, _gl, igshid, mibextid, __cf_chl_*), strips URL fragments, lowercases the hostname, and sorts remaining params. Conservative list — only params known to be tracking-only are stripped, so functional params (page, id, q, etc.) survive untouched. - Export `_resetFailureCache` and `_failureCacheSize` for tests. Tests New test/tool/webfetch-cache.test.ts (15 tests): - strips utm_*, fbclid, gclid, msclkid family - preserves functional params (page, id, q, etc.) - sorts params for stable cache key - strips fragments - lowercases hostname (case-insensitive per RFC 3986) - preserves path case (paths ARE case-sensitive per HTTP) - collapses Cloudflare __cf_chl_* challenge tokens - 6 variations of the same logical URL collapse to one cache key - different functional params and paths do NOT collapse - invalid URL strings pass through unchanged Validation - `tsgo --noEmit`: pass. - `bun test test/tool/webfetch-cache.test.ts test/tool/webfetch.test.ts`: 19 pass / 0 fail (15 new + 4 existing). - Marker check: pass — altimate_change markers applied. Expected telemetry impact - Further reduction in residual webfetch 404 errors (currently 486 in 14d). Specifically helps long sessions revisiting dead doc URLs multiple times, and LLMs pasting the same URL with different tracking suffixes. - No change for first-time fetches of legitimately new URLs. Closes #838 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR normalizes URLs before using them as failure-cache keys (strip tracking params, fragments, and userinfo; lowercase host; sort params), uses normalized keys for cache lookup and writes, applies status-specific TTLs (404/410 vs 451), implements delete-before-set retouch with FIFO eviction at capacity, and exports testing helpers to reset and inspect the cache. ChangesURL Failure Cache Normalization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/opencode/src/tool/webfetch.ts`:
- Around line 114-120: In cacheUrlFailure, the eviction logic uses
failedUrls.size >= MAX_CACHED_URLS without checking whether the normalized key
already exists, causing an unnecessary eviction when simply updating an existing
entry; change the guard so eviction only runs if the map is at capacity AND
failedUrls does not already have(key) (use normalizeUrlForCache(url) as key),
then evict oldest via failedUrls.keys().next().value before
failedUrls.set(key,...); this ensures updates to existing keys don't evict other
entries (symbols: cacheUrlFailure, failedUrls, MAX_CACHED_URLS, key,
normalizeUrlForCache).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 80f9a8fa-fcaa-4905-80aa-d51c6c4b6620
📒 Files selected for processing (2)
packages/opencode/src/tool/webfetch.tspackages/opencode/test/tool/webfetch-cache.test.ts
There was a problem hiding this comment.
1 issue found across 2 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/opencode/src/tool/webfetch.ts">
<violation number="1" location="packages/opencode/src/tool/webfetch.ts:47">
P2: `ref` is not universally tracking-only; stripping it in cache-key normalization can cause false cache hits and skip valid fetches.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…are sort, LRU re-touch Second-pass on PR #839 after code review. Three follow-ups in scope. 1. Preserve `ref` / `ref_src` / `ref_url` (MAJOR per reviewer) `ref` is functional on GitHub raw URLs and git-hosting APIs — `?ref=main` vs `?ref=v2.0` selects a different branch/tag and resolves to different content. Stripping it would suppress legitimate fetches when the agent (which fetches GitHub heavily) tries different refs of the same path. Removed `ref`, `ref_src`, `ref_url` from TRACKING_PARAMS. Kept `referrer` (the full word, tracking-only). Added doc-block at the TRACKING_PARAMS declaration explaining the exclusion so the next reviewer doesn't re-add them. 2. Value-aware param sort (MINOR) Duplicate-key params like `?a=1&a=2` and `?a=2&a=1` produced different cache keys despite carrying the same logical content — sorting by key alone leaked insertion order between same-key entries. Changed the comparator to sort by (key, value) tuple. 3. LRU re-touch semantics (MINOR) `Map.set` on an existing key preserves the original insertion position, so a frequently-failing URL stayed at the head of the FIFO queue and was the first to be evicted under pressure. Wrong for an LRU-flavoured cache. Delete-then-set on the re-touch path so the entry moves to the tail. 4. TTL doc (MINOR) Added a comment explaining the deliberate choice to use one TTL (30 min) for all three of 404, 410, 451 rather than split. The 3-status grouping is a simplification, not equivalence. 5. `@internal` JSDoc tags (NIT) Marked `_resetFailureCache` and `_failureCacheSize` as `@internal` so future tree-shakers / IDE hints discourage their use from production code. The underscore convention was the only guard. 6. Dropped explicit `__cf_chl_*` entries (NIT) Cloudflare challenge tokens are caught by the `isTrackingParamPrefix` check; the two enumerated entries (`__cf_chl_tk`, `__cf_chl_jschl_tk__`) were redundant. Kept utm_* in both the explicit list and the prefix check — the duplication documents the known cases (typed) while the prefix catches future variants. Tests - 3 new tests pin the new contracts: - `ref` param is preserved across normalization, AND different `ref` values do NOT collapse (regression guard against re-adding `ref` to TRACKING_PARAMS). - `referrer` (full word) is still stripped to confirm the distinction is real. - Duplicate-key params produce the same cache key regardless of input order (`?a=1&a=2` ↔ `?a=2&a=1`). - Updated existing test "preserves functional params when mixed with tracking params" to use a different tracking param (since `ref` is no longer one). Validation - `tsgo --noEmit`: pass. - `bun test test/tool/webfetch-cache.test.ts test/tool/webfetch.test.ts`: 22 pass / 0 fail (3 new tests). - Marker check: pass. Reviewer note - DeepSeek second opinion was unavailable (kilo CLI timeout, no output within window). Single-reviewer Claude analysis. Closes review feedback for PR #839. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review fixes applied (commit 7529453)Consensus review found 1 MAJOR (false-positive risk on GitHub fetches), 3 MINOR, 3 NITs. All addressed: Required fix (MAJOR)
Polish (MINOR)
NITs
Tests
Validation
Reviewer noteDeepSeek second opinion was unavailable (kilo CLI timeout). Single-reviewer Claude analysis. Deferred to follow-up
|
dev-punia-altimate
left a comment
There was a problem hiding this comment.
Approved
Rationale: Telemetry-driven fix to a real problem (486 residual webfetch 404s). The approach is pragmatic:
- TTL 5m → 30m: Sessions run longer; URLs that 404 rarely self-heal mid-session. Data shows this is impactful (down from 2,222 to 486, now removing residual churn).
- URL normalization strips only known-tracking params (utm_*, fbclid, etc.), preserving functional params (page, id, ref). Conservative list avoids false-positive cache hits.
- Stable sorting handles multi-value params correctly.
- Error handling graceful: non-standard URLs pass through unchanged.
Implementation is clean. Ready to merge.
…Adobe/Piwik tracking params
Reviewer flagged two cache-key fragmentation cases not covered by the
first pass:
1. Basic-auth userinfo (user:pass@host) was left in the cache key,
letting the same logical URL with and without credentials occupy
two slots. Also a hygiene win — telemetry never logs cache keys
with creds.
2. The tracking-param set covered Google/Meta/Twitter ad ecosystems
but missed three major B2B martech vendors that dominate the
email-newsletter URLs LLMs love to retrieve:
- HubSpot: _hsenc, _hsmi, hsCtaTracking
- Marketo: mkt_tok
- Adobe Analytics: s_cid, s_kwcid
- Piwik/Matomo: pk_campaign, pk_kwd, pk_source, pk_medium,
pk_content, piwik_campaign, piwik_keyword
Deliberately omitted Adobe Target's at_* prefix — too risky for false
positives (e.g., at_preview_* are functional). The named s_cid /
s_kwcid params catch the high-volume cases.
5 new tests pin the contract; existing 19 still pass.
Closes #838
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third-pass fix (commit 4b4b6e9)Two fragmentation cases the first pass missed
Deliberately omitted
Validation
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/opencode/test/tool/webfetch-cache.test.ts`:
- Line 191: The test titled "strips Piwik/Matomo tracking params (pk_*,
piwik_*)" overstates behavior because it only asserts explicit keys rather than
true prefix stripping; update the test to either rename the test title to
something accurate (e.g., "strips known Piwik/Matomo tracking params (pk_123,
piwik_abc)") or extend the assertions in the test case to include keys that
exercise prefix behavior (e.g., add query params like "pk_foo" and "piwik_bar"
and assert they are removed). Locate the test with the exact name "strips
Piwik/Matomo tracking params (pk_*, piwik_*)" and change the string or add
additional expect cases so the test title matches actual coverage.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b7bebf67-d45c-48e3-b8ba-0e6ea405c68a
📒 Files selected for processing (2)
packages/opencode/src/tool/webfetch.tspackages/opencode/test/tool/webfetch-cache.test.ts
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/opencode/src/tool/webfetch.ts">
<violation number="1" location="packages/opencode/src/tool/webfetch.ts:114">
P1: Do not strip URL userinfo from the failure-cache key; different credentials can legitimately produce different results, and this can cause false cached 404/410/451 misses for authenticated URLs.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| // resource the URL points to, and leaving it in lets the same logical | ||
| // URL with/without credentials occupy two cache slots. Also a minor | ||
| // hygiene win for cache keys that get logged in telemetry. | ||
| parsed.username = "" |
There was a problem hiding this comment.
P1: Do not strip URL userinfo from the failure-cache key; different credentials can legitimately produce different results, and this can cause false cached 404/410/451 misses for authenticated URLs.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/opencode/src/tool/webfetch.ts, line 114:
<comment>Do not strip URL userinfo from the failure-cache key; different credentials can legitimately produce different results, and this can cause false cached 404/410/451 misses for authenticated URLs.</comment>
<file context>
@@ -90,6 +107,12 @@ export function normalizeUrlForCache(url: string): string {
+ // resource the URL points to, and leaving it in lets the same logical
+ // URL with/without credentials occupy two cache slots. Also a minor
+ // hygiene win for cache keys that get logged in telemetry.
+ parsed.username = ""
+ parsed.password = ""
const kept: [string, string][] = []
</file context>
… re-touch + fix stale doc Reviewer surfaced four issues. Three actionable: 1. STALE HEADER DOC — header comment listed `ref` as a stripped param, but the exclusion block immediately below explains we deliberately keep it. Future readers would hit cognitive whiplash. Removed `ref` from the inline list; switched the cache description to "process- level" (more accurate — Map is module-scope, persists across logical sessions in daemon/MCP mode). 2. SPLIT 451 TTL (webfetch.ts) — 404/410 are resource-permanent (30 min stays right). 451 is observer-conditional: an agent that swaps networks (VPN, travel, hotspot) inside a long-cache window would be locked out of URLs that are reachable from the new vantage point. Added FAILURE_CACHE_TTL_451 = 5 min and ttlForStatus() helper. 3. LRU RE-TOUCH WAS UNTESTED — the most subtle change in the original PR (delete-then-set so frequently-failing URLs don't sit at the head of the FIFO) had zero test coverage. Exposed three test-only helpers (_cacheUrlFailureForTest, _failureCacheKeysInOrder, _isUrlCachedFailureForTest) and added 6 cache state-machine tests: - cacheUrlFailure → isUrlCachedFailure round-trip - only 404/410/451 are cached (other statuses ignored) - tracking-param variations all hit the same cache key - re-touch moves entry to tail (pins the LRU contract) - 451 + 410 entries coexist (TTL split is wired) - _resetFailureCache actually clears cached entries 29/29 tests pass; marker check clean. Closes #838 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fourth-pass fix (commit c37af0e)Background reviewer finally returned. Verdict was APPROVE WITH NITS. Three actionable items applied: Fix 1 (MINOR) — Stale header docHeader docblock still listed Fix 2 (MINOR) — Split 451 TTL404/410 are resource-permanent (30 min stays right). 451 is observer-conditional: an agent that swaps networks (VPN, travel, hotspot) inside a long-cache window would be locked out of URLs that are reachable from the new vantage point. Added Fix 3 (NIT-1, the most important) — LRU re-touch was untestedThe most subtle change in the original PR (
The old Updated cache descriptionSwitched from "session-level" → "process-level" in the header doc — more accurate since the Map is module-scope and persists across logical sessions in daemon/MCP mode. Adding session-lifecycle wiring is a worthy follow-up but out of scope here. Validation
|
…me, not prefix" contract Reviewer flagged that the test title "strips Piwik/Matomo tracking params (pk_*, piwik_*)" overstates the contract — TRACKING_PARAMS does exact-name matching, NOT pk_* / piwik_* prefix stripping. The title was documenting behavior that isn't guaranteed. Renamed to enumerate the named params we DO strip and added two counter-tests verifying that pk_unknown / piwik_unknown ARE preserved. This pins the "exact-name, not prefix" contract so a future refactor can't silently flip to prefix-stripping without breaking these. Restraint rationale documented inline: prefix-stripping `pk_` risks false positives on functional params from unrelated services that happen to start with the same two letters. 29/29 tests pass; marker check clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re: cubic P1 (userinfo strip)Disagreeing with this one. The risk model the comment assumes (different credentials → different results) is real for 2xx/4xx auth statuses but doesn't apply to the failure-cache scope:
The CLI is a single-user process — two distinct users with two distinct creds against the same URL in the same Node process simply doesn't happen. The hygiene argument is the load-bearing one: cache keys flow through internal logging surfaces; userinfo in keys is a minor secret-exposure risk that the strip eliminates with zero cost in any realistic scenario. Keeping the strip. If a real-world counter-example surfaces in telemetry I'll happily revisit. |
Telemetry-driven hardening release. Five P0 fixes from the telemetry-analysis-2026-05-21 pass (#828 finops auto-pick, #831 project_scan defensive spawn, #833 build-agent name normalization, #837 tokens_input_total semantics, #839 webfetch 404 cache) plus a multi-persona pre-release review that surfaced privacy/security hardening: credential strip on git.remoteUrl, masked git stderr, agent-name input hardening (control-char strip + NFKC + length cap), auth-bearing query-param strip from the webfetch cache key, cache-hit error prefix, warehouse_filter param description disambiguation, DEFAULT_FINOPS_TYPES single source of truth, and docs alignment for the auto-pick behavior. 34 adversarial tests pin every invariant. 4 deferred items filed as #840 #841 #842 #843. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What does this PR do?
Cuts residual webfetch 404 errors (486 in 14 days, down 78% from March's 2,222 — the existing cache is working, this PR shrinks what's left). Two gaps in the current cache:
utm_source,fbclid,gclid) or fragments. Each variation got its own cache entry, so a known-bad URL with different tracking suffixes re-hit the network every call.Fix:
FAILURE_CACHE_TTLfrom 5 min → 30 minnormalizeUrlForCache()— strips universally-tracking params (utm_*,ref,fbclid,gclid,msclkid,mc_*,_ga,_gl,igshid,mibextid,__cf_chl_*), strips fragments, lowercases hostname, sorts remaining params. Conservative list — functional params (page,id,q) untouched.Type of change
Issue for this PR
Closes #838
How did you verify your code works?
tsgo --noEmit: passbun test test/tool/webfetch-cache.test.ts test/tool/webfetch.test.ts: 19 pass / 0 fail (15 new + 4 existing)Checklist
Files changed (2):
packages/opencode/src/tool/webfetch.ts—normalizeUrlForCachehelper, 30-min TTL,_resetFailureCache/_failureCacheSizetest helperspackages/opencode/test/tool/webfetch-cache.test.ts(new) — 15 tests covering the normalization contractExpected telemetry impact: further reduction in residual webfetch 404 errors. Specifically helps long sessions revisiting dead URLs (TTL extension) and LLM-generated tracking-param variations (normalization). No change for first-time fetches of legitimately new URLs.
P0 series progress (telemetry-2026-05-21):
Summary by CodeRabbit
Bug Fixes
Tests