Skip to content

Comments

Add browser-compatible headers to TempFileAPI URL downloads#34721

Draft
Copilot wants to merge 5 commits intomainfrom
copilot/add-browser-headers-to-tempfileapi
Draft

Add browser-compatible headers to TempFileAPI URL downloads#34721
Copilot wants to merge 5 commits intomainfrom
copilot/add-browser-headers-to-tempfileapi

Conversation

Copy link
Contributor

Copilot AI commented Feb 23, 2026

Many CDNs and Cloudflare-protected hosts reject bare server-side HTTP requests with 403/406. TempFileAPI.createTempFileFromUrl() and validUrl() made such bare requests with no headers.

Proposed Changes

  • Added getBrowserHeaders() method — builds header map on each call so Config overrides are effective at runtime without restart
  • All 5 headers configurable via Config; setting a key to blank disables that header (operators can suppress any header without code change):
    • TEMP_FILE_URL_USER_AGENT → Chrome-compatible User-Agent
    • TEMP_FILE_URL_ACCEPT*/* (generic — endpoint downloads any file type, not just images)
    • TEMP_FILE_URL_ACCEPT_LANGUAGEen-US,en;q=0.9
    • TEMP_FILE_URL_ACCEPT_ENCODINGgzip, deflate (no br — Apache HttpClient has no Brotli decoder)
    • TEMP_FILE_URL_CONNECTIONkeep-alive
  • Sec-Fetch-Dest/Mode/Site intentionally omitted — these are browser-generated Fetch Metadata headers for server→browser trust; sending them from a server-side client is misleading and can cause rejections
  • Applied headers to both createTempFileFromUrl() and validUrl() via .setHeaders(getBrowserHeaders())

Checklist

  • Tests
  • Translations
  • Security Implications Contemplated (add notes if applicable)

Additional Info

testBrowserHeaders_containsAllRequiredKeys — verifies all 5 headers present
testBrowserHeaders_defaultValues — verifies Accept is */*, Connection is keep-alive, Accept-Encoding has no br, and Sec-Fetch-* headers are absent
testValidUrl_sendsBrowserHeadersMockHttpServer wire test confirming every header in getBrowserHeaders() actually arrives at the remote server

Screenshots

Original Updated
** original screenshot ** ** updated screenshot **
Original prompt

This section details on the original issue you should resolve

<issue_title>Add browser-like headers to TempFileAPI.createTempFileFromUrl to fix remote image downloads</issue_title>
<issue_description>### Description

The TempFileAPI.createTempFileFromUrl() method in dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileAPI.java (lines 218-254) makes HTTP GET requests to download remote files via CircuitBreakerUrl without setting any browser-like headers.

This causes ~80% of image URL downloads to fail because many image servers (Unsplash, Pexels, CDNs, stock photo sites, Cloudflare-protected sites) reject requests that don't include proper headers, returning 403 Forbidden or 406 Not Acceptable.

Current code (lines 243-245):

final CircuitBreakerUrl urlGetter =
    CircuitBreakerUrl.builder().setMethod(Method.GET).setUrl(finalUrl)
        .setTimeout(timeoutSeconds * 1000L).build();

Additional issue: Line 240 appends ?sourceIp=... to the URL, which can break signed URLs or CDN URLs that are sensitive to query parameter changes.

CircuitBreakerUrl already supports custom headers via .setHeaders(), so the fix is straightforward.

Headers to Add

The following headers should be set to mimic a standard browser request and maximize download success:

Header Value Why
User-Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Most critical. CDNs and image servers check this header first and block requests without it or with generic/bot-like values. A realistic browser User-Agent is essential.
Accept image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8 Tells the server what content types the client can handle. Some servers return 406 Not Acceptable without it. Using the browser-standard image Accept header maximizes compatibility.
Accept-Language en-US,en;q=0.9 Some CDNs (especially Cloudflare) flag requests missing this header as bot traffic.
Accept-Encoding gzip, deflate, br Signals the client supports compression. Missing this can trigger bot detection on some WAFs.
Connection keep-alive Standard browser behavior. Some servers reject requests with Connection: close.
Sec-Fetch-Dest image Modern Fetch Metadata header. Cloudflare and other WAFs use these to distinguish real browser requests from bots.
Sec-Fetch-Mode no-cors Standard fetch mode for image requests.
Sec-Fetch-Site cross-site Indicates the request is cross-origin, which is the expected pattern for image downloads.

Proposed fix:

final CircuitBreakerUrl urlGetter =
    CircuitBreakerUrl.builder().setMethod(Method.GET).setUrl(finalUrl)
        .setHeaders(ImmutableMap.<String, String>builder()
            .put("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
            .put("Accept", "image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8")
            .put("Accept-Language", "en-US,en;q=0.9")
            .put("Accept-Encoding", "gzip, deflate, br")
            .put("Connection", "keep-alive")
            .put("Sec-Fetch-Dest", "image")
            .put("Sec-Fetch-Mode", "no-cors")
            .put("Sec-Fetch-Site", "cross-site")
            .build())
        .setTimeout(timeoutSeconds * 1000L).build();

Affected endpoint: POST /api/v1/temp/byUrl

Acceptance Criteria

  • CircuitBreakerUrl request in createTempFileFromUrl includes all recommended browser-like headers (see table above)
  • Image URLs from common providers (Unsplash, Pexels, Pixabay, Cloudflare-protected CDNs) download successfully via /api/v1/temp/byUrl
  • The validUrl() method (line 263) also gets the same header treatment since it makes the same bare request
  • Evaluate whether appending sourceIp query parameter should be removed or made optional to avoid breaking signed URLs
  • Consider making headers configurable via Config.getStringProperty() for future flexibility
  • Existing tests still pass

Priority

Medium

Additional Context

  • File: dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileAPI.java
  • CircuitBreakerUrl builder supports .setHeaders() — see usage example in dotCMS/src/main/java/com/dotcms/http/CircuitBreakerUrl.java:60-62
  • The validUrl() method (line 263) has the same issue — it also makes a bare GET with no headers
  • Reference: https://scrapfly.io/blog/posts/how-to-avoid-blocking-scraping-images
    </issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…age downloads

Co-authored-by: fmontes <751424+fmontes@users.noreply.github.com>
Copilot AI changed the title [WIP] Add browser-like headers to TempFileAPI.createTempFileFromUrl Add browser-like headers to TempFileAPI URL requests to fix remote image downloads Feb 23, 2026
Copilot AI requested a review from fmontes February 23, 2026 04:20
@fmontes fmontes requested a review from Copilot February 23, 2026 20:48
Co-authored-by: fmontes <751424+fmontes@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates TempFileAPI to send browser-like HTTP headers when validating and downloading remote URLs, improving compatibility with CDNs/WAF-protected image hosts that reject “bare” requests.

Changes:

  • Added a shared BROWSER_HEADERS header map (with partial configurability via Config).
  • Applied these headers to CircuitBreakerUrl requests in createTempFileFromUrl() and validUrl().

Comment on lines 83 to 96
static final Map<String, String> BROWSER_HEADERS = ImmutableMap.<String, String>builder()
.put("User-Agent", Config.getStringProperty("TEMP_FILE_URL_USER_AGENT",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"))
.put("Accept", Config.getStringProperty("TEMP_FILE_URL_ACCEPT",
"image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8"))
.put("Accept-Language", Config.getStringProperty("TEMP_FILE_URL_ACCEPT_LANGUAGE",
"en-US,en;q=0.9"))
.put("Accept-Encoding", Config.getStringProperty("TEMP_FILE_URL_ACCEPT_ENCODING",
"gzip, deflate, br"))
.put("Connection", "keep-alive")
.put("Sec-Fetch-Dest", "image")
.put("Sec-Fetch-Mode", "no-cors")
.put("Sec-Fetch-Site", "cross-site")
.build();
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BROWSER_HEADERS is initialized once at class-load time, so the values read from Config.getStringProperty(...) won’t reflect later runtime config changes (system table updates / hot-reload). If these TEMP_FILE_URL_* settings are expected to be tunable without a restart, consider building the headers map on demand (e.g., a getBrowserHeaders() method) so each call re-reads Config.

Copilot uses AI. Check for mistakes.
Comment on lines 265 to 268
final CircuitBreakerUrl urlGetter =
CircuitBreakerUrl.builder().setMethod(Method.GET).setUrl(finalUrl)
.setHeaders(BROWSER_HEADERS)
.setTimeout(timeoutSeconds * 1000L).build();
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new behavior (sending browser-like headers during URL validation/download) isn’t covered by tests here. Given there are existing integration tests for TempFile URL upload, consider adding a WireMock-based test that asserts the expected headers are present on the outbound request (and that the response body is written correctly).

Copilot uses AI. Check for mistakes.
.put("Accept-Language", Config.getStringProperty("TEMP_FILE_URL_ACCEPT_LANGUAGE",
"en-US,en;q=0.9"))
.put("Accept-Encoding", Config.getStringProperty("TEMP_FILE_URL_ACCEPT_ENCODING",
"gzip, deflate, br"))
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accept-Encoding default includes br, but CircuitBreakerUrl uses Apache HttpClient without any Brotli decoding support (no brotli dependency/interceptor in CircuitBreakerUrl.doOut). If a server responds with Content-Encoding: br (common for SVG/text), the downloaded temp file will contain compressed bytes rather than the original content. Consider removing br from the default (e.g., gzip, deflate), omitting Accept-Encoding entirely, or adding explicit Brotli decoding support before advertising it.

Suggested change
"gzip, deflate, br"))
"gzip, deflate"))

Copilot uses AI. Check for mistakes.
… br, add wire-level header test

Co-authored-by: fmontes <751424+fmontes@users.noreply.github.com>
@spbolton
Copy link
Contributor

@fmontes — I had a gut feeling that this implementation could be problematic in the way it tries to spoof a browser identity, so I ran it by Claude with the full PR context. Here's what came back:

Concerns with the Browser-Spoofing Approach

1. Deceptive User-Agent — Legal/ToS Risk (Most Serious)

The hardcoded Chrome UA string (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...) actively lies about the client's identity. This may violate the ToS of the very services we're trying to reach (Cloudflare, Unsplash, Pexels all explicitly prohibit automated scraping with spoofed UAs). It also exposes dotCMS operators to legal risk. A truthful but compatible UA like dotCMS/8.x (+https://dotcms.com) solves the practical problem without the deception.

2. The Sec-Fetch-* Headers Are Counterproductive

Sec-Fetch-Dest, Sec-Fetch-Mode, and Sec-Fetch-Site are browser-generated Fetch Metadata headers whose purpose is to let servers verify a request came from a real browser context. Sending them from a Java HTTP client:

  • Is immediately detectable as spoofed by any WAF doing real Fetch Metadata validation (no matching TLS fingerprint)
  • Sec-Fetch-Mode: no-cors is semantically incoherent for a server-side downloader
  • Sec-Fetch-Site: cross-site without a matching Origin/Referer is inconsistent and suspicious
  • Modern Cloudflare Bot Management checks JA3/JA4 TLS fingerprints alongside these headers — the mismatch still gets flagged

These headers add no practical benefit and increase detection risk with sophisticated WAFs.

3. Image-Only Accept Breaks Non-Image Downloads

TempFileAPI downloads arbitrary files — PDFs, videos, ZIP files, etc. The Accept: image/avif,image/webp,... header can cause non-image servers to return 406 Not Acceptable — the exact error this PR is trying to fix. Accept: */* is correct for a generic file downloader.

4. Non-Configurable Headers Can't Be Disabled

Connection, Sec-Fetch-Dest, Sec-Fetch-Mode, and Sec-Fetch-Site are hardcoded with no Config override path. If a downstream service rejects requests with invalid Fetch Metadata, operators have no way to disable those headers.

Suggested Minimal Fix

Header Value
User-Agent dotCMS/8.x (+https://dotcms.com)
Accept */*
Accept-Language en-US,en;q=0.9
Accept-Encoding gzip, deflate

Drop Sec-Fetch-* entirely. The real blocker for most CDNs is the missing/bot-like UA — an honest dotCMS UA resolves that without the ToS and detection risks.

@fmontes
Copy link
Member

fmontes commented Feb 23, 2026

@spbolton all this is cool, but it's missing the important part... what options do I have? 😂

…configurable and disableable

Co-authored-by: fmontes <751424+fmontes@users.noreply.github.com>
Copilot AI changed the title Add browser-like headers to TempFileAPI URL requests to fix remote image downloads Add browser-compatible headers to TempFileAPI URL downloads Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Add browser-like headers to TempFileAPI.createTempFileFromUrl to fix remote image downloads

3 participants