Skip to content

fix(security): Add SSRF protection to LinkContentFetcher#10514

Closed
Mr-Neutr0n wants to merge 1 commit intodeepset-ai:mainfrom
Mr-Neutr0n:security/ssrf-protection-link-content-fetcher
Closed

fix(security): Add SSRF protection to LinkContentFetcher#10514
Mr-Neutr0n wants to merge 1 commit intodeepset-ai:mainfrom
Mr-Neutr0n:security/ssrf-protection-link-content-fetcher

Conversation

@Mr-Neutr0n
Copy link

Related Issue

Fixes #10513

Summary

This PR adds Server-Side Request Forgery (SSRF) protection to the LinkContentFetcher component by implementing URL validation that blocks requests to internal/private network resources.

Changes

New Features

  • Added is_safe_url() function in haystack/utils/url_validation.py to detect unsafe URLs
  • Added _is_private_ip() helper function to identify private/internal IP addresses
  • Added block_internal_urls parameter to LinkContentFetcher (default: True)

Blocked URL Types

The protection blocks requests to:

  • Private IP ranges: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
  • Loopback addresses: 127.0.0.1, localhost, ::1, 0.0.0.0
  • Link-local addresses: 169.254.0.0/16 (includes cloud metadata endpoints)
  • Reserved and multicast addresses

DNS Rebinding Protection

The implementation also performs DNS resolution to prevent DNS rebinding attacks where a hostname initially resolves to a safe IP but later resolves to an internal IP.

Usage

from haystack.components.fetchers import LinkContentFetcher

# SSRF protection enabled by default
fetcher = LinkContentFetcher()
# This will be blocked:
# fetcher.run(urls=["http://169.254.169.254/latest/meta-data/"])

# Opt-out for trusted internal use cases
fetcher = LinkContentFetcher(block_internal_urls=False)
# Now internal URLs are allowed

Test Plan

  • Added unit tests for _is_private_ip() function
  • Added unit tests for is_safe_url() function
  • Added unit tests for LinkContentFetcher SSRF protection (sync)
  • Added unit tests for LinkContentFetcher SSRF protection (async)
  • Tests cover all private IP ranges, localhost variants, and cloud metadata endpoints
  • Tests verify the opt-out mechanism works correctly

Breaking Changes

This change is backwards compatible but may affect existing code that relies on fetching content from internal URLs. Users who need to access internal resources can set block_internal_urls=False.

Security Impact

  • Prevents attackers from using Haystack pipelines to access internal network resources
  • Blocks access to cloud metadata services (credential theft prevention)
  • Prevents access to localhost services
  • Protection is enabled by default but can be disabled for trusted use cases

🤖 Generated with Claude Code

This commit addresses a Server-Side Request Forgery (SSRF) vulnerability
in the LinkContentFetcher component by implementing URL validation that
blocks requests to internal/private network resources.

Changes:
- Add `is_safe_url()` function in `url_validation.py` to detect unsafe URLs
- Add `_is_private_ip()` helper to identify private/internal IP addresses
- Add `block_internal_urls` parameter to LinkContentFetcher (default=True)
- Block requests to:
  - Private IP ranges (10.x, 172.16-31.x, 192.168.x)
  - Loopback addresses (127.0.0.1, localhost, ::1)
  - Link-local addresses (169.254.x.x) including cloud metadata endpoints
  - Reserved and multicast addresses
- Perform DNS resolution to prevent DNS rebinding attacks
- Add comprehensive test coverage for SSRF protection

Security Impact:
- Prevents attackers from using Haystack pipelines to access internal
  network resources, cloud metadata services, or localhost services
- Protection is enabled by default but can be disabled via
  `block_internal_urls=False` for trusted internal use cases

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Mr-Neutr0n Mr-Neutr0n requested a review from a team as a code owner February 5, 2026 10:33
@Mr-Neutr0n Mr-Neutr0n requested review from bogdankostic and removed request for a team February 5, 2026 10:33
@vercel
Copy link

vercel bot commented Feb 5, 2026

@Mr-Neutr0n is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link

CLAassistant commented Feb 5, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Feb 5, 2026
@Mr-Neutr0n
Copy link
Author

Friendly follow-up - is there anything I can improve in this PR? Happy to address any feedback. Thanks!

@julian-risch
Copy link
Member

@Mr-Neutr0n Thank you for the suggestion to add url validation to the LinkContentFetcher component and opening this pull request.
After internal discussions, we decided to keep the behavior of the LinkContentFetcher as is for now. We plan to extend the documentation to better explain the risks of passing user inputs to this component and how application developers can validate inputs prior to forwarding the inputs to the LinkContentFetcher. Haystack expects the application to handle any input validation/sanitization and detect any user-defined inputs with malicious intent before sending inputs to the framework.
As there are valid use cases where the LinkContentFetcher needs access to internal IP addresses, the suggested changes with the default setting of block_internal_urls=True would be a breaking change.

@Mr-Neutr0n
Copy link
Author

Makes sense, thanks for taking the time to discuss it internally and getting back to me. I can see how defaulting to blocking internal URLs would be a breaking change for folks with valid internal network use cases.

Documenting the risks and recommending input validation at the application layer sounds like a reasonable approach. Closing this one out — cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend documentation of LinkContentFetcher and explain risks of passing user-defined inputs

3 participants