crawl4ai version
0.7.4
Expected Behavior
When an HTML document contains a <base> tag, relative links should be resolved against the URL specified in the <base> tag's href attribute, as per HTML standards.
Current Behavior
Relative links are resolved against the base_url passed to generate_markdown (usually the page URL), ignoring the <base> tag present in the HTML content. This leads to incorrect URLs when the <base> tag modifies the base path for relative links.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
1. Create an HTML string with a `<base>` tag pointing to a different directory/root than the page URL.
2. Add a relative link in the HTML.
3. Use `DefaultMarkdownGenerator` to convert the HTML to Markdown, passing the original page URL as `base_url`.
4. Observe that the link in the Markdown is incorrect (resolved against page URL instead of base tag URL).
Code snippets
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
# Use css_selector="#main" to simulate extracting a specific element which might strip <head>
result = await crawler.arun(
url="https://www.philippsburg.de/index.php/oeffentliche-bekanntmachungen.html",
css_selector="#main" # also fails if this is not present
)
correct_url = "https://www.philippsburg.de/files/philippsburg/Oeffentliche%20Bekanntgaben/2025/Neufassung%20Hundesteuersatzung.pdf"
incorrect_url = "https://www.philippsburg.de/index.php/files/philippsburg/Oeffentliche%20Bekanntgaben/2025/Neufassung%20Hundesteuersatzung.pdf"
found_correct = correct_url in result.markdown
found_incorrect = incorrect_url in result.markdown
if found_correct:
print("SUCCESS: Found correct URL.")
if found_incorrect:
print("ERROR: Found incorrect URL (relative link not resolved correctly).")
if __name__ == "__main__":
asyncio.run(main())
OS
Windows
Python version
3.14.0
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
crawl4ai version
0.7.4
Expected Behavior
When an HTML document contains a
<base>tag, relative links should be resolved against the URL specified in the<base>tag'shrefattribute, as per HTML standards.Current Behavior
Relative links are resolved against the
base_urlpassed togenerate_markdown(usually the page URL), ignoring the<base>tag present in the HTML content. This leads to incorrect URLs when the<base>tag modifies the base path for relative links.Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
Windows
Python version
3.14.0
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response