Skip to content

[Bug]: the fit_html/cleaned_html/raw_html generator removes rowspan/colspan from tables #2007

@HW618

Description

@HW618

crawl4ai version

v0.8.9

Expected Behavior

Python Version:3.14.5
Here is my code:

import asyncio

from crawl4ai.async_configs import BrowserConfig,CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai import DefaultTableExtraction
from crawl4ai import AsyncWebCrawler,CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter

target_url = "https://en.wikipedia.org/wiki/List_of_prime_ministers_of_India"
# browser_config
browser_config = BrowserConfig(
    headless=True,
    user_agent_mode='random',
)

prune_filter = PruningContentFilter(
    threshold=0.8,
    threshold_type="dynamic",
)

# CrawlerConfig
run_config = CrawlerRunConfig(
    magic=True,
    markdown_generator=DefaultMarkdownGenerator(
        content_source = "raw_html",
        options={
            'bypass_tables': True,
        }
    ),
    cache_mode=CacheMode.BYPASS,
    css_selector='table.wikitable',
    flatten_shadow_dom= True,
    keep_attrs=['rowspan','colspan'],
    table_extraction= DefaultTableExtraction()
)

async def main():
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url=target_url,config=run_config)
        print(result.markdown)
        print(result.markdown.fit_markdown)
        print(result.tables)
        with open('test_raw.md','w') as f:
            f.write(result.markdown)
        with open('test_fit.md','w') as f:
            f.write(result.markdown.fit_markdown)
if __name__ == "__main__":
    asyncio.run(main())

Current Behavior

I tried many times ,but neither rowspan nor colspan was involved in result.markdown.

test_raw.md

Is this reproducible?

Yes

Inputs Causing the Bug

- url: "https://en.wikipedia.org/wiki/List_of_prime_ministers_of_India"
- browser_config:

# browser_config
browser_config = BrowserConfig(
    headless=True,
    user_agent_mode='random',
)

- crawler_config:

# CrawlerConfig
run_config = CrawlerRunConfig(
    magic=True,
    markdown_generator=DefaultMarkdownGenerator(
        content_source = "raw_html",
        #content_filter= prune_filter,
        options={
            'bypass_tables': True,
        }
    ),
    cache_mode=CacheMode.BYPASS,
    css_selector='table.wikitable',
    flatten_shadow_dom= True,
    keep_attrs=['rowspan','colspan'],
    table_extraction= DefaultTableExtraction()
)

Steps to Reproduce

Code snippets

OS

macOS

Python version

3.14.5

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions