Skip to content

fix: prevent PDF binary content from being included in scrape output#42

Open
dan-and wants to merge 1 commit intodevflowinc:mainfrom
dan-and:pdf_detection
Open

fix: prevent PDF binary content from being included in scrape output#42
dan-and wants to merge 1 commit intodevflowinc:mainfrom
dan-and:pdf_detection

Conversation

@dan-and
Copy link

@dan-and dan-and commented Sep 27, 2025

Add PDF detection to skip processing PDF files in fetch and playwright scrapers. This prevents raw PDF binary data from being dumped into HTML/markdown fields.

Fixes #28

Add PDF detection to skip processing PDF files in fetch and playwright scrapers.
This prevents raw PDF binary data from being dumped into HTML/markdown fields.

Fixes devflowinc#28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] PDF Content Incorrectly Dumped into HTML/Markdown Fields During Web Craw

1 participant