Add script to parse PDFs to huggingface dataset #55

fcogidi · 2026-01-19T23:22:51Z

PR Type

Feature

Short Description

Introduce a new script that utilizes OCR to parse PDF files and save the extracted text chunks into a HuggingFace dataset format.

Tests Added

No tests added in this update.

…dataset

Copilot

Pull request overview

This PR introduces a new script that processes PDF files using OCR and converts them into a HuggingFace dataset format. The script renders PDF pages as images, sends them to an OpenAI-compatible multimodal API for text extraction, and chunks the resulting text for RAG applications.

Changes:

Added pdf_to_hf_dataset.py script with comprehensive PDF-to-dataset conversion functionality
Added pymupdf dependency to support PDF rendering

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.

File	Description
src/utils/data/pdf_to_hf_dataset.py	New script implementing PDF OCR, text chunking, and HuggingFace dataset creation with configurable parameters
pyproject.toml	Added pymupdf dependency for PDF processing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/utils/data/pdf_to_hf_dataset.py

… script

Add script to parse PDF with OCR and save text chunks to HuggingFace …

971dde7

…dataset

fcogidi requested a review from Copilot January 19, 2026 23:23

fcogidi self-assigned this Jan 19, 2026

fcogidi added the enhancement New feature or request label Jan 19, 2026

Copilot AI reviewed Jan 19, 2026

View reviewed changes

src/utils/data/pdf_to_hf_dataset.py Show resolved Hide resolved

src/utils/data/pdf_to_hf_dataset.py Show resolved Hide resolved

Remove redundant option alias for model in PDF to HuggingFace dataset…

ca15e8b

… script

fcogidi merged commit add6257 into main Jan 19, 2026
4 checks passed

fcogidi deleted the fco/pdf_to_hf branch January 19, 2026 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to parse PDFs to huggingface dataset #55

Add script to parse PDFs to huggingface dataset #55

Uh oh!

fcogidi commented Jan 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add script to parse PDFs to huggingface dataset #55

Add script to parse PDFs to huggingface dataset #55

Uh oh!

Conversation

fcogidi commented Jan 19, 2026

PR Type

Short Description

Tests Added

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants