Skip to content

Conversation

@fcogidi
Copy link
Collaborator

@fcogidi fcogidi commented Jan 19, 2026

PR Type

Feature

Short Description

Introduce a new script that utilizes OCR to parse PDF files and save the extracted text chunks into a HuggingFace dataset format.

Tests Added

No tests added in this update.

@fcogidi fcogidi requested a review from Copilot January 19, 2026 23:23
@fcogidi fcogidi self-assigned this Jan 19, 2026
@fcogidi fcogidi added the enhancement New feature or request label Jan 19, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new script that processes PDF files using OCR and converts them into a HuggingFace dataset format. The script renders PDF pages as images, sends them to an OpenAI-compatible multimodal API for text extraction, and chunks the resulting text for RAG applications.

Changes:

  • Added pdf_to_hf_dataset.py script with comprehensive PDF-to-dataset conversion functionality
  • Added pymupdf dependency to support PDF rendering

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/utils/data/pdf_to_hf_dataset.py New script implementing PDF OCR, text chunking, and HuggingFace dataset creation with configurable parameters
pyproject.toml Added pymupdf dependency for PDF processing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@fcogidi fcogidi merged commit add6257 into main Jan 19, 2026
4 checks passed
@fcogidi fcogidi deleted the fco/pdf_to_hf branch January 19, 2026 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants