PDF Processing Toolkit

A modular command-line toolkit for processing PDF documents. Run main.py to access all features through a central menu.

Project Structure

project/
├── main.py                  ← entry point and main menu
└── modules/
    ├── __init__.py          ← marks modules/ as a Python package
    ├── pdf_scanner.py       ← scans a drive and copies matching PDFs
    ├── brand_reader.py      ← extracts brand name fields into Excel
    └── batch_print.py       ← batch prints PDFs to a physical printer

Requirements

Python

Python 3.10 or higher — required for the str | None type hint syntax used throughout the modules.

To check your version:

python --version

Python Dependencies

Install all required packages in one command:

pip install pypdf pdfplumber openpyxl pywin32

Package	Required by	Purpose
pypdf	pdf_scanner, brand_reader	Primary PDF text extraction
pdfplumber	pdf_scanner, brand_reader	Fallback extraction for complex layouts
openpyxl	brand_reader	Writing formatted Excel reports
pywin32	batch_print	Windows printer spooler access

Optional — OCR Support

Only required if your PDFs are scanned images rather than digitally created documents. The toolkit functions without OCR — it simply skips the OCR step.

pip install pytesseract pdf2image

You must also install the following external binaries:

Tesseract OCR Download: https://github.com/UB-Mannheim/tesseract/wiki Default path expected: C:\Program Files\Tesseract-OCR\tesseract.exe Update TESSERACT_PATH in any module that uses OCR if installed elsewhere.

Poppler Download: https://github.com/oschwartz10612/poppler-windows/releases Default path expected: C:\poppler-26.02.0\Library\bin Update POPPLER_PATH in any module that uses OCR if installed elsewhere.

Installation

Clone or download this repository into a local folder.

Install Python dependencies:

pip install pypdf pdfplumber openpyxl pywin32

If OCR is needed:
```
pip install pytesseract pdf2image
```
Then install Tesseract and Poppler binaries (see links above) and update the path constants at the top of pdf_scanner.py and brand_reader.py.
For batch printing, install SumatraPDF: Download: https://www.sumatrapdfreader.org/download-free-pdf-viewer Update SUMATRA_PATH in batch_print.py to match your installation path.
Run the toolkit:
```
python main.py
```

How to Run

From inside the project/ folder:

python main.py

You will be presented with the following menu:

=======================================================
  PDF Processing Toolkit
=======================================================
  1. Scan drive and copy matching PDFs       (pdf_scanner)
  2. Extract brand names to Excel            (brand_reader)
  3. Batch print PDFs to printer             (batch_print)
  0. Exit
=======================================================

Select an option by typing the number and pressing Enter. After each operation completes, press Enter to return to the menu.

To pre-configure paths so you are not prompted at runtime, set SEARCH_ROOT and DEST_FOLDER at the top of pdf_scanner.py, or FOLDER_PATH in brand_reader.py. Leave them as empty strings "" to be prompted each time.

Modules

1. PDF Scanner (`modules/pdf_scanner.py`)

Recursively walks a drive or folder, identifies PDFs that match a configurable keyword and regex combination, and copies them to a destination folder.

Matching Logic

A PDF qualifies only when both of the following conditions are true:

At least one keyword from KEYWORDS is found anywhere in the extracted text (case-insensitive), AND
At least MATCH_THRESHOLD of the MATCHERS regex patterns also match

Pages are read one at a time and scanning stops the moment both conditions are satisfied — the remaining pages are never read.

Default Keywords (OR logic — any one qualifies)

Certificate of Good Manufacturing Practice
Certificate of Product Registration
Certificate of Listing of Identical Drug Product

Default Regex Patterns (6 total, threshold: 2)

Brand Name:
Registration Number:
FDA Registration No.:
Valid Until <date>
Manufacturer:
Importer / Distributor:

To add keywords, append to the KEYWORDS list in pdf_scanner.py. To make matching stricter, raise MATCH_THRESHOLD.

Text Extraction Fallback Chain

pypdf  →  pdfplumber  →  OCR (Tesseract)

pypdf — fastest, works on standard digitally created PDFs
pdfplumber — slower, handles complex layouts, tables, and multi-column text
OCR — slowest, used only when extracted text is below TEXT_THRESHOLD characters

If a method returns sufficient text and the match conditions are met, the remaining methods are never attempted.

Drive Walk Behavior

Walks all subdirectories recursively regardless of nesting depth
Walk and processing run concurrently — the first file starts processing while the walker is still discovering new directories
The destination folder is automatically excluded from the walk to prevent re-processing already-copied files
Current directory being scanned is displayed on a single overwriting console line

Folders always skipped:

.* (any hidden folder)    $* (system folders)    ~* (temp folders)
Windows                   Program Files          Program Files (x86)
ProgramData               System Volume Information    winnt

Duplicate Handling

Files are SHA-256 hashed before copying:

Files under 512 KB: full file hashed
Files over 512 KB: first + last 256 KB hashed (for speed)

If a matching file with identical content has already been copied in the current run (regardless of filename or location), it is skipped and logged under DUPLICATES in the output log.

Configuration

All settings are at the top of pdf_scanner.py:

Setting	Default	Description
`SEARCH_ROOT`	`""`	Drive or folder to scan. Empty = prompted at runtime
`DEST_FOLDER`	`""`	Destination for copied files. Empty = prompted
`MAX_WORKERS`	4	Parallel worker processes (bypasses GIL)
`MAX_PAGES`	5	Maximum pages to scan per PDF
`TEXT_THRESHOLD`	50	Minimum characters before trying next extractor
`FILE_TIMEOUT`	30	Seconds before abandoning a single file
`MIN_FILE_SIZE`	1024	Skip files smaller than this in bytes
`MAX_FILE_SIZE`	0	Skip files larger than this in bytes (0 = no limit)
`MOVE_FILES`	False	True = move files, False = copy files
`SKIP_DUPLICATES`	True	Skip files with identical content
`SKIP_HIDDEN`	True	Skip hidden and system folders
`MATCH_THRESHOLD`	2	Minimum regex pattern hits required
`OCR_DPI`	150	DPI for OCR image rendering
`TESSERACT_PATH`	—	Full path to `tesseract.exe`
`POPPLER_PATH`	—	Full path to Poppler `bin` folder

Important: MOVE_FILES defaults to False. Always verify results in copy mode before switching to True. Moving files is irreversible.

Output

A scan_results.txt log is written to the destination folder containing:

Full source → destination path for every copied file
All skipped files (no match)
All duplicate files (same content, skipped)
All errors with error messages
Summary counts at the bottom

2. Brand Reader (`modules/brand_reader.py`)

Scans all PDFs in a single folder and extracts structured regulatory fields into a formatted Excel report.

Fields Extracted

Field	Pattern matched
Brand Name	`Brand Name:`
Registration No.	`Registration Number:` / `FDA Registration No.:`
Valid Until	`valid until <date>`
Manufacturer	`Manufacturer:` / `Manufacturer Name and Address:`
Trader	`Trader:`
Importer	`Importer:` / `Importer / Distributor:`
Distributor	`Distributor:`

Text Extraction Fallback Chain

Same as pdf_scanner: pypdf → pdfplumber → OCR (Tesseract).

Configuration

All settings are at the top of brand_reader.py:

Setting	Default	Description
`MAX_WORKERS`	4	Parallel worker threads
`MAX_PAGES`	5	Maximum pages to scan per PDF
`OCR_DPI`	300	DPI for OCR rendering
`TEXT_THRESHOLD`	50	Minimum characters before trying next extractor
`TESSERACT_PATH`	—	Full path to `tesseract.exe`
`POPPLER_PATH`	—	Full path to Poppler `bin` folder

Output

A brand_results.xlsx file is written to the scanned folder. If the file already exists it is saved as brand_results(1).xlsx, brand_results(2).xlsx, and so on — existing files are never overwritten.

The Excel report is divided into three labeled sections:

Section	Contents
✔ FOUND	Files where Brand Name was successfully extracted
✘ NOT FOUND	Files processed but Brand Name was not found
⚠ ERRORS	Files that could not be read or caused exceptions

A summary row at the bottom shows total counts for each section.

3. Batch Print (`modules/batch_print.py`)

Sends all PDFs in a folder to a physical printer in natural sort order with a live dashboard showing real-time print queue state.

Windows only. This module requires pywin32 and the Windows print spooler. It will not run on macOS or Linux.

How It Works

PDFs are sorted in natural order (cert2.pdf before cert10.pdf)
Before sending each file, the module checks the spooler — if MAX_ACTIVE_JOBS is already in the queue, it waits
Each file is sent via SumatraPDF in silent mode
The spooler job ID is captured by comparing job lists before and after sending
Completed jobs are detected when their ID disappears from the active spooler
After the last file is sent, a drain loop waits up to DRAIN_TIMEOUT seconds for all remaining jobs to clear

Configuration

All settings are at the top of batch_print.py:

Setting	Default	Description
`PRINTER_NAME`	FUJI XEROX DocuPrint M455 df	Exact printer name as shown in Windows
`SUMATRA_PATH`	—	Full path to `SumatraPDF.exe`
`MAX_ACTIVE_JOBS`	2	Maximum concurrent spooler jobs before waiting

To find your exact printer name: open Control Panel → Devices and Printers and copy the name exactly as displayed, including spacing and capitalization.

Output

A print_history.txt log is written to the PDF source folder on completion, listing all printed files in order. Files that failed to send are tagged with [FAILED].

Adding a New Module

Create modules/your_module.py with a run() function:

def run(folder_path: str) -> None:
    # your logic here

Import it in main.py:

import modules.your_module as your_module

Add a menu label to the MENU list in main.py:

MENU = [
    ...
    "Your feature description     (your_module)",
]

Add a launcher function in main.py:

def launch_your_module():
    print("\n── Your Module ──────────────────────────────────────")
    folder = prompt_path("Enter folder path", must_exist=True)
    your_module.run(folder)

Append the launcher to the LAUNCHERS list:

LAUNCHERS = [
    ...
    launch_your_module,
]

Menu numbering updates automatically. MENU and LAUNCHERS must always have the same number of entries and be in the same order.

Troubleshooting

ImportError: cannot import name 'pdf_scanner' from 'modules' Use import modules.pdf_scanner as pdf_scanner instead of from modules import pdf_scanner. The __init__.py is intentionally empty — modules must be imported by full path.

ModuleNotFoundError: No module named 'win32print' Run pip install pywin32. This is required for batch_print only.

ModuleNotFoundError: No module named 'pytesseract' OCR is optional. If not installed, the toolkit falls back to text-only extraction. Install with pip install pytesseract pdf2image only if your PDFs are scanned images.

PDF scanner finds no matches

Confirm your PDFs contain one of the three certificate title keywords
Lower MATCH_THRESHOLD to 1 in pdf_scanner.py temporarily to test keyword-only matching
If PDFs are scanned images, ensure OCR is installed and OCR_DPI is at least 200

Brand reader returns empty fields

The field labels in the PDF must match the regex patterns (e.g. Brand Name:, Manufacturer:)
Check if the PDF is image-based — if so, OCR must be installed
Raise OCR_DPI to 300 in brand_reader.py for better accuracy on low-quality scans

Batch print sends jobs but dashboard shows no completion

Verify PRINTER_NAME matches exactly what appears in Control Panel → Devices and Printers
Some printer drivers clear jobs from the spooler immediately after accepting them — the drain loop may time out harmlessly; check the physical printer for output

Printer spooler is unreachable safe_get_jobs() will retry 3 times with a 5-second delay between attempts before returning an empty job list. If the printer is consistently unreachable, check that the print spooler service is running: services.msc → Print Spooler → Started.

Notes

MOVE_FILES = False by default in pdf_scanner.py. Always verify results in copy mode first.
batch_print.py is Windows-only. It will not run on macOS or Linux.
OCR is optional across all modules. Missing pytesseract/pdf2image prints a warning but does not prevent the toolkit from running.
All state in batch_print is scoped to each run() call — running batch print twice in one session starts completely clean.
The destination folder in pdf_scanner is automatically excluded from the walk even when it is inside the search root, preventing an infinite copy loop.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
modules		modules
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Folders and files

Latest commit

History

Repository files navigation

PDF Processing Toolkit

Table of Contents

Project Structure

Requirements

Python

Python Dependencies

Optional — OCR Support

Installation

How to Run

Modules

1. PDF Scanner (modules/pdf_scanner.py)

Matching Logic

Default Keywords (OR logic — any one qualifies)

Default Regex Patterns (6 total, threshold: 2)

Text Extraction Fallback Chain

Drive Walk Behavior

Duplicate Handling

Configuration

Output

2. Brand Reader (modules/brand_reader.py)

Fields Extracted

Text Extraction Fallback Chain

Configuration

Output

3. Batch Print (modules/batch_print.py)

How It Works

Configuration

Output

Adding a New Module

Troubleshooting

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. PDF Scanner (`modules/pdf_scanner.py`)

2. Brand Reader (`modules/brand_reader.py`)

3. Batch Print (`modules/batch_print.py`)

Packages