A modular command-line toolkit for processing PDF documents. Run main.py to access all features through a central menu.
- Project Structure
- Requirements
- Installation
- How to Run
- Modules
- Adding a New Module
- Troubleshooting
- Notes
project/
├── main.py ← entry point and main menu
└── modules/
├── __init__.py ← marks modules/ as a Python package
├── pdf_scanner.py ← scans a drive and copies matching PDFs
├── brand_reader.py ← extracts brand name fields into Excel
└── batch_print.py ← batch prints PDFs to a physical printer
Python 3.10 or higher — required for the str | None type hint syntax used throughout the modules.
To check your version:
python --version
Install all required packages in one command:
pip install pypdf pdfplumber openpyxl pywin32
| Package | Required by | Purpose |
|---|---|---|
| pypdf | pdf_scanner, brand_reader | Primary PDF text extraction |
| pdfplumber | pdf_scanner, brand_reader | Fallback extraction for complex layouts |
| openpyxl | brand_reader | Writing formatted Excel reports |
| pywin32 | batch_print | Windows printer spooler access |
Only required if your PDFs are scanned images rather than digitally created documents. The toolkit functions without OCR — it simply skips the OCR step.
pip install pytesseract pdf2image
You must also install the following external binaries:
Tesseract OCR
Download: https://github.com/UB-Mannheim/tesseract/wiki
Default path expected: C:\Program Files\Tesseract-OCR\tesseract.exe
Update TESSERACT_PATH in any module that uses OCR if installed elsewhere.
Poppler
Download: https://github.com/oschwartz10612/poppler-windows/releases
Default path expected: C:\poppler-26.02.0\Library\bin
Update POPPLER_PATH in any module that uses OCR if installed elsewhere.
-
Clone or download this repository into a local folder.
-
Install Python dependencies:
pip install pypdf pdfplumber openpyxl pywin32 -
If OCR is needed:
pip install pytesseract pdf2imageThen install Tesseract and Poppler binaries (see links above) and update the path constants at the top of
pdf_scanner.pyandbrand_reader.py. -
For batch printing, install SumatraPDF: Download: https://www.sumatrapdfreader.org/download-free-pdf-viewer Update
SUMATRA_PATHinbatch_print.pyto match your installation path. -
Run the toolkit:
python main.py
From inside the project/ folder:
python main.py
You will be presented with the following menu:
=======================================================
PDF Processing Toolkit
=======================================================
1. Scan drive and copy matching PDFs (pdf_scanner)
2. Extract brand names to Excel (brand_reader)
3. Batch print PDFs to printer (batch_print)
0. Exit
=======================================================
Select an option by typing the number and pressing Enter. After each operation completes, press Enter to return to the menu.
To pre-configure paths so you are not prompted at runtime, set SEARCH_ROOT and DEST_FOLDER at the top of pdf_scanner.py, or FOLDER_PATH in brand_reader.py. Leave them as empty strings "" to be prompted each time.
Recursively walks a drive or folder, identifies PDFs that match a configurable keyword and regex combination, and copies them to a destination folder.
A PDF qualifies only when both of the following conditions are true:
- At least one keyword from
KEYWORDSis found anywhere in the extracted text (case-insensitive), AND - At least
MATCH_THRESHOLDof theMATCHERSregex patterns also match
Pages are read one at a time and scanning stops the moment both conditions are satisfied — the remaining pages are never read.
Certificate of Good Manufacturing Practice
Certificate of Product Registration
Certificate of Listing of Identical Drug Product
Brand Name:
Registration Number:
FDA Registration No.:
Valid Until <date>
Manufacturer:
Importer / Distributor:
To add keywords, append to the KEYWORDS list in pdf_scanner.py.
To make matching stricter, raise MATCH_THRESHOLD.
pypdf → pdfplumber → OCR (Tesseract)
- pypdf — fastest, works on standard digitally created PDFs
- pdfplumber — slower, handles complex layouts, tables, and multi-column text
- OCR — slowest, used only when extracted text is below
TEXT_THRESHOLDcharacters
If a method returns sufficient text and the match conditions are met, the remaining methods are never attempted.
- Walks all subdirectories recursively regardless of nesting depth
- Walk and processing run concurrently — the first file starts processing while the walker is still discovering new directories
- The destination folder is automatically excluded from the walk to prevent re-processing already-copied files
- Current directory being scanned is displayed on a single overwriting console line
Folders always skipped:
.* (any hidden folder) $* (system folders) ~* (temp folders)
Windows Program Files Program Files (x86)
ProgramData System Volume Information winnt
Files are SHA-256 hashed before copying:
- Files under 512 KB: full file hashed
- Files over 512 KB: first + last 256 KB hashed (for speed)
If a matching file with identical content has already been copied in the current run (regardless of filename or location), it is skipped and logged under DUPLICATES in the output log.
All settings are at the top of pdf_scanner.py:
| Setting | Default | Description |
|---|---|---|
SEARCH_ROOT |
"" |
Drive or folder to scan. Empty = prompted at runtime |
DEST_FOLDER |
"" |
Destination for copied files. Empty = prompted |
MAX_WORKERS |
4 | Parallel worker processes (bypasses GIL) |
MAX_PAGES |
5 | Maximum pages to scan per PDF |
TEXT_THRESHOLD |
50 | Minimum characters before trying next extractor |
FILE_TIMEOUT |
30 | Seconds before abandoning a single file |
MIN_FILE_SIZE |
1024 | Skip files smaller than this in bytes |
MAX_FILE_SIZE |
0 | Skip files larger than this in bytes (0 = no limit) |
MOVE_FILES |
False | True = move files, False = copy files |
SKIP_DUPLICATES |
True | Skip files with identical content |
SKIP_HIDDEN |
True | Skip hidden and system folders |
MATCH_THRESHOLD |
2 | Minimum regex pattern hits required |
OCR_DPI |
150 | DPI for OCR image rendering |
TESSERACT_PATH |
— | Full path to tesseract.exe |
POPPLER_PATH |
— | Full path to Poppler bin folder |
Important:
MOVE_FILESdefaults toFalse. Always verify results in copy mode before switching toTrue. Moving files is irreversible.
A scan_results.txt log is written to the destination folder containing:
- Full source → destination path for every copied file
- All skipped files (no match)
- All duplicate files (same content, skipped)
- All errors with error messages
- Summary counts at the bottom
Scans all PDFs in a single folder and extracts structured regulatory fields into a formatted Excel report.
| Field | Pattern matched |
|---|---|
| Brand Name | Brand Name: |
| Registration No. | Registration Number: / FDA Registration No.: |
| Valid Until | valid until <date> |
| Manufacturer | Manufacturer: / Manufacturer Name and Address: |
| Trader | Trader: |
| Importer | Importer: / Importer / Distributor: |
| Distributor | Distributor: |
Same as pdf_scanner: pypdf → pdfplumber → OCR (Tesseract).
All settings are at the top of brand_reader.py:
| Setting | Default | Description |
|---|---|---|
MAX_WORKERS |
4 | Parallel worker threads |
MAX_PAGES |
5 | Maximum pages to scan per PDF |
OCR_DPI |
300 | DPI for OCR rendering |
TEXT_THRESHOLD |
50 | Minimum characters before trying next extractor |
TESSERACT_PATH |
— | Full path to tesseract.exe |
POPPLER_PATH |
— | Full path to Poppler bin folder |
A brand_results.xlsx file is written to the scanned folder. If the file already exists it is saved as brand_results(1).xlsx, brand_results(2).xlsx, and so on — existing files are never overwritten.
The Excel report is divided into three labeled sections:
| Section | Contents |
|---|---|
| ✔ FOUND | Files where Brand Name was successfully extracted |
| ✘ NOT FOUND | Files processed but Brand Name was not found |
| ⚠ ERRORS | Files that could not be read or caused exceptions |
A summary row at the bottom shows total counts for each section.
Sends all PDFs in a folder to a physical printer in natural sort order with a live dashboard showing real-time print queue state.
Windows only. This module requires
pywin32and the Windows print spooler. It will not run on macOS or Linux.
- PDFs are sorted in natural order (
cert2.pdfbeforecert10.pdf) - Before sending each file, the module checks the spooler — if
MAX_ACTIVE_JOBSis already in the queue, it waits - Each file is sent via SumatraPDF in silent mode
- The spooler job ID is captured by comparing job lists before and after sending
- Completed jobs are detected when their ID disappears from the active spooler
- After the last file is sent, a drain loop waits up to
DRAIN_TIMEOUTseconds for all remaining jobs to clear
All settings are at the top of batch_print.py:
| Setting | Default | Description |
|---|---|---|
PRINTER_NAME |
FUJI XEROX DocuPrint M455 df | Exact printer name as shown in Windows |
SUMATRA_PATH |
— | Full path to SumatraPDF.exe |
MAX_ACTIVE_JOBS |
2 | Maximum concurrent spooler jobs before waiting |
To find your exact printer name: open Control Panel → Devices and Printers and copy the name exactly as displayed, including spacing and capitalization.
A print_history.txt log is written to the PDF source folder on completion, listing all printed files in order. Files that failed to send are tagged with [FAILED].
-
Create
modules/your_module.pywith arun()function:def run(folder_path: str) -> None: # your logic here
-
Import it in
main.py:import modules.your_module as your_module
-
Add a menu label to the
MENUlist inmain.py:MENU = [ ... "Your feature description (your_module)", ]
-
Add a launcher function in
main.py:def launch_your_module(): print("\n── Your Module ──────────────────────────────────────") folder = prompt_path("Enter folder path", must_exist=True) your_module.run(folder)
-
Append the launcher to the
LAUNCHERSlist:LAUNCHERS = [ ... launch_your_module, ]
Menu numbering updates automatically. MENU and LAUNCHERS must always have the same number of entries and be in the same order.
ImportError: cannot import name 'pdf_scanner' from 'modules'
Use import modules.pdf_scanner as pdf_scanner instead of from modules import pdf_scanner. The __init__.py is intentionally empty — modules must be imported by full path.
ModuleNotFoundError: No module named 'win32print'
Run pip install pywin32. This is required for batch_print only.
ModuleNotFoundError: No module named 'pytesseract'
OCR is optional. If not installed, the toolkit falls back to text-only extraction. Install with pip install pytesseract pdf2image only if your PDFs are scanned images.
PDF scanner finds no matches
- Confirm your PDFs contain one of the three certificate title keywords
- Lower
MATCH_THRESHOLDto1inpdf_scanner.pytemporarily to test keyword-only matching - If PDFs are scanned images, ensure OCR is installed and
OCR_DPIis at least 200
Brand reader returns empty fields
- The field labels in the PDF must match the regex patterns (e.g.
Brand Name:,Manufacturer:) - Check if the PDF is image-based — if so, OCR must be installed
- Raise
OCR_DPIto300inbrand_reader.pyfor better accuracy on low-quality scans
Batch print sends jobs but dashboard shows no completion
- Verify
PRINTER_NAMEmatches exactly what appears in Control Panel → Devices and Printers - Some printer drivers clear jobs from the spooler immediately after accepting them — the drain loop may time out harmlessly; check the physical printer for output
Printer spooler is unreachable
safe_get_jobs() will retry 3 times with a 5-second delay between attempts before returning an empty job list. If the printer is consistently unreachable, check that the print spooler service is running: services.msc → Print Spooler → Started.
MOVE_FILES = Falseby default inpdf_scanner.py. Always verify results in copy mode first.batch_print.pyis Windows-only. It will not run on macOS or Linux.- OCR is optional across all modules. Missing
pytesseract/pdf2imageprints a warning but does not prevent the toolkit from running. - All state in
batch_printis scoped to eachrun()call — running batch print twice in one session starts completely clean. - The destination folder in
pdf_scanneris automatically excluded from the walk even when it is inside the search root, preventing an infinite copy loop.