A Python library for preprocessing scanned documents to improve OCR accuracy. Optimizes image quality through deskewing, denoising, contrast enhancement, and more.
Raw scanned documents often have issues that reduce OCR accuracy:
- Skewed pages from misaligned scanning
- Noise and artifacts from scanner sensors
- Low contrast making text hard to distinguish
- Blurry text from poor scan quality
This library applies a preprocessing pipeline that addresses these issues, resulting in significantly better OCR results.
- Deskewing - Automatically detect and correct page rotation
- Denoising - Remove scanner artifacts while preserving text
- Contrast Enhancement - CLAHE-based enhancement for better text visibility
- Sharpening - Enhance text edges for clearer recognition
- Color Preservation - Preserve signatures, stamps, and colored elements
- PDF Support - Process multi-page PDFs with single PDF output
- Configurable Pipeline - Use predefined pipelines or custom steps
- CLI Tool - Process files directly from command line
pip install ocr-preprocessorOr install from source:
git clone https://github.com/kloia/ocr-preprocessor.git
cd ocr-preprocessor
pip install -e .from ocr_preprocessor import OCRPreprocessor
processor = OCRPreprocessor()
# Process scanned image
with open("scanned_document.jpg", "rb") as f:
processed = processor.process_image(f.read())
# Save result (ready for OCR)
with open("ready_for_ocr.png", "wb") as f:
f.write(processed)# Process an image
ocrprep scan.jpg -o processed.png
# Process PDF to PDF
ocrprep document.pdf -o processed.pdf
# Process with verbose output
ocrprep scan.jpg -o output.png --verbose
# Use fast pipeline
ocrprep scan.jpg --pipeline fast
# Custom steps only
ocrprep scan.jpg --steps normalize deskew contrastThe full preprocessing pipeline applies these steps in order:
| Step | Purpose | OCR Benefit |
|---|---|---|
| Normalize | Stabilize pixel intensities | Consistent input for OCR |
| Deskew | Correct page rotation | Better line detection |
| Resize | Scale to optimal resolution | Improved character recognition |
| Denoise | Remove scanner artifacts | Cleaner text boundaries |
| Contrast | CLAHE enhancement | Better text/background separation |
| Sharpen | Enhance text edges | Clearer character shapes |
from ocr_preprocessor import OCRPreprocessor, Pipeline
processor = OCRPreprocessor()
# Full pipeline (default) - all steps
result = processor.process_image(image_bytes, pipeline=Pipeline.FULL)
# Minimal pipeline - for clean digital PDFs
result = processor.process_image(image_bytes, pipeline=Pipeline.MINIMAL)
# Fast pipeline - balance of speed and quality
result = processor.process_image(image_bytes, pipeline=Pipeline.FAST)from ocr_preprocessor import OCRPreprocessor, ProcessingStep
processor = OCRPreprocessor()
# Only apply specific steps
result = processor.process_image(
image_bytes,
steps=[
ProcessingStep.NORMALIZE,
ProcessingStep.DESKEW,
ProcessingStep.CONTRAST
]
)def on_progress(current: int, total: int, step_name: str) -> None:
print(f"Step {current}/{total}: {step_name}")
result = processor.process_image(
image_bytes,
progress_callback=on_progress
)
# Output:
# Step 1/6: normalize
# Step 2/6: deskew
# ...from ocr_preprocessor import OCRPreprocessor, convert_pdf_all_pages, images_to_pdf
processor = OCRPreprocessor()
with open("document.pdf", "rb") as f:
pdf_bytes = f.read()
# Convert pages to images
pages = convert_pdf_all_pages(pdf_bytes)
# Process each page
processed = [processor.process_image(p) for p in pages]
# Combine back to PDF
result_pdf = images_to_pdf(processed)
with open("processed.pdf", "wb") as f:
f.write(result_pdf)# Single command - outputs processed PDF
ocrprep document.pdf -o processed.pdf
# Output individual pages instead
ocrprep document.pdf --output ./pages/processor = OCRPreprocessor(
min_width=1000, # Minimum output width (pixels)
max_width=3000 # Maximum output width (pixels)
)| Method | Description |
|---|---|
process_image(bytes, ...) -> bytes |
Process image with configurable pipeline |
process_image_minimal(bytes) -> bytes |
Light preprocessing (for clean PDFs) |
process(bytes, file_type) -> bytes |
Auto-detect and process |
process_array(ndarray) -> ndarray |
Process numpy array directly |
from ocr_preprocessor import ProcessingStep, Pipeline, OutputFormat
# Processing steps
ProcessingStep.NORMALIZE # Normalize pixel intensities
ProcessingStep.DESKEW # Fix rotation
ProcessingStep.RESIZE # Scale to target size
ProcessingStep.DENOISE # Remove noise
ProcessingStep.CONTRAST # CLAHE enhancement
ProcessingStep.SHARPEN # Sharpen edges
# Pipelines
Pipeline.FULL # All steps (default)
Pipeline.MINIMAL # Normalize + Resize + Light Contrast
Pipeline.FAST # Normalize + Resize + Contrast
# Output formats
OutputFormat.PNG # Lossless (default)
OutputFormat.JPEG # Lossy, smaller size| Function | Description |
|---|---|
convert_pdf_first_page(bytes, dpi=200) -> bytes |
Convert first page only |
convert_pdf_all_pages(bytes, dpi=200) -> list[bytes] |
Convert all pages |
convert_pdf_page(bytes, page_number, dpi=200) -> bytes |
Convert specific page |
get_pdf_page_count(bytes) -> int |
Get page count |
images_to_pdf(list[bytes]) -> bytes |
Combine images into PDF |
from ocr_preprocessor import (
OCRPreprocessorError, # Base exception
ImageDecodeError, # Failed to decode image
ImageEncodeError, # Failed to encode image
PDFConversionError, # PDF conversion failed
PDFInfoError, # Cannot read PDF info
InvalidFileTypeError, # Unsupported file type
ProcessingStepError, # Processing step failed
)The library automatically detects colored regions (signatures, stamps, highlights) using HSV saturation analysis. These areas receive lighter processing to preserve their appearance while text areas get stronger enhancement for better OCR.
After preprocessing, you can pass the output to any OCR engine:
from ocr_preprocessor import OCRPreprocessor
import pytesseract # or any other OCR library
processor = OCRPreprocessor()
with open("scan.jpg", "rb") as f:
processed = processor.process_image(f.read())
# Save and OCR
with open("processed.png", "wb") as f:
f.write(processed)
text = pytesseract.image_to_string("processed.png")# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Lint
ruff check src/
# Type check
mypy src/MIT License - see LICENSE for details.