OCR Preprocessor

A Python library for preprocessing scanned documents to improve OCR accuracy. Optimizes image quality through deskewing, denoising, contrast enhancement, and more.

Why OCR Preprocessing?

Raw scanned documents often have issues that reduce OCR accuracy:

Skewed pages from misaligned scanning
Noise and artifacts from scanner sensors
Low contrast making text hard to distinguish
Blurry text from poor scan quality

This library applies a preprocessing pipeline that addresses these issues, resulting in significantly better OCR results.

Features

Deskewing - Automatically detect and correct page rotation
Denoising - Remove scanner artifacts while preserving text
Contrast Enhancement - CLAHE-based enhancement for better text visibility
Sharpening - Enhance text edges for clearer recognition
Color Preservation - Preserve signatures, stamps, and colored elements
PDF Support - Process multi-page PDFs with single PDF output
Configurable Pipeline - Use predefined pipelines or custom steps
CLI Tool - Process files directly from command line

Installation

pip install ocr-preprocessor

Or install from source:

git clone https://github.com/kloia/ocr-preprocessor.git
cd ocr-preprocessor
pip install -e .

Quick Start

Python API

from ocr_preprocessor import OCRPreprocessor

processor = OCRPreprocessor()

# Process scanned image
with open("scanned_document.jpg", "rb") as f:
    processed = processor.process_image(f.read())

# Save result (ready for OCR)
with open("ready_for_ocr.png", "wb") as f:
    f.write(processed)

CLI Usage

# Process an image
ocrprep scan.jpg -o processed.png

# Process PDF to PDF
ocrprep document.pdf -o processed.pdf

# Process with verbose output
ocrprep scan.jpg -o output.png --verbose

# Use fast pipeline
ocrprep scan.jpg --pipeline fast

# Custom steps only
ocrprep scan.jpg --steps normalize deskew contrast

Processing Pipeline

The full preprocessing pipeline applies these steps in order:

Step	Purpose	OCR Benefit
Normalize	Stabilize pixel intensities	Consistent input for OCR
Deskew	Correct page rotation	Better line detection
Resize	Scale to optimal resolution	Improved character recognition
Denoise	Remove scanner artifacts	Cleaner text boundaries
Contrast	CLAHE enhancement	Better text/background separation
Sharpen	Enhance text edges	Clearer character shapes

Predefined Pipelines

from ocr_preprocessor import OCRPreprocessor, Pipeline

processor = OCRPreprocessor()

# Full pipeline (default) - all steps
result = processor.process_image(image_bytes, pipeline=Pipeline.FULL)

# Minimal pipeline - for clean digital PDFs
result = processor.process_image(image_bytes, pipeline=Pipeline.MINIMAL)

# Fast pipeline - balance of speed and quality
result = processor.process_image(image_bytes, pipeline=Pipeline.FAST)

Custom Steps

from ocr_preprocessor import OCRPreprocessor, ProcessingStep

processor = OCRPreprocessor()

# Only apply specific steps
result = processor.process_image(
    image_bytes,
    steps=[
        ProcessingStep.NORMALIZE,
        ProcessingStep.DESKEW,
        ProcessingStep.CONTRAST
    ]
)

Progress Callback

def on_progress(current: int, total: int, step_name: str) -> None:
    print(f"Step {current}/{total}: {step_name}")

result = processor.process_image(
    image_bytes,
    progress_callback=on_progress
)
# Output:
# Step 1/6: normalize
# Step 2/6: deskew
# ...

PDF Processing

Process PDF to PDF

from ocr_preprocessor import OCRPreprocessor, convert_pdf_all_pages, images_to_pdf

processor = OCRPreprocessor()

with open("document.pdf", "rb") as f:
    pdf_bytes = f.read()

# Convert pages to images
pages = convert_pdf_all_pages(pdf_bytes)

# Process each page
processed = [processor.process_image(p) for p in pages]

# Combine back to PDF
result_pdf = images_to_pdf(processed)

with open("processed.pdf", "wb") as f:
    f.write(result_pdf)

CLI: PDF to PDF

# Single command - outputs processed PDF
ocrprep document.pdf -o processed.pdf

# Output individual pages instead
ocrprep document.pdf --output ./pages/

API Reference

OCRPreprocessor

processor = OCRPreprocessor(
    min_width=1000,  # Minimum output width (pixels)
    max_width=3000   # Maximum output width (pixels)
)

Methods

Method	Description
`process_image(bytes, ...) -> bytes`	Process image with configurable pipeline
`process_image_minimal(bytes) -> bytes`	Light preprocessing (for clean PDFs)
`process(bytes, file_type) -> bytes`	Auto-detect and process
`process_array(ndarray) -> ndarray`	Process numpy array directly

Enums

from ocr_preprocessor import ProcessingStep, Pipeline, OutputFormat

# Processing steps
ProcessingStep.NORMALIZE  # Normalize pixel intensities
ProcessingStep.DESKEW     # Fix rotation
ProcessingStep.RESIZE     # Scale to target size
ProcessingStep.DENOISE    # Remove noise
ProcessingStep.CONTRAST   # CLAHE enhancement
ProcessingStep.SHARPEN    # Sharpen edges

# Pipelines
Pipeline.FULL     # All steps (default)
Pipeline.MINIMAL  # Normalize + Resize + Light Contrast
Pipeline.FAST     # Normalize + Resize + Contrast

# Output formats
OutputFormat.PNG   # Lossless (default)
OutputFormat.JPEG  # Lossy, smaller size

PDF Functions

Function	Description
`convert_pdf_first_page(bytes, dpi=200) -> bytes`	Convert first page only
`convert_pdf_all_pages(bytes, dpi=200) -> list[bytes]`	Convert all pages
`convert_pdf_page(bytes, page_number, dpi=200) -> bytes`	Convert specific page
`get_pdf_page_count(bytes) -> int`	Get page count
`images_to_pdf(list[bytes]) -> bytes`	Combine images into PDF

Exceptions

from ocr_preprocessor import (
    OCRPreprocessorError,  # Base exception
    ImageDecodeError,      # Failed to decode image
    ImageEncodeError,      # Failed to encode image
    PDFConversionError,    # PDF conversion failed
    PDFInfoError,          # Cannot read PDF info
    InvalidFileTypeError,  # Unsupported file type
    ProcessingStepError,   # Processing step failed
)

Color Preservation

The library automatically detects colored regions (signatures, stamps, highlights) using HSV saturation analysis. These areas receive lighter processing to preserve their appearance while text areas get stronger enhancement for better OCR.

Integration with OCR Engines

After preprocessing, you can pass the output to any OCR engine:

from ocr_preprocessor import OCRPreprocessor
import pytesseract  # or any other OCR library

processor = OCRPreprocessor()

with open("scan.jpg", "rb") as f:
    processed = processor.process_image(f.read())

# Save and OCR
with open("processed.png", "wb") as f:
    f.write(processed)

text = pytesseract.image_to_string("processed.png")

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Lint
ruff check src/

# Type check
mypy src/

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
src/ocr_preprocessor		src/ocr_preprocessor
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OCR Preprocessor

Why OCR Preprocessing?

Features

Installation

Quick Start

Python API

CLI Usage

Processing Pipeline

Predefined Pipelines

Custom Steps

Progress Callback

PDF Processing

Process PDF to PDF

CLI: PDF to PDF

API Reference

OCRPreprocessor

Methods

Enums

PDF Functions

Exceptions

Color Preservation

Integration with OCR Engines

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

kloia/ocr-preprocessor

Folders and files

Latest commit

History

Repository files navigation

OCR Preprocessor

Why OCR Preprocessing?

Features

Installation

Quick Start

Python API

CLI Usage

Processing Pipeline

Predefined Pipelines

Custom Steps

Progress Callback

PDF Processing

Process PDF to PDF

CLI: PDF to PDF

API Reference

OCRPreprocessor

Methods

Enums

PDF Functions

Exceptions

Color Preservation

Integration with OCR Engines

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages