Skip to content

kloia/ocr-preprocessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR Preprocessor

A Python library for preprocessing scanned documents to improve OCR accuracy. Optimizes image quality through deskewing, denoising, contrast enhancement, and more.

Why OCR Preprocessing?

Raw scanned documents often have issues that reduce OCR accuracy:

  • Skewed pages from misaligned scanning
  • Noise and artifacts from scanner sensors
  • Low contrast making text hard to distinguish
  • Blurry text from poor scan quality

This library applies a preprocessing pipeline that addresses these issues, resulting in significantly better OCR results.

Features

  • Deskewing - Automatically detect and correct page rotation
  • Denoising - Remove scanner artifacts while preserving text
  • Contrast Enhancement - CLAHE-based enhancement for better text visibility
  • Sharpening - Enhance text edges for clearer recognition
  • Color Preservation - Preserve signatures, stamps, and colored elements
  • PDF Support - Process multi-page PDFs with single PDF output
  • Configurable Pipeline - Use predefined pipelines or custom steps
  • CLI Tool - Process files directly from command line

Installation

pip install ocr-preprocessor

Or install from source:

git clone https://github.com/kloia/ocr-preprocessor.git
cd ocr-preprocessor
pip install -e .

Quick Start

Python API

from ocr_preprocessor import OCRPreprocessor

processor = OCRPreprocessor()

# Process scanned image
with open("scanned_document.jpg", "rb") as f:
    processed = processor.process_image(f.read())

# Save result (ready for OCR)
with open("ready_for_ocr.png", "wb") as f:
    f.write(processed)

CLI Usage

# Process an image
ocrprep scan.jpg -o processed.png

# Process PDF to PDF
ocrprep document.pdf -o processed.pdf

# Process with verbose output
ocrprep scan.jpg -o output.png --verbose

# Use fast pipeline
ocrprep scan.jpg --pipeline fast

# Custom steps only
ocrprep scan.jpg --steps normalize deskew contrast

Processing Pipeline

The full preprocessing pipeline applies these steps in order:

Step Purpose OCR Benefit
Normalize Stabilize pixel intensities Consistent input for OCR
Deskew Correct page rotation Better line detection
Resize Scale to optimal resolution Improved character recognition
Denoise Remove scanner artifacts Cleaner text boundaries
Contrast CLAHE enhancement Better text/background separation
Sharpen Enhance text edges Clearer character shapes

Predefined Pipelines

from ocr_preprocessor import OCRPreprocessor, Pipeline

processor = OCRPreprocessor()

# Full pipeline (default) - all steps
result = processor.process_image(image_bytes, pipeline=Pipeline.FULL)

# Minimal pipeline - for clean digital PDFs
result = processor.process_image(image_bytes, pipeline=Pipeline.MINIMAL)

# Fast pipeline - balance of speed and quality
result = processor.process_image(image_bytes, pipeline=Pipeline.FAST)

Custom Steps

from ocr_preprocessor import OCRPreprocessor, ProcessingStep

processor = OCRPreprocessor()

# Only apply specific steps
result = processor.process_image(
    image_bytes,
    steps=[
        ProcessingStep.NORMALIZE,
        ProcessingStep.DESKEW,
        ProcessingStep.CONTRAST
    ]
)

Progress Callback

def on_progress(current: int, total: int, step_name: str) -> None:
    print(f"Step {current}/{total}: {step_name}")

result = processor.process_image(
    image_bytes,
    progress_callback=on_progress
)
# Output:
# Step 1/6: normalize
# Step 2/6: deskew
# ...

PDF Processing

Process PDF to PDF

from ocr_preprocessor import OCRPreprocessor, convert_pdf_all_pages, images_to_pdf

processor = OCRPreprocessor()

with open("document.pdf", "rb") as f:
    pdf_bytes = f.read()

# Convert pages to images
pages = convert_pdf_all_pages(pdf_bytes)

# Process each page
processed = [processor.process_image(p) for p in pages]

# Combine back to PDF
result_pdf = images_to_pdf(processed)

with open("processed.pdf", "wb") as f:
    f.write(result_pdf)

CLI: PDF to PDF

# Single command - outputs processed PDF
ocrprep document.pdf -o processed.pdf

# Output individual pages instead
ocrprep document.pdf --output ./pages/

API Reference

OCRPreprocessor

processor = OCRPreprocessor(
    min_width=1000,  # Minimum output width (pixels)
    max_width=3000   # Maximum output width (pixels)
)

Methods

Method Description
process_image(bytes, ...) -> bytes Process image with configurable pipeline
process_image_minimal(bytes) -> bytes Light preprocessing (for clean PDFs)
process(bytes, file_type) -> bytes Auto-detect and process
process_array(ndarray) -> ndarray Process numpy array directly

Enums

from ocr_preprocessor import ProcessingStep, Pipeline, OutputFormat

# Processing steps
ProcessingStep.NORMALIZE  # Normalize pixel intensities
ProcessingStep.DESKEW     # Fix rotation
ProcessingStep.RESIZE     # Scale to target size
ProcessingStep.DENOISE    # Remove noise
ProcessingStep.CONTRAST   # CLAHE enhancement
ProcessingStep.SHARPEN    # Sharpen edges

# Pipelines
Pipeline.FULL     # All steps (default)
Pipeline.MINIMAL  # Normalize + Resize + Light Contrast
Pipeline.FAST     # Normalize + Resize + Contrast

# Output formats
OutputFormat.PNG   # Lossless (default)
OutputFormat.JPEG  # Lossy, smaller size

PDF Functions

Function Description
convert_pdf_first_page(bytes, dpi=200) -> bytes Convert first page only
convert_pdf_all_pages(bytes, dpi=200) -> list[bytes] Convert all pages
convert_pdf_page(bytes, page_number, dpi=200) -> bytes Convert specific page
get_pdf_page_count(bytes) -> int Get page count
images_to_pdf(list[bytes]) -> bytes Combine images into PDF

Exceptions

from ocr_preprocessor import (
    OCRPreprocessorError,  # Base exception
    ImageDecodeError,      # Failed to decode image
    ImageEncodeError,      # Failed to encode image
    PDFConversionError,    # PDF conversion failed
    PDFInfoError,          # Cannot read PDF info
    InvalidFileTypeError,  # Unsupported file type
    ProcessingStepError,   # Processing step failed
)

Color Preservation

The library automatically detects colored regions (signatures, stamps, highlights) using HSV saturation analysis. These areas receive lighter processing to preserve their appearance while text areas get stronger enhancement for better OCR.

Integration with OCR Engines

After preprocessing, you can pass the output to any OCR engine:

from ocr_preprocessor import OCRPreprocessor
import pytesseract  # or any other OCR library

processor = OCRPreprocessor()

with open("scan.jpg", "rb") as f:
    processed = processor.process_image(f.read())

# Save and OCR
with open("processed.png", "wb") as f:
    f.write(processed)

text = pytesseract.image_to_string("processed.png")

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Lint
ruff check src/

# Type check
mypy src/

License

MIT License - see LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages