A robust, professionally structured Python-based web application (Flask) that performs asynchronous Optical Character Recognition (OCR) on uploaded documents, allows text editing, and converts the final content into a secure, downloadable, multi-page PDF.
Marjory D. Marquez
The project has been significantly enhanced for better document handling:
- Asynchronous OCR: Uses Python Threading to handle long-running OCR tasks in the background, ensuring the web interface remains responsive and can be polled for job status.
- Language Detection & Correction: Automatically detects the source document's language and applies Spell Checking to the OCR output for superior accuracy.
- Robust Document Scanning: Supports single-page images (
PNG,JPG) and multi-page documents (PDF,TIFF). - Reliable OCR: Uses the powerful Tesseract engine for accurate text extraction.
- Zero-Dependency PDF/TIFF Rendering: Uses PyMuPDF (
fitz) to reliably process multi-page PDF and TIFF files without requiring external dependencies like Ghostscript.
- Secure File Handling: Implemented in
security.pyto prevent directory traversal attacks and sanitize all filenames. - Rate Limiting: Protects the resource-intensive
/uploadendpoint using Flask-Limiter to prevent abuse. - Modular Architecture: Code is professionally organized into dedicated modules (
utils.py,security.py) and uses a robust, environment-aware Configuration Class (config.py). - Automatic Cleanup: Files in the
uploadsdirectory are automatically deleted after a set time (default: 1 hour) to manage disk space. - Unit Testing & CI: Includes a comprehensive
test_ocr.pyfile and a GitHub Actions workflow for Continuous Integration.
- Multi-Page PDF Generation: Converts the final, edited text, preserving page breaks, into a clean PDF using fpdf.
-
Python 3.8+
-
Tesseract OCR: Must be installed on your system and accessible via the command line.
-
Installation: Follow the instructions for your OS (e.g.,
sudo apt install tesseract-ocron Debian/Ubuntu).- Configuration: If Tesseract is not in your system PATH, you may need to set the
TESSERACT_CMDenvironment variable.
- Configuration: If Tesseract is not in your system PATH, you may need to set the
All necessary Python libraries can be installed using pip from the project's requirements.txt file:
pip install -r requirements.txt
## Setup and Run
Bash
git clone https://github.com/YourUsername/ImageToPDF_App.git cd ImageToPDF_App
Create a file named .env in the root directory and set the following variables.
Crucial: Use a strong, random key for FLASK_SECRET_KEY in any non-development environment.
The FLASK_ENV controls which configuration settings (Development, Testing, or Production) are loaded via config.py.
FLASK_ENV="development" FLASK_SECRET_KEY="A_VERY_LONG_RANDOM_KEY_HERE"
Ensure your virtual environment is active, then install the required Python packages: Bash
pip install -r requirements.txt
Open your web browser and navigate to:
Verify the project's integrity by running the unit tests: Bash
python test_ocr.py -v
