System for preprocessing documents for RAG (Retrieval Augmented Generation) applications.
The system implements a four-phase pipeline for preprocessing PDF documents:
- Phase 1: Parsing PDF documents and generating JSON structures
- Phase 2: Data preprocessing and insertion into a PostgreSQL database
- Phase 3: Generation of metadata (keywords, summaries, descriptions) using LLMs
- Phase 4: Generation of vector embeddings for semantic search
- Downloading PDF documents from S3/MinIO storage
- Extraction of text, images, and tables from PDFs using Docling
- Segmentation of text into optimal-length chunks
- LLM-generated keywords and summaries
- Detection of the language of text blocks
- Generation of 768-dimensional embeddings (default using the BAAI/bge-m3 model)
- Hierarchical metadata structure (document → section → element)
- Python 3.8+
- PostgreSQL database with the database schema visible down below
- MinIO/S3 storage
- Ideally a CUDA-compatible GPU (for faster embedding generation)
- Azure OpenAI API access
- Clone the repository:
git clone https://github.com/gregorgatej/zrsvn-rag-preprocessing.git
cd zrsvn-rag-preprocessing- Install dependencies:
pip install -r requirements.txt- Create a .env file with the following variables:
S3_ACCESS_KEY=your_s3_access_key
S3_SECRET_ACCESS_KEY=your_s3_secret_key
POSTGRES_PASSWORD=your_postgres_password
AZURE_OPENAI_API_KEY=your_azure_openai_key
AZURE_OPENAI_ENDPOINT=your_azure_endpoint- Prepare the PostgreSQL database with the schema visible down below.
python pipeline/all_phases_flow.pypython pipeline/phase1_flow.py
python pipeline/phase2_flow.py
python pipeline/phase3_flow.py
python pipeline/phase4_flow.pyThe system creates a hierarchical data structure in the PostgreSQL database:
- files - basic document metadata
- sections - sections within documents
- section_elements - individual elements (paragraphs, images, tables)
- text_chunks - optimized text blocks for RAG
- embeddings - vector representations for semantic search
Diagram generated with dbdiagram.io
The pipeline uses Prefect to manage data flows with built-in support for:
- Automatic retries on errors (3 times with a 2s delay)
- Progress tracking with JSON files
- Sequential execution of phases