Skip to content

gregorgatej/zrsvn-rag-preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZRSVN RAG Preprocessing

System for preprocessing documents for RAG (Retrieval Augmented Generation) applications.

About the project

The system implements a four-phase pipeline for preprocessing PDF documents:

  • Phase 1: Parsing PDF documents and generating JSON structures
  • Phase 2: Data preprocessing and insertion into a PostgreSQL database
  • Phase 3: Generation of metadata (keywords, summaries, descriptions) using LLMs
  • Phase 4: Generation of vector embeddings for semantic search

Features

  • Downloading PDF documents from S3/MinIO storage
  • Extraction of text, images, and tables from PDFs using Docling
  • Segmentation of text into optimal-length chunks
  • LLM-generated keywords and summaries
  • Detection of the language of text blocks
  • Generation of 768-dimensional embeddings (default using the BAAI/bge-m3 model)
  • Hierarchical metadata structure (document → section → element)

Technical requirements

  • Python 3.8+
  • PostgreSQL database with the database schema visible down below
  • MinIO/S3 storage
  • Ideally a CUDA-compatible GPU (for faster embedding generation)
  • Azure OpenAI API access

Installation

  1. Clone the repository:
git clone https://github.com/gregorgatej/zrsvn-rag-preprocessing.git  
cd zrsvn-rag-preprocessing
  1. Install dependencies:
pip install -r requirements.txt
  1. Create a .env file with the following variables:
S3_ACCESS_KEY=your_s3_access_key  
S3_SECRET_ACCESS_KEY=your_s3_secret_key  
POSTGRES_PASSWORD=your_postgres_password  
AZURE_OPENAI_API_KEY=your_azure_openai_key  
AZURE_OPENAI_ENDPOINT=your_azure_endpoint
  1. Prepare the PostgreSQL database with the schema visible down below.

Usage

Running the full pipeline

python pipeline/all_phases_flow.py

Running individual phases

python pipeline/phase1_flow.py
python pipeline/phase2_flow.py
python pipeline/phase3_flow.py
python pipeline/phase4_flow.py

Data structure

The system creates a hierarchical data structure in the PostgreSQL database:

  • files - basic document metadata
  • sections - sections within documents
  • section_elements - individual elements (paragraphs, images, tables)
  • text_chunks - optimized text blocks for RAG
  • embeddings - vector representations for semantic search

DB schema

Database Schema

Diagram generated with dbdiagram.io

Orchestration

The pipeline uses Prefect to manage data flows with built-in support for:

  • Automatic retries on errors (3 times with a 2s delay)
  • Progress tracking with JSON files
  • Sequential execution of phases

About

System for preprocessing documents for RAG (Retrieval Augmented Generation) applications.

Resources

License

Stars

Watchers

Forks

Contributors

Languages