uv sync
uv run python main.py --prompt V4 --model "configs/llms/openai/G41-mini.yaml" --input "input/pt-BR/ptBR_Final_Data_classification.xlsx" --language pt-BRResults will be saved under:
output/classification/This project provides an automated system for classifying life goals using Large Language Models (LLMs). It supports batch processing of Excel datasets, multiple prompt versions (V1–V6), multilingual configurations, and transparent logging of system prompts, user prompts, and classification outputs.
The system is designed for controlled experimentation on how prompt structure affects model behavior and classification outcomes.
├── main.py # Main classification script
├── run_all_combinations.sh # Run all configuration combinations
├── evaluation_main.py # Compare human and LLM classifications
├── kappa.py # Cohen’s Kappa for agreement testing
├── Unified_Data.py # Data merging utility
├── Age_Check.py # Age validation
├── translate_EN.py # English translation of goal texts
├── lifeproject/
│ ├── classifier_batched.py # Core batched LLM classification logic and build user prompt
│ ├── prompt_builder.py # Build system prompts
│ ├── llm.py # LLM configuration and async client
│ ├── config.py # YAML-based model loader
├── configs/
│ ├── llms/
│ │ └── openai/
│ │ ├── G4O-mini.yaml
│ │ ├── G41-mini.yaml
│ │ └── G5.yaml
│ └── prompt/
│ ├── taskset/
│ ├── role.txt
│ ├── ...
│ ├── other prompt component.txt
│ ├── language_hint/
│ └── codebook/
├── input/ # Input files for classification and evaluation
│ ├── pt-BR/
│ └── ZH/
├── output/
│ ├── classification/ # Classification results
│ ├── evaluation/ # Evaluation reports
│ ├── kappa/ # Kappa statistics
│ └── prompt/ # Saved prompts used in runs
├── requirements.txt
├── pyproject.toml
└── uv.lockThe project uses uv for dependency management.
Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"Linux/MacOS:
curl -LsSf https://astral.sh/uv/install.sh | shuv syncOR for quick testing:
uv pip install -r requirements.inCreate a .env file:
OPENAI_API_KEY=your_api_key_hereuv run python main.py \
--prompt V5 --model "configs/llms/openai/G41-mini.yaml" \
--input "input/pt-BR/ptBR_Final_Data_classification.xlsx" \
--language "configs/prompt/language_hint/pt-BR.txt" chmod +x run_all_combinations.sh
./run_all_combinations.shuv run python evaluate_accuracy.pyuv run python kappa.pyModel YAML files are stored under:
configs/llms/Each file defines:
- API endpoint and key
- Model name
- Temperature
- Token limits
- Pricing and concurrency
Switch models easily by using a different YAML file with the --model argument.
More details in configs/llms/README.md.
Prompt files are under:
configs/prompt/Switch prompt version easily by using a different YAML file with the --prompt argument.
More details in configs/prompt/README.md
Controls data loading, prompt construction, model calls, and output saving.
Main workflow:
- Parse command-line arguments (paths for input file, taskset, language, and model)
- Load environment variables (e.g., API key)
- Load model configuration using
LLMConfigManager.from_yaml() - Build the system prompt with
prompt_builder.load_prompt_components() - Read input Excel data and extract goal texts (goal1–goal15)
- Call
get_batched_model_response()for batch classification - Save model outputs (classification + reasoning + token usage) as Excel/CSV files
Compares human and LLM classifications goal by goal and generates a mismatch report. Workflow:
- Load data (human, LLM, and optional English translation)
- Reshape all datasets to long format with load_and_reshape()
- Merge by id and loc (same individual and goal)
- Normalize multi-label categories (e.g., “IR,WEC”)
- Compute accuracy (exact matches)
- Generate a mismatch report for all differing cases
- Save evaluation results under output/evaluation/
Calculates Cohen’s Kappa per category, mean and weighted Kappa, and significance. Workflow:
- Read both Excel files (human vs. LLM)
- Extract all
LPSgoalX_categorycolumns - Identify all unique category labels
- Build binary matrices for each label (1 = present, 0 = absent)
- Calculate Kappa, standard error, and p-value
- Compute mean and weighted mean Kappa
- Save summarized results to
output/kappa/
- main.py— pipeline controller
- config.py + llm.py — model configuration and API management
- prompt_builder.py — system prompt construction
- classifier_batched.py — batch classification execution and user prompt construction
- evaluation_main.py + kappa.py — evaluation and agreement analysis
| Type | Directory |
|---|---|
| Classification | output/classification/ |
| Evaluation | output/evaluation/ |
| Kappa | output/kappa/ |
| Prompt | output/prompt |
- Default codebook:
configs/prompt/codebook/codebook_en.txt - Supports multilingual inputs (currently
pt-BRandzh-TW)
FTOLP is a project by the ODISSEI Social Data Science (SoDa) team. Do you have questions, suggestions, or remarks on the technical implementation? Create an issue in the issue tracker or feel free to contact Qixiang Fang or Shiyu Dong.