Knowledge Graph Extraction Time Benchmarking

This directory provides scripts for benchmarking the time cost of knowledge graph extraction and concept generation processes. Unlike the parallel_generation directory which focuses on parallel processing, this directory is designed specifically for measuring and analyzing extraction performance.

Files

1_slice_kg_extraction.py: Benchmarks the time cost of entity-event triple extraction from text documents with detailed timing metrics.
2_concept_generation.py: Benchmarks the time cost of concept node generation and graph construction from extracted triples.

Purpose

This benchmark suite helps you:

Measure extraction speed for different LLM models
Compare performance across different hardware configurations
Optimize batch sizes for maximum throughput
Estimate processing time for large-scale datasets
Profile bottlenecks in the extraction pipeline

Quick Start

1. Triple Extraction Benchmark

Run entity-event extraction timing:

python 1_slice_kg_extraction.py \
    --shard 0 \
    --total_shards 1 \
    --port 8135

Key Parameters:

--shard: Which data shard to process (default: 0)
--total_shards: Total number of data shards (default: 1)
--port: vLLM/SGLang server port (default: 8135)

2. Concept Generation Benchmark

Run concept generation timing:

python 2_concept_generation.py \
    --shard 0 \
    --total_shards 1 \
    --port 8135

Timing Metrics

Entity-Event Extraction Time

The total extraction time is recorded in the last object of the output JSON file:

Location: output_dir/kg_extraction/xxx_1_in_1.json

Key: total_extraction_time_seconds

Example:

{
  "id": "doc_12345",
  "text": "...",
  "triples": [...],
  "total_extraction_time_seconds": 245.67
}

Concept Generation Time

The concept generation time is recorded in the last line of the logging file:

Location: output_dir/concepts/logging.txt

Format: Total concept generation time: xxx seconds

Example:

Processing concepts...
Creating CSV files...
Total time: 89.34 seconds

Configuration

Benchmark Settings

Both scripts use benchmark=True and record=True in ProcessingConfig:

kg_extraction_config = ProcessingConfig(
    model_path=model_name,
    data_directory="/data/AutoSchema/processed_data/cc_en_head",
    filename_pattern=keyword,
    batch_size_triple=16,        # Extraction batch size
    batch_size_concept=64,       # Concept generation batch size
    output_directory=f'/data/AutoSchema/processed_data/cc_en_head/{model_name}',
    current_shard_triple=args.shard,
    total_shards_triple=args.total_shards,
    record=True,                 # Save detailed results
    max_new_tokens=8192,         # Max tokens (extraction: 8192, concept: 512)
    benchmark=True               # Enable timing metrics
)

Benchmarking Workflow

Step-by-Step Process

Start LLM Server

# Example: vLLM server
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8135

Run Triple Extraction Benchmark

python 1_slice_kg_extraction.py --port 8135

Check Extraction Time

# View last object in JSON output
tail -n 20 output_dir/kg_extraction/xxx_1_in_1.json | grep total_extraction_time_seconds

Run Concept Generation Benchmark

python 2_concept_generation.py --port 8135

Check Concept Time

# View last line of logging file
tail -n 1 output_dir/concepts/logging.txt

Performance Analysis

Factors Affecting Speed

Model Size: Larger models (70B) are slower but more accurate than smaller models (7B)
Batch Size: Larger batches improve throughput but require more memory
Max Tokens: Higher token limits allow more complex extractions but increase latency
Hardware: GPU memory and compute capability directly impact speed
Concurrency: max_workers parameter controls parallel API calls

Output Structure

After benchmarking, you'll find:

output_dir/
├── kg_extraction/
│   └── xxx_1_in_1.json          # Contains total_extraction_time_seconds
├── concepts/
│   ├── logging.txt               # Contains total concept generation time
│   ├── concept_nodes.csv
│   └── concept_edges.csv
└── graphml/
    └── knowledge_graph.graphml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowledge Graph Extraction Time Benchmarking

Files

Purpose

Quick Start

1. Triple Extraction Benchmark

2. Concept Generation Benchmark

Timing Metrics

Entity-Event Extraction Time

Concept Generation Time

Configuration

Benchmark Settings

Benchmarking Workflow

Step-by-Step Process

Performance Analysis

Factors Affecting Speed

Output Structure

FilesExpand file tree

readme.md

Latest commit

History

readme.md

File metadata and controls

Knowledge Graph Extraction Time Benchmarking

Files

Purpose

Quick Start

1. Triple Extraction Benchmark

2. Concept Generation Benchmark

Timing Metrics

Entity-Event Extraction Time

Concept Generation Time

Configuration

Benchmark Settings

Benchmarking Workflow

Step-by-Step Process

Performance Analysis

Factors Affecting Speed

Output Structure