Skip to content

⚡️ Speed up function read_indexer_relationships by 35%#63

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-read_indexer_relationships-mglo59iq
Open

⚡️ Speed up function read_indexer_relationships by 35%#63
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-read_indexer_relationships-mglo59iq

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 11, 2025

📄 35% (0.35x) speedup for read_indexer_relationships in graphrag/query/indexer_adapters.py

⏱️ Runtime : 22.5 milliseconds 16.7 milliseconds (best of 30 runs)

📝 Explanation and details

The optimized code replaces the expensive _prepare_records() function call with df.reset_index().itertuples(index=False, name="Row"), which provides a 34% speedup by eliminating DataFrame-to-dict conversion overhead.

Key optimizations:

  1. Eliminated intermediate dict conversion: The original code called _prepare_records() which converted the entire DataFrame to a list of dictionaries using df.to_dict("records"). The optimized version uses itertuples() directly, avoiding this expensive conversion step.

  2. Faster row iteration: itertuples() yields named tuples which are more memory-efficient and faster to access than dictionaries. However, the code converts each row to a dict via row._asdict() to maintain compatibility with the existing utility functions that expect dictionary-like objects.

  3. Replaced list comprehension with explicit loop: Changed from a list comprehension to an explicit loop with rels.append(), which provides better performance characteristics for this use case.

The line profiler shows the optimization is most effective for the record preparation step, reducing time from 84.6ms to 40.1ms (53% improvement). The speedup is consistent across different test cases, showing 26-56% improvements depending on DataFrame size and complexity. Larger datasets (1000+ rows) see around 26-27% improvement, while smaller datasets with varied data types achieve 40-56% speedups, making this optimization particularly valuable for typical relationship loading scenarios.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 20 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from dataclasses import dataclass
from typing import Any, Dict, List, Optional

import pandas as pd
# imports
import pytest
from graphrag.query.indexer_adapters import read_indexer_relationships


@dataclass
class Relationship:
    id: str
    short_id: Optional[str]
    source: str
    target: str
    description: Optional[str]
    description_embedding: Optional[List[float]]
    weight: Optional[float]
    text_unit_ids: Optional[List[str]]
    rank: Optional[int]
    attributes: Optional[Dict[str, Any]]
from graphrag.query.indexer_adapters import read_indexer_relationships

# -------------------- UNIT TESTS --------------------

# 1. Basic Test Cases

def test_basic_single_relationship():
    # Test with a single row, all fields present
    df = pd.DataFrame([{
        "id": "r1",
        "human_readable_id": "HR1",
        "source": "A",
        "target": "B",
        "description": "desc",
        "combined_degree": 5,
        "weight": 1.5,
        "text_unit_ids": ["t1", "t2"]
    }])
    codeflash_output = read_indexer_relationships(df); rels = codeflash_output # 761μs -> 527μs (44.4% faster)
    r = rels[0]

def test_basic_multiple_relationships():
    # Test with multiple rows
    df = pd.DataFrame([
        {
            "id": "r1",
            "human_readable_id": "HR1",
            "source": "A",
            "target": "B",
            "description": "desc1",
            "combined_degree": 5,
            "weight": 1.5,
            "text_unit_ids": ["t1", "t2"]
        },
        {
            "id": "r2",
            "human_readable_id": "HR2",
            "source": "B",
            "target": "C",
            "description": "desc2",
            "combined_degree": 10,
            "weight": 2.5,
            "text_unit_ids": ["t3"]
        }
    ])
    codeflash_output = read_indexer_relationships(df); rels = codeflash_output # 739μs -> 508μs (45.6% faster)


def test_basic_text_unit_ids_as_string():
    # Test with text_unit_ids as a string (should be converted to list)
    df = pd.DataFrame([{
        "id": "r1",
        "human_readable_id": "HR1",
        "source": "A",
        "target": "B",
        "description": "desc",
        "combined_degree": 5,
        "weight": 1.5,
        "text_unit_ids": "t1"
    }])
    codeflash_output = read_indexer_relationships(df); rels = codeflash_output # 761μs -> 531μs (43.4% faster)




def test_edge_empty_dataframe():
    # Empty dataframe should return empty list
    df = pd.DataFrame(columns=["id", "human_readable_id", "source", "target", "combined_degree"])
    codeflash_output = read_indexer_relationships(df); rels = codeflash_output # 602μs -> 428μs (40.5% faster)









def test_large_scale_many_relationships():
    # Test with 1000 relationships
    n = 1000
    df = pd.DataFrame({
        "id": [f"r{i}" for i in range(n)],
        "human_readable_id": [f"HR{i}" for i in range(n)],
        "source": [f"S{i}" for i in range(n)],
        "target": [f"T{i}" for i in range(n)],
        "description": [f"desc{i}" for i in range(n)],
        "combined_degree": [i for i in range(n)],
        "weight": [float(i) for i in range(n)],
        "text_unit_ids": [[f"t{i}a", f"t{i}b"] for i in range(n)]
    })
    codeflash_output = read_indexer_relationships(df); rels = codeflash_output # 4.64ms -> 3.66ms (26.9% faster)




#------------------------------------------------
import pandas as pd
# imports
import pytest
from graphrag.query.indexer_adapters import read_indexer_relationships

# ----------- UNIT TESTS ------------

# 1. Basic Test Cases
def test_basic_single_row_minimal():
    # Single row with all required columns, minimal values
    df = pd.DataFrame([{
        "id": "r1",
        "human_readable_id": "R1",
        "source": "A",
        "target": "B",
        "description": "desc",
        "combined_degree": 2
    }])
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 715μs -> 478μs (49.4% faster)
    rel = result[0]

def test_basic_multiple_rows():
    # Multiple rows, different values
    df = pd.DataFrame([
        {
            "id": "r1",
            "human_readable_id": "R1",
            "source": "A",
            "target": "B",
            "description": "desc1",
            "combined_degree": 1
        },
        {
            "id": "r2",
            "human_readable_id": "R2",
            "source": "B",
            "target": "C",
            "description": "desc2",
            "combined_degree": 2
        }
    ])
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 699μs -> 462μs (51.2% faster)

def test_basic_type_conversion():
    # Types: int for id, float for rank, should be converted
    df = pd.DataFrame([{
        "id": 123,
        "human_readable_id": 456,
        "source": "X",
        "target": "Y",
        "description": None,
        "combined_degree": 7.0
    }])
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 712μs -> 457μs (55.8% faster)
    rel = result[0]

def test_basic_missing_optional_columns():
    # Optional columns missing
    df = pd.DataFrame([{
        "id": "r1",
        "human_readable_id": "R1",
        "source": "A",
        "target": "B",
        "description": None,
        "combined_degree": None
    }])
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 628μs -> 451μs (39.2% faster)
    rel = result[0]

# 2. Edge Test Cases



def test_edge_unexpected_types():
    # Wrong type for rank (string instead of int/float)
    df = pd.DataFrame([{
        "id": "r1",
        "human_readable_id": "R1",
        "source": "A",
        "target": "B",
        "description": "desc",
        "combined_degree": "not_a_number"
    }])
    with pytest.raises(TypeError):
        read_indexer_relationships(df) # 657μs -> 473μs (38.7% faster)

def test_edge_extra_columns_ignored():
    # Extra columns should be ignored
    df = pd.DataFrame([{
        "id": "r1",
        "human_readable_id": "R1",
        "source": "A",
        "target": "B",
        "description": "desc",
        "combined_degree": 5,
        "extra_col": "should_be_ignored"
    }])
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 699μs -> 490μs (42.7% faster)
    rel = result[0]

def test_edge_empty_dataframe():
    # Empty dataframe should return empty list
    df = pd.DataFrame(columns=["id", "human_readable_id", "source", "target", "description", "combined_degree"])
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 612μs -> 445μs (37.5% faster)


def test_edge_rank_is_float():
    # Rank is float, should convert to int
    df = pd.DataFrame([{
        "id": "r1",
        "human_readable_id": "R1",
        "source": "A",
        "target": "B",
        "description": "desc",
        "combined_degree": 3.7
    }])
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 668μs -> 485μs (37.7% faster)

def test_edge_rank_is_none():
    # Rank is None, should be None
    df = pd.DataFrame([{
        "id": "r1",
        "human_readable_id": "R1",
        "source": "A",
        "target": "B",
        "description": "desc",
        "combined_degree": None
    }])
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 633μs -> 459μs (37.9% faster)

def test_edge_id_is_int():
    # id is int, should convert to str
    df = pd.DataFrame([{
        "id": 42,
        "human_readable_id": "R42",
        "source": "A",
        "target": "B",
        "description": "desc",
        "combined_degree": 1
    }])
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 668μs -> 439μs (52.2% faster)

def test_edge_human_readable_id_is_none():
    # human_readable_id is None, should be None
    df = pd.DataFrame([{
        "id": "r1",
        "human_readable_id": None,
        "source": "A",
        "target": "B",
        "description": "desc",
        "combined_degree": 1
    }])
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 692μs -> 458μs (50.9% faster)

# 3. Large Scale Test Cases

def test_large_scale_1000_rows():
    # 1000 rows, all valid, test scalability
    N = 1000
    df = pd.DataFrame({
        "id": [f"r{i}" for i in range(N)],
        "human_readable_id": [f"HR{i}" for i in range(N)],
        "source": [f"S{i%10}" for i in range(N)],
        "target": [f"T{i%10}" for i in range(N)],
        "description": [f"desc{i}" for i in range(N)],
        "combined_degree": [i for i in range(N)],
    })
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 3.87ms -> 3.05ms (26.9% faster)
    # Check a few random indices
    for idx in [0, 499, 999]:
        rel = result[idx]

def test_large_scale_varied_types():
    # 500 rows, alternating types for id and rank
    N = 500
    df = pd.DataFrame({
        "id": [i if i%2==0 else f"r{i}" for i in range(N)],
        "human_readable_id": [f"HR{i}" for i in range(N)],
        "source": ["A"]*N,
        "target": ["B"]*N,
        "description": [None if i%3==0 else f"desc{i}" for i in range(N)],
        "combined_degree": [float(i) if i%2==0 else i for i in range(N)],
    })
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 2.25ms -> 1.81ms (24.4% faster)
    for i in range(N):
        rel = result[i]
        # description is None every third row
        if i%3==0:
            pass
        else:
            pass


def test_large_scale_empty():
    # Empty dataframe with correct columns
    df = pd.DataFrame(columns=["id", "human_readable_id", "source", "target", "description", "combined_degree"])
    codeflash_output = read_indexer_relationships(df); result = codeflash_output # 619μs -> 451μs (37.3% faster)

To edit these changes git checkout codeflash/optimize-read_indexer_relationships-mglo59iq and push.

Codeflash

The optimized code replaces the expensive `_prepare_records()` function call with `df.reset_index().itertuples(index=False, name="Row")`, which provides a **34% speedup** by eliminating DataFrame-to-dict conversion overhead.

**Key optimizations:**

1. **Eliminated intermediate dict conversion**: The original code called `_prepare_records()` which converted the entire DataFrame to a list of dictionaries using `df.to_dict("records")`. The optimized version uses `itertuples()` directly, avoiding this expensive conversion step.

2. **Faster row iteration**: `itertuples()` yields named tuples which are more memory-efficient and faster to access than dictionaries. However, the code converts each row to a dict via `row._asdict()` to maintain compatibility with the existing utility functions that expect dictionary-like objects.

3. **Replaced list comprehension with explicit loop**: Changed from a list comprehension to an explicit loop with `rels.append()`, which provides better performance characteristics for this use case.

The line profiler shows the optimization is most effective for the record preparation step, reducing time from 84.6ms to 40.1ms (53% improvement). The speedup is consistent across different test cases, showing **26-56% improvements** depending on DataFrame size and complexity. Larger datasets (1000+ rows) see around 26-27% improvement, while smaller datasets with varied data types achieve 40-56% speedups, making this optimization particularly valuable for typical relationship loading scenarios.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 11, 2025 02:41
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants