Skip to content

⚡️ Speed up function read_indexer_covariates by 96%#62

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-read_indexer_covariates-mglnynmo
Open

⚡️ Speed up function read_indexer_covariates by 96%#62
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-read_indexer_covariates-mglnynmo

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 11, 2025

📄 96% (0.96x) speedup for read_indexer_covariates in graphrag/query/indexer_adapters.py

⏱️ Runtime : 49.5 milliseconds 25.2 milliseconds (best of 86 runs)

📝 Explanation and details

The optimization replaces the expensive _prepare_records() function call with a more efficient direct row iteration approach. The key improvement is eliminating the DataFrame-to-dict conversion bottleneck.

What changed:

  • Removed the _prepare_records(df) call that internally used df.reset_index().rename().to_dict("records")
  • Replaced it with direct df.itertuples(index=True, name=None) iteration
  • Added an inline _row_dict() function that converts each row tuple to a dictionary on-demand

Why it's faster:

  • Avoids expensive DataFrame operations: The original code performed reset_index(), rename(), and to_dict("records") on the entire DataFrame upfront
  • Eliminates large intermediate data structures: to_dict("records") creates a full list of dictionaries in memory, while the optimized version processes one row at a time
  • Reduces memory allocations: itertuples() yields lightweight tuple objects instead of creating heavy dictionary objects for all rows at once

Performance characteristics:
The optimization shows consistent 72-154% speedup across all test cases, with particularly strong performance on:

  • Large datasets (1000+ rows): 77-83% faster due to reduced memory pressure
  • Simple cases with all columns present: 141-154% faster
  • Edge cases with missing data: 127-166% faster

The line profiler confirms the bottleneck was in _prepare_records() (71.4% of original runtime), which is now replaced by the much faster itertuples() approach (43.6% of optimized runtime).

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 28 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from collections.abc import Mapping
from typing import Any

import pandas as pd
# imports
import pytest  # used for our unit tests
from graphrag.query.indexer_adapters import read_indexer_covariates

# function to test
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License


# Minimal Covariate class for testing
class Covariate:
    def __init__(
        self,
        id,
        short_id,
        subject_id,
        covariate_type,
        text_unit_ids,
        attributes,
    ):
        self.id = id
        self.short_id = short_id
        self.subject_id = subject_id
        self.covariate_type = covariate_type
        self.text_unit_ids = text_unit_ids
        self.attributes = attributes

    def __eq__(self, other):
        if not isinstance(other, Covariate):
            return False
        return (
            self.id == other.id
            and self.short_id == other.short_id
            and self.subject_id == other.subject_id
            and self.covariate_type == other.covariate_type
            and self.text_unit_ids == other.text_unit_ids
            and self.attributes == other.attributes
        )

    def __repr__(self):
        return (
            f"Covariate(id={self.id!r}, short_id={self.short_id!r}, "
            f"subject_id={self.subject_id!r}, covariate_type={self.covariate_type!r}, "
            f"text_unit_ids={self.text_unit_ids!r}, attributes={self.attributes!r})"
        )
from graphrag.query.indexer_adapters import read_indexer_covariates

# ------------------ UNIT TESTS ------------------

# BASIC TEST CASES

def test_basic_single_row():
    # Test a simple DataFrame with all expected columns and one row
    df = pd.DataFrame([{
        "id": 1,
        "human_readable_id": "cov1",
        "subject_id": "subjA",
        "type": "claim",
        "object_id": "objX",
        "status": "active",
        "start_date": "2024-01-01",
        "end_date": "2024-12-31",
        "description": "A description",
    }])
    codeflash_output = read_indexer_covariates(df); covs = codeflash_output # 841μs -> 349μs (141% faster)
    cov = covs[0]

def test_basic_multiple_rows():
    # Test multiple rows with different values
    df = pd.DataFrame([
        {
            "id": 2,
            "human_readable_id": "cov2",
            "subject_id": "subjB",
            "type": "claim",
            "object_id": "objY",
            "status": "inactive",
            "start_date": "2023-01-01",
            "end_date": "2023-12-31",
            "description": "Desc 2",
        },
        {
            "id": 3,
            "human_readable_id": "cov3",
            "subject_id": "subjC",
            "type": "claim",
            "object_id": "objZ",
            "status": "active",
            "start_date": "2022-01-01",
            "end_date": "2022-12-31",
            "description": "Desc 3",
        }
    ])
    codeflash_output = read_indexer_covariates(df); covs = codeflash_output # 838μs -> 332μs (152% faster)
    ids = [cov.id for cov in covs]

def test_basic_missing_optional_columns():
    # Test missing optional columns: description and end_date
    df = pd.DataFrame([{
        "id": 4,
        "human_readable_id": "cov4",
        "subject_id": "subjD",
        "type": "claim",
        "object_id": "objW",
        "status": "pending",
        "start_date": "2025-01-01",
        # "end_date" missing
        # "description" missing
    }])
    codeflash_output = read_indexer_covariates(df); covs = codeflash_output # 790μs -> 296μs (166% faster)
    cov = covs[0]

# EDGE TEST CASES

def test_edge_empty_dataframe():
    # Test with empty DataFrame
    df = pd.DataFrame(columns=[
        "id", "human_readable_id", "subject_id", "type",
        "object_id", "status", "start_date", "end_date", "description"
    ])
    codeflash_output = read_indexer_covariates(df); covs = codeflash_output # 845μs -> 371μs (128% faster)

def test_edge_missing_required_column():
    # Test missing required column 'id'
    df = pd.DataFrame([{
        # "id" missing
        "human_readable_id": "cov5",
        "subject_id": "subjE",
        "type": "claim",
        "object_id": "objV",
        "status": "active",
        "start_date": "2026-01-01",
        "end_date": "2026-12-31",
        "description": "Desc 5",
    }])
    with pytest.raises(KeyError):
        read_indexer_covariates(df) # 37.3μs -> 36.6μs (1.94% faster)

def test_edge_id_as_non_str_type():
    # Test with id as float, should be cast to str
    df = pd.DataFrame([{
        "id": 6.0,
        "human_readable_id": "cov6",
        "subject_id": "subjF",
        "type": "claim",
        "object_id": "objU",
        "status": "active",
        "start_date": "2027-01-01",
        "end_date": "2027-12-31",
        "description": "Desc 6",
    }])
    codeflash_output = read_indexer_covariates(df); covs = codeflash_output # 831μs -> 342μs (143% faster)

def test_edge_none_values():
    # Test with None values in several columns
    df = pd.DataFrame([{
        "id": None,
        "human_readable_id": None,
        "subject_id": None,
        "type": None,
        "object_id": None,
        "status": None,
        "start_date": None,
        "end_date": None,
        "description": None,
    }])
    codeflash_output = read_indexer_covariates(df); covs = codeflash_output # 870μs -> 379μs (130% faster)
    cov = covs[0]

def test_edge_extra_columns():
    # Test with extra columns not used by the function
    df = pd.DataFrame([{
        "id": 7,
        "human_readable_id": "cov7",
        "subject_id": "subjG",
        "type": "claim",
        "object_id": "objT",
        "status": "inactive",
        "start_date": "2028-01-01",
        "end_date": "2028-12-31",
        "description": "Desc 7",
        "extra_col": "should be ignored",
    }])
    codeflash_output = read_indexer_covariates(df); covs = codeflash_output # 843μs -> 343μs (145% faster)
    cov = covs[0]

def test_edge_unexpected_types_in_attributes():
    # Test with unexpected types in attributes columns
    df = pd.DataFrame([{
        "id": 8,
        "human_readable_id": "cov8",
        "subject_id": "subjH",
        "type": "claim",
        "object_id": ["objS"],  # list instead of str
        "status": {"status": "active"},  # dict instead of str
        "start_date": 20290101,  # int instead of str
        "end_date": None,
        "description": 12345,  # int instead of str
    }])
    codeflash_output = read_indexer_covariates(df); covs = codeflash_output # 919μs -> 374μs (145% faster)
    cov = covs[0]

# LARGE SCALE TEST CASES

def test_large_scale_1000_rows():
    # Test with 1000 rows
    n = 1000
    df = pd.DataFrame({
        "id": [str(i) for i in range(n)],
        "human_readable_id": [f"cov{i}" for i in range(n)],
        "subject_id": [f"subj{i%10}" for i in range(n)],
        "type": ["claim"] * n,
        "object_id": [f"obj{i%5}" for i in range(n)],
        "status": ["active" if i%2==0 else "inactive" for i in range(n)],
        "start_date": [f"2024-01-{(i%31)+1:02d}" for i in range(n)],
        "end_date": [f"2024-12-{(i%31)+1:02d}" for i in range(n)],
        "description": [f"Desc {i}" for i in range(n)],
    })
    codeflash_output = read_indexer_covariates(df); covs = codeflash_output # 4.90ms -> 2.68ms (83.0% faster)
    # Spot check a few rows
    for i in [0, 499, 999]:
        cov = covs[i]

def test_large_scale_missing_some_columns():
    # Test with 1000 rows, some missing description and end_date
    n = 1000
    df = pd.DataFrame({
        "id": [str(i) for i in range(n)],
        "human_readable_id": [f"cov{i}" for i in range(n)],
        "subject_id": [f"subj{i%10}" for i in range(n)],
        "type": ["claim"] * n,
        "object_id": [f"obj{i%5}" for i in range(n)],
        "status": ["active" if i%2==0 else "inactive" for i in range(n)],
        "start_date": [f"2024-01-{(i%31)+1:02d}" for i in range(n)],
        # "end_date" missing
        # "description" missing
    })
    codeflash_output = read_indexer_covariates(df); covs = codeflash_output # 4.28ms -> 2.48ms (72.8% faster)
    for i in [0, 499, 999]:
        cov = covs[i]

def test_large_scale_all_none():
    # Test with 1000 rows, all columns None
    n = 1000
    df = pd.DataFrame({
        "id": [None]*n,
        "human_readable_id": [None]*n,
        "subject_id": [None]*n,
        "type": [None]*n,
        "object_id": [None]*n,
        "status": [None]*n,
        "start_date": [None]*n,
        "end_date": [None]*n,
        "description": [None]*n,
    })
    codeflash_output = read_indexer_covariates(df); covs = codeflash_output # 4.89ms -> 2.71ms (80.1% faster)
    for cov in covs[::100]:  # Spot check every 100th
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from collections.abc import Mapping
from typing import Any

import pandas as pd
# imports
import pytest  # used for our unit tests
from graphrag.query.indexer_adapters import read_indexer_covariates

# function to test
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License


# Minimal Covariate class for testing
class Covariate:
    def __init__(
        self,
        id,
        short_id,
        subject_id,
        covariate_type,
        text_unit_ids,
        attributes,
    ):
        self.id = id
        self.short_id = short_id
        self.subject_id = subject_id
        self.covariate_type = covariate_type
        self.text_unit_ids = text_unit_ids
        self.attributes = attributes

    def __eq__(self, other):
        # For test comparison
        if not isinstance(other, Covariate):
            return False
        return (
            self.id == other.id
            and self.short_id == other.short_id
            and self.subject_id == other.subject_id
            and self.covariate_type == other.covariate_type
            and self.text_unit_ids == other.text_unit_ids
            and self.attributes == other.attributes
        )

    def __repr__(self):
        return (
            f"Covariate(id={self.id!r}, short_id={self.short_id!r}, "
            f"subject_id={self.subject_id!r}, covariate_type={self.covariate_type!r}, "
            f"text_unit_ids={self.text_unit_ids!r}, attributes={self.attributes!r})"
        )
from graphrag.query.indexer_adapters import read_indexer_covariates

# =============================
# Unit tests for read_indexer_covariates
# =============================

# -------------------------------
# Basic Test Cases
# -------------------------------

def test_single_row_basic():
    # Basic: single row with all fields present
    df = pd.DataFrame([{
        "id": 123,
        "human_readable_id": "claim_1",
        "subject_id": "subj_a",
        "type": "claim_type",
        "object_id": "obj_1",
        "status": "active",
        "start_date": "2024-01-01",
        "end_date": "2024-12-31",
        "description": "desc here"
    }])
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 843μs -> 335μs (151% faster)
    cov = result[0]

def test_multiple_rows_basic():
    # Basic: multiple rows, all fields present
    df = pd.DataFrame([
        {
            "id": 1,
            "human_readable_id": "claim_1",
            "subject_id": "subj_a",
            "type": "t1",
            "object_id": "obj_1",
            "status": "active",
            "start_date": "2024-01-01",
            "end_date": "2024-12-31",
            "description": "desc1",
        },
        {
            "id": 2,
            "human_readable_id": "claim_2",
            "subject_id": "subj_b",
            "type": "t2",
            "object_id": "obj_2",
            "status": "inactive",
            "start_date": "2023-01-01",
            "end_date": "2023-12-31",
            "description": "desc2",
        },
    ])
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 822μs -> 328μs (151% faster)

def test_missing_optional_fields_basic():
    # Basic: missing optional fields in attributes (should be None)
    df = pd.DataFrame([{
        "id": "foo",
        "human_readable_id": "bar",
        "subject_id": "baz",
        "type": "typ",
        "object_id": None,
        "status": None,
        "start_date": None,
        "end_date": None,
        "description": None,
    }])
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 855μs -> 376μs (127% faster)
    cov = result[0]

def test_id_is_int_is_converted_to_str():
    # Basic: id column is int, should be converted to str
    df = pd.DataFrame([{
        "id": 42,
        "human_readable_id": "claim_42",
        "subject_id": "subj_x",
        "type": "type_x",
        "object_id": "obj_x",
        "status": "active",
        "start_date": "2022-01-01",
        "end_date": "2022-12-31",
        "description": "desc_x",
    }])
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 824μs -> 327μs (152% faster)

# -------------------------------
# Edge Test Cases
# -------------------------------

def test_empty_dataframe():
    # Edge: empty dataframe should return empty list
    df = pd.DataFrame(columns=[
        "id", "human_readable_id", "subject_id", "type",
        "object_id", "status", "start_date", "end_date", "description"
    ])
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 837μs -> 363μs (131% faster)


def test_missing_id_column():
    # Edge: missing id column (should raise KeyError)
    df = pd.DataFrame([{
        # "id" missing
        "human_readable_id": "hrid",
        "subject_id": "subj",
        "type": "typ",
        "object_id": "obj",
        "status": "stat",
        "start_date": "sd",
        "end_date": "ed",
        "description": "desc",
    }])
    with pytest.raises(KeyError):
        read_indexer_covariates(df) # 41.9μs -> 39.6μs (6.02% faster)



def test_extra_columns_are_ignored():
    # Edge: extra columns should be ignored
    df = pd.DataFrame([{
        "id": "abc",
        "human_readable_id": "hrid",
        "subject_id": "subj",
        "type": "typ",
        "object_id": "obj",
        "status": "stat",
        "start_date": "sd",
        "end_date": "ed",
        "description": "desc",
        "extra_col": "should_ignore",
    }])
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 904μs -> 420μs (115% faster)

def test_attributes_are_partial():
    # Edge: some attributes columns missing, should be present in dict as None
    df = pd.DataFrame([{
        "id": "abc",
        "human_readable_id": "hrid",
        "subject_id": "subj",
        "type": "typ",
        "object_id": "obj",
        # "status" missing
        "start_date": "sd",
        "end_date": "ed",
        "description": "desc",
    }])
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 839μs -> 361μs (132% faster)
    attrs = result[0].attributes

def test_id_is_none():
    # Edge: id is None, should be converted to "None"
    df = pd.DataFrame([{
        "id": None,
        "human_readable_id": "hrid",
        "subject_id": "subj",
        "type": "typ",
        "object_id": "obj",
        "status": "stat",
        "start_date": "sd",
        "end_date": "ed",
        "description": "desc",
    }])
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 868μs -> 373μs (133% faster)

def test_human_readable_id_is_none():
    # Edge: human_readable_id is None, should be None in Covariate
    df = pd.DataFrame([{
        "id": "abc",
        "human_readable_id": None,
        "subject_id": "subj",
        "type": "typ",
        "object_id": "obj",
        "status": "stat",
        "start_date": "sd",
        "end_date": "ed",
        "description": "desc",
    }])
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 863μs -> 369μs (134% faster)

def test_non_string_types_in_attributes():
    # Edge: attributes columns have non-string types
    df = pd.DataFrame([{
        "id": "abc",
        "human_readable_id": "hrid",
        "subject_id": "subj",
        "type": "typ",
        "object_id": 123,
        "status": True,
        "start_date": pd.Timestamp("2024-01-01"),
        "end_date": None,
        "description": 42.5,
    }])
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 1.00ms -> 394μs (154% faster)
    attrs = result[0].attributes

def test_dataframe_with_index():
    # Edge: DataFrame has a custom index, should not affect output
    df = pd.DataFrame([{
        "id": "abc",
        "human_readable_id": "hrid",
        "subject_id": "subj",
        "type": "typ",
        "object_id": "obj",
        "status": "stat",
        "start_date": "sd",
        "end_date": "ed",
        "description": "desc",
    }], index=[99])
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 864μs -> 374μs (131% faster)

# -------------------------------
# Large Scale Test Cases
# -------------------------------

def test_large_scale_1000_rows():
    # Large scale: 1000 rows
    n = 1000
    df = pd.DataFrame({
        "id": [str(i) for i in range(n)],
        "human_readable_id": [f"claim_{i}" for i in range(n)],
        "subject_id": [f"subj_{i%10}" for i in range(n)],
        "type": ["type_a"]*n,
        "object_id": [f"obj_{i%5}" for i in range(n)],
        "status": ["active" if i%2==0 else "inactive" for i in range(n)],
        "start_date": ["2024-01-01"]*n,
        "end_date": ["2024-12-31"]*n,
        "description": [f"desc_{i}" for i in range(n)],
    })
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 4.85ms -> 2.66ms (82.8% faster)

def test_large_scale_missing_some_attributes():
    # Large scale: 1000 rows, some attributes missing
    n = 1000
    df = pd.DataFrame({
        "id": [str(i) for i in range(n)],
        "human_readable_id": [f"claim_{i}" for i in range(n)],
        "subject_id": [f"subj_{i%10}" for i in range(n)],
        "type": ["type_a"]*n,
        "object_id": [f"obj_{i%5}" for i in range(n)],
        # "status" missing for all
        "start_date": ["2024-01-01"]*n,
        "end_date": ["2024-12-31"]*n,
        "description": [f"desc_{i}" for i in range(n)],
    })
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 4.54ms -> 2.57ms (77.1% faster)
    # All attributes["status"] should be None
    for cov in result:
        pass

def test_large_scale_all_none():
    # Large scale: all attribute columns are None
    n = 1000
    df = pd.DataFrame({
        "id": [str(i) for i in range(n)],
        "human_readable_id": [None]*n,
        "subject_id": [f"subj_{i%10}" for i in range(n)],
        "type": ["type_a"]*n,
        "object_id": [None]*n,
        "status": [None]*n,
        "start_date": [None]*n,
        "end_date": [None]*n,
        "description": [None]*n,
    })
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 4.76ms -> 2.59ms (83.5% faster)
    for cov in result:
        pass

def test_large_scale_with_custom_index():
    # Large scale: custom index
    n = 1000
    df = pd.DataFrame({
        "id": [str(i) for i in range(n)],
        "human_readable_id": [f"claim_{i}" for i in range(n)],
        "subject_id": [f"subj_{i%10}" for i in range(n)],
        "type": ["type_a"]*n,
        "object_id": [f"obj_{i%5}" for i in range(n)],
        "status": ["active" if i%2==0 else "inactive" for i in range(n)],
        "start_date": ["2024-01-01"]*n,
        "end_date": ["2024-12-31"]*n,
        "description": [f"desc_{i}" for i in range(n)],
    }, index=[i+1000 for i in range(n)])
    codeflash_output = read_indexer_covariates(df); result = codeflash_output # 4.85ms -> 2.67ms (81.5% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-read_indexer_covariates-mglnynmo and push.

Codeflash

The optimization replaces the expensive `_prepare_records()` function call with a more efficient direct row iteration approach. The key improvement is **eliminating the DataFrame-to-dict conversion bottleneck**.

**What changed:**
- Removed the `_prepare_records(df)` call that internally used `df.reset_index().rename().to_dict("records")`
- Replaced it with direct `df.itertuples(index=True, name=None)` iteration
- Added an inline `_row_dict()` function that converts each row tuple to a dictionary on-demand

**Why it's faster:**
- **Avoids expensive DataFrame operations**: The original code performed `reset_index()`, `rename()`, and `to_dict("records")` on the entire DataFrame upfront
- **Eliminates large intermediate data structures**: `to_dict("records")` creates a full list of dictionaries in memory, while the optimized version processes one row at a time
- **Reduces memory allocations**: `itertuples()` yields lightweight tuple objects instead of creating heavy dictionary objects for all rows at once

**Performance characteristics:**
The optimization shows consistent 72-154% speedup across all test cases, with particularly strong performance on:
- Large datasets (1000+ rows): 77-83% faster due to reduced memory pressure
- Simple cases with all columns present: 141-154% faster
- Edge cases with missing data: 127-166% faster

The line profiler confirms the bottleneck was in `_prepare_records()` (71.4% of original runtime), which is now replaced by the much faster `itertuples()` approach (43.6% of optimized runtime).
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 11, 2025 02:36
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants