⚡️ Speed up function `read_indexer_reports` by 29% by codeflash-ai[bot] · Pull Request #64 · codeflash-ai/graphrag

codeflash-ai · 2025-10-11T02:47:10Z

📄 29% (0.29x) speedup for `read_indexer_reports` in `graphrag/query/indexer_adapters.py`

⏱️ Runtime : 79.0 milliseconds → 61.5 milliseconds (best of 46 runs)

📝 Explanation and details

The optimized code achieves a 28% speedup through several key improvements to DataFrame operations in read_indexer_reports:

1. Streamlined Community Processing

Original: Used chained .loc[:, "community"] assignments followed by groupby().agg().reset_index() and merge() operations
Optimized: Combined fillna and astype into a single operation, then used drop_duplicates(subset=["title"], keep="last") with direct filtering via isin()
Why faster: Eliminates expensive groupby aggregation and merge operations, replacing them with more efficient direct DataFrame filtering

2. Reduced DataFrame Operations

Original: Multiple separate operations: fillna(-1), astype(int), groupby, merge, drop_duplicates
Optimized: Consolidated into fewer, more efficient operations using vectorized pandas methods
Why faster: Fewer intermediate DataFrame copies and less overhead from chained operations

3. Optimized Embedding Logic

Original: Always called the expensive embed_community_reports function
Optimized: Added conditional logic to only embed missing values using boolean indexing to target specific rows
Why faster: Avoids unnecessary embedding operations and reduces function call overhead

4. Minor Loop Optimizations in read_community_reports

Added local variable caching for frequently accessed functions and objects to reduce attribute lookup overhead in tight loops
Split the comprehension into separate branches to avoid repeated conditional checks

The optimizations are most effective for test cases with:

Large datasets (17-30% improvement on 300-1000 record tests)
Non-dynamic community selection scenarios (where the groupby optimization applies)
Cases with existing embeddings (avoiding expensive re-embedding)

For small datasets or dynamic selection cases, improvements are minimal (0.5-2%) as the overhead reduction is less significant.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 22 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	93.8%

🌀 Generated Regression Tests and Runtime

from dataclasses import dataclass
from typing import Any, Dict, List, Optional

import numpy as np
import pandas as pd
# imports
import pytest
from graphrag.query.indexer_adapters import read_indexer_reports


@dataclass
class CommunityReport:
    id: str
    short_id: Optional[str]
    title: str
    community_id: str
    summary: str
    full_content: str
    rank: Optional[float]
    full_content_embedding: Optional[List[float]]
    attributes: Optional[Dict[str, Any]] = None
from graphrag.query.indexer_adapters import read_indexer_reports

# ----------------------------
# Unit tests for read_indexer_reports
# ----------------------------

# ----------- BASIC TEST CASES ------------

def make_basic_reports_df():
    # Returns a simple DataFrame for basic tests
    return pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "sumA", "full_content": "contentA", "rank": 10.0, "full_content_embedding": [1.0, 2.0]},
        {"id": "r2", "community": 2, "level": 2, "title": "B", "summary": "sumB", "full_content": "contentB", "rank": 20.0, "full_content_embedding": [3.0, 4.0]},
    ])

def make_basic_communities_df():
    # Returns a simple DataFrame for basic tests
    return pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1", "e2"]},
        {"community": 2, "level": 2, "title": "B", "entity_ids": ["e3"]},
    ])

def test_basic_returns_expected_reports():
    """Test that basic input returns correct CommunityReport objects."""
    reports_df = make_basic_reports_df()
    communities_df = make_basic_communities_df()
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 3.31ms -> 2.36ms (40.1% faster)
    # Should only include communities present in both, with correct rollup
    ids = [r.id for r in result]

def test_basic_with_community_level_filter():
    """Test that filtering by community_level works."""
    reports_df = make_basic_reports_df()
    communities_df = make_basic_communities_df()
    # Only level <= 1 should be included
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=1); result = codeflash_output # 3.62ms -> 2.65ms (36.3% faster)
    ids = [r.id for r in result]

def test_basic_dynamic_community_selection():
    """Test dynamic_community_selection disables rollup."""
    reports_df = make_basic_reports_df()
    communities_df = make_basic_communities_df()
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None, dynamic_community_selection=True); result = codeflash_output # 1.56ms -> 1.59ms (1.90% slower)
    # Should not filter or rollup, so all reports remain
    ids = [r.id for r in result]


def test_empty_reports_df_returns_empty():
    """Test that empty reports dataframe returns empty result."""
    reports_df = pd.DataFrame(columns=["id", "community", "level", "title", "summary", "full_content", "rank", "full_content_embedding"])
    communities_df = make_basic_communities_df()
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 3.13ms -> 2.27ms (38.2% faster)

def test_empty_communities_df_returns_empty():
    """Test that empty communities dataframe returns empty result."""
    reports_df = make_basic_reports_df()
    communities_df = pd.DataFrame(columns=["community", "level", "title", "entity_ids"])
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 3.12ms -> 2.09ms (49.2% faster)




def test_duplicate_communities_rollup():
    """Test that duplicate communities are rolled up to max level."""
    reports_df = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "sumA", "full_content": "contentA", "rank": 10.0, "full_content_embedding": [1.0]},
        {"id": "r2", "community": 1, "level": 2, "title": "A", "summary": "sumB", "full_content": "contentB", "rank": 20.0, "full_content_embedding": [2.0]},
    ])
    communities_df = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
        {"community": 1, "level": 2, "title": "A", "entity_ids": ["e2"]},
    ])
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 3.25ms -> 2.29ms (41.8% faster)
    # Only the highest level community should be kept
    ids = [r.id for r in result]

def test_explode_entity_ids_with_multiple_entities():
    """Test that explode works with multiple entities per community."""
    reports_df = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "sumA", "full_content": "contentA", "rank": 10.0, "full_content_embedding": [1.0]},
    ])
    communities_df = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1", "e2", "e3"]},
    ])
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 3.28ms -> 2.35ms (39.6% faster)

def test_reports_with_missing_optional_columns():
    """Test that missing optional columns does not break function."""
    reports_df = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "sumA", "full_content": "abc"},
    ])
    communities_df = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
    ])
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 3.10ms -> 2.15ms (44.4% faster)

# ----------- LARGE SCALE TEST CASES ------------

def test_large_scale_reports_and_communities():
    """Test with large number of reports and communities."""
    n = 500
    reports_df = pd.DataFrame([
        {"id": f"r{i}", "community": i % 10, "level": i % 5, "title": f"T{i%10}", "summary": f"sum{i}", "full_content": f"content{i}", "rank": float(i), "full_content_embedding": [float(i), float(i+1)]}
        for i in range(n)
    ])
    communities_df = pd.DataFrame([
        {"community": i, "level": i % 5, "title": f"T{i}", "entity_ids": [f"e{i}", f"e{i+1}"]}
        for i in range(10)
    ])
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 5.21ms -> 4.18ms (24.7% faster)
    # Should only include reports with communities present in communities_df
    communities_in_result = set(r.community_id for r in result)


def test_large_scale_with_community_level_filter():
    """Test large scale with community_level filter."""
    n = 300
    reports_df = pd.DataFrame([
        {"id": f"r{i}", "community": i % 10, "level": i % 5, "title": f"T{i%10}", "summary": f"sum{i}", "full_content": f"content{i}", "rank": float(i), "full_content_embedding": [float(i), float(i+1)]}
        for i in range(n)
    ])
    communities_df = pd.DataFrame([
        {"community": i, "level": i % 5, "title": f"T{i}", "entity_ids": [f"e{i}", f"e{i+1}"]}
        for i in range(10)
    ])
    # Only level <= 2 should be included
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=2); result = codeflash_output # 4.45ms -> 3.40ms (30.8% faster)

def test_large_scale_dynamic_selection():
    """Test large scale with dynamic_community_selection=True disables rollup."""
    n = 400
    reports_df = pd.DataFrame([
        {"id": f"r{i}", "community": i % 10, "level": i % 5, "title": f"T{i%10}", "summary": f"sum{i}", "full_content": f"content{i}", "rank": float(i), "full_content_embedding": [float(i), float(i+1)]}
        for i in range(n)
    ])
    communities_df = pd.DataFrame([
        {"community": i, "level": i % 5, "title": f"T{i}", "entity_ids": [f"e{i}", f"e{i+1}"]}
        for i in range(10)
    ])
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None, dynamic_community_selection=True); result = codeflash_output # 3.04ms -> 3.02ms (0.559% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import sys

import numpy as np
import pandas as pd
# imports
import pytest
from graphrag.query.indexer_adapters import read_indexer_reports

# --- Unit Tests for read_indexer_reports ---

# Helper: Minimal CommunityReport for test validation
class CommunityReport:
    def __init__(
        self,
        id,
        short_id,
        title,
        community_id,
        summary,
        full_content,
        rank,
        full_content_embedding,
        attributes=None,
    ):
        self.id = id
        self.short_id = short_id
        self.title = title
        self.community_id = community_id
        self.summary = summary
        self.full_content = full_content
        self.rank = rank
        self.full_content_embedding = full_content_embedding
        self.attributes = attributes

    def __eq__(self, other):
        if not isinstance(other, CommunityReport):
            return False
        return (
            self.id == other.id
            and self.short_id == other.short_id
            and self.title == other.title
            and self.community_id == other.community_id
            and self.summary == other.summary
            and self.full_content == other.full_content
            and self.rank == other.rank
            and self.full_content_embedding == other.full_content_embedding
            and self.attributes == other.attributes
        )

    def __repr__(self):
        return (
            f"CommunityReport(id={self.id!r}, short_id={self.short_id!r}, title={self.title!r}, "
            f"community_id={self.community_id!r}, summary={self.summary!r}, full_content={self.full_content!r}, "
            f"rank={self.rank!r}, full_content_embedding={self.full_content_embedding!r}, attributes={self.attributes!r})"
        )

# Helper: Minimal EmbeddingModel for test validation
class DummyEmbeddingModel:
    def embed(self, text):
        # Returns a deterministic embedding for testing
        return [len(text), sum(ord(c) for c in text) % 100]

# Helper: Minimal GraphRagConfig for embedding model config
class DummyGraphRagConfig:
    def get_language_model_config(self, model_id):
        class DummyLMConfig:
            type = "dummy"
        return DummyLMConfig()

# Helper: Patch ModelManager for embedding
class DummyModelManager:
    def get_or_create_embedding_model(self, name, model_type, config):
        return DummyEmbeddingModel()
from graphrag.query.indexer_adapters import read_indexer_reports

# ---- TESTS ----

# --- Basic Test Cases ---

def test_basic_single_report():
    # One report, one community, no filtering
    reports = pd.DataFrame([{
        "id": "r1",
        "community": 1,
        "level": 1,
        "title": "Title1",
        "summary": "Summary1",
        "full_content": "Content1",
        "rank": 0.5,
        "full_content_embedding": [1.0, 2.0],
    }])
    communities = pd.DataFrame([{
        "community": 1,
        "level": 1,
        "title": "Title1",
        "entity_ids": ["e1", "e2"],
    }])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 3.29ms -> 2.36ms (39.6% faster)

def test_basic_multi_report_multi_community():
    # Multiple reports, multiple communities
    reports = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "S1", "full_content": "C1", "rank": 0.5, "full_content_embedding": [1.0, 2.0]},
        {"id": "r2", "community": 2, "level": 2, "title": "B", "summary": "S2", "full_content": "C2", "rank": 0.7, "full_content_embedding": [3.0, 4.0]},
    ])
    communities = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
        {"community": 2, "level": 2, "title": "B", "entity_ids": ["e2"]},
    ])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 3.19ms -> 2.26ms (41.0% faster)
    ids = set(r.id for r in result)

def test_basic_dynamic_community_selection_true():
    # Dynamic selection disables rollup
    reports = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "S1", "full_content": "C1", "rank": 0.5, "full_content_embedding": [1.0, 2.0]},
        {"id": "r2", "community": 2, "level": 2, "title": "B", "summary": "S2", "full_content": "C2", "rank": 0.7, "full_content_embedding": [3.0, 4.0]},
    ])
    communities = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
        {"community": 2, "level": 2, "title": "B", "entity_ids": ["e2"]},
    ])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None, dynamic_community_selection=True); result = codeflash_output # 1.46ms -> 1.50ms (2.45% slower)
    ids = set(r.id for r in result)


def test_edge_empty_reports_and_communities():
    # Both DataFrames empty
    reports = pd.DataFrame(columns=["id", "community", "level", "title", "summary", "full_content", "rank", "full_content_embedding"])
    communities = pd.DataFrame(columns=["community", "level", "title", "entity_ids"])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 3.00ms -> 1.99ms (51.0% faster)

def test_edge_missing_embedding_column_and_config_none():
    # Embedding column missing, but config is None, so no embedding is added
    reports = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "S1", "full_content": "C1", "rank": 0.5},
    ])
    communities = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
    ])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 3.16ms -> 2.23ms (41.9% faster)


def test_edge_filtering_by_community_level():
    # Only reports/communities with level <= community_level are kept
    reports = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "S1", "full_content": "C1", "rank": 0.5, "full_content_embedding": [1.0, 2.0]},
        {"id": "r2", "community": 2, "level": 3, "title": "B", "summary": "S2", "full_content": "C2", "rank": 0.7, "full_content_embedding": [3.0, 4.0]},
    ])
    communities = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
        {"community": 2, "level": 3, "title": "B", "entity_ids": ["e2"]},
    ])
    codeflash_output = read_indexer_reports(reports, communities, community_level=2); result = codeflash_output # 3.54ms -> 2.59ms (36.9% faster)

def test_edge_entity_ids_with_empty_list():
    # entity_ids is empty list, should not crash
    reports = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "S1", "full_content": "C1", "rank": 0.5, "full_content_embedding": [1.0, 2.0]},
    ])
    communities = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": []},
    ])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 3.19ms -> 2.25ms (41.8% faster)

def test_edge_report_with_missing_optional_fields():
    # Some optional fields (rank, embedding) are missing
    reports = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "S1", "full_content": "C1"},
    ])
    communities = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
    ])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 3.13ms -> 2.16ms (45.1% faster)


def test_large_scale_1000_reports():
    # 1000 reports, 10 communities, all levels 1-10
    n_reports = 1000
    n_communities = 10
    reports = pd.DataFrame({
        "id": [f"r{i}" for i in range(n_reports)],
        "community": [i % n_communities for i in range(n_reports)],
        "level": [i % 10 + 1 for i in range(n_reports)],
        "title": [f"Title{i%10}" for i in range(n_reports)],
        "summary": [f"Summary{i}" for i in range(n_reports)],
        "full_content": [f"Content{i}" for i in range(n_reports)],
        "rank": [float(i)/n_reports for i in range(n_reports)],
        "full_content_embedding": [[float(i), float(i+1)] for i in range(n_reports)],
    })
    communities = pd.DataFrame({
        "community": list(range(n_communities)),
        "level": [i+1 for i in range(n_communities)],
        "title": [f"Title{i}" for i in range(n_communities)],
        "entity_ids": [["e"+str(i*10+j) for j in range(10)] for i in range(n_communities)],
    })
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 7.13ms -> 6.04ms (17.9% faster)

def test_large_scale_filtering():
    # 1000 reports, only those with level <= 5 should remain after filtering
    n_reports = 1000
    n_communities = 10
    reports = pd.DataFrame({
        "id": [f"r{i}" for i in range(n_reports)],
        "community": [i % n_communities for i in range(n_reports)],
        "level": [i % 10 + 1 for i in range(n_reports)],
        "title": [f"Title{i%10}" for i in range(n_reports)],
        "summary": [f"Summary{i}" for i in range(n_reports)],
        "full_content": [f"Content{i}" for i in range(n_reports)],
        "rank": [float(i)/n_reports for i in range(n_reports)],
        "full_content_embedding": [[float(i), float(i+1)] for i in range(n_reports)],
    })
    communities = pd.DataFrame({
        "community": list(range(n_communities)),
        "level": [i+1 for i in range(n_communities)],
        "title": [f"Title{i}" for i in range(n_communities)],
        "entity_ids": [["e"+str(i*10+j) for j in range(10)] for i in range(n_communities)],
    })
    codeflash_output = read_indexer_reports(reports, communities, community_level=5); result = codeflash_output # 5.60ms -> 4.54ms (23.5% faster)
    # Only reports with level <= 5 remain
    expected_count = sum(1 for i in range(n_reports) if (i % 10 + 1) <= 5)


def test_large_scale_dynamic_community_selection():
    # 1000 reports, dynamic_community_selection True
    n_reports = 1000
    n_communities = 10
    reports = pd.DataFrame({
        "id": [f"r{i}" for i in range(n_reports)],
        "community": [i % n_communities for i in range(n_reports)],
        "level": [i % 10 + 1 for i in range(n_reports)],
        "title": [f"Title{i%10}" for i in range(n_reports)],
        "summary": [f"Summary{i}" for i in range(n_reports)],
        "full_content": [f"Content{i}" for i in range(n_reports)],
        "rank": [float(i)/n_reports for i in range(n_reports)],
        "full_content_embedding": [[float(i), float(i+1)] for i in range(n_reports)],
    })
    communities = pd.DataFrame({
        "community": list(range(n_communities)),
        "level": [i+1 for i in range(n_communities)],
        "title": [f"Title{i}" for i in range(n_communities)],
        "entity_ids": [["e"+str(i*10+j) for j in range(10)] for i in range(n_communities)],
    })
    codeflash_output = read_indexer_reports(reports, communities, community_level=None, dynamic_community_selection=True); result = codeflash_output # 5.22ms -> 5.19ms (0.472% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-read_indexer_reports-mglocpqt and push.

The optimized code achieves a 28% speedup through several key improvements to DataFrame operations in `read_indexer_reports`: **1. Streamlined Community Processing** - **Original**: Used chained `.loc[:, "community"]` assignments followed by `groupby().agg().reset_index()` and `merge()` operations - **Optimized**: Combined fillna and astype into a single operation, then used `drop_duplicates(subset=["title"], keep="last")` with direct filtering via `isin()` - **Why faster**: Eliminates expensive groupby aggregation and merge operations, replacing them with more efficient direct DataFrame filtering **2. Reduced DataFrame Operations** - **Original**: Multiple separate operations: fillna(-1), astype(int), groupby, merge, drop_duplicates - **Optimized**: Consolidated into fewer, more efficient operations using vectorized pandas methods - **Why faster**: Fewer intermediate DataFrame copies and less overhead from chained operations **3. Optimized Embedding Logic** - **Original**: Always called the expensive `embed_community_reports` function - **Optimized**: Added conditional logic to only embed missing values using boolean indexing to target specific rows - **Why faster**: Avoids unnecessary embedding operations and reduces function call overhead **4. Minor Loop Optimizations in read_community_reports** - Added local variable caching for frequently accessed functions and objects to reduce attribute lookup overhead in tight loops - Split the comprehension into separate branches to avoid repeated conditional checks The optimizations are most effective for test cases with: - **Large datasets** (17-30% improvement on 300-1000 record tests) - **Non-dynamic community selection** scenarios (where the groupby optimization applies) - **Cases with existing embeddings** (avoiding expensive re-embedding) For small datasets or dynamic selection cases, improvements are minimal (0.5-2%) as the overhead reduction is less significant.

codeflash-ai bot requested a review from mashraf-222 October 11, 2025 02:47

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up function `read_indexer_reports` by 29%#64

⚡️ Speed up function `read_indexer_reports` by 29%#64
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-read_indexer_reports-mglocpqt

codeflash-ai bot commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

codeflash-ai bot commented Oct 11, 2025

📄 29% (0.29x) speedup for read_indexer_reports in graphrag/query/indexer_adapters.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

📄 29% (0.29x) speedup for `read_indexer_reports` in `graphrag/query/indexer_adapters.py`