Skip to content

⚡️ Speed up function read_indexer_reports by 29%#64

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-read_indexer_reports-mglocpqt
Open

⚡️ Speed up function read_indexer_reports by 29%#64
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-read_indexer_reports-mglocpqt

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 11, 2025

📄 29% (0.29x) speedup for read_indexer_reports in graphrag/query/indexer_adapters.py

⏱️ Runtime : 79.0 milliseconds 61.5 milliseconds (best of 46 runs)

📝 Explanation and details

The optimized code achieves a 28% speedup through several key improvements to DataFrame operations in read_indexer_reports:

1. Streamlined Community Processing

  • Original: Used chained .loc[:, "community"] assignments followed by groupby().agg().reset_index() and merge() operations
  • Optimized: Combined fillna and astype into a single operation, then used drop_duplicates(subset=["title"], keep="last") with direct filtering via isin()
  • Why faster: Eliminates expensive groupby aggregation and merge operations, replacing them with more efficient direct DataFrame filtering

2. Reduced DataFrame Operations

  • Original: Multiple separate operations: fillna(-1), astype(int), groupby, merge, drop_duplicates
  • Optimized: Consolidated into fewer, more efficient operations using vectorized pandas methods
  • Why faster: Fewer intermediate DataFrame copies and less overhead from chained operations

3. Optimized Embedding Logic

  • Original: Always called the expensive embed_community_reports function
  • Optimized: Added conditional logic to only embed missing values using boolean indexing to target specific rows
  • Why faster: Avoids unnecessary embedding operations and reduces function call overhead

4. Minor Loop Optimizations in read_community_reports

  • Added local variable caching for frequently accessed functions and objects to reduce attribute lookup overhead in tight loops
  • Split the comprehension into separate branches to avoid repeated conditional checks

The optimizations are most effective for test cases with:

  • Large datasets (17-30% improvement on 300-1000 record tests)
  • Non-dynamic community selection scenarios (where the groupby optimization applies)
  • Cases with existing embeddings (avoiding expensive re-embedding)

For small datasets or dynamic selection cases, improvements are minimal (0.5-2%) as the overhead reduction is less significant.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 22 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 93.8%
🌀 Generated Regression Tests and Runtime
from dataclasses import dataclass
from typing import Any, Dict, List, Optional

import numpy as np
import pandas as pd
# imports
import pytest
from graphrag.query.indexer_adapters import read_indexer_reports


@dataclass
class CommunityReport:
    id: str
    short_id: Optional[str]
    title: str
    community_id: str
    summary: str
    full_content: str
    rank: Optional[float]
    full_content_embedding: Optional[List[float]]
    attributes: Optional[Dict[str, Any]] = None
from graphrag.query.indexer_adapters import read_indexer_reports

# ----------------------------
# Unit tests for read_indexer_reports
# ----------------------------

# ----------- BASIC TEST CASES ------------

def make_basic_reports_df():
    # Returns a simple DataFrame for basic tests
    return pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "sumA", "full_content": "contentA", "rank": 10.0, "full_content_embedding": [1.0, 2.0]},
        {"id": "r2", "community": 2, "level": 2, "title": "B", "summary": "sumB", "full_content": "contentB", "rank": 20.0, "full_content_embedding": [3.0, 4.0]},
    ])

def make_basic_communities_df():
    # Returns a simple DataFrame for basic tests
    return pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1", "e2"]},
        {"community": 2, "level": 2, "title": "B", "entity_ids": ["e3"]},
    ])

def test_basic_returns_expected_reports():
    """Test that basic input returns correct CommunityReport objects."""
    reports_df = make_basic_reports_df()
    communities_df = make_basic_communities_df()
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 3.31ms -> 2.36ms (40.1% faster)
    # Should only include communities present in both, with correct rollup
    ids = [r.id for r in result]

def test_basic_with_community_level_filter():
    """Test that filtering by community_level works."""
    reports_df = make_basic_reports_df()
    communities_df = make_basic_communities_df()
    # Only level <= 1 should be included
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=1); result = codeflash_output # 3.62ms -> 2.65ms (36.3% faster)
    ids = [r.id for r in result]

def test_basic_dynamic_community_selection():
    """Test dynamic_community_selection disables rollup."""
    reports_df = make_basic_reports_df()
    communities_df = make_basic_communities_df()
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None, dynamic_community_selection=True); result = codeflash_output # 1.56ms -> 1.59ms (1.90% slower)
    # Should not filter or rollup, so all reports remain
    ids = [r.id for r in result]


def test_empty_reports_df_returns_empty():
    """Test that empty reports dataframe returns empty result."""
    reports_df = pd.DataFrame(columns=["id", "community", "level", "title", "summary", "full_content", "rank", "full_content_embedding"])
    communities_df = make_basic_communities_df()
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 3.13ms -> 2.27ms (38.2% faster)

def test_empty_communities_df_returns_empty():
    """Test that empty communities dataframe returns empty result."""
    reports_df = make_basic_reports_df()
    communities_df = pd.DataFrame(columns=["community", "level", "title", "entity_ids"])
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 3.12ms -> 2.09ms (49.2% faster)




def test_duplicate_communities_rollup():
    """Test that duplicate communities are rolled up to max level."""
    reports_df = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "sumA", "full_content": "contentA", "rank": 10.0, "full_content_embedding": [1.0]},
        {"id": "r2", "community": 1, "level": 2, "title": "A", "summary": "sumB", "full_content": "contentB", "rank": 20.0, "full_content_embedding": [2.0]},
    ])
    communities_df = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
        {"community": 1, "level": 2, "title": "A", "entity_ids": ["e2"]},
    ])
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 3.25ms -> 2.29ms (41.8% faster)
    # Only the highest level community should be kept
    ids = [r.id for r in result]

def test_explode_entity_ids_with_multiple_entities():
    """Test that explode works with multiple entities per community."""
    reports_df = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "sumA", "full_content": "contentA", "rank": 10.0, "full_content_embedding": [1.0]},
    ])
    communities_df = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1", "e2", "e3"]},
    ])
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 3.28ms -> 2.35ms (39.6% faster)

def test_reports_with_missing_optional_columns():
    """Test that missing optional columns does not break function."""
    reports_df = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "sumA", "full_content": "abc"},
    ])
    communities_df = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
    ])
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 3.10ms -> 2.15ms (44.4% faster)

# ----------- LARGE SCALE TEST CASES ------------

def test_large_scale_reports_and_communities():
    """Test with large number of reports and communities."""
    n = 500
    reports_df = pd.DataFrame([
        {"id": f"r{i}", "community": i % 10, "level": i % 5, "title": f"T{i%10}", "summary": f"sum{i}", "full_content": f"content{i}", "rank": float(i), "full_content_embedding": [float(i), float(i+1)]}
        for i in range(n)
    ])
    communities_df = pd.DataFrame([
        {"community": i, "level": i % 5, "title": f"T{i}", "entity_ids": [f"e{i}", f"e{i+1}"]}
        for i in range(10)
    ])
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None); result = codeflash_output # 5.21ms -> 4.18ms (24.7% faster)
    # Should only include reports with communities present in communities_df
    communities_in_result = set(r.community_id for r in result)


def test_large_scale_with_community_level_filter():
    """Test large scale with community_level filter."""
    n = 300
    reports_df = pd.DataFrame([
        {"id": f"r{i}", "community": i % 10, "level": i % 5, "title": f"T{i%10}", "summary": f"sum{i}", "full_content": f"content{i}", "rank": float(i), "full_content_embedding": [float(i), float(i+1)]}
        for i in range(n)
    ])
    communities_df = pd.DataFrame([
        {"community": i, "level": i % 5, "title": f"T{i}", "entity_ids": [f"e{i}", f"e{i+1}"]}
        for i in range(10)
    ])
    # Only level <= 2 should be included
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=2); result = codeflash_output # 4.45ms -> 3.40ms (30.8% faster)

def test_large_scale_dynamic_selection():
    """Test large scale with dynamic_community_selection=True disables rollup."""
    n = 400
    reports_df = pd.DataFrame([
        {"id": f"r{i}", "community": i % 10, "level": i % 5, "title": f"T{i%10}", "summary": f"sum{i}", "full_content": f"content{i}", "rank": float(i), "full_content_embedding": [float(i), float(i+1)]}
        for i in range(n)
    ])
    communities_df = pd.DataFrame([
        {"community": i, "level": i % 5, "title": f"T{i}", "entity_ids": [f"e{i}", f"e{i+1}"]}
        for i in range(10)
    ])
    codeflash_output = read_indexer_reports(reports_df, communities_df, community_level=None, dynamic_community_selection=True); result = codeflash_output # 3.04ms -> 3.02ms (0.559% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import sys

import numpy as np
import pandas as pd
# imports
import pytest
from graphrag.query.indexer_adapters import read_indexer_reports

# --- Unit Tests for read_indexer_reports ---

# Helper: Minimal CommunityReport for test validation
class CommunityReport:
    def __init__(
        self,
        id,
        short_id,
        title,
        community_id,
        summary,
        full_content,
        rank,
        full_content_embedding,
        attributes=None,
    ):
        self.id = id
        self.short_id = short_id
        self.title = title
        self.community_id = community_id
        self.summary = summary
        self.full_content = full_content
        self.rank = rank
        self.full_content_embedding = full_content_embedding
        self.attributes = attributes

    def __eq__(self, other):
        if not isinstance(other, CommunityReport):
            return False
        return (
            self.id == other.id
            and self.short_id == other.short_id
            and self.title == other.title
            and self.community_id == other.community_id
            and self.summary == other.summary
            and self.full_content == other.full_content
            and self.rank == other.rank
            and self.full_content_embedding == other.full_content_embedding
            and self.attributes == other.attributes
        )

    def __repr__(self):
        return (
            f"CommunityReport(id={self.id!r}, short_id={self.short_id!r}, title={self.title!r}, "
            f"community_id={self.community_id!r}, summary={self.summary!r}, full_content={self.full_content!r}, "
            f"rank={self.rank!r}, full_content_embedding={self.full_content_embedding!r}, attributes={self.attributes!r})"
        )

# Helper: Minimal EmbeddingModel for test validation
class DummyEmbeddingModel:
    def embed(self, text):
        # Returns a deterministic embedding for testing
        return [len(text), sum(ord(c) for c in text) % 100]

# Helper: Minimal GraphRagConfig for embedding model config
class DummyGraphRagConfig:
    def get_language_model_config(self, model_id):
        class DummyLMConfig:
            type = "dummy"
        return DummyLMConfig()

# Helper: Patch ModelManager for embedding
class DummyModelManager:
    def get_or_create_embedding_model(self, name, model_type, config):
        return DummyEmbeddingModel()
from graphrag.query.indexer_adapters import read_indexer_reports

# ---- TESTS ----

# --- Basic Test Cases ---

def test_basic_single_report():
    # One report, one community, no filtering
    reports = pd.DataFrame([{
        "id": "r1",
        "community": 1,
        "level": 1,
        "title": "Title1",
        "summary": "Summary1",
        "full_content": "Content1",
        "rank": 0.5,
        "full_content_embedding": [1.0, 2.0],
    }])
    communities = pd.DataFrame([{
        "community": 1,
        "level": 1,
        "title": "Title1",
        "entity_ids": ["e1", "e2"],
    }])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 3.29ms -> 2.36ms (39.6% faster)

def test_basic_multi_report_multi_community():
    # Multiple reports, multiple communities
    reports = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "S1", "full_content": "C1", "rank": 0.5, "full_content_embedding": [1.0, 2.0]},
        {"id": "r2", "community": 2, "level": 2, "title": "B", "summary": "S2", "full_content": "C2", "rank": 0.7, "full_content_embedding": [3.0, 4.0]},
    ])
    communities = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
        {"community": 2, "level": 2, "title": "B", "entity_ids": ["e2"]},
    ])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 3.19ms -> 2.26ms (41.0% faster)
    ids = set(r.id for r in result)

def test_basic_dynamic_community_selection_true():
    # Dynamic selection disables rollup
    reports = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "S1", "full_content": "C1", "rank": 0.5, "full_content_embedding": [1.0, 2.0]},
        {"id": "r2", "community": 2, "level": 2, "title": "B", "summary": "S2", "full_content": "C2", "rank": 0.7, "full_content_embedding": [3.0, 4.0]},
    ])
    communities = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
        {"community": 2, "level": 2, "title": "B", "entity_ids": ["e2"]},
    ])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None, dynamic_community_selection=True); result = codeflash_output # 1.46ms -> 1.50ms (2.45% slower)
    ids = set(r.id for r in result)


def test_edge_empty_reports_and_communities():
    # Both DataFrames empty
    reports = pd.DataFrame(columns=["id", "community", "level", "title", "summary", "full_content", "rank", "full_content_embedding"])
    communities = pd.DataFrame(columns=["community", "level", "title", "entity_ids"])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 3.00ms -> 1.99ms (51.0% faster)

def test_edge_missing_embedding_column_and_config_none():
    # Embedding column missing, but config is None, so no embedding is added
    reports = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "S1", "full_content": "C1", "rank": 0.5},
    ])
    communities = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
    ])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 3.16ms -> 2.23ms (41.9% faster)


def test_edge_filtering_by_community_level():
    # Only reports/communities with level <= community_level are kept
    reports = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "S1", "full_content": "C1", "rank": 0.5, "full_content_embedding": [1.0, 2.0]},
        {"id": "r2", "community": 2, "level": 3, "title": "B", "summary": "S2", "full_content": "C2", "rank": 0.7, "full_content_embedding": [3.0, 4.0]},
    ])
    communities = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
        {"community": 2, "level": 3, "title": "B", "entity_ids": ["e2"]},
    ])
    codeflash_output = read_indexer_reports(reports, communities, community_level=2); result = codeflash_output # 3.54ms -> 2.59ms (36.9% faster)

def test_edge_entity_ids_with_empty_list():
    # entity_ids is empty list, should not crash
    reports = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "S1", "full_content": "C1", "rank": 0.5, "full_content_embedding": [1.0, 2.0]},
    ])
    communities = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": []},
    ])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 3.19ms -> 2.25ms (41.8% faster)

def test_edge_report_with_missing_optional_fields():
    # Some optional fields (rank, embedding) are missing
    reports = pd.DataFrame([
        {"id": "r1", "community": 1, "level": 1, "title": "A", "summary": "S1", "full_content": "C1"},
    ])
    communities = pd.DataFrame([
        {"community": 1, "level": 1, "title": "A", "entity_ids": ["e1"]},
    ])
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 3.13ms -> 2.16ms (45.1% faster)


def test_large_scale_1000_reports():
    # 1000 reports, 10 communities, all levels 1-10
    n_reports = 1000
    n_communities = 10
    reports = pd.DataFrame({
        "id": [f"r{i}" for i in range(n_reports)],
        "community": [i % n_communities for i in range(n_reports)],
        "level": [i % 10 + 1 for i in range(n_reports)],
        "title": [f"Title{i%10}" for i in range(n_reports)],
        "summary": [f"Summary{i}" for i in range(n_reports)],
        "full_content": [f"Content{i}" for i in range(n_reports)],
        "rank": [float(i)/n_reports for i in range(n_reports)],
        "full_content_embedding": [[float(i), float(i+1)] for i in range(n_reports)],
    })
    communities = pd.DataFrame({
        "community": list(range(n_communities)),
        "level": [i+1 for i in range(n_communities)],
        "title": [f"Title{i}" for i in range(n_communities)],
        "entity_ids": [["e"+str(i*10+j) for j in range(10)] for i in range(n_communities)],
    })
    codeflash_output = read_indexer_reports(reports, communities, community_level=None); result = codeflash_output # 7.13ms -> 6.04ms (17.9% faster)

def test_large_scale_filtering():
    # 1000 reports, only those with level <= 5 should remain after filtering
    n_reports = 1000
    n_communities = 10
    reports = pd.DataFrame({
        "id": [f"r{i}" for i in range(n_reports)],
        "community": [i % n_communities for i in range(n_reports)],
        "level": [i % 10 + 1 for i in range(n_reports)],
        "title": [f"Title{i%10}" for i in range(n_reports)],
        "summary": [f"Summary{i}" for i in range(n_reports)],
        "full_content": [f"Content{i}" for i in range(n_reports)],
        "rank": [float(i)/n_reports for i in range(n_reports)],
        "full_content_embedding": [[float(i), float(i+1)] for i in range(n_reports)],
    })
    communities = pd.DataFrame({
        "community": list(range(n_communities)),
        "level": [i+1 for i in range(n_communities)],
        "title": [f"Title{i}" for i in range(n_communities)],
        "entity_ids": [["e"+str(i*10+j) for j in range(10)] for i in range(n_communities)],
    })
    codeflash_output = read_indexer_reports(reports, communities, community_level=5); result = codeflash_output # 5.60ms -> 4.54ms (23.5% faster)
    # Only reports with level <= 5 remain
    expected_count = sum(1 for i in range(n_reports) if (i % 10 + 1) <= 5)


def test_large_scale_dynamic_community_selection():
    # 1000 reports, dynamic_community_selection True
    n_reports = 1000
    n_communities = 10
    reports = pd.DataFrame({
        "id": [f"r{i}" for i in range(n_reports)],
        "community": [i % n_communities for i in range(n_reports)],
        "level": [i % 10 + 1 for i in range(n_reports)],
        "title": [f"Title{i%10}" for i in range(n_reports)],
        "summary": [f"Summary{i}" for i in range(n_reports)],
        "full_content": [f"Content{i}" for i in range(n_reports)],
        "rank": [float(i)/n_reports for i in range(n_reports)],
        "full_content_embedding": [[float(i), float(i+1)] for i in range(n_reports)],
    })
    communities = pd.DataFrame({
        "community": list(range(n_communities)),
        "level": [i+1 for i in range(n_communities)],
        "title": [f"Title{i}" for i in range(n_communities)],
        "entity_ids": [["e"+str(i*10+j) for j in range(10)] for i in range(n_communities)],
    })
    codeflash_output = read_indexer_reports(reports, communities, community_level=None, dynamic_community_selection=True); result = codeflash_output # 5.22ms -> 5.19ms (0.472% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-read_indexer_reports-mglocpqt and push.

Codeflash

The optimized code achieves a 28% speedup through several key improvements to DataFrame operations in `read_indexer_reports`:

**1. Streamlined Community Processing**
- **Original**: Used chained `.loc[:, "community"]` assignments followed by `groupby().agg().reset_index()` and `merge()` operations
- **Optimized**: Combined fillna and astype into a single operation, then used `drop_duplicates(subset=["title"], keep="last")` with direct filtering via `isin()`
- **Why faster**: Eliminates expensive groupby aggregation and merge operations, replacing them with more efficient direct DataFrame filtering

**2. Reduced DataFrame Operations**
- **Original**: Multiple separate operations: fillna(-1), astype(int), groupby, merge, drop_duplicates
- **Optimized**: Consolidated into fewer, more efficient operations using vectorized pandas methods
- **Why faster**: Fewer intermediate DataFrame copies and less overhead from chained operations

**3. Optimized Embedding Logic**
- **Original**: Always called the expensive `embed_community_reports` function
- **Optimized**: Added conditional logic to only embed missing values using boolean indexing to target specific rows
- **Why faster**: Avoids unnecessary embedding operations and reduces function call overhead

**4. Minor Loop Optimizations in read_community_reports**
- Added local variable caching for frequently accessed functions and objects to reduce attribute lookup overhead in tight loops
- Split the comprehension into separate branches to avoid repeated conditional checks

The optimizations are most effective for test cases with:
- **Large datasets** (17-30% improvement on 300-1000 record tests)
- **Non-dynamic community selection** scenarios (where the groupby optimization applies)
- **Cases with existing embeddings** (avoiding expensive re-embedding)

For small datasets or dynamic selection cases, improvements are minimal (0.5-2%) as the overhead reduction is less significant.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 11, 2025 02:47
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants