Skip to content

⚡️ Speed up function embed_community_reports by 28%#66

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-embed_community_reports-mglotmub
Open

⚡️ Speed up function embed_community_reports by 28%#66
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-embed_community_reports-mglotmub

Conversation

@codeflash-ai
Copy link
Copy Markdown

@codeflash-ai codeflash-ai bot commented Oct 11, 2025

📄 28% (0.28x) speedup for embed_community_reports in graphrag/query/indexer_adapters.py

⏱️ Runtime : 5.77 milliseconds 4.52 milliseconds (best of 377 runs)

📝 Explanation and details

The optimization replaces pandas' .apply() with lambda function with a direct list comprehension approach, yielding a 27% speedup.

Key Changes:

  • Eliminated pandas .apply() overhead: The original code used reports_df.loc[:, source_col].apply(lambda x: embedder.embed(x)) which has significant pandas overhead for element-wise operations
  • Direct list comprehension: Replaced with src = reports_df[source_col].to_list() followed by embeddings = [embedder.embed(x) for x in src]
  • Reduced pandas Series operations: Converted to native Python list processing before assigning back to the DataFrame

Why This is Faster:

  1. Pandas .apply() overhead: Each .apply() call has internal pandas machinery that processes each element through the pandas Series infrastructure
  2. Lambda function overhead: Creating and calling lambda functions for each row adds computational cost
  3. List comprehension efficiency: Native Python list comprehensions are highly optimized in CPython and avoid pandas' internal overhead

Performance Characteristics:

  • Best for moderate to large datasets: Shows 18-40% improvements across test cases with varying DataFrame sizes
  • Consistent gains: Even small DataFrames (single row) see 37-40% speedup
  • Scales well: Large DataFrames (1000 rows) maintain 18-20% improvements
  • Edge cases preserved: Handles None values, mixed types, and empty DataFrames correctly while maintaining the performance benefit

The line profiler shows the bottleneck shifted from a single expensive .apply() operation (93.4% of time) to three more balanced operations: list conversion (9.8%), embedding computation (33.5%), and DataFrame assignment (49.3%).

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 27 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pandas as pd
# imports
import pytest  # used for our unit tests
from graphrag.language_model.protocol.base import EmbeddingModel
from graphrag.query.indexer_adapters import embed_community_reports

# ---- Test Infrastructure ----

class DummyEmbedder:
    """A dummy embedder that returns the input string reversed as a list of chars."""
    def __init__(self):
        self.calls = []

    def embed(self, text):
        self.calls.append(text)
        # For testing, return a deterministic embedding: reversed string as list of chars
        # If text is None, return None
        if text is None:
            return None
        return list(str(text)[::-1])

# ---- Unit Tests ----

# 1. Basic Test Cases

def test_basic_embedding_single_row():
    """Test embedding a single row dataframe with default columns."""
    df = pd.DataFrame({"full_content": ["hello world"]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 245μs -> 178μs (37.7% faster)

def test_basic_embedding_multiple_rows():
    """Test embedding a dataframe with multiple rows."""
    df = pd.DataFrame({"full_content": ["foo", "bar", "baz"]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 240μs -> 175μs (37.1% faster)

def test_basic_embedding_custom_columns():
    """Test embedding with custom source and embedding column names."""
    df = pd.DataFrame({"text": ["alpha", "beta"]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(
        df.copy(), embedder, source_col="text", embedding_col="text_emb"
    ); result = codeflash_output # 239μs -> 172μs (39.1% faster)

def test_embedding_col_already_exists():
    """Test that if embedding_col already exists, it's not overwritten or recomputed."""
    df = pd.DataFrame({
        "full_content": ["foo", "bar"],
        "full_content_embedding": ["should stay", "untouched"]
    })
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 18.0μs -> 17.7μs (1.38% faster)

# 2. Edge Test Cases

def test_missing_source_column_raises():
    """Test that missing source_col raises ValueError."""
    df = pd.DataFrame({"other_col": ["data"]})
    embedder = DummyEmbedder()
    with pytest.raises(ValueError) as excinfo:
        embed_community_reports(df, embedder) # 18.3μs -> 17.5μs (4.11% faster)

def test_empty_dataframe():
    """Test embedding on an empty dataframe (should add embedding_col, but no calls)."""
    df = pd.DataFrame({"full_content": []})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 243μs -> 173μs (40.5% faster)

def test_null_values_in_source_column():
    """Test that null values in source_col are passed to embedder (which returns None)."""
    df = pd.DataFrame({"full_content": ["abc", None, "xyz"]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 241μs -> 177μs (36.1% faster)

def test_source_column_with_non_string_types():
    """Test that non-string types in source_col are stringified and embedded."""
    df = pd.DataFrame({"full_content": [123, 4.56, True, None]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 241μs -> 176μs (36.7% faster)

def test_embedding_col_same_as_source_col():
    """Test that if embedding_col is the same as source_col, it overwrites source_col."""
    df = pd.DataFrame({"foo": ["bar", "baz"]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder, source_col="foo", embedding_col="foo"); result = codeflash_output # 18.3μs -> 17.1μs (6.87% faster)

def test_embedding_col_exists_but_is_nan():
    """Test that embedding_col exists but is all NaN (should not recompute)."""
    df = pd.DataFrame({
        "full_content": ["abc", "def"],
        "full_content_embedding": [float('nan'), float('nan')]
    })
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 16.0μs -> 16.3μs (1.68% slower)

# 3. Large Scale Test Cases

def test_large_dataframe_embedding():
    """Test embedding a large dataframe (up to 1000 rows) for performance and correctness."""
    N = 1000
    texts = [f"row_{i}" for i in range(N)]
    df = pd.DataFrame({"full_content": texts})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 568μs -> 472μs (20.5% faster)
    # Spot check a few rows
    for i in [0, 10, 999]:
        expected = list(texts[i][::-1])

def test_large_dataframe_embedding_col_exists():
    """Test large dataframe where embedding_col already exists (should not call embedder)."""
    N = 500
    df = pd.DataFrame({
        "full_content": [f"row_{i}" for i in range(N)],
        "full_content_embedding": ["exists"] * N
    })
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 19.6μs -> 19.2μs (1.70% faster)

def test_large_dataframe_with_nulls():
    """Test large dataframe with many nulls in source_col."""
    N = 1000
    # Alternate between valid string and None
    texts = [f"row_{i}" if i % 2 == 0 else None for i in range(N)]
    df = pd.DataFrame({"full_content": texts})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 467μs -> 382μs (22.1% faster)
    # Check a few embeddings
    for i in [0, 1, 998, 999]:
        if texts[i] is None:
            pass
        else:
            pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pandas as pd
# imports
import pytest  # used for our unit tests
from graphrag.language_model.protocol.base import EmbeddingModel
from graphrag.query.indexer_adapters import embed_community_reports

# ---- Mock EmbeddingModel for testing ----

class DummyEmbedder:
    """A dummy embedder that returns a tuple of (length, first char, last char) for testability."""
    def embed(self, text):
        if text is None:
            return None
        s = str(text)
        if len(s) == 0:
            return [0]
        return [len(s), s[0], s[-1]]

# ---- Unit Tests ----

# Basic Test Cases

def test_basic_embedding_single_row():
    """Test embedding a single row DataFrame with default column names."""
    df = pd.DataFrame({"full_content": ["hello world"]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 254μs -> 184μs (38.0% faster)

def test_basic_embedding_multiple_rows():
    """Test embedding multiple rows with varying content."""
    df = pd.DataFrame({"full_content": ["abc", "de", ""]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 240μs -> 174μs (37.6% faster)

def test_basic_custom_column_names():
    """Test embedding with custom source and embedding column names."""
    df = pd.DataFrame({"text": ["xyz"]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(
        df.copy(), embedder, source_col="text", embedding_col="text_emb"
    ); result = codeflash_output # 240μs -> 170μs (40.4% faster)

def test_embedding_col_already_exists():
    """If embedding column already exists, function should not overwrite it."""
    df = pd.DataFrame({
        "full_content": ["abc"],
        "full_content_embedding": ["should_stay"]
    })
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 18.0μs -> 17.4μs (3.59% faster)

# Edge Test Cases

def test_missing_source_column_raises():
    """If the source column is missing, should raise ValueError."""
    df = pd.DataFrame({"other_col": ["hello"]})
    embedder = DummyEmbedder()
    with pytest.raises(ValueError) as excinfo:
        embed_community_reports(df, embedder) # 18.0μs -> 17.7μs (1.32% faster)

def test_empty_dataframe():
    """Embedding an empty DataFrame should return an empty DataFrame with the embedding column."""
    df = pd.DataFrame({"full_content": []})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 239μs -> 176μs (35.9% faster)

def test_none_values_in_source_column():
    """Rows with None in the source column should get None as embedding."""
    df = pd.DataFrame({"full_content": ["abc", None, ""]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 242μs -> 174μs (38.8% faster)

def test_non_string_values_in_source_column():
    """Non-string values in source column should be handled by str conversion."""
    df = pd.DataFrame({"full_content": [123, 45.6, True, None]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 241μs -> 175μs (37.7% faster)

def test_embedding_col_same_as_source_col():
    """If embedding_col is the same as source_col, should overwrite the source column with embeddings."""
    df = pd.DataFrame({"foo": ["bar", "baz"]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder, source_col="foo", embedding_col="foo"); result = codeflash_output # 18.2μs -> 17.2μs (5.66% faster)

def test_embedding_col_exists_but_diff_type():
    """If embedding_col exists but is not the result of embedding, don't overwrite."""
    df = pd.DataFrame({"full_content": ["abc"], "full_content_embedding": [123]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 15.9μs -> 16.0μs (0.723% slower)

# Large Scale Test Cases

def test_large_scale_embedding():
    """Test embedding with a large DataFrame (1000 rows)."""
    n = 1000
    df = pd.DataFrame({"full_content": ["row" + str(i) for i in range(n)]})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 456μs -> 384μs (18.8% faster)
    # Spot check a few rows
    for i in [0, 10, 500, 999]:
        s = "row" + str(i)
        expected = [len(s), s[0], s[-1]]

def test_large_scale_empty_strings():
    """Test embedding with a large DataFrame of empty strings."""
    n = 1000
    df = pd.DataFrame({"full_content": [""] * n})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 406μs -> 341μs (18.8% faster)

def test_large_scale_none_values():
    """Test embedding with a large DataFrame of None values."""
    n = 1000
    df = pd.DataFrame({"full_content": [None] * n})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 342μs -> 285μs (19.8% faster)

def test_large_scale_mixed_types():
    """Test embedding with a large DataFrame of mixed types."""
    n = 1000
    mixed = ["abc", 123, None, "", True, 45.6] * (n // 6)
    df = pd.DataFrame({"full_content": mixed})
    embedder = DummyEmbedder()
    codeflash_output = embed_community_reports(df.copy(), embedder); result = codeflash_output # 462μs -> 386μs (19.7% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-embed_community_reports-mglotmub and push.

Codeflash

The optimization replaces pandas' `.apply()` with lambda function with a direct list comprehension approach, yielding a **27% speedup**.

**Key Changes:**
- **Eliminated pandas `.apply()` overhead**: The original code used `reports_df.loc[:, source_col].apply(lambda x: embedder.embed(x))` which has significant pandas overhead for element-wise operations
- **Direct list comprehension**: Replaced with `src = reports_df[source_col].to_list()` followed by `embeddings = [embedder.embed(x) for x in src]`
- **Reduced pandas Series operations**: Converted to native Python list processing before assigning back to the DataFrame

**Why This is Faster:**
1. **Pandas `.apply()` overhead**: Each `.apply()` call has internal pandas machinery that processes each element through the pandas Series infrastructure
2. **Lambda function overhead**: Creating and calling lambda functions for each row adds computational cost
3. **List comprehension efficiency**: Native Python list comprehensions are highly optimized in CPython and avoid pandas' internal overhead

**Performance Characteristics:**
- **Best for moderate to large datasets**: Shows 18-40% improvements across test cases with varying DataFrame sizes
- **Consistent gains**: Even small DataFrames (single row) see 37-40% speedup
- **Scales well**: Large DataFrames (1000 rows) maintain 18-20% improvements
- **Edge cases preserved**: Handles None values, mixed types, and empty DataFrames correctly while maintaining the performance benefit

The line profiler shows the bottleneck shifted from a single expensive `.apply()` operation (93.4% of time) to three more balanced operations: list conversion (9.8%), embedding computation (33.5%), and DataFrame assignment (49.3%).
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 11, 2025 03:00
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants