Skip to content

⚡️ Speed up function _group_and_resolve_entities by 299%#60

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-_group_and_resolve_entities-mglmouew
Open

⚡️ Speed up function _group_and_resolve_entities by 299%#60
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-_group_and_resolve_entities-mglmouew

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 11, 2025

📄 299% (2.99x) speedup for _group_and_resolve_entities in graphrag/index/update/entities.py

⏱️ Runtime : 384 milliseconds 96.1 milliseconds (best of 13 runs)

📝 Explanation and details

The optimization achieves a 299% speedup by eliminating the expensive pandas groupby.agg() operation with Python lambdas, which was consuming 86.5% of the original runtime (879.9ms out of 1017ms).

Key optimizations:

  1. Replaced pandas groupby.agg() with manual grouping: Instead of using pandas' slow lambda functions in .agg(), the code now uses np.unique() with return_inverse=True to efficiently group rows by title, then applies operations directly on numpy arrays.

  2. Eliminated lambda overhead: The original lambdas like lambda x: list(x.astype(str)) and lambda x: list(itertools.chain(*x.tolist())) were called for each group. The optimized version precomputes string conversions and list flattening using vectorized operations and list comprehensions.

  3. Direct DataFrame construction: Rather than creating an intermediate grouped object and resetting the index, the code builds the result dictionary directly and constructs the final DataFrame in one step.

  4. Removed itertools import: No longer needed since list flattening is handled with nested list comprehensions.

Performance characteristics by test case:

  • Basic cases (2-4 entities): 63-69% faster, showing consistent overhead reduction
  • Large scale no overlap (1000 entities): 517-598% faster, demonstrating excellent scaling for non-overlapping data
  • Large scale with overlap (1000 entities): 378-491% faster, still highly effective when merging is required

The optimization is particularly effective for larger datasets where the groupby operation becomes the dominant bottleneck, while maintaining identical functionality and output format.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 26 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import itertools
from typing import List

import numpy as np
import pandas as pd
# imports
import pytest  # used for our unit tests
from graphrag.index.update.entities import _group_and_resolve_entities

# For testing, define ENTITIES_FINAL_COLUMNS as used in the function
ENTITIES_FINAL_COLUMNS = [
    "id",
    "title",
    "type",
    "human_readable_id",
    "description",
    "text_unit_ids",
    "degree",
    "x",
    "y",
    "frequency",
]
from graphrag.index.update.entities import _group_and_resolve_entities

# ------------------- UNIT TESTS -------------------

# Helper to create a DataFrame for entities
def make_entities_df(
    ids: List[int],
    titles: List[str],
    types: List[str],
    human_readable_ids: List[int],
    descriptions: List[str],
    text_unit_ids: List[List[int]],
    degrees: List[int],
    xs: List[float],
    ys: List[float]
) -> pd.DataFrame:
    return pd.DataFrame({
        "id": ids,
        "title": titles,
        "type": types,
        "human_readable_id": human_readable_ids,
        "description": descriptions,
        "text_unit_ids": text_unit_ids,
        "degree": degrees,
        "x": xs,
        "y": ys,
    })

# ------------------- BASIC TEST CASES -------------------

def test_basic_no_overlap():
    """Test with two dataframes with no overlapping titles."""
    df_a = make_entities_df(
        [1], ["A"], ["type1"], [0], ["descA"], [[101, 102]], [1], [0.0], [0.0]
    )
    df_b = make_entities_df(
        [2], ["B"], ["type2"], [1], ["descB"], [[201]], [2], [1.0], [1.0]
    )
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 3.96ms -> 2.40ms (64.6% faster)

def test_basic_with_overlap():
    """Test with overlapping title between A and B."""
    df_a = make_entities_df(
        [1], ["A"], ["type1"], [0], ["descA"], [[101]], [1], [0.0], [0.0]
    )
    df_b = make_entities_df(
        [2], ["A"], ["type1"], [1], ["descB"], [[201, 202]], [2], [1.0], [1.0]
    )
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 3.80ms -> 2.26ms (68.2% faster)

def test_basic_multiple_rows():
    """Test with multiple rows, some overlapping, some not."""
    df_a = make_entities_df(
        [1, 3], ["A", "C"], ["type1", "type3"], [0, 2], ["descA", "descC"], [[101], [301]], [1, 3], [0.0, 3.0], [0.0, 3.0]
    )
    df_b = make_entities_df(
        [2, 4], ["A", "B"], ["type1", "type2"], [1, 3], ["descB", "descD"], [[201], [401, 402]], [2, 4], [1.0, 4.0], [1.0, 4.0]
    )
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 4.02ms -> 2.40ms (67.0% faster)
    # Check text_unit_ids for A
    a_row = result[result["title"] == "A"].iloc[0]
    # B and C have correct text_unit_ids
    b_row = result[result["title"] == "B"].iloc[0]
    c_row = result[result["title"] == "C"].iloc[0]

# ------------------- EDGE TEST CASES -------------------


def test_empty_delta_entities():
    """Test with non-empty old_entities_df and empty delta_entities_df."""
    df_a = make_entities_df(
        [1], ["A"], ["type1"], [0], ["descA"], [[101]], [1], [0.0], [0.0]
    )
    df_b = make_entities_df([], [], [], [], [], [], [], [], [])
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 3.83ms -> 2.35ms (63.1% faster)


def test_duplicate_titles_in_delta():
    """Test delta_entities_df with duplicate titles (should merge)."""
    df_a = make_entities_df(
        [1], ["A"], ["type1"], [0], ["descA"], [[101]], [1], [0.0], [0.0]
    )
    df_b = make_entities_df(
        [2, 3], ["A", "A"], ["type1", "type1"], [1, 2], ["descB", "descC"], [[201], [202]], [2, 2], [1.0, 1.0], [1.0, 1.0]
    )
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 3.86ms -> 2.39ms (61.7% faster)
    descs = result["description"].iloc[0]
    text_ids = result["text_unit_ids"].iloc[0]

def test_duplicate_titles_in_old():
    """Test old_entities_df with duplicate titles (should merge)."""
    df_a = make_entities_df(
        [1, 2], ["A", "A"], ["type1", "type1"], [0, 1], ["descA", "descB"], [[101], [102]], [1, 1], [0.0, 0.0], [0.0, 0.0]
    )
    df_b = make_entities_df(
        [3], ["A"], ["type1"], [2], ["descC"], [[201]], [2], [1.0], [1.0]
    )
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 3.90ms -> 2.35ms (65.6% faster)
    descs = result["description"].iloc[0]
    text_ids = result["text_unit_ids"].iloc[0]

def test_non_integer_ids():
    """Test with string ids and human_readable_ids."""
    df_a = pd.DataFrame({
        "id": ["a1"], "title": ["A"], "type": ["type1"], "human_readable_id": [0],
        "description": ["descA"], "text_unit_ids": [[101]], "degree": [1], "x": [0.0], "y": [0.0]
    })
    df_b = pd.DataFrame({
        "id": ["b2"], "title": ["A"], "type": ["type1"], "human_readable_id": [1],
        "description": ["descB"], "text_unit_ids": [[201, 202]], "degree": [2], "x": [1.0], "y": [1.0]
    })
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 3.73ms -> 2.21ms (68.8% faster)

def test_nan_in_fields():
    """Test with NaN values in some fields."""
    df_a = pd.DataFrame({
        "id": [1], "title": ["A"], "type": ["type1"], "human_readable_id": [0],
        "description": [np.nan], "text_unit_ids": [[101]], "degree": [1], "x": [0.0], "y": [0.0]
    })
    df_b = pd.DataFrame({
        "id": [2], "title": ["A"], "type": ["type1"], "human_readable_id": [1],
        "description": ["descB"], "text_unit_ids": [[201]], "degree": [2], "x": [1.0], "y": [1.0]
    })
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 3.86ms -> 2.33ms (65.6% faster)
    # NaN should be converted to 'nan' string in description
    descs = result["description"].iloc[0]

def test_empty_text_unit_ids():
    """Test with empty lists in text_unit_ids."""
    df_a = make_entities_df(
        [1], ["A"], ["type1"], [0], ["descA"], [[]], [1], [0.0], [0.0]
    )
    df_b = make_entities_df(
        [2], ["A"], ["type1"], [1], ["descB"], [[201]], [2], [1.0], [1.0]
    )
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 3.81ms -> 2.27ms (68.3% faster)

def test_human_readable_id_increment():
    """Test that human_readable_id for delta starts at max of old + 1."""
    df_a = make_entities_df(
        [1, 2], ["A", "B"], ["type1", "type2"], [0, 2], ["descA", "descB"], [[101], [201]], [1, 2], [0.0, 1.0], [0.0, 1.0]
    )
    df_b = make_entities_df(
        [3, 4], ["C", "D"], ["type3", "type4"], [0, 0], ["descC", "descD"], [[301], [401]], [3, 4], [2.0, 3.0], [2.0, 3.0]
    )
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 4.07ms -> 2.40ms (69.3% faster)
    # human_readable_id for C should be 3, for D should be 4
    c_id = result[result["title"] == "C"]["human_readable_id"].iloc[0]
    d_id = result[result["title"] == "D"]["human_readable_id"].iloc[0]

# ------------------- LARGE SCALE TEST CASES -------------------

def test_large_scale_no_overlap():
    """Test with 500 old and 500 delta, no overlapping titles."""
    n = 500
    df_a = make_entities_df(
        list(range(n)),
        [f"A{i}" for i in range(n)],
        ["typeA"] * n,
        list(range(n)),
        [f"descA{i}" for i in range(n)],
        [[i] for i in range(n)],
        [1] * n,
        [float(i) for i in range(n)],
        [float(i) for i in range(n)],
    )
    df_b = make_entities_df(
        list(range(n, 2*n)),
        [f"B{i}" for i in range(n)],
        ["typeB"] * n,
        list(range(n, 2*n)),
        [f"descB{i}" for i in range(n)],
        [[1000 + i] for i in range(n)],
        [2] * n,
        [float(i) for i in range(n, 2*n)],
        [float(i) for i in range(n, 2*n)],
    )
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 43.4ms -> 7.02ms (518% faster)

def test_large_scale_all_overlap():
    """Test with 500 old and 500 delta, all titles overlap."""
    n = 500
    df_a = make_entities_df(
        list(range(n)),
        [f"A{i}" for i in range(n)],
        ["typeA"] * n,
        list(range(n)),
        [f"descA{i}" for i in range(n)],
        [[i] for i in range(n)],
        [1] * n,
        [float(i) for i in range(n)],
        [float(i) for i in range(n)],
    )
    df_b = make_entities_df(
        list(range(n, 2*n)),
        [f"A{i}" for i in range(n)],
        ["typeA"] * n,
        list(range(n, 2*n)),
        [f"descB{i}" for i in range(n)],
        [[1000 + i] for i in range(n)],
        [2] * n,
        [float(i) for i in range(n, 2*n)],
        [float(i) for i in range(n, 2*n)],
    )
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 23.9ms -> 5.00ms (378% faster)
    # id_map should map all B ids to A ids
    for i in range(n):
        # text_unit_ids merged
        row = result[result["title"] == f"A{i}"].iloc[0]

def test_large_scale_some_overlap():
    """Test with 400 overlap, 100 unique in each."""
    n = 500
    overlap = 400
    df_a = make_entities_df(
        list(range(n)),
        [f"T{i}" for i in range(n)],
        ["typeA"] * n,
        list(range(n)),
        [f"descA{i}" for i in range(n)],
        [[i] for i in range(n)],
        [1] * n,
        [float(i) for i in range(n)],
        [float(i) for i in range(n)],
    )
    df_b = make_entities_df(
        list(range(n, n+overlap)) + list(range(n+overlap, n+overlap+100)),
        [f"T{i}" for i in range(overlap)] + [f"U{i}" for i in range(100)],
        ["typeB"] * (overlap + 100),
        list(range(n, n+overlap+100)),
        [f"descB{i}" for i in range(overlap)] + [f"descU{i}" for i in range(100)],
        [[1000 + i] for i in range(overlap)] + [[2000 + i] for i in range(100)],
        [2] * (overlap + 100),
        [float(i) for i in range(n, n+overlap+100)],
        [float(i) for i in range(n, n+overlap+100)],
    )
    result, id_map = _group_and_resolve_entities(df_a, df_b) # 27.9ms -> 5.50ms (408% faster)
    # id_map should map overlap B ids to A ids
    for i in range(overlap):
        row = result[result["title"] == f"T{i}"].iloc[0]
    # Unique B's present
    for i in range(100):
        title = f"U{i}"
        row = result[result["title"] == title].iloc[0]

def test_large_scale_duplicate_titles_in_both():
    """Test with duplicate titles in both old and delta, large scale."""
    n = 200
    # Each title appears twice in both A and B
    ids_a = list(range(n)) + list(range(n, 2*n))
    titles = [f"T{i}" for i in range(n)] * 2
    types = ["typeA"] * (2*n)
    hr_ids = list(range(n)) + list(range(n, 2*n))
    descs = [f"descA{i}" for i in range(n)] + [f"descA{i}b" for i in range(n)]
    text_ids = [[i] for i in range(n)] + [[i+100] for i in range(n)]
    degrees = [1] * (2*n)
    xs = [float(i) for i in range(n)] + [float(i) for i in range(n, 2*n)]
    ys = xs
    df_a = make_entities_df(ids_a, titles, types, hr_ids, descs, text_ids, degrees, xs, ys)

    ids_b = list(range(2*n, 3*n)) + list(range(3*n, 4*n))
    titles_b = [f"T{i}" for i in range(n)] * 2
    types_b = ["typeB"] * (2*n)
    hr_ids_b = list(range(2*n, 3*n)) + list(range(3*n, 4*n))
    descs_b = [f"descB{i}" for i in range(n)] + [f"descB{i}b" for i in range(n)]
    text_ids_b = [[i+200] for i in range(n)] + [[i+300] for i in range(n)]
    degrees_b = [2] * (2*n)
    xs_b = [float(i) for i in range(2*n, 3*n)] + [float(i) for i in range(3*n, 4*n)]
    ys_b = xs_b
    df_b = make_entities_df(ids_b, titles_b, types_b, hr_ids_b, descs_b, text_ids_b, degrees_b, xs_b, ys_b)

    result, id_map = _group_and_resolve_entities(df_a, df_b) # 12.3ms -> 3.82ms (222% faster)
    # Each title's text_unit_ids should be of length 4 (2 from A, 2 from B)
    for i in range(n):
        row = result[result["title"] == f"T{i}"].iloc[0]
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import itertools

import numpy as np
import pandas as pd
# imports
import pytest  # used for our unit tests
from graphrag.index.update.entities import _group_and_resolve_entities

# function to test
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License


# For testing, define ENTITIES_FINAL_COLUMNS as in the original code
ENTITIES_FINAL_COLUMNS = [
    "id",
    "title",
    "type",
    "human_readable_id",
    "description",
    "text_unit_ids",
    "degree",
    "x",
    "y",
    "frequency",
]
from graphrag.index.update.entities import _group_and_resolve_entities

# unit tests

# Helper function to create a DataFrame for entities
def make_entities_df(entities):
    """
    Helper to create a DataFrame with all required columns.
    entities: list of dicts with keys matching columns (missing keys get default values)
    """
    defaults = {
        "id": None,
        "title": "",
        "type": "",
        "human_readable_id": 0,
        "description": "",
        "text_unit_ids": [],
        "degree": 0,
        "x": 0.0,
        "y": 0.0,
        "frequency": 0,
    }
    rows = []
    for ent in entities:
        row = defaults.copy()
        row.update(ent)
        # Copy list to avoid shared references
        row["text_unit_ids"] = list(row["text_unit_ids"])
        rows.append(row)
    df = pd.DataFrame(rows)
    # Ensure columns order
    df = df[ENTITIES_FINAL_COLUMNS]
    return df

# ---------------- BASIC TEST CASES ----------------

def test_basic_no_overlap():
    """Test with two DataFrames with no overlapping titles."""
    df_a = make_entities_df([
        {"id": 1, "title": "Apple", "type": "Fruit", "human_readable_id": 0, "description": "A fruit", "text_unit_ids": [1], "degree": 1, "x": 0.1, "y": 0.2},
    ])
    df_b = make_entities_df([
        {"id": 2, "title": "Banana", "type": "Fruit", "description": "A yellow fruit", "text_unit_ids": [2], "degree": 2, "x": 0.3, "y": 0.4},
    ])
    resolved, id_mapping = _group_and_resolve_entities(df_a, df_b) # 3.91ms -> 2.34ms (66.5% faster)
    # human_readable_id of Banana should be 1 (max of df_a + 1)
    banana_hrid = resolved[resolved["title"] == "Banana"]["human_readable_id"].iloc[0]

def test_basic_with_overlap():
    """Test with overlapping titles, should map ids and merge text_unit_ids/descriptions."""
    df_a = make_entities_df([
        {"id": 10, "title": "Car", "type": "Vehicle", "human_readable_id": 5, "description": "A vehicle", "text_unit_ids": [100], "degree": 1, "x": 1.0, "y": 2.0},
        {"id": 11, "title": "Bike", "type": "Vehicle", "human_readable_id": 6, "description": "A two-wheeler", "text_unit_ids": [101], "degree": 2, "x": 3.0, "y": 4.0},
    ])
    df_b = make_entities_df([
        {"id": 20, "title": "Car", "type": "Vehicle", "description": "Automobile", "text_unit_ids": [102], "degree": 3, "x": 5.0, "y": 6.0},
        {"id": 21, "title": "Plane", "type": "Vehicle", "description": "Flies", "text_unit_ids": [103], "degree": 4, "x": 7.0, "y": 8.0},
    ])
    resolved, id_mapping = _group_and_resolve_entities(df_a, df_b) # 3.95ms -> 2.33ms (69.3% faster)
    # Car's text_unit_ids should include both 100 and 102
    car_row = resolved[resolved["title"] == "Car"].iloc[0]
    # Plane's human_readable_id should be 7 (max of df_a + 1 for first new, 7 for Plane)
    plane_row = resolved[resolved["title"] == "Plane"].iloc[0]



def test_basic_multiple_overlap():
    """Test with multiple overlapping titles, ensure all mappings and merges are correct."""
    df_a = make_entities_df([
        {"id": 1, "title": "X", "type": "T", "human_readable_id": 0, "description": "descX", "text_unit_ids": [1], "degree": 1, "x": 0.0, "y": 0.0},
        {"id": 2, "title": "Y", "type": "T", "human_readable_id": 1, "description": "descY", "text_unit_ids": [2], "degree": 2, "x": 1.0, "y": 1.0},
    ])
    df_b = make_entities_df([
        {"id": 3, "title": "Y", "type": "T", "description": "descY2", "text_unit_ids": [3], "degree": 3, "x": 2.0, "y": 2.0},
        {"id": 4, "title": "Z", "type": "T", "description": "descZ", "text_unit_ids": [4], "degree": 4, "x": 3.0, "y": 3.0},
    ])
    resolved, id_mapping = _group_and_resolve_entities(df_a, df_b) # 3.95ms -> 2.35ms (68.1% faster)
    y_row = resolved[resolved["title"] == "Y"].iloc[0]
    z_row = resolved[resolved["title"] == "Z"].iloc[0]

# ---------------- EDGE TEST CASES ----------------


def test_edge_all_overlap():
    """Test where all entities overlap by title."""
    df_a = make_entities_df([
        {"id": 1, "title": "Alpha", "type": "T", "human_readable_id": 0, "description": "descA", "text_unit_ids": [1], "degree": 1, "x": 0.0, "y": 0.0},
        {"id": 2, "title": "Beta", "type": "T", "human_readable_id": 1, "description": "descB", "text_unit_ids": [2], "degree": 2, "x": 1.0, "y": 1.0},
    ])
    df_b = make_entities_df([
        {"id": 3, "title": "Alpha", "type": "T", "description": "descA2", "text_unit_ids": [3], "degree": 3, "x": 2.0, "y": 2.0},
        {"id": 4, "title": "Beta", "type": "T", "description": "descB2", "text_unit_ids": [4], "degree": 4, "x": 3.0, "y": 3.0},
    ])
    resolved, id_mapping = _group_and_resolve_entities(df_a, df_b) # 3.79ms -> 2.23ms (69.4% faster)
    alpha_row = resolved[resolved["title"] == "Alpha"].iloc[0]
    beta_row = resolved[resolved["title"] == "Beta"].iloc[0]

def test_edge_duplicate_titles_in_one_df():
    """Test where one DataFrame has duplicate titles (should not happen, but test for robustness)."""
    df_a = make_entities_df([
        {"id": 1, "title": "Dup", "type": "T", "human_readable_id": 0, "description": "desc1", "text_unit_ids": [1], "degree": 1, "x": 0.0, "y": 0.0},
        {"id": 2, "title": "Dup", "type": "T", "human_readable_id": 1, "description": "desc2", "text_unit_ids": [2], "degree": 2, "x": 1.0, "y": 1.0},
    ])
    df_b = make_entities_df([
        {"id": 3, "title": "Dup", "type": "T", "description": "desc3", "text_unit_ids": [3], "degree": 3, "x": 2.0, "y": 2.0},
    ])
    resolved, id_mapping = _group_and_resolve_entities(df_a, df_b) # 3.82ms -> 2.32ms (65.1% faster)
    dup_row = resolved.iloc[0]


def test_edge_non_list_text_unit_ids():
    """Test where text_unit_ids is not a list (should still work if iterable)."""
    df_a = make_entities_df([
        {"id": 1, "title": "A", "type": "T", "human_readable_id": 0, "description": "desc", "text_unit_ids": [1], "degree": 1, "x": 0.0, "y": 0.0},
    ])
    # Use tuple instead of list
    df_b = make_entities_df([
        {"id": 2, "title": "A", "type": "T", "description": "desc2", "text_unit_ids": (2,), "degree": 2, "x": 1.0, "y": 1.0},
    ])
    resolved, id_mapping = _group_and_resolve_entities(df_a, df_b) # 3.72ms -> 2.21ms (68.2% faster)
    row = resolved.iloc[0]

def test_edge_nan_description_and_text_unit_ids():
    """Test with NaN in description and text_unit_ids."""
    df_a = make_entities_df([
        {"id": 1, "title": "A", "type": "T", "human_readable_id": 0, "description": np.nan, "text_unit_ids": [1], "degree": 1, "x": 0.0, "y": 0.0},
    ])
    df_b = make_entities_df([
        {"id": 2, "title": "A", "type": "T", "description": "desc2", "text_unit_ids": [2], "degree": 2, "x": 1.0, "y": 1.0},
    ])
    resolved, id_mapping = _group_and_resolve_entities(df_a, df_b) # 3.78ms -> 2.26ms (67.3% faster)
    row = resolved.iloc[0]

def test_edge_empty_text_unit_ids():
    """Test with empty text_unit_ids in one of the entities."""
    df_a = make_entities_df([
        {"id": 1, "title": "A", "type": "T", "human_readable_id": 0, "description": "desc", "text_unit_ids": [], "degree": 1, "x": 0.0, "y": 0.0},
    ])
    df_b = make_entities_df([
        {"id": 2, "title": "A", "type": "T", "description": "desc2", "text_unit_ids": [2], "degree": 2, "x": 1.0, "y": 1.0},
    ])
    resolved, id_mapping = _group_and_resolve_entities(df_a, df_b) # 3.72ms -> 2.19ms (69.5% faster)
    row = resolved.iloc[0]

# ---------------- LARGE SCALE TEST CASES ----------------

def test_large_scale_no_overlap():
    """Test with 500 entities in each DataFrame, no overlap."""
    n = 500
    df_a = make_entities_df([
        {"id": i, "title": f"T{i}", "type": "TypeA", "human_readable_id": i, "description": f"desc{i}", "text_unit_ids": [i], "degree": i, "x": float(i), "y": float(i+1)}
        for i in range(n)
    ])
    df_b = make_entities_df([
        {"id": i+n, "title": f"T{i+n}", "type": "TypeB", "description": f"desc{i+n}", "text_unit_ids": [i+n], "degree": i+n, "x": float(i+n), "y": float(i+n+1)}
        for i in range(n)
    ])
    resolved, id_mapping = _group_and_resolve_entities(df_a, df_b) # 43.4ms -> 7.04ms (517% faster)
    # Check human_readable_id for some random index
    idx = np.random.randint(0, n)
    title = f"T{idx+n}"
    row = resolved[resolved["title"] == title].iloc[0]

def test_large_scale_all_overlap():
    """Test with 1000 entities, all overlapping titles."""
    n = 1000
    df_a = make_entities_df([
        {"id": i, "title": f"T{i}", "type": "TypeA", "human_readable_id": i, "description": f"descA{i}", "text_unit_ids": [i], "degree": i, "x": float(i), "y": float(i+1)}
        for i in range(n)
    ])
    df_b = make_entities_df([
        {"id": i+n, "title": f"T{i}", "type": "TypeB", "description": f"descB{i}", "text_unit_ids": [i+n], "degree": i+n, "x": float(i+n), "y": float(i+n+1)}
        for i in range(n)
    ])
    resolved, id_mapping = _group_and_resolve_entities(df_a, df_b) # 44.2ms -> 7.91ms (459% faster)
    # All ids from df_b should map to df_a's ids
    expected_mapping = {i+n: i for i in range(n)}
    # Check merged text_unit_ids and descriptions for several random indices
    for idx in [0, n//2, n-1]:
        row = resolved[resolved["title"] == f"T{idx}"].iloc[0]

def test_large_scale_some_overlap():
    """Test with 500 overlap, 250 unique to each."""
    n = 500
    df_a = make_entities_df([
        {"id": i, "title": f"T{i}", "type": "TypeA", "human_readable_id": i, "description": f"descA{i}", "text_unit_ids": [i], "degree": i, "x": float(i), "y": float(i+1)}
        for i in range(750)
    ])
    df_b = make_entities_df([
        {"id": i+1000, "title": f"T{i}", "type": "TypeB", "description": f"descB{i}", "text_unit_ids": [i+1000], "degree": i+1000, "x": float(i+1000), "y": float(i+1001)}
        for i in range(250, 1000)
    ])
    resolved, id_mapping = _group_and_resolve_entities(df_a, df_b) # 44.0ms -> 7.45ms (491% faster)
    # id_mapping: 250-749 in df_b map to 250-749 in df_a
    expected_mapping = {i+1000: i for i in range(250, 750)}
    for k, v in expected_mapping.items():
        pass
    # Check that a unique title in df_b is present and correct
    row = resolved[resolved["title"] == "T999"].iloc[0]
    # Check that a unique title in df_a is present and correct
    row = resolved[resolved["title"] == "T0"].iloc[0]
    # Check merged text_unit_ids for an overlapping title
    row = resolved[resolved["title"] == "T500"].iloc[0]

def test_large_scale_performance():
    """Ensure function completes quickly on large input."""
    n = 900
    df_a = make_entities_df([
        {"id": i, "title": f"T{i}", "type": "A", "human_readable_id": i, "description": f"desc{i}", "text_unit_ids": [i], "degree": i, "x": float(i), "y": float(i+1)}
        for i in range(n)
    ])
    df_b = make_entities_df([
        {"id": i+n, "title": f"T{i+n}", "type": "B", "description": f"desc{i+n}", "text_unit_ids": [i+n], "degree": i+n, "x": float(i+n), "y": float(i+n+1)}
        for i in range(n)
    ])
    import time
    start = time.time()
    resolved, id_mapping = _group_and_resolve_entities(df_a, df_b) # 75.2ms -> 10.8ms (598% faster)
    elapsed = time.time() - start
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_group_and_resolve_entities-mglmouew and push.

Codeflash

The optimization achieves a **299% speedup** by eliminating the expensive pandas `groupby.agg()` operation with Python lambdas, which was consuming 86.5% of the original runtime (879.9ms out of 1017ms).

**Key optimizations:**

1. **Replaced pandas groupby.agg() with manual grouping**: Instead of using pandas' slow lambda functions in `.agg()`, the code now uses `np.unique()` with `return_inverse=True` to efficiently group rows by title, then applies operations directly on numpy arrays.

2. **Eliminated lambda overhead**: The original lambdas like `lambda x: list(x.astype(str))` and `lambda x: list(itertools.chain(*x.tolist()))` were called for each group. The optimized version precomputes string conversions and list flattening using vectorized operations and list comprehensions.

3. **Direct DataFrame construction**: Rather than creating an intermediate grouped object and resetting the index, the code builds the result dictionary directly and constructs the final DataFrame in one step.

4. **Removed itertools import**: No longer needed since list flattening is handled with nested list comprehensions.

**Performance characteristics by test case:**
- **Basic cases** (2-4 entities): 63-69% faster, showing consistent overhead reduction
- **Large scale no overlap** (1000 entities): 517-598% faster, demonstrating excellent scaling for non-overlapping data
- **Large scale with overlap** (1000 entities): 378-491% faster, still highly effective when merging is required

The optimization is particularly effective for larger datasets where the groupby operation becomes the dominant bottleneck, while maintaining identical functionality and output format.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 11, 2025 02:00
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants