Skip to content

⚡️ Speed up function _rank_report_context by 17%#68

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-_rank_report_context-mglpgdxa
Open

⚡️ Speed up function _rank_report_context by 17%#68
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-_rank_report_context-mglpgdxa

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 11, 2025

📄 17% (0.17x) speedup for _rank_report_context in graphrag/query/context_builder/community_context.py

⏱️ Runtime : 14.9 milliseconds 12.8 milliseconds (best of 202 runs)

📝 Explanation and details

The optimized code introduces three key performance optimizations:

1. Conditional Type Conversion with Copy Avoidance

  • Added if report_df[column].dtype != float: checks before astype() calls
  • Used astype(float, copy=False) instead of astype(float)
  • This avoids unnecessary type conversions when columns are already float and prevents memory copies when conversion is needed

2. Skip Sorting for Single-Row DataFrames

  • Added if len(report_df) > 1: check before sort_values()
  • Sorting a single row is a no-op that still incurs pandas overhead
  • This optimization is particularly effective for small datasets

3. Performance Impact Analysis
From the line profiler results:

  • Type conversion time reduced significantly (27.7% → 10.6% for weight column, 14.6% → 4.6% for rank column)
  • The dtype != float checks themselves take minimal time (10.6% and 4.6% combined)
  • Sorting remains the dominant cost (57.4%) but only runs when necessary

The optimizations are most effective for:

  • Empty/single-row DataFrames: 95-123% faster (from annotated tests)
  • Large DataFrames with existing float columns: 35-37% faster
  • DataFrames with NaN values: 44% faster (likely due to reduced memory operations)

These micro-optimizations compound to achieve a 16% overall speedup by eliminating redundant operations without changing the function's behavior or API.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 34 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pandas as pd
# imports
import pytest  # used for our unit tests
from graphrag.query.context_builder.community_context import \
    _rank_report_context

# unit tests

# 1. Basic Test Cases

def test_basic_sort_by_weight_and_rank():
    # Test that sorting by both columns works as expected
    df = pd.DataFrame({
        "occurrence weight": [1, 3, 2, 3],
        "rank": [10, 5, 20, 1],
        "other": ["a", "b", "c", "d"]
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 628μs -> 638μs (1.46% slower)
    # Should be sorted by weight descending, then rank descending
    expected = pd.DataFrame({
        "occurrence weight": [3, 3, 2, 1],
        "rank": [5, 1, 20, 10],
        "other": ["b", "d", "c", "a"]
    }, index=[1, 3, 2, 0])

def test_basic_sort_by_weight_only():
    # Test that sorting by weight only works
    df = pd.DataFrame({
        "occurrence weight": [1, 5, 3],
        "other": ["a", "b", "c"]
    })
    codeflash_output = _rank_report_context(df.copy(), weight_column="occurrence weight", rank_column=None); result = codeflash_output # 270μs -> 279μs (3.12% slower)
    expected = pd.DataFrame({
        "occurrence weight": [5, 3, 1],
        "other": ["b", "c", "a"]
    }, index=[1, 2, 0])

def test_basic_sort_by_rank_only():
    # Test that sorting by rank only works
    df = pd.DataFrame({
        "rank": [10, 5, 20],
        "other": ["a", "b", "c"]
    })
    codeflash_output = _rank_report_context(df.copy(), weight_column=None, rank_column="rank"); result = codeflash_output # 262μs -> 267μs (1.97% slower)
    expected = pd.DataFrame({
        "rank": [20, 10, 5],
        "other": ["c", "a", "b"]
    }, index=[2, 0, 1])

def test_basic_no_sort_columns():
    # Test that if both columns are None, the DataFrame is unchanged
    df = pd.DataFrame({
        "foo": [1, 2, 3],
        "bar": [4, 5, 6]
    })
    codeflash_output = _rank_report_context(df.copy(), weight_column=None, rank_column=None); result = codeflash_output # 899ns -> 948ns (5.17% slower)

# 2. Edge Test Cases

def test_empty_dataframe():
    # Should handle empty DataFrame without error
    df = pd.DataFrame(columns=["occurrence weight", "rank"])
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 522μs -> 255μs (105% faster)

def test_single_row_dataframe():
    # Sorting a single row should return the same row
    df = pd.DataFrame({"occurrence weight": [7], "rank": [1], "foo": ["bar"]})
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 555μs -> 249μs (123% faster)

def test_missing_weight_column():
    # Should raise KeyError if weight_column is specified but not present
    df = pd.DataFrame({"rank": [1, 2, 3]})
    with pytest.raises(KeyError):
        _rank_report_context(df.copy(), weight_column="occurrence weight", rank_column="rank") # 35.2μs -> 34.9μs (0.831% faster)

def test_missing_rank_column():
    # Should raise KeyError if rank_column is specified but not present
    df = pd.DataFrame({"occurrence weight": [1, 2, 3]})
    with pytest.raises(KeyError):
        _rank_report_context(df.copy(), weight_column="occurrence weight", rank_column="rank") # 150μs -> 144μs (4.50% faster)

def test_non_numeric_weight_column():
    # Should raise ValueError if weight_column cannot be converted to float
    df = pd.DataFrame({"occurrence weight": ["a", "b", "c"], "rank": [1, 2, 3]})
    with pytest.raises(ValueError):
        _rank_report_context(df.copy()) # 90.2μs -> 92.7μs (2.72% slower)

def test_non_numeric_rank_column():
    # Should raise ValueError if rank_column cannot be converted to float
    df = pd.DataFrame({"occurrence weight": [1, 2, 3], "rank": ["x", "y", "z"]})
    with pytest.raises(ValueError):
        _rank_report_context(df.copy()) # 173μs -> 181μs (4.27% slower)

def test_nan_values_in_columns():
    # NaN values should be sorted last (since float('nan') is not greater than any number)
    df = pd.DataFrame({
        "occurrence weight": [3, float('nan'), 2],
        "rank": [1, 2, float('nan')],
        "foo": ["a", "b", "c"]
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 612μs -> 424μs (44.4% faster)

def test_duplicate_rows():
    # Should preserve all rows, even if they are duplicates
    df = pd.DataFrame({
        "occurrence weight": [2, 2, 1],
        "rank": [5, 5, 10],
        "foo": ["a", "a", "b"]
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 576μs -> 586μs (1.73% slower)

def test_column_names_with_spaces():
    # Should work with column names with spaces
    df = pd.DataFrame({
        "occurrence weight": [1, 2],
        "rank": [2, 1]
    })
    codeflash_output = _rank_report_context(df.copy(), weight_column="occurrence weight", rank_column="rank"); result = codeflash_output # 587μs -> 603μs (2.64% slower)

def test_column_names_are_none():
    # Should not fail if both weight_column and rank_column are None
    df = pd.DataFrame({"foo": [1, 2], "bar": [3, 4]})
    codeflash_output = _rank_report_context(df.copy(), weight_column=None, rank_column=None); result = codeflash_output # 952ns -> 936ns (1.71% faster)

def test_weight_and_rank_column_same():
    # Should work if weight_column and rank_column are the same column
    df = pd.DataFrame({
        "score": [1, 3, 2],
        "other": ["a", "b", "c"]
    })
    codeflash_output = _rank_report_context(df.copy(), weight_column="score", rank_column="score"); result = codeflash_output # 546μs -> 497μs (9.87% faster)
    expected = pd.DataFrame({
        "score": [3, 2, 1],
        "other": ["b", "c", "a"]
    }, index=[1, 2, 0])

# 3. Large Scale Test Cases

def test_large_dataframe_sorting():
    # Test with a large DataFrame (1000 rows)
    import random
    random.seed(0)
    size = 1000
    weights = [random.uniform(0, 1000) for _ in range(size)]
    ranks = [random.uniform(0, 1000) for _ in range(size)]
    df = pd.DataFrame({
        "occurrence weight": weights,
        "rank": ranks,
        "foo": list(range(size))
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 756μs -> 559μs (35.3% faster)
    # The first row should have the highest weight, and if there are ties, the highest rank
    sorted_df = df.sort_values(by=["occurrence weight", "rank"], ascending=False)

def test_large_dataframe_with_nans():
    # Test with a large DataFrame and some NaN values
    import numpy as np
    size = 500
    weights = [float(i) for i in range(size)]
    ranks = [float(size - i) for i in range(size)]
    # Insert NaNs at random positions
    weights[100] = float('nan')
    ranks[200] = float('nan')
    df = pd.DataFrame({
        "occurrence weight": weights,
        "rank": ranks
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 670μs -> 467μs (43.5% faster)
    # The row with NaN in rank (but valid weight) should be before the NaN in weight
    nan_rank_idx = result[result["rank"].isna() & result["occurrence weight"].notna()].index
    nan_weight_idx = result[result["occurrence weight"].isna()].index

def test_large_dataframe_no_sort_columns():
    # Large DataFrame with no sort columns should be unchanged
    size = 1000
    df = pd.DataFrame({
        "foo": list(range(size)),
        "bar": list(reversed(range(size)))
    })
    codeflash_output = _rank_report_context(df.copy(), weight_column=None, rank_column=None); result = codeflash_output # 981ns -> 967ns (1.45% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pandas as pd
# imports
import pytest  # used for our unit tests
from graphrag.query.context_builder.community_context import \
    _rank_report_context

# unit tests

# -------- BASIC TEST CASES --------

def test_basic_sort_by_weight_and_rank():
    # Simple DataFrame with both columns
    df = pd.DataFrame({
        "occurrence weight": [2, 1, 3],
        "rank": [1, 2, 1],
        "other": ["a", "b", "c"]
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 632μs -> 647μs (2.35% slower)
    # Should sort by occurrence weight DESC, then rank DESC
    expected = pd.DataFrame({
        "occurrence weight": [3, 2, 1],
        "rank": [1, 1, 2],
        "other": ["c", "a", "b"]
    }, index=[2, 0, 1])

def test_basic_sort_by_weight_only():
    # Only weight column present
    df = pd.DataFrame({
        "occurrence weight": [2, 3, 1],
        "other": ["x", "y", "z"]
    })
    codeflash_output = _rank_report_context(df.copy(), rank_column=None); result = codeflash_output # 269μs -> 281μs (3.95% slower)
    expected = pd.DataFrame({
        "occurrence weight": [3, 2, 1],
        "other": ["y", "x", "z"]
    }, index=[1, 0, 2])

def test_basic_sort_by_rank_only():
    # Only rank column present
    df = pd.DataFrame({
        "rank": [10, 20, 15],
        "other": ["foo", "bar", "baz"]
    })
    codeflash_output = _rank_report_context(df.copy(), weight_column=None); result = codeflash_output # 262μs -> 268μs (2.19% slower)
    expected = pd.DataFrame({
        "rank": [20, 15, 10],
        "other": ["bar", "baz", "foo"]
    }, index=[1, 2, 0])

def test_basic_no_sort_columns():
    # Neither column present: should return unchanged
    df = pd.DataFrame({
        "other": [1, 2, 3]
    })
    codeflash_output = _rank_report_context(df.copy(), weight_column=None, rank_column=None); result = codeflash_output # 876ns -> 958ns (8.56% slower)

def test_basic_non_numeric_columns():
    # Columns with string numbers: should be converted to float
    df = pd.DataFrame({
        "occurrence weight": ["2", "1", "3"],
        "rank": ["1", "2", "1"],
        "other": ["a", "b", "c"]
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 660μs -> 668μs (1.15% slower)
    expected = pd.DataFrame({
        "occurrence weight": [3.0, 2.0, 1.0],
        "rank": [1.0, 1.0, 2.0],
        "other": ["c", "a", "b"]
    }, index=[2, 0, 1])

# -------- EDGE TEST CASES --------

def test_edge_empty_dataframe():
    # Empty DataFrame: should return unchanged
    df = pd.DataFrame(columns=["occurrence weight", "rank", "other"])
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 553μs -> 283μs (95.5% faster)

def test_edge_missing_sort_column():
    # DataFrame missing one of the sort columns: should raise KeyError
    df = pd.DataFrame({
        "occurrence weight": [1, 2, 3],
        "other": ["x", "y", "z"]
    })
    with pytest.raises(KeyError):
        _rank_report_context(df.copy(), weight_column="occurrence weight", rank_column="rank") # 146μs -> 143μs (2.06% faster)

def test_edge_nan_values():
    # DataFrame with NaN values: should sort with NaNs at the end
    df = pd.DataFrame({
        "occurrence weight": [2, None, 3],
        "rank": [1, 2, None],
        "other": ["a", "b", "c"]
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 608μs -> 421μs (44.5% faster)
    # The row with None/NaN should be last after sorting
    expected = pd.DataFrame({
        "occurrence weight": [3.0, 2.0, None],
        "rank": [None, 1.0, 2.0],
        "other": ["c", "a", "b"]
    }, index=[2, 0, 1])

def test_edge_all_equal_values():
    # All values equal: original order should be preserved
    df = pd.DataFrame({
        "occurrence weight": [1, 1, 1],
        "rank": [1, 1, 1],
        "other": ["x", "y", "z"]
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 573μs -> 586μs (2.22% slower)

def test_edge_duplicate_values():
    # Duplicate values in sort columns
    df = pd.DataFrame({
        "occurrence weight": [2, 2, 1],
        "rank": [1, 2, 1],
        "other": ["a", "b", "c"]
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 609μs -> 623μs (2.15% slower)
    expected = pd.DataFrame({
        "occurrence weight": [2, 2, 1],
        "rank": [2, 1, 1],
        "other": ["b", "a", "c"]
    }, index=[1, 0, 2])

def test_edge_custom_column_names():
    # Custom column names for sorting
    df = pd.DataFrame({
        "weight": [5, 3, 7],
        "score": [10, 20, 15],
        "other": ["foo", "bar", "baz"]
    })
    codeflash_output = _rank_report_context(df.copy(), weight_column="weight", rank_column="score"); result = codeflash_output # 618μs -> 631μs (2.04% slower)
    expected = pd.DataFrame({
        "weight": [7, 5, 3],
        "score": [15, 10, 20],
        "other": ["baz", "foo", "bar"]
    }, index=[2, 0, 1])

def test_edge_column_dtype_conversion():
    # Columns with mixed types: should convert to float
    df = pd.DataFrame({
        "occurrence weight": [1, "2", 3.0],
        "rank": ["1", 2, 1.0],
        "other": ["a", "b", "c"]
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 655μs -> 659μs (0.549% slower)
    expected = pd.DataFrame({
        "occurrence weight": [3.0, 2.0, 1.0],
        "rank": [1.0, 2.0, 1.0],
        "other": ["c", "b", "a"]
    }, index=[2, 1, 0])

# -------- LARGE SCALE TEST CASES --------

def test_large_scale_sorting():
    # Large DataFrame with random values
    import random
    random.seed(42)
    size = 1000
    weights = [random.uniform(0, 1000) for _ in range(size)]
    ranks = [random.uniform(0, 1000) for _ in range(size)]
    df = pd.DataFrame({
        "occurrence weight": weights,
        "rank": ranks,
        "other": [str(i) for i in range(size)]
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 763μs -> 558μs (36.8% faster)
    # Should be sorted by weight DESC, then rank DESC
    sorted_idx = sorted(range(size), key=lambda i: (weights[i], ranks[i]), reverse=True)
    expected = df.iloc[sorted_idx].reset_index(drop=True)

def test_large_scale_all_equal():
    # Large DataFrame with all equal values
    size = 1000
    df = pd.DataFrame({
        "occurrence weight": [1] * size,
        "rank": [1] * size,
        "other": [str(i) for i in range(size)]
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 604μs -> 614μs (1.71% slower)

def test_large_scale_nan_values():
    # Large DataFrame with some NaN values
    import numpy as np
    size = 1000
    weights = [float(i) if i % 10 != 0 else None for i in range(size)]
    ranks = [float(i) if i % 15 != 0 else None for i in range(size)]
    df = pd.DataFrame({
        "occurrence weight": weights,
        "rank": ranks,
        "other": [str(i) for i in range(size)]
    })
    codeflash_output = _rank_report_context(df.copy()); result = codeflash_output # 745μs -> 542μs (37.4% faster)
    # NaN values should be sorted to the end
    # Confirm that the first row has the highest non-NaN weight and rank
    first_valid_idx = max(i for i in range(size) if weights[i] is not None and ranks[i] is not None)

def test_large_scale_custom_column_names():
    # Large DataFrame with custom column names
    import random
    random.seed(123)
    size = 1000
    weights = [random.uniform(0, 1000) for _ in range(size)]
    ranks = [random.uniform(0, 1000) for _ in range(size)]
    df = pd.DataFrame({
        "w": weights,
        "r": ranks,
        "other": [str(i) for i in range(size)]
    })
    codeflash_output = _rank_report_context(df.copy(), weight_column="w", rank_column="r"); result = codeflash_output # 751μs -> 551μs (36.1% faster)
    sorted_idx = sorted(range(size), key=lambda i: (weights[i], ranks[i]), reverse=True)
    expected = df.iloc[sorted_idx].reset_index(drop=True)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_rank_report_context-mglpgdxa and push.

Codeflash

The optimized code introduces three key performance optimizations:

**1. Conditional Type Conversion with Copy Avoidance**
- Added `if report_df[column].dtype != float:` checks before `astype()` calls
- Used `astype(float, copy=False)` instead of `astype(float)`
- This avoids unnecessary type conversions when columns are already float and prevents memory copies when conversion is needed

**2. Skip Sorting for Single-Row DataFrames**
- Added `if len(report_df) > 1:` check before `sort_values()`
- Sorting a single row is a no-op that still incurs pandas overhead
- This optimization is particularly effective for small datasets

**3. Performance Impact Analysis**
From the line profiler results:
- Type conversion time reduced significantly (27.7% → 10.6% for weight column, 14.6% → 4.6% for rank column)
- The `dtype != float` checks themselves take minimal time (10.6% and 4.6% combined)
- Sorting remains the dominant cost (57.4%) but only runs when necessary

The optimizations are most effective for:
- **Empty/single-row DataFrames**: 95-123% faster (from annotated tests)
- **Large DataFrames with existing float columns**: 35-37% faster 
- **DataFrames with NaN values**: 44% faster (likely due to reduced memory operations)

These micro-optimizations compound to achieve a 16% overall speedup by eliminating redundant operations without changing the function's behavior or API.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 11, 2025 03:18
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants