Skip to content

Conversation

@codeflash-ai
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Feb 2, 2026

⚡️ This pull request contains optimizations for PR #1199

If you approve this dependent PR, these changes will be merged into the original PR branch omni-java.

This PR will be automatically closed if the original PR is merged.


📄 12% (0.12x) speedup for _extract_class_body_context in codeflash/languages/java/context.py

⏱️ Runtime : 93.6 microseconds 83.6 microseconds (best of 163 runs)

📝 Explanation and details

The optimized code achieves an 11% runtime improvement (93.6μs → 83.6μs) through two key changes:

1. Caching child.type in a local variable

child_type = child.type  # Cache the attribute access
if child_type in ("{", "}", ";", ","):

In the loop over body_node.children, child.type was accessed 3-4 times per iteration. By storing it once in child_type, we eliminate repeated attribute lookups on the Node object, which are more expensive than local variable access in Python.

2. Replacing append("".join(...)) with extend(...)
Original:

field_lines = lines[javadoc_start : end_line + 1]
field_parts.append("".join(field_lines))  # Join then append

Optimized:

field_parts.extend(lines[javadoc_start : end_line + 1])  # Directly extend

This eliminates intermediate string concatenations inside the loop. Instead of creating a joined string for each field/constructor and appending it to the list, we extend the list with the raw line slices. The final "".join(field_parts) at the end performs one single join operation over all accumulated lines, which is significantly more efficient than multiple joins.

Performance impact by test case:

  • Large-scale test (200 fields): 16.5% faster (71.6μs → 61.5μs) - the extend optimization scales particularly well with many fields
  • Multiple mixed fields/constructors: 4.86% faster - benefits from both optimizations
  • Basic single field tests: slight variation (some 0.5-5% slower) - the overhead of the extra local variable assignment is negligible for single-element cases but the optimization still maintains correctness

The optimization is most effective when processing Java files with many field declarations or constructors, which is common in real-world codebases. The deferred string joining pattern is a classic Python performance technique that reduces memory allocations and intermediate object creation.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 6 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 85.7%
🌀 Click to see Generated Regression Tests
from types import \
    SimpleNamespace  # lightweight container for Node-like objects

# imports
import pytest  # used for our unit tests
from codeflash.languages.java.context import _extract_class_body_context

# Helper utilities used by tests -------------------------------------------------

def _make_source_from_lines(lines):
    """
    Given a list of lines (each a string, typically ending with '\n'),
    produce:
      - source_bytes: concatenation encoded as utf8
      - offsets: list of byte offsets for the start of each line
    This lets tests set start_byte/end_byte precisely to match slices in the function.
    """
    offsets = []
    pos = 0
    for ln in lines:
        offsets.append(pos)
        # measure bytes length to support non-ascii safely
        pos += len(ln.encode("utf8"))
    source_bytes = "".join(lines).encode("utf8")
    return source_bytes, offsets

def _make_node(node_type, start_line, end_line, start_byte, end_byte, prev_named_sibling=None):
    """
    Create a small object that mimics the attributes accessed by the function:
      - type
      - start_point (tuple where [0] is the start line index)
      - end_point   (tuple where [0] is the end line index)
      - start_byte  (byte offset)
      - end_byte    (byte offset)
      - prev_named_sibling (object or None)
    We use SimpleNamespace rather than defining new classes to be minimal and explicit.
    """
    return SimpleNamespace(
        type=node_type,
        start_point=(start_line, 0),
        end_point=(end_line, 0),
        start_byte=start_byte,
        end_byte=end_byte,
        prev_named_sibling=prev_named_sibling,
    )

def test_basic_field_with_javadoc_included():
    # A class body with a block_comment that is a Javadoc immediately preceding a field_declaration.
    # Expectation: the returned fields string should include both the Javadoc and the field line.
    lines = [
        "/** This is javadoc for x */\n",  # line 0: block_comment (javadoc)
        "private int x;\n",               # line 1: field_declaration
        "}\n",                            # line 2: closing brace (irrelevant)
    ]
    source_bytes, offsets = _make_source_from_lines(lines)

    # Create the javadoc block_comment node covering line 0 bytes
    block_comment_node = _make_node(
        node_type="block_comment",
        start_line=0,
        end_line=0,
        start_byte=offsets[0],
        end_byte=offsets[0] + len(lines[0].encode("utf8")),
    )

    # Create the field node which has prev_named_sibling pointing to the javadoc node
    field_node = _make_node(
        node_type="field_declaration",
        start_line=1,
        end_line=1,
        start_byte=offsets[1],
        end_byte=offsets[1] + len(lines[1].encode("utf8")),
        prev_named_sibling=block_comment_node,
    )

    # body_node children list -- order doesn't need to reflect prev_named_sibling pointers,
    # because the function uses child.prev_named_sibling directly.
    body_node = SimpleNamespace(children=[block_comment_node, field_node])

    fields_code, constructors_code = _extract_class_body_context(
        body_node, source_bytes, lines, target_method_name="irrelevant"
    ) # 4.54μs -> 4.78μs (5.00% slower)

    # The javadoc line and the field line should be returned intact and concatenated.
    expected_fields = lines[0] + lines[1]

def test_constructor_with_non_javadoc_block_comment_not_included():
    # If the preceding block_comment does NOT start with '/**', it should not be considered Javadoc
    # and therefore should NOT be included in the constructor output.
    lines = [
        "/* regular comment */\n",  # line 0: not a Javadoc (starts with '/*', not '/**')
        "public MyClass() {\n",     # line 1: constructor start
        "}\n",                      # line 2: constructor end
    ]
    source_bytes, offsets = _make_source_from_lines(lines)

    non_javadoc_comment = _make_node(
        node_type="block_comment",
        start_line=0,
        end_line=0,
        start_byte=offsets[0],
        end_byte=offsets[0] + len(lines[0].encode("utf8")),
    )

    constructor_node = _make_node(
        node_type="constructor_declaration",
        start_line=1,
        end_line=2,
        start_byte=offsets[1],
        end_byte=offsets[2] + len(lines[2].encode("utf8")),
        prev_named_sibling=non_javadoc_comment,  # present but not a Javadoc
    )

    body_node = SimpleNamespace(children=[non_javadoc_comment, constructor_node])

    fields_code, constructors_code = _extract_class_body_context(
        body_node, source_bytes, lines, target_method_name="ctor"
    ) # 4.28μs -> 4.26μs (0.446% faster)

    # Because the comment was not Javadoc, only the constructor lines should be included,
    # not the comment. The constructor spans lines[1:2+1].
    expected_constructor = "".join(lines[1 : 2 + 1])

def test_enum_constant_and_semicolon_children_ignored_for_class():
    # For a "class" type the enum_constant branch should NOT produce output.
    # Also tokens like "{" "}" ";" "," should be skipped entirely.
    lines = [
        "SOME_CONSTANT,\n",  # line 0: would be an enum constant if type_kind == "enum"
        ";\n",               # line 1: semicolon token that should be ignored
        "private int y;\n",  # line 2: a real field we include
    ]
    source_bytes, offsets = _make_source_from_lines(lines)

    enum_node = _make_node(
        node_type="enum_constant",
        start_line=0,
        end_line=0,
        start_byte=offsets[0],
        end_byte=offsets[0] + len(lines[0].encode("utf8")),
    )

    semicolon_node = _make_node(
        node_type=";", start_line=1, end_line=1, start_byte=offsets[1], end_byte=offsets[1] + len(lines[1].encode("utf8"))
    )

    # A normal field_declaration (no preceding javadoc)
    field_node = _make_node(
        node_type="field_declaration",
        start_line=2,
        end_line=2,
        start_byte=offsets[2],
        end_byte=offsets[2] + len(lines[2].encode("utf8")),
    )

    body_node = SimpleNamespace(children=[enum_node, semicolon_node, field_node])

    fields_code, constructors_code = _extract_class_body_context(
        body_node, source_bytes, lines, target_method_name="tm"
    ) # 3.10μs -> 3.12μs (0.674% slower)

def test_multiple_mixed_fields_and_constructors_order_preserved():
    # Mixed children: field, constructor, field. Ensure order of concatenation preserves
    # the order the function iterates children, and that constructors and fields are separated.
    lines = [
        "/** f1 javadoc */\n",  # 0
        "private int f1;\n",    # 1
        "/** c1 javadoc */\n",  # 2
        "public C() {}\n",      # 3
        "private String f2;\n", # 4
    ]
    source_bytes, offsets = _make_source_from_lines(lines)

    # nodes with appropriate prev_named_sibling links for Javadocs
    block_javadoc_f1 = _make_node("block_comment", 0, 0, offsets[0], offsets[0] + len(lines[0].encode("utf8")))
    field_f1 = _make_node("field_declaration", 1, 1, offsets[1], offsets[1] + len(lines[1].encode("utf8")), prev_named_sibling=block_javadoc_f1)

    block_javadoc_c1 = _make_node("block_comment", 2, 2, offsets[2], offsets[2] + len(lines[2].encode("utf8")))
    ctor_c1 = _make_node("constructor_declaration", 3, 3, offsets[3], offsets[3] + len(lines[3].encode("utf8")), prev_named_sibling=block_javadoc_c1)

    field_f2 = _make_node("field_declaration", 4, 4, offsets[4], offsets[4] + len(lines[4].encode("utf8")))

    # Intentionally interleave nodes in the children list as they might appear in a real AST
    body_node = SimpleNamespace(children=[block_javadoc_f1, field_f1, block_javadoc_c1, ctor_c1, field_f2])

    fields_code, constructors_code = _extract_class_body_context(
        body_node, source_bytes, lines, target_method_name="whatever"
    ) # 6.49μs -> 6.19μs (4.86% faster)

    expected_fields = lines[0] + lines[1] + lines[4]  # f1 javadoc + f1 + f2 (no javadoc for f2)
    expected_constructors = lines[2] + lines[3]      # c1 javadoc + constructor

def test_large_number_of_fields_scalability():
    # Large-scale test: create 200 field_declaration children each on their own line.
    # This exercises the accumulation and ensures performance within reasonable bounds.
    # We stay well under the 1000 element constraint in the instructions.
    N = 200
    # Build lines where each is a distinct field declaration line.
    lines = [f"private int f{i};\n" for i in range(N)]
    # Add a closing brace to mimic a class body end (not strictly needed)
    lines.append("}\n")
    source_bytes, offsets = _make_source_from_lines(lines)

    children = []
    for i in range(N):
        # Each field is on line i, no prev_named_sibling
        node = _make_node(
            node_type="field_declaration",
            start_line=i,
            end_line=i,
            start_byte=offsets[i],
            end_byte=offsets[i] + len(lines[i].encode("utf8")),
        )
        children.append(node)

    body_node = SimpleNamespace(children=children)

    fields_code, constructors_code = _extract_class_body_context(
        body_node, source_bytes, lines, target_method_name="none"
    ) # 71.6μs -> 61.5μs (16.5% faster)

    # Expect the concatenation of all N field lines in order, and no constructors.
    expected_fields = "".join(lines[0:N])

def test_prev_named_sibling_when_absent_or_different_type():
    # If prev_named_sibling exists but is not a block_comment, it should be ignored.
    # Also if prev_named_sibling is None, behavior should be correct.
    lines = [
        "/* not a block_comment node in AST sense but present */\n",  # line 0 (we'll attach as prev_named_sibling but it has wrong type)
        "private int a;\n",  # line 1 field
        "private int b;\n",  # line 2 field with no prev_named_sibling
    ]
    source_bytes, offsets = _make_source_from_lines(lines)

    # prev_named_sibling of non-block-comment type (e.g., a different AST node type)
    other_node = _make_node("some_other_node", 0, 0, offsets[0], offsets[0] + len(lines[0].encode("utf8")))

    field_a = _make_node("field_declaration", 1, 1, offsets[1], offsets[1] + len(lines[1].encode("utf8")), prev_named_sibling=other_node)
    field_b = _make_node("field_declaration", 2, 2, offsets[2], offsets[2] + len(lines[2].encode("utf8")))

    body_node = SimpleNamespace(children=[other_node, field_a, field_b])

    fields_code, constructors_code = _extract_class_body_context(body_node, source_bytes, lines, target_method_name="x") # 3.57μs -> 3.79μs (5.81% slower)

    # Since other_node.type != "block_comment", the code should not treat it as a Javadoc.
    expected = lines[1] + lines[2]
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr1199-2026-02-02T00.48.34 and push.

Codeflash

The optimized code achieves an **11% runtime improvement** (93.6μs → 83.6μs) through two key changes:

**1. Caching `child.type` in a local variable**
```python
child_type = child.type  # Cache the attribute access
if child_type in ("{", "}", ";", ","):
```
In the loop over `body_node.children`, `child.type` was accessed 3-4 times per iteration. By storing it once in `child_type`, we eliminate repeated attribute lookups on the Node object, which are more expensive than local variable access in Python.

**2. Replacing `append("".join(...))` with `extend(...)`**
Original:
```python
field_lines = lines[javadoc_start : end_line + 1]
field_parts.append("".join(field_lines))  # Join then append
```

Optimized:
```python
field_parts.extend(lines[javadoc_start : end_line + 1])  # Directly extend
```

This eliminates intermediate string concatenations inside the loop. Instead of creating a joined string for each field/constructor and appending it to the list, we extend the list with the raw line slices. The final `"".join(field_parts)` at the end performs one single join operation over all accumulated lines, which is significantly more efficient than multiple joins.

**Performance impact by test case:**
- **Large-scale test** (200 fields): 16.5% faster (71.6μs → 61.5μs) - the extend optimization scales particularly well with many fields
- **Multiple mixed fields/constructors**: 4.86% faster - benefits from both optimizations
- **Basic single field tests**: slight variation (some 0.5-5% slower) - the overhead of the extra local variable assignment is negligible for single-element cases but the optimization still maintains correctness

The optimization is most effective when processing Java files with many field declarations or constructors, which is common in real-world codebases. The deferred string joining pattern is a classic Python performance technique that reduces memory allocations and intermediate object creation.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant