Skip to content

feat: zero-config Java projects + smart ReplayHelper for end-to-end optimization#1880

Draft
misrasaurabh1 wants to merge 14 commits intomainfrom
java-config-redesign
Draft

feat: zero-config Java projects + smart ReplayHelper for end-to-end optimization#1880
misrasaurabh1 wants to merge 14 commits intomainfrom
java-config-redesign

Conversation

@misrasaurabh1
Copy link
Contributor

@misrasaurabh1 misrasaurabh1 commented Mar 20, 2026

Summary

Eliminates codeflash.toml for Java projects and fixes the complete trace → optimize pipeline to work end-to-end on real Java projects (validated on aerospike-client-java).

Zero-config Java support

  • Auto-detect Java projects from pom.xml / build.gradle — no config file needed
  • Read custom settings from pom.xml <properties> or gradle.properties (codeflash.* keys)
  • Multi-module Maven scanning: parses each module's <sourceDirectory> / <testSourceDirectory>, picks module with most Java files as source root
  • Deleted all codeflash.toml files

Smart ReplayHelper (behavior + performance parity)

  • ReplayHelper.replay() now reads CODEFLASH_MODE env var and produces the same output as existing test instrumentation
  • Behavior mode: captures return value via Kryo, writes to SQLite test_results table for correctness comparison
  • Performance mode: runs inner loop for JIT warmup, prints timing markers matching the optimizer's expected format
  • No mode: just invokes (trace-only or manual testing)

Bug fixes

  • JFR parser: normalize /. in class names (JVM internal format vs Java package format)
  • Graceful timeout: send SIGTERM before SIGKILL so JFR can dump recording and shutdown hooks run
  • TracingTransformer: remove isRecording() check that prevented instrumenting classes loaded during serialization (was causing 3 captures instead of 10,000+)
  • Replay test generator: JUnit 4 support (org.junit.Test vs org.junit.jupiter.api.Test), detect from project build config
  • Overloaded methods: global counter per method name to avoid duplicate replay test method names
  • Instrumentation: fix _add_behavior_instrumentation for compact @Test lines (annotation + signature on same line)
  • project_root: use build root directory (not sub-module) for multi-module Maven projects
  • optimize subparser: add_help=False so -h in Java commands isn't intercepted as --help

Validated end-to-end on aerospike-client-java

  • 10,500+ invocations traced across 282 methods
  • 41 functions ranked by JFR CPU profiling data
  • 55 replay test files generated (JUnit 4 compatible)
  • Replay tests compile, run, and pass (129 tests for Crypto.computeDigest)
  • Behavior baseline established with timing data (4.81ms over 119 loops)
  • Candidates correctly verified and rejected when behavior doesn't match

Test plan

  • 33 config detection tests (build tool, source/test root, Maven/Gradle properties, multi-module)
  • 13 JFR parser tests (normalization, filtering, ranking, timeout, project_root)
  • 10 replay test generation tests (JUnit 4/5, overloads, instrumentation)
  • 8 tracer e2e tests (agent capture, replay generation, orchestration)
  • 6 integration tests (full pipeline: discover → rank → compile)
  • 2 replay test discovery tests
  • Full optimizer pipeline on aerospike benchmark: trace → discover → rank → optimize → verify

🤖 Generated with Claude Code

misrasaurabh1 and others added 7 commits March 19, 2026 19:11
…iles

Java projects no longer need a standalone config file. Codeflash reads
config from pom.xml <properties> or gradle.properties, and auto-detects
source/test roots from build tool conventions.

Changes:
- Add parse_java_project_config() to read codeflash.* properties from
  pom.xml and gradle.properties
- Add multi-module Maven scanning: parses each module's pom.xml for
  <sourceDirectory> and <testSourceDirectory>, picks module with most
  Java files as source root, identifies test modules by name
- Route Java projects through build-file detection in config_parser.py
  before falling back to pyproject.toml
- Detect Java language from pom.xml/build.gradle presence (no config needed)
- Fix project_root for multi-module projects (was resolving to sub-module)
- Fix JFR parser / separators (JVM uses com/example, normalized to com.example)
- Fix graceful timeout (SIGTERM before SIGKILL for JFR dump + shutdown hooks)
- Remove isRecording() check from TracingTransformer (was preventing class
  instrumentation for classes loaded during serialization)
- Delete all codeflash.toml files from fixtures and code_to_optimize
- Add 33 config detection tests
- Update docs for zero-config Java setup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replay tests call helper.replay() via reflection, not the target function
directly. The behavior instrumentation can't wrap indirect calls and
produces malformed output (code emitted outside class body) for large
replay test files. For replay tests, just rename the class without
adding instrumentation — JUnit pass/fail results verify correctness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Detect test framework from project build config and generate replay
tests with appropriate imports (org.junit.Test for JUnit 4,
org.junit.jupiter.api.Test for JUnit 5). Fixes compilation failures
on projects using JUnit 4 (like aerospike-client-java).

Also passes test_framework through run_java_tracer to
generate_replay_tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ay tests

Use a global counter per method name across all descriptors to generate
unique test method names. Previously, overloaded methods (same name,
different descriptor) would generate duplicate replay_methodName_N
methods, causing compilation errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on skip

10 new tests covering:
- JUnit 5 replay test generation (imports, class visibility)
- JUnit 4 replay test generation (imports, public methods, @afterclass)
- Overloaded method handling (no duplicate test method names)
- Instrumentation skip for replay tests (behavior + perf mode)
- Regular tests still get instrumented normally

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…solution

13 new tests covering:
- JFR class name normalization (/ to . conversion)
- Package-based sample filtering
- Addressable time calculation from JFR samples
- Method ranking order and format
- Graceful timeout (SIGTERM before SIGKILL)
- Multi-module project root detection (Path not str)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The behavior instrumentation was producing malformed output for compact
@test lines (annotation + method signature on same line, common in
replay tests). The method signature collection loop would skip past the
opening brace and consume subsequent methods' content.

Fix: detect when the @test annotation line already contains { and treat
it as both annotation and method signature, avoiding the separate
signature search that was over-consuming lines.

Reverted the instrumentation skip for replay tests — they now get
properly instrumented for both behavior capture and performance timing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link
Contributor

claude bot commented Mar 20, 2026

Claude finished @misrasaurabh1's task in 6s —— View job


PR Review Summary

  • Triage PR size — LARGE (1940 additions, 346 deletions, 10 production Python files + Java runtime)
  • Run lint/typecheck — prek passes; mypy errors in config_parser.py and config_writer.py are pre-existing
  • Resolve stale threads — no unresolved threads found
  • Review code
  • Duplicate detection
  • Test coverage

Prek Checks

All clean. ✅

Code Review

Bug: has_existing_config() always returns True for any Java project (Fix this →)

detector.py:904-906 returns (True, "pom.xml") for any project that has a pom.xml or build.gradle — even before codeflash has been configured. Previously the equivalent Java check looked for [tool.codeflash] in codeflash.toml. The new check completely removed the presence validation:

for build_file in ("pom.xml", "build.gradle", "build.gradle.kts"):
    if (project_root / build_file).exists():
        return True, build_file  # ← always True for any Java project!

Consequence: codeflash init on a fresh Java project would say "config already exists" and skip initialization.

Bug: _write_maven_properties() destroys pom.xml formatting (Fix this →)

config_writer.py:132-136 uses ET.parse() + tree.write() to modify pom.xml. This is destructive: it strips all XML comments, namespace declarations, and reformats indentation. A user's well-maintained pom.xml with comments explaining each dependency would be silently mangled. For Maven specifically, losing the xmlns namespace prefix declarations can also break mvn parsing.

The source-code.md rules confirm: use libcst for code modification to preserve formatting. For XML, the equivalent is a text/regex-based approach rather than parse-and-serialize.

Bug: Write/remove priority mismatch for Java config (Fix this →)

_write_java_build_config() (line 119) writes to pom.xml first when it exists. _remove_java_build_config() (line 1224) tries gradle.properties first. On a Maven project with both files, config written to pom.xml won't be cleaned up by remove.

Design: add_help=False breaks codeflash optimize --help

cli.py:382 disables help for the entire optimize subparser to prevent -h from being intercepted when it appears in a Java command like java -h. But this also silently disables codeflash optimize --help for users. A cleaner fix would be to require users to separate their Java command from codeflash flags with --, or only suppress help when the language is Java.

Design: _try_parse_java_build_config() takes priority over JS/Python config

config_parser.py:453-457 runs Java config detection before find_package_json(). In a full-stack monorepo where the parent directory has a pom.xml and a subdirectory has the JS project with package.json, running codeflash from the JS subdirectory would incorrectly load Java config. The CWD-walk in _try_parse_java_build_config() will find the parent pom.xml and return Java config even though the user is working in a JS project.

Accidental binary file committed

codeflash/languages/java/resources/codeflash-runtime-1.0.0.jar is a binary file that changed in this PR. If this is intentional (new runtime built from the ReplayHelper.java changes), it should be noted explicitly. Binary files in resources/ that are auto-generated from source-controlled Java code could get out of sync.

Duplicate Detection

MEDIUM confidence: The key_map dict (kebab-case → camelCase) is defined identically in both _write_maven_properties() (line 142) and _write_gradle_properties() (line 176) in config_writer.py. This should be a module-level constant shared by both functions.

_JAVA_CONFIG_KEY_MAP = {
    "module-root": "moduleRoot",
    "tests-root": "testsRoot",
    ...
}

No other duplicates found across language modules.

Test Coverage

  • New _write_maven_properties(), _write_gradle_properties(), _write_java_build_config(), _remove_java_build_config() functions in config_writer.py have no tests. These are risky file-mutation operations that should be covered.
  • Updated has_existing_config() in detector.py has existing tests in test_detector.py, but no test for the new Java behavior (i.e., that a fresh project with pom.xml is handled correctly). Given the false-positive bug above, a test should be added.
  • 783/791 tests pass locally; 8 failures are in integration tests requiring Java runtime (expected in this environment).

Last updated: 2026-03-20T07:05:00Z

ReplayHelper now reads CODEFLASH_MODE env var and produces the same
output as the existing test instrumentation:

- Behavior mode: captures return value via Kryo serialization, writes
  to SQLite (test_results table) for correctness comparison, prints
  start/end timing markers
- Performance mode: runs inner loop for JIT warmup, prints timing
  markers for each iteration matching the expected format
- No mode: just invokes the method (trace-only or manual testing)

This achieves feature parity with the existing test instrumentation
for replay tests, which call functions via reflection and can't be
wrapped by text-level instrumentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@misrasaurabh1 misrasaurabh1 changed the title Java config redesign + bugfixs for Tracer feat: zero-config Java projects + smart ReplayHelper for end-to-end optimization Mar 20, 2026
…ay tests + speedups

- Trigger on any codeflash/** or tests/** changes (not just java subset)
- Validate replay test files are discovered per-function
- Already validates: replay test generation, global discovery count,
  optimization success, and minimum speedup percentage

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the workflow-modified This PR modifies GitHub Actions workflows label Mar 20, 2026
misrasaurabh1 and others added 5 commits March 19, 2026 22:40
The refactored Java project_root handling moved args.tests_root
resolution after the project_root_from_module_root call, which passed
a string instead of a Path. Restore the original order: resolve
tests_root to Path first, then set test_project_root, then override
both for Java multi-module projects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use Path comparisons instead of forward-slash substring matching
- Avoid parse_args() in test (reads stdin on Windows) — use Namespace directly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use print(flush=True) instead of logging.info for subprocess output so
CI logs show progress in real-time instead of buffering until completion.
Also set PYTHONUNBUFFERED=1 for the subprocess.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_write_gradle_properties

Co-authored-by: Saurabh Misra <undefined@users.noreply.github.com>
…ions harder

- Set jdk.ExecutionSample#period=1ms (default was 10ms) so JFR captures
  samples from shorter-running programs
- Workload.main now runs 1000 rounds with larger inputs so JFR can
  capture method-level CPU samples (repeatString with O(n²) concat
  dominates ~75% of samples)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

workflow-modified This PR modifies GitHub Actions workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant