Skip to content

Add Iterative Predictor for Improved SWE-bench Issue Resolution #1397

@Jerryguan777

Description

@Jerryguan777

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem this feature solves

The current full predictor in the swe_bench evaluation example utilizes a one-shot generation approach, which lacks the necessary robustness for complex swe_bench tasks. Based on my evaluation of 8 instances across major swe_bench projects (including SymPy, Astropy, Django, Matplotlib), the current success rate is 0%.

The core limitations identified are:

  • Generates fixes without running tests to validate them
  • Lacks feedback loops to refine solutions based on execution errors
  • Cannot recover from failures or adjust strategies
  • Relies on static code analysis without dynamic execution feedback
nat eval --config_file examples/evaluation_and_profiling/swe_bench/configs/config_full.yml
=== EVALUATION SUMMARY ===
Workflow Status: COMPLETED (workflow_output.json)
Total Runtime: 132.62s

Per evaluator results:
| Evaluator   |   Avg Score | Output File           |
|-------------|-------------|-----------------------|
| swe_bench   |           0 | swe_bench_output.json |

Describe your ideal solution

I propose the implementation of an Iterative Predictor that introduces a dynamic feedback loop into the SWE-bench resolution process. This feature will transition the agent from a "one-shot" model to an "reason-action-observation" model.

Key Components of the Solution:

  • Step-by-step execution: Executes commands incrementally and observes results
  • Test-driven validation: Runs tests after each fix attempt and uses failure signals to guide refinement
  • Error recovery: Handles failures gracefully with retry mechanisms and strategy adjustments
  • Dynamic feedback: Uses runtime errors, test outputs, and execution results instead of static analysis

Additional context

I plan to implement this iterative predictor. Will extend the SweBenchPredictorBase class and reuse the existing environment interaction logic to ensure consistency with the current framework. Once the implementation is verified, I will submit a PR for review.

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request

Metadata

Metadata

Assignees

No one assigned

    Labels

    StaleActivity is stale; may be automatically closed without updatefeature requestNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions