Skip to content

MLflow Agent Skill Update and Add Ability to Generate Test Cases for GEPA based on User Instruction#293

Open
auschoi96 wants to merge 12 commits intodatabricks-solutions:mainfrom
auschoi96:main
Open

MLflow Agent Skill Update and Add Ability to Generate Test Cases for GEPA based on User Instruction#293
auschoi96 wants to merge 12 commits intodatabricks-solutions:mainfrom
auschoi96:main

Conversation

@auschoi96
Copy link
Collaborator

@auschoi96 auschoi96 commented Mar 11, 2026

Summary

Forrest had a great implementation in MLflow that evaluated the skill's performance based on Claude Code or coding agent. See the FR here: mlflow/mlflow#21255

I took the design of this and the best practice articles from Anthropic included in the FR and used that as the core evaluation metric. This does increase eval and optimization time but it will be realistic and of high quality

Further, fixed an issue where the tools weren't being called during the GEPA optimization

After some initial testing and discussion internally, we decided that more test cases are needed, especially as resources update. For example, if the zerobus sdk updates and has breaking changes, we want to make sure that's captured in the skill. Or, maybe the user only wants serverless to be used, then they can run this to make sure it prioritizes the skill.

This should generate new test_cases in the ground_truth.yaml and help make GEPA more accurate.

What's in the PR

This PR address the following:

  1. addition of --focus that users can add multiple times and it will generate test cases
  2. Change in judges for more targeted evaluation
  3. Change in weights on what GEPA considers as important
  4. Adjustment and removal of static judges in favor for Forrest's implementation
  5. updated --agent-eval-full to do the full eval and optimization using claude code, not a single LLM call

Test Plan.

You can run the following commands to test the new flags and optimizations. You will need to set the correct env variables according to the .test/README.md:

uv run python .test/scripts/optimize.py databricks-zerobus-ingest --reflection-lm databricks/gepa-fallbacks --judge-model databricks/gepa-fallbacks --preset quick --agent-eval-full --mlflow-experiment "/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates dc-assistant-agent_experiment"

This is in example of using --focus to generate more examples which will aide in the optimization

uv run python .test/scripts/optimize.py databricks-zerobus-ingest --reflection-lm databricks/gepa-fallbacks --judge-model databricks/gepa-fallbacks --preset quick --agent-eval --mlflow-experiment "/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates dc-assistant-agent_experiment" --focus "ensure the latest databricks-sdk is being used like 0.97.0" --focus "ensure the latest databricks zerobus sdk is being using 1.1.0" --focus "make sure compatibility across runtimes"

Test plan

  • Ensured new test cases are generated
  • Ensure new judges are being used
  • Ensure the CLI command above works

@auschoi96 auschoi96 marked this pull request as ready for review March 13, 2026 14:58
@auschoi96 auschoi96 requested a review from calreynolds March 13, 2026 15:09
…lflow/mlflow#21255 updated the evaluation so that it's actually evaluating claude code's ability to use the skill and we are evaluating the performance of the skill.

This design includes a skill folder with criterias added as general-quality, sql-correctness and tool-selection. For  more information on the design, check out the PR.

Additionally, added a fix where the MCP and Tools were not added to the claude code configuration of the GEPA optimizer so the tools weren't actually being called
@auschoi96 auschoi96 changed the title Add Ability to Generate Test Cases for GEPA based on User Instruction MLflow Agent Skill Update and Add Ability to Generate Test Cases for GEPA based on User Instruction Mar 14, 2026
…lflow/mlflow#21255 updated the evaluation so that it's actually evaluating claude code's ability to use the skill and we are evaluating the performance of the skill.

This design includes a skill folder with criterias added as general-quality, sql-correctness and tool-selection. For  more information on the design, check out the PR.

readme and technical updates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant