MLflow Agent Skill Update and Add Ability to Generate Test Cases for GEPA based on User Instruction by auschoi96 · Pull Request #293 · databricks-solutions/ai-dev-kit

auschoi96 · 2026-03-11T05:48:20Z

Summary

Forrest had a great implementation in MLflow that evaluated the skill's performance based on Claude Code or coding agent. See the FR here: mlflow/mlflow#21255

I took the design of this and the best practice articles from Anthropic included in the FR and used that as the core evaluation metric. This does increase eval and optimization time but it will be realistic and of high quality

Further, fixed an issue where the tools weren't being called during the GEPA optimization

After some initial testing and discussion internally, we decided that more test cases are needed, especially as resources update. For example, if the zerobus sdk updates and has breaking changes, we want to make sure that's captured in the skill. Or, maybe the user only wants serverless to be used, then they can run this to make sure it prioritizes the skill.

This should generate new test_cases in the ground_truth.yaml and help make GEPA more accurate.

What's in the PR

This PR address the following:

addition of --focus that users can add multiple times and it will generate test cases
Change in judges for more targeted evaluation
Change in weights on what GEPA considers as important
Adjustment and removal of static judges in favor for Forrest's implementation
updated --agent-eval-full to do the full eval and optimization using claude code, not a single LLM call

Test Plan.

You can run the following commands to test the new flags and optimizations. You will need to set the correct env variables according to the .test/README.md:

uv run python .test/scripts/optimize.py databricks-zerobus-ingest --reflection-lm databricks/gepa-fallbacks --judge-model databricks/gepa-fallbacks --preset quick --agent-eval-full --mlflow-experiment "/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates dc-assistant-agent_experiment"

This is in example of using --focus to generate more examples which will aide in the optimization

uv run python .test/scripts/optimize.py databricks-zerobus-ingest --reflection-lm databricks/gepa-fallbacks --judge-model databricks/gepa-fallbacks --preset quick --agent-eval --mlflow-experiment "/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates dc-assistant-agent_experiment" --focus "ensure the latest databricks-sdk is being used like 0.97.0" --focus "ensure the latest databricks zerobus sdk is being using 1.1.0" --focus "make sure compatibility across runtimes"

Test plan

Ensured new test cases are generated
Ensure new judges are being used
Ensure the CLI command above works

…sks for via --focus

Co-authored-by: Isaac

…lflow/mlflow#21255 updated the evaluation so that it's actually evaluating claude code's ability to use the skill and we are evaluating the performance of the skill. This design includes a skill folder with criterias added as general-quality, sql-correctness and tool-selection. For more information on the design, check out the PR. Additionally, added a fix where the MCP and Tools were not added to the claude code configuration of the GEPA optimizer so the tools weren't actually being called

…lflow/mlflow#21255 updated the evaluation so that it's actually evaluating claude code's ability to use the skill and we are evaluating the performance of the skill. This design includes a skill folder with criterias added as general-quality, sql-correctness and tool-selection. For more information on the design, check out the PR. readme and technical updates

auschoi96 and others added 7 commits March 10, 2026 17:51

protobuf fixes

6010653

added the ability to generate test cases depending on what the user a…

5909a22

…sks for via --focus

added the ability to generate test cases depending on what the user a…

321ae4b

…sks for via --focus

revamp the evaluation to have a more concentrated judge/scorer

a8a49f5

Merge branch 'databricks-solutions:main' into main

6d2c4a5

linting fix

bb235f0

Merge branch 'main' of https://github.com/auschoi96/ai-dev-kit

406fd92

auschoi96 marked this pull request as ready for review March 13, 2026 14:58

linting fix

bd7abdf

auschoi96 requested a review from calreynolds March 13, 2026 15:09

auschoi96 added 3 commits March 13, 2026 08:35

fix: ruff formatting and remove accidental app files from PR

53028c3

Co-authored-by: Isaac

revert accidental zerobus test files to match upstream

10c5fe7

Co-authored-by: Isaac

auschoi96 changed the title ~~Add Ability to Generate Test Cases for GEPA based on User Instruction~~ MLflow Agent Skill Update and Add Ability to Generate Test Cases for GEPA based on User Instruction Mar 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLflow Agent Skill Update and Add Ability to Generate Test Cases for GEPA based on User Instruction#293

MLflow Agent Skill Update and Add Ability to Generate Test Cases for GEPA based on User Instruction#293
auschoi96 wants to merge 12 commits intodatabricks-solutions:mainfrom
auschoi96:main

auschoi96 commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

auschoi96 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

Test Plan.

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

auschoi96 commented Mar 11, 2026 •

edited

Loading