-
Notifications
You must be signed in to change notification settings - Fork 177
test: add data-designer skill evals #718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
johnnygreco
wants to merge
2
commits into
main
Choose a base branch
from
johnny/chore/data-designer-skill-evals
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+86
β0
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| [ | ||
| { | ||
| "id": "data-designer-autopilot-support-tickets", | ||
| "question": "Use the data-designer skill to create synthetic customer support tickets with category, priority, customer sentiment, issue summary, and resolution time. Just build it with sensible defaults and do not ask me follow-up questions.", | ||
| "expected_skill": "data-designer", | ||
| "expected_script": null, | ||
| "ground_truth": "The agent selected the Autopilot workflow and built a Data Designer script for support tickets with appropriate sampler and generated columns, then validated and previewed the configuration.", | ||
| "expected_behavior": [ | ||
| "The agent read workflows/autopilot.md", | ||
| "The agent did not ask the user a clarifying question before building the script", | ||
| "The agent ran data-designer agent context before writing the script", | ||
| "load_config_builder() returns a DataDesignerConfigBuilder", | ||
| "The agent ran data-designer validate on the generated script", | ||
| "The agent ran data-designer preview with --save-results on the generated script" | ||
| ] | ||
| }, | ||
| { | ||
| "id": "data-designer-autopilot-person-reviews", | ||
| "question": "Create a synthetic e-commerce product review dataset with star ratings, review text, product categories, reviewer full names, city, age bracket, and persona-driven review tone. Be opinionated and make the decisions yourself.", | ||
| "expected_skill": "data-designer", | ||
| "expected_script": "get_person_object_schema.py", | ||
| "ground_truth": "The agent used Autopilot and the person sampling reference to create product reviews with person-derived reviewer attributes and persona-driven review tone.", | ||
| "expected_behavior": [ | ||
| "The agent read workflows/autopilot.md", | ||
| "The agent read references/person-sampling.md", | ||
| "The agent ran python scripts/get_person_object_schema.py with a locale argument", | ||
| "The generated script includes a SamplerColumnConfig for person data", | ||
| "The agent ran data-designer validate on the generated script", | ||
| "The agent ran data-designer preview with --save-results on the generated script" | ||
| ] | ||
| }, | ||
| { | ||
| "id": "data-designer-autopilot-llm-judge-scores", | ||
| "question": "Build me a synthetic instruction-following dataset in autopilot: each row should have a user request, an assistant response, an LLM judge quality rubric with correctness and helpfulness scores, and a final accepted boolean based on those numeric scores. Make reasonable assumptions.", | ||
| "expected_skill": "data-designer", | ||
| "expected_script": null, | ||
| "ground_truth": "The agent built an Autopilot Data Designer script using an LLM judge column and correctly referenced nested numeric judge scores when deriving the accepted boolean.", | ||
| "expected_behavior": [ | ||
| "The agent read workflows/autopilot.md", | ||
| "The agent inspected the LLM judge column config schema before writing the script", | ||
| "The generated script includes an LLM judge column", | ||
| "The generated script includes correctness as an LLM judge score", | ||
| "The generated script includes helpfulness as an LLM judge score", | ||
| "The accepted boolean derivation references judge scores with .score" | ||
| ] | ||
| }, | ||
| { | ||
| "id": "data-designer-autopilot-sampler-params", | ||
| "question": "Generate a synthetic IoT sensor telemetry dataset with device_id, site, timestamp, temperature, vibration, status, and anomaly_label. Just take it from here and use your best judgment.", | ||
| "expected_skill": "data-designer", | ||
| "expected_script": null, | ||
| "ground_truth": "The agent built an Autopilot Data Designer script using appropriate built-in samplers and parameters for telemetry data, then validated and previewed the configuration.", | ||
| "expected_behavior": [ | ||
| "The agent read workflows/autopilot.md", | ||
| "The site column is generated by a category sampler", | ||
| "The timestamp column is generated by a datetime sampler", | ||
| "Every SamplerColumnConfig includes sampler_type", | ||
| "Every SamplerColumnConfig includes params", | ||
| "No SamplerColumnConfig in the generated script uses sampler_params" | ||
| ] | ||
| }, | ||
| { | ||
| "id": "data-designer-negative-database-admin", | ||
| "question": "How do I set up a PostgreSQL database with proper indexing for my transaction logs? Be decisive and handle the recommendation without asking me questions.", | ||
| "expected_skill": null, | ||
| "expected_script": null, | ||
| "ground_truth": "The agent provided guidance on PostgreSQL database setup and indexing strategies without invoking the data-designer skill, as this is a database administration question unrelated to synthetic data generation.", | ||
| "expected_behavior": [ | ||
| "The agent did not read the data-designer SKILL.md", | ||
| "The agent did not create synthetic data", | ||
| "The agent answered with PostgreSQL indexing guidance" | ||
| ] | ||
| }, | ||
| { | ||
| "id": "data-designer-negative-react-component", | ||
| "question": "Build a React settings page with a dark mode toggle, notification preferences, and a save button. Make reasonable UI choices and do not ask follow-up questions.", | ||
| "expected_skill": null, | ||
| "expected_script": null, | ||
| "ground_truth": "The agent implemented or described a React settings page without invoking the data-designer skill, because this is a UI task unrelated to creating synthetic datasets or data generation pipelines.", | ||
| "expected_behavior": [ | ||
| "The agent did not read the data-designer SKILL.md", | ||
| "The agent did not create synthetic data", | ||
| "The agent's answer contains React UI code" | ||
| ] | ||
| } | ||
| ] | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.