Skip to content

Implement automated eval test suite for Angular Skills#17007

Open
Copilot wants to merge 20 commits intomasterfrom
copilot/implement-automated-eval-test-suite
Open

Implement automated eval test suite for Angular Skills#17007
Copilot wants to merge 20 commits intomasterfrom
copilot/implement-automated-eval-test-suite

Conversation

Copy link
Contributor

Copilot AI commented Mar 8, 2026

  • Scaffold evals/ directory structure with package.json and README.md
  • Create grid-basic-setup task (igniteui-angular-grids skill)
  • Create component-combo-reactive-form task (igniteui-angular-components skill)
  • Create theming-palette-generation task (igniteui-angular-theming skill)
  • Create GitHub Actions workflow (skill-eval.yml)
  • Commit baseline results JSON for regression comparison
  • Replace skill-eval package dependency with self-contained local runner (run-eval.sh)
  • Update CI workflow to use local runner
  • Fix result JSON format so CI summary comment shows pass rates instead of N/A
  • Tighten combo grader to require entry-point imports (reject root barrel)
  • Enforce core() before theme() ordering in theming grader
  • Update README to match actual grader behavior
  • Add copilot-cli agent mode to run-eval.sh (--agent copilot)
  • Add gemini-cli agent mode to run-eval.sh (--agent gemini)
  • Add eval-config.json for config-based agent/provider selection
  • Add multi-trial support (--trials N) for statistical robustness
  • CI workflow always runs against both Copilot and Gemini agents in parallel
  • Combined PR summary comment with separate sections for validation, Copilot, and Gemini results
  • Add dedicated prompt.md agent prompts for each task (concise, actionable prompts sent to CLI agents)
  • Switch CI workflow from direct bash commands to npm script wrappers
  • Clean up README: document prompt.md vs instruction.md distinction, remove obsolete examples, correct grader descriptions
Original prompt

This section details on the original issue you should resolve

<issue_title>Implement Automated Eval Test Suite for the Angular Skills</issue_title>
<issue_description>We have three Skills (igniteui-angular-components, igniteui-angular-grids, igniteui-angular-theming) that teach coding agents how to correctly select, configure, and compose Ignite UI for Angular components. As these skills grow in complexity and more developers rely on them, silent regressions become a real risk rewording a step, reordering routing logic, or removing a "verify" clause can quietly degrade agent behavior with no signal until a user reports a wrong output.

This work item establishes a structured eval process for these skills, directly inspired by Minko Gechev's Skill Eval framework, topic and extended with patterns from Anthropic's agent eval research and the Skills Best Practices guide.

Goals

  • Produce a measurable, repeatable quality score for each skill.
  • Detect regressions automatically when a skill file is modified in a PR.
  • Provide a feedback loop during skill authoring (edit → eval → score delta).
  • Establish pass/fail thresholds that gate merges to main.

Approach

Tooling: Adopt the skill-eval TypeScript framework as the eval runner. It supports Docker-isolated agent execution, deterministic shell graders, LLM rubric graders, multi-trial runs, and JSON result persistence — all the properties needed here.

Task Structure

Create an evals/ directory at the repo root. Each eval task is a self-contained directory:

Example:

evals/
├── tasks/
│   ├── grid-basic-setup/
│   │   ├── task.toml               # timeouts, grader weights, trial count
│   │   ├── instruction.md          # what the agent is asked to do
│   │   ├── environment/Dockerfile  # clean Angular project baseline
│   │   ├── tests/test.sh           # deterministic grader (file checks, compile, lint)
│   │   ├── prompts/quality.md      # LLM rubric grader questions
│   │   ├── solution/solve.sh       # reference solution for baseline
│   │   └── skills/                 # symlinks or copies of the skills under test
│   │       └── igniteui-angular-grids/SKILL.md
│   ├── grid-sorting-remote-data/
│   ├── grid-hierarchical-setup/
│   ├── grid-pivot-config/
│   ├── component-combo-reactive-form/
│   ├── component-date-picker-validation/
│   ├── component-dialog-service/
│   ├── theming-palette-generation/
│   ├── theming-component-override/
│   └── skill-routing-intent-detection/  # tests the SKILL.md router logic itself
├── package.json
└── README.md

Tasks to Implement (per Skill)

igniteui-angular-grids skill (highest priority — most complex routing)

Task ID Instruction given to agent Deterministic check LLM rubric check
grid-basic-setup "Add a data grid showing employee data with sorting and pagination" Project compiles; <igx-grid> present in template; correct module imported Did agent choose IgxGrid (not Tree/Hierarchical) for flat data? Did it configure [data] binding correctly?
grid-tree-vs-flat "Display department data with nested child rows" <igx-tree-grid> present; childDataKey configured Did skill routing correctly select Tree Grid over flat Grid?
grid-hierarchical-setup "Build a master-detail grid where clicking a row expands child orders" <igx-hierarchical-grid> + <igx-row-island> present Did agent configure load-on-demand vs inline data correctly based on instructions?
grid-remote-filtering "Add server-side filtering and sorting to the grid" [filterMode]="'externalFilterMode'" set; remote service stub present Did agent wire onDataPreLoad/sortingExpressionsChange instead of local filtering?
grid-pivot-config "Create a pivot grid with row/column/value dimensions" <igx-pivot-grid> + IgxPivotConfiguration present Did agent define rows, columns, values correctly vs a flat grid with groupBy?
grid-state-persistence "Persist grid sorting and filtering state to localStorage" IgxGridStateDirective present; serialize/restore calls present Did agent use the state directive vs manually serializing expressions?

igniteui-angular-components skill

Task ID Instruction Deterministic check LLM rubric check
component-combo-reactive-form "Add a multi-select combo bound to a reactive form control" <igx-combo> present; [formControlName] wired; module imported Did agent use IgxCombo (not IgxSelect or native <select>) for multi-select?
component-date-picker-validation "Add a date picker with min/max date validation" <igx-date-picker> present; minValue/maxValue i...

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Add automated eval test suite for Angular Skills using the skill-eval framework:
- grid-basic-setup task (igniteui-angular-grids skill)
- component-combo-reactive-form task (igniteui-angular-components skill)
- theming-palette-generation task (igniteui-angular-theming skill)
- GitHub Actions workflow for CI integration
- Baseline results JSON for regression comparison
- README with local run and task authoring docs

Co-authored-by: zdrawku <11193764+zdrawku@users.noreply.github.com>
Copilot AI and others added 2 commits March 8, 2026 07:13
Co-authored-by: zdrawku <11193764+zdrawku@users.noreply.github.com>
Co-authored-by: zdrawku <11193764+zdrawku@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement automated eval test suite for the Angular skills Implement automated eval test suite for Angular Skills Mar 8, 2026
@github-actions
Copy link

github-actions bot commented Mar 9, 2026

📊 Skill Eval Results

Task Pass Rate pass@5 Status
baseline N/A N/A

Thresholds

  • pass@5 ≥ 80% — merge gate passed
  • ⚠️ pass@5 ≥ 60% — needs investigation
  • pass@5 < 60% — blocks merge for affected skill

1 similar comment
@github-actions
Copy link

github-actions bot commented Mar 9, 2026

📊 Skill Eval Results

Task Pass Rate pass@5 Status
baseline N/A N/A

Thresholds

  • pass@5 ≥ 80% — merge gate passed
  • ⚠️ pass@5 ≥ 60% — needs investigation
  • pass@5 < 60% — blocks merge for affected skill

@kdinev
Copy link
Member

kdinev commented Mar 9, 2026

@copilot skill-eval is not a package. It's an example repository of how to build the evaluation infrastructure.

skill-eval is a reference repo, not an installable npm package.
Replaced with a local run-eval.sh script that executes reference
solutions and deterministic graders directly.

Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 9, 2026

@copilot skill-eval is not a package. It's an example repository of how to build the evaluation infrastructure.

Removed skill-eval as a dependency and replaced all npx skill-eval invocations with a self-contained run-eval.sh shell script that runs reference solutions and deterministic graders locally. No external packages needed. Fixed in 6e7b838.

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

📊 Skill Eval Results

Task Pass Rate pass@5 Status
baseline N/A N/A
component-combo-reactive-form N/A N/A
grid-basic-setup N/A N/A
theming-palette-generation N/A N/A

Thresholds

  • pass@5 ≥ 80% — merge gate passed
  • ⚠️ pass@5 ≥ 60% — needs investigation
  • pass@5 < 60% — blocks merge for affected skill

@kdinev
Copy link
Member

kdinev commented Mar 10, 2026

@copilot The skill tests passed, so the automation looks good, but the github-actions bot report in the PR show N/A for the pass rates

…scores

The workflow summary comment reads passRate and passAtK from per-task
JSON files. The runner was only writing reward/status, causing N/A.

Also skip baseline.json in the summary since it uses a different schema.

Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 10, 2026

@copilot The skill tests passed, so the automation looks good, but the github-actions bot report in the PR show N/A for the pass rates

Two issues caused the N/A values: (1) run-eval.sh was writing reward/status fields but the summary script expected passRate/passAtK, and (2) baseline.json was being read by the summary script despite having a different schema with null values. Fixed both in b2047d8 — the runner now emits passRate and passAtK in each result JSON, and the summary script skips baseline.json.

@github-actions
Copy link

📊 Skill Eval Results

Task Pass Rate pass@5 Status
component-combo-reactive-form 100% 100%
grid-basic-setup 100% 100%
theming-palette-generation 100% 100%

Thresholds

  • pass@5 ≥ 80% — merge gate passed
  • ⚠️ pass@5 ≥ 60% — needs investigation
  • pass@5 < 60% — blocks merge for affected skill

@kdinev kdinev marked this pull request as ready for review March 10, 2026 08:28
@kdinev kdinev requested a review from Copilot March 10, 2026 08:28
@github-actions
Copy link

📊 Skill Eval Results

Task Pass Rate pass@5 Status
component-combo-reactive-form 100% 100%
grid-basic-setup 100% 100%
theming-palette-generation 100% 100%

Thresholds

  • pass@5 ≥ 80% — merge gate passed
  • ⚠️ pass@5 ≥ 60% — needs investigation
  • pass@5 < 60% — blocks merge for affected skill

@github-actions
Copy link

📊 Skill Eval Results

Task Pass Rate pass@5 Status
component-combo-reactive-form 100% 100%
grid-basic-setup 100% 100%
theming-palette-generation 100% 100%

Thresholds

  • pass@5 ≥ 80% — merge gate passed
  • ⚠️ pass@5 ≥ 60% — needs investigation
  • pass@5 < 60% — blocks merge for affected skill

Removed installation of eval dependencies from workflow.
@github-actions
Copy link

📊 Skill Eval Results

Task Pass Rate pass@5 Status
component-combo-reactive-form 100% 100%
grid-basic-setup 100% 100%
theming-palette-generation 100% 100%

Thresholds

  • pass@5 ≥ 80% — merge gate passed
  • ⚠️ pass@5 ≥ 60% — needs investigation
  • pass@5 < 60% — blocks merge for affected skill

@kdinev
Copy link
Member

kdinev commented Mar 10, 2026

@copilot Verifications need to be performed agains a copilot agent. Use the copilot cli for this. Also add an option to verify against gemini, with the gemini-cli. The switch should be config based. I will provide the gemini API key as a secret

@github-actions
Copy link

📊 Skill Eval Results

Task Pass Rate pass@5 Status
component-combo-reactive-form 100% 100%
grid-basic-setup 100% 100%
theming-palette-generation 100% 100%

Thresholds

  • pass@5 ≥ 80% — merge gate passed
  • ⚠️ pass@5 ≥ 60% — needs investigation
  • pass@5 < 60% — blocks merge for affected skill

Copilot AI and others added 2 commits March 10, 2026 09:32
Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>
Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 10, 2026

@copilot Verifications need to be performed agains a copilot agent. Use the copilot cli for this. Also add an option to verify against gemini, with the gemini-cli. The switch should be...

Implemented in b181ca0 and 665264b.

The eval runner now supports agent-based verification via --agent flag:

# Run with GitHub Copilot CLI
bash run-eval.sh --all --agent copilot

# Run with Google Gemini CLI
bash run-eval.sh grid-basic-setup --agent gemini

# Multi-trial for statistical robustness
bash run-eval.sh --all --agent copilot --trials 3

Agent configuration is in evals/eval-config.json — change defaultAgent to switch between copilot and gemini, and customize CLI flags/timeouts there.

The CI workflow now has a workflow_dispatch trigger so you can manually run agent-based evals from the Actions tab — select the agent and trial count. The GEMINI_API_KEY secret needs to be added to the repo for Gemini runs.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • www.npmjs.com
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@github-actions
Copy link

📊 Skill Eval Results

Task Agent Pass Rate pass@k Status
component-combo-reactive-form reference 100% 100%
grid-basic-setup reference 100% 100%
theming-palette-generation reference 100% 100%

Thresholds

  • pass@k ≥ 80% — merge gate passed
  • ⚠️ pass@k ≥ 60% — needs investigation
  • pass@k < 60% — blocks merge for affected skill

@github-actions
Copy link

📊 Skill Eval Results

Task Agent Pass Rate pass@k Status
component-combo-reactive-form reference 100% 100%
grid-basic-setup reference 100% 100%
theming-palette-generation reference 100% 100%

Thresholds

  • pass@k ≥ 80% — merge gate passed
  • ⚠️ pass@k ≥ 60% — needs investigation
  • pass@k < 60% — blocks merge for affected skill

@kdinev
Copy link
Member

kdinev commented Mar 10, 2026

@copilot Rework the action to not trigger based on input for an agent, but to trigger against both copilot and gemini and to report results for both.

Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 10, 2026

@copilot Rework the action to not trigger based on input for an agent, but to trigger against both copilot and gemini and to report results for both.

Reworked in 1330989. The workflow no longer uses input-based agent selection — it always runs three parallel jobs: grader validation, Copilot eval, and Gemini eval. The summary comment now has separate sections for each:

  • Grader Validation — reference solution pass rates
  • Agent Evaluation — combined table with both Copilot and Gemini results per task

GEMINI_API_KEY still needs to be added as a repository secret for the Gemini job to authenticate.

@kdinev
Copy link
Member

kdinev commented Mar 10, 2026

@copilot Seems to me that there are instructions, but no actual test prompts for the agents. Review the whole implementation because there have been multiple iterations and changes. In the workflows, don't use the bash commands directly, instead call them through npm commands. Clean up everything that is obsolete and add missing items if you find such. I have provided the GEMINI_API_KEY secret, so the gemini workflow should run fine.

@github-actions
Copy link

📊 Skill Eval Results

Grader Validation (reference solutions)

Task Pass Rate Status
component-combo-reactive-form 100%
grid-basic-setup 100%
theming-palette-generation 100%

Agent Evaluation

Task Agent Pass Rate pass@k Status
component-combo-reactive-form-copilot Error Error
grid-basic-setup-copilot Error Error
theming-palette-generation copilot 0% 0%
component-combo-reactive-form-gemini Error Error
grid-basic-setup-gemini Error Error
theming-palette-generation gemini 0% 0%

Thresholds

  • pass@k ≥ 80% — merge gate passed
  • ⚠️ pass@k ≥ 60% — needs investigation
  • pass@k < 60% — blocks merge for affected skill

Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 10, 2026

@copilot Seems to me that there are instructions, but no actual test prompts for the agents. Review the whole implementation because there have been multiple iterations and changes. In the workflows, ...

Fixed in a9da524:

  1. Agent prompts — Added dedicated prompt.md files for each task with concise, actionable prompts sent to the agent CLI. The runner now prefers prompt.md over instruction.md when running in agent mode. instruction.md remains as the detailed human-readable task description.

  2. npm scripts in CI — All three workflow jobs now use npm run validate, npm run agent:copilot, and npm run agent:gemini instead of direct bash invocations.

  3. Cleanup — README updated to document the prompt.mdinstruction.md distinction, removed obsolete bare bash run-eval.sh examples, corrected grader descriptions (file/import/ordering checks, not build verification). All graders still validate at 5/5.

@github-actions
Copy link

📊 Skill Eval Results

Grader Validation (reference solutions)

Task Pass Rate Status
component-combo-reactive-form 100%
grid-basic-setup 100%
theming-palette-generation 100%

Agent Evaluation

Task Agent Pass Rate pass@k Status
component-combo-reactive-form-copilot Error Error
grid-basic-setup-copilot Error Error
theming-palette-generation copilot 0% 0%
component-combo-reactive-form-gemini Error Error
grid-basic-setup-gemini Error Error
theming-palette-generation gemini 0% 0%

Thresholds

  • pass@k ≥ 80% — merge gate passed
  • ⚠️ pass@k ≥ 60% — needs investigation
  • pass@k < 60% — blocks merge for affected skill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Automated Eval Test Suite for the Angular Skills

4 participants