Implement automated eval test suite for Angular Skills by Copilot · Pull Request #17007 · IgniteUI/igniteui-angular

Copilot · 2026-03-08T07:05:57Z

Original prompt

This section details on the original issue you should resolve

<issue_title>Implement Automated Eval Test Suite for the Angular Skills</issue_title>
<issue_description>We have three Skills (igniteui-angular-components, igniteui-angular-grids, igniteui-angular-theming) that teach coding agents how to correctly select, configure, and compose Ignite UI for Angular components. As these skills grow in complexity and more developers rely on them, silent regressions become a real risk rewording a step, reordering routing logic, or removing a "verify" clause can quietly degrade agent behavior with no signal until a user reports a wrong output.

This work item establishes a structured eval process for these skills, directly inspired by Minko Gechev's Skill Eval framework, topic and extended with patterns from Anthropic's agent eval research and the Skills Best Practices guide.

Goals

Produce a measurable, repeatable quality score for each skill.
Detect regressions automatically when a skill file is modified in a PR.
Provide a feedback loop during skill authoring (edit → eval → score delta).
Establish pass/fail thresholds that gate merges to main.

Approach

Tooling: Adopt the skill-eval TypeScript framework as the eval runner. It supports Docker-isolated agent execution, deterministic shell graders, LLM rubric graders, multi-trial runs, and JSON result persistence — all the properties needed here.

Task Structure

Create an evals/ directory at the repo root. Each eval task is a self-contained directory:

Example:

evals/
├── tasks/
│   ├── grid-basic-setup/
│   │   ├── task.toml               # timeouts, grader weights, trial count
│   │   ├── instruction.md          # what the agent is asked to do
│   │   ├── environment/Dockerfile  # clean Angular project baseline
│   │   ├── tests/test.sh           # deterministic grader (file checks, compile, lint)
│   │   ├── prompts/quality.md      # LLM rubric grader questions
│   │   ├── solution/solve.sh       # reference solution for baseline
│   │   └── skills/                 # symlinks or copies of the skills under test
│   │       └── igniteui-angular-grids/SKILL.md
│   ├── grid-sorting-remote-data/
│   ├── grid-hierarchical-setup/
│   ├── grid-pivot-config/
│   ├── component-combo-reactive-form/
│   ├── component-date-picker-validation/
│   ├── component-dialog-service/
│   ├── theming-palette-generation/
│   ├── theming-component-override/
│   └── skill-routing-intent-detection/  # tests the SKILL.md router logic itself
├── package.json
└── README.md

Tasks to Implement (per Skill)

`igniteui-angular-grids` skill (highest priority — most complex routing)

Task ID	Instruction given to agent	Deterministic check	LLM rubric check
`grid-basic-setup`	"Add a data grid showing employee data with sorting and pagination"	Project compiles; `<igx-grid>` present in template; correct module imported	Did agent choose IgxGrid (not Tree/Hierarchical) for flat data? Did it configure `[data]` binding correctly?
`grid-tree-vs-flat`	"Display department data with nested child rows"	`<igx-tree-grid>` present; `childDataKey` configured	Did skill routing correctly select Tree Grid over flat Grid?
`grid-hierarchical-setup`	"Build a master-detail grid where clicking a row expands child orders"	`<igx-hierarchical-grid>` + `<igx-row-island>` present	Did agent configure load-on-demand vs inline data correctly based on instructions?
`grid-remote-filtering`	"Add server-side filtering and sorting to the grid"	`[filterMode]="'externalFilterMode'"` set; remote service stub present	Did agent wire `onDataPreLoad`/`sortingExpressionsChange` instead of local filtering?
`grid-pivot-config`	"Create a pivot grid with row/column/value dimensions"	`<igx-pivot-grid>` + `IgxPivotConfiguration` present	Did agent define `rows`, `columns`, `values` correctly vs a flat grid with groupBy?
`grid-state-persistence`	"Persist grid sorting and filtering state to localStorage"	`IgxGridStateDirective` present; serialize/restore calls present	Did agent use the state directive vs manually serializing expressions?

`igniteui-angular-components` skill

Task ID	Instruction	Deterministic check	LLM rubric check
`component-combo-reactive-form`	"Add a multi-select combo bound to a reactive form control"	`<igx-combo>` present; `[formControlName]` wired; module imported	Did agent use IgxCombo (not IgxSelect or native `<select>`) for multi-select?
`component-date-picker-validation`	"Add a date picker with min/max date validation"	`<igx-date-picker>` present; `minValue`/`maxValue` i...

Fixes Implement Automated Eval Test Suite for the Angular Skills #17001

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Add automated eval test suite for Angular Skills using the skill-eval framework: - grid-basic-setup task (igniteui-angular-grids skill) - component-combo-reactive-form task (igniteui-angular-components skill) - theming-palette-generation task (igniteui-angular-theming skill) - GitHub Actions workflow for CI integration - Baseline results JSON for regression comparison - README with local run and task authoring docs Co-authored-by: zdrawku <11193764+zdrawku@users.noreply.github.com>

.github/workflows/skill-eval.yml

Co-authored-by: zdrawku <11193764+zdrawku@users.noreply.github.com>

github-actions · 2026-03-09T16:42:11Z

📊 Skill Eval Results

Task	Pass Rate	pass@5	Status
baseline	N/A	N/A	❌

Thresholds

✅ pass@5 ≥ 80% — merge gate passed
⚠️ pass@5 ≥ 60% — needs investigation
❌ pass@5 < 60% — blocks merge for affected skill

github-actions · 2026-03-09T16:42:20Z

📊 Skill Eval Results

Task	Pass Rate	pass@5	Status
baseline	N/A	N/A	❌

Thresholds

✅ pass@5 ≥ 80% — merge gate passed
⚠️ pass@5 ≥ 60% — needs investigation
❌ pass@5 < 60% — blocks merge for affected skill

kdinev · 2026-03-09T17:33:56Z

@copilot skill-eval is not a package. It's an example repository of how to build the evaluation infrastructure.

skill-eval is a reference repo, not an installable npm package. Replaced with a local run-eval.sh script that executes reference solutions and deterministic graders directly. Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>

Copilot · 2026-03-09T17:42:43Z

@copilot skill-eval is not a package. It's an example repository of how to build the evaluation infrastructure.

Removed skill-eval as a dependency and replaced all npx skill-eval invocations with a self-contained run-eval.sh shell script that runs reference solutions and deterministic graders locally. No external packages needed. Fixed in 6e7b838.

github-actions · 2026-03-09T22:32:19Z

📊 Skill Eval Results

Task	Pass Rate	pass@5	Status
baseline	N/A	N/A	❌
component-combo-reactive-form	N/A	N/A	❌
grid-basic-setup	N/A	N/A	❌
theming-palette-generation	N/A	N/A	❌

Thresholds

✅ pass@5 ≥ 80% — merge gate passed
⚠️ pass@5 ≥ 60% — needs investigation
❌ pass@5 < 60% — blocks merge for affected skill

kdinev · 2026-03-10T08:19:59Z

@copilot The skill tests passed, so the automation looks good, but the github-actions bot report in the PR show N/A for the pass rates

…scores The workflow summary comment reads passRate and passAtK from per-task JSON files. The runner was only writing reward/status, causing N/A. Also skip baseline.json in the summary since it uses a different schema. Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>

Copilot · 2026-03-10T08:23:13Z

@copilot The skill tests passed, so the automation looks good, but the github-actions bot report in the PR show N/A for the pass rates

Two issues caused the N/A values: (1) run-eval.sh was writing reward/status fields but the summary script expected passRate/passAtK, and (2) baseline.json was being read by the summary script despite having a different schema with null values. Fixed both in b2047d8 — the runner now emits passRate and passAtK in each result JSON, and the summary script skips baseline.json.

github-actions · 2026-03-10T08:28:04Z

📊 Skill Eval Results

Task	Pass Rate	pass@5	Status
component-combo-reactive-form	100%	100%	✅
grid-basic-setup	100%	100%	✅
theming-palette-generation	100%	100%	✅

Thresholds

✅ pass@5 ≥ 80% — merge gate passed
⚠️ pass@5 ≥ 60% — needs investigation
❌ pass@5 < 60% — blocks merge for affected skill

github-actions · 2026-03-10T08:45:44Z

📊 Skill Eval Results

Task	Pass Rate	pass@5	Status
component-combo-reactive-form	100%	100%	✅
grid-basic-setup	100%	100%	✅
theming-palette-generation	100%	100%	✅

Thresholds

✅ pass@5 ≥ 80% — merge gate passed
⚠️ pass@5 ≥ 60% — needs investigation
❌ pass@5 < 60% — blocks merge for affected skill

github-actions · 2026-03-10T08:48:14Z

📊 Skill Eval Results

Task	Pass Rate	pass@5	Status
component-combo-reactive-form	100%	100%	✅
grid-basic-setup	100%	100%	✅
theming-palette-generation	100%	100%	✅

Thresholds

✅ pass@5 ≥ 80% — merge gate passed
⚠️ pass@5 ≥ 60% — needs investigation
❌ pass@5 < 60% — blocks merge for affected skill

Removed installation of eval dependencies from workflow.

github-actions · 2026-03-10T09:05:44Z

📊 Skill Eval Results

Task	Pass Rate	pass@5	Status
component-combo-reactive-form	100%	100%	✅
grid-basic-setup	100%	100%	✅
theming-palette-generation	100%	100%	✅

Thresholds

✅ pass@5 ≥ 80% — merge gate passed
⚠️ pass@5 ≥ 60% — needs investigation
❌ pass@5 < 60% — blocks merge for affected skill

kdinev · 2026-03-10T09:23:45Z

@copilot Verifications need to be performed agains a copilot agent. Use the copilot cli for this. Also add an option to verify against gemini, with the gemini-cli. The switch should be config based. I will provide the gemini API key as a secret

github-actions · 2026-03-10T09:24:06Z

📊 Skill Eval Results

Task	Pass Rate	pass@5	Status
component-combo-reactive-form	100%	100%	✅
grid-basic-setup	100%	100%	✅
theming-palette-generation	100%	100%	✅

Thresholds

✅ pass@5 ≥ 80% — merge gate passed
⚠️ pass@5 ≥ 60% — needs investigation
❌ pass@5 < 60% — blocks merge for affected skill

Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>

Copilot · 2026-03-10T09:34:40Z

@copilot Verifications need to be performed agains a copilot agent. Use the copilot cli for this. Also add an option to verify against gemini, with the gemini-cli. The switch should be...

Implemented in b181ca0 and 665264b.

The eval runner now supports agent-based verification via --agent flag:

# Run with GitHub Copilot CLI
bash run-eval.sh --all --agent copilot

# Run with Google Gemini CLI
bash run-eval.sh grid-basic-setup --agent gemini

# Multi-trial for statistical robustness
bash run-eval.sh --all --agent copilot --trials 3

Agent configuration is in evals/eval-config.json — change defaultAgent to switch between copilot and gemini, and customize CLI flags/timeouts there.

The CI workflow now has a workflow_dispatch trigger so you can manually run agent-based evals from the Actions tab — select the agent and trial count. The GEMINI_API_KEY secret needs to be added to the repo for Gemini runs.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

www.npmjs.com
- Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

github-actions · 2026-03-10T09:35:37Z

📊 Skill Eval Results

Task	Agent	Pass Rate	pass@k	Status
component-combo-reactive-form	reference	100%	100%	✅
grid-basic-setup	reference	100%	100%	✅
theming-palette-generation	reference	100%	100%	✅

Thresholds

✅ pass@k ≥ 80% — merge gate passed
⚠️ pass@k ≥ 60% — needs investigation
❌ pass@k < 60% — blocks merge for affected skill

github-actions · 2026-03-10T09:46:50Z

📊 Skill Eval Results

Task	Agent	Pass Rate	pass@k	Status
component-combo-reactive-form	reference	100%	100%	✅
grid-basic-setup	reference	100%	100%	✅
theming-palette-generation	reference	100%	100%	✅

Thresholds

✅ pass@k ≥ 80% — merge gate passed
⚠️ pass@k ≥ 60% — needs investigation
❌ pass@k < 60% — blocks merge for affected skill

kdinev · 2026-03-10T09:51:32Z

@copilot Rework the action to not trigger based on input for an agent, but to trigger against both copilot and gemini and to report results for both.

Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>

Copilot · 2026-03-10T09:55:05Z

@copilot Rework the action to not trigger based on input for an agent, but to trigger against both copilot and gemini and to report results for both.

Reworked in 1330989. The workflow no longer uses input-based agent selection — it always runs three parallel jobs: grader validation, Copilot eval, and Gemini eval. The summary comment now has separate sections for each:

Grader Validation — reference solution pass rates
Agent Evaluation — combined table with both Copilot and Gemini results per task

GEMINI_API_KEY still needs to be added as a repository secret for the Gemini job to authenticate.

kdinev · 2026-03-10T10:00:52Z

@copilot Seems to me that there are instructions, but no actual test prompts for the agents. Review the whole implementation because there have been multiple iterations and changes. In the workflows, don't use the bash commands directly, instead call them through npm commands. Clean up everything that is obsolete and add missing items if you find such. I have provided the GEMINI_API_KEY secret, so the gemini workflow should run fine.

github-actions · 2026-03-10T10:02:30Z

📊 Skill Eval Results

Grader Validation (reference solutions)

Task	Pass Rate	Status
component-combo-reactive-form	100%	✅
grid-basic-setup	100%	✅
theming-palette-generation	100%	✅

Agent Evaluation

Task	Agent	Pass Rate	pass@k	Status
component-combo-reactive-form-copilot	—	Error	Error	❌
grid-basic-setup-copilot	—	Error	Error	❌
theming-palette-generation	copilot	0%	0%	❌
component-combo-reactive-form-gemini	—	Error	Error	❌
grid-basic-setup-gemini	—	Error	Error	❌
theming-palette-generation	gemini	0%	0%	❌

Thresholds

✅ pass@k ≥ 80% — merge gate passed
⚠️ pass@k ≥ 60% — needs investigation
❌ pass@k < 60% — blocks merge for affected skill

Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>

Copilot · 2026-03-10T10:07:50Z

@copilot Seems to me that there are instructions, but no actual test prompts for the agents. Review the whole implementation because there have been multiple iterations and changes. In the workflows, ...

Fixed in a9da524:

Agent prompts — Added dedicated prompt.md files for each task with concise, actionable prompts sent to the agent CLI. The runner now prefers prompt.md over instruction.md when running in agent mode. instruction.md remains as the detailed human-readable task description.
npm scripts in CI — All three workflow jobs now use npm run validate, npm run agent:copilot, and npm run agent:gemini instead of direct bash invocations.
Cleanup — README updated to document the prompt.md → instruction.md distinction, removed obsolete bare bash run-eval.sh examples, corrected grader descriptions (file/import/ordering checks, not build verification). All graders still validate at 5/5.

github-actions · 2026-03-10T11:18:34Z

📊 Skill Eval Results

Grader Validation (reference solutions)

Task	Pass Rate	Status
component-combo-reactive-form	100%	✅
grid-basic-setup	100%	✅
theming-palette-generation	100%	✅

Agent Evaluation

Task	Agent	Pass Rate	pass@k	Status
component-combo-reactive-form-copilot	—	Error	Error	❌
grid-basic-setup-copilot	—	Error	Error	❌
theming-palette-generation	copilot	0%	0%	❌
component-combo-reactive-form-gemini	—	Error	Error	❌
grid-basic-setup-gemini	—	Error	Error	❌
theming-palette-generation	gemini	0%	0%	❌

Thresholds

✅ pass@k ≥ 80% — merge gate passed
⚠️ pass@k ≥ 60% — needs investigation
❌ pass@k < 60% — blocks merge for affected skill

Initial plan

5b7cca0

Copilot AI assigned Copilot and zdrawku Mar 8, 2026

Copilot started work on behalf of zdrawku March 8, 2026 07:06 View session

github-advanced-security bot found potential problems Mar 8, 2026

View reviewed changes

.github/workflows/skill-eval.yml Fixed Show fixed Hide fixed

Copilot AI and others added 2 commits March 8, 2026 07:13

refactor: improve regex readability in grader scripts per code review

ac1335a

Co-authored-by: zdrawku <11193764+zdrawku@users.noreply.github.com>

fix: add explicit permissions to skill-eval workflow (CodeQL alert)

f807aa3

Co-authored-by: zdrawku <11193764+zdrawku@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Implement automated eval test suite for the Angular skills~~ Implement automated eval test suite for Angular Skills Mar 8, 2026

Copilot finished work on behalf of zdrawku March 8, 2026 07:16

Merge branch 'master' into copilot/implement-automated-eval-test-suite

c183089

Copilot started work on behalf of kdinev March 9, 2026 17:34 View session

Copilot finished work on behalf of kdinev March 9, 2026 17:43

kdinev added ai-skills ci: github-actions labels Mar 10, 2026

Copilot started work on behalf of kdinev March 10, 2026 08:20 View session

Copilot finished work on behalf of kdinev March 10, 2026 08:23

kdinev marked this pull request as ready for review March 10, 2026 08:28

kdinev requested a review from Copilot March 10, 2026 08:28

Copilot started reviewing on behalf of kdinev March 10, 2026 08:28 View session

Copilot finished work on behalf of kdinev March 10, 2026 08:44

Update Node.js version in skill-eval workflow

94d4bf8

Update skill-eval.yml

18f3e25

Remove eval dependencies installation step

568b04d

Removed installation of eval dependencies from workflow.

Merge branch 'master' into copilot/implement-automated-eval-test-suite

5da6711

Copilot started work on behalf of kdinev March 10, 2026 09:23 View session

Copilot AI and others added 2 commits March 10, 2026 09:32

feat: add copilot-cli and gemini-cli agent modes to eval runner

b181ca0

Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>

fix: use read -ra for safe array parsing, add TRIALS guard

665264b

Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>

Copilot finished work on behalf of kdinev March 10, 2026 09:35

Update skill-eval.yml

b3fa973

Copilot started work on behalf of kdinev March 10, 2026 09:51 View session

rework CI workflow to always run against both copilot and gemini agents

1330989

Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>

Copilot finished work on behalf of kdinev March 10, 2026 09:56

Copilot started work on behalf of kdinev March 10, 2026 10:01 View session

add agent prompt files, switch CI to npm scripts, clean up README

a9da524

Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>

Copilot finished work on behalf of kdinev March 10, 2026 10:08

Conversation

Copilot AI commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goals

Approach

Task Structure

Tasks to Implement (per Skill)

igniteui-angular-grids skill (highest priority — most complex routing)

igniteui-angular-components skill

Uh oh!

Uh oh!

github-actions bot commented Mar 9, 2026

📊 Skill Eval Results

Thresholds

Uh oh!

github-actions bot commented Mar 9, 2026

📊 Skill Eval Results

Thresholds

Uh oh!

kdinev commented Mar 9, 2026

Uh oh!

Copilot AI commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

📊 Skill Eval Results

Thresholds

Uh oh!

kdinev commented Mar 10, 2026

Uh oh!

Copilot AI commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

📊 Skill Eval Results

Thresholds

Uh oh!

github-actions bot commented Mar 10, 2026

📊 Skill Eval Results

Thresholds

Uh oh!

github-actions bot commented Mar 10, 2026

📊 Skill Eval Results

Thresholds

Uh oh!

github-actions bot commented Mar 10, 2026

📊 Skill Eval Results

Thresholds

Uh oh!

kdinev commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

📊 Skill Eval Results

Thresholds

Uh oh!

Copilot AI commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

github-actions bot commented Mar 10, 2026

📊 Skill Eval Results

Thresholds

Uh oh!

github-actions bot commented Mar 10, 2026

📊 Skill Eval Results

Thresholds

Uh oh!

kdinev commented Mar 10, 2026

Uh oh!

Copilot AI commented Mar 10, 2026

Uh oh!

kdinev commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

📊 Skill Eval Results

Grader Validation (reference solutions)

Agent Evaluation

Thresholds

Uh oh!

Copilot AI commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Copilot AI commented Mar 8, 2026 •

edited

Loading

`igniteui-angular-grids` skill (highest priority — most complex routing)

`igniteui-angular-components` skill

Copilot AI commented Mar 10, 2026 •

edited

Loading