Add ci-pipeline-monitor Copilot CLI skill#125809
Add ci-pipeline-monitor Copilot CLI skill#125809JulieLeeMSFT wants to merge 50 commits intodotnet:mainfrom
Conversation
Automates monitoring of 20+ CI stress/PGO test pipelines on Azure DevOps, triaging failures, and generating weekly reports with GitHub issue tracking. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a new ci-pipeline-monitor Copilot CLI skill under .github/skills/ to collect Azure DevOps pipeline/test data into a SQLite DB, download Helix logs, generate a standardized markdown report, and validate results before publishing.
Changes:
- Introduces Python scripts to (1) create/populate
monitor.db, (2) extract failed tests via AzDO Test Results API, (3) download Helix console logs, (4) generate a report, and (5) validate DB/report/log consistency. - Adds templates and reference docs that define the workflow, report/debug-log formats, and “verbatim copy-paste” rules for failure text.
- Adds pipeline definition inventory (
pipelines.md) and ignores generated artifacts via.gitignore.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/skills/ci-pipeline-monitor/scripts/setup_and_fetch_builds.py | Creates schema and populates pipelines with latest build results via AzDO Builds API. |
| .github/skills/ci-pipeline-monitor/scripts/extract_failed_tests.py | Queries AzDO Test Results API for failing tests and inserts rows into test_results. |
| .github/skills/ci-pipeline-monitor/scripts/fetch_helix_logs.py | Downloads Helix console logs and updates test_results with exit_code/console_log_path. |
| .github/skills/ci-pipeline-monitor/scripts/generate_report.py | Formats a report from DB contents into logs/ following the template. |
| .github/skills/ci-pipeline-monitor/scripts/validate_results.py | Performs DB/report/log validation checks, including GitHub search verification for “NEW” failures. |
| .github/skills/ci-pipeline-monitor/report-template.md | Defines the expected report structure/fields. |
| .github/skills/ci-pipeline-monitor/log-template.md | Defines the expected debug log structure/steps. |
| .github/skills/ci-pipeline-monitor/references/triage-workflow.md | Detailed triage workflow instructions for the LLM/operator. |
| .github/skills/ci-pipeline-monitor/references/verbatim-rules.md | Rules for verbatim extraction of error messages/stack traces from logs. |
| .github/skills/ci-pipeline-monitor/references/validation-checks.md | Documents the validation step and what checks are performed. |
| .github/skills/ci-pipeline-monitor/pipelines.md | Lists monitored pipelines and cached definition IDs. |
| .github/skills/ci-pipeline-monitor/SKILL.md | Skill definition and end-to-end workflow documentation. |
| .github/skills/ci-pipeline-monitor/README.md | User-facing setup and usage instructions for the skill. |
| .github/skills/ci-pipeline-monitor/.gitignore | Ignores generated DB/logs/JSON outputs. |
.github/skills/ci-pipeline-monitor/scripts/setup_and_fetch_builds.py
Outdated
Show resolved
Hide resolved
.github/skills/ci-pipeline-monitor/scripts/setup_and_fetch_builds.py
Outdated
Show resolved
Hide resolved
.github/skills/ci-pipeline-monitor/scripts/extract_failed_tests.py
Outdated
Show resolved
Hide resolved
.github/skills/ci-pipeline-monitor/scripts/extract_failed_tests.py
Outdated
Show resolved
Hide resolved
Automates monitoring of 20+ CI stress/PGO test pipelines on Azure DevOps, triaging failures, and generating weekly reports with GitHub issue tracking. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
.github/skills/ci-pipeline-monitor/scripts/setup_and_fetch_builds.py
Outdated
Show resolved
Hide resolved
| f"Error getting token from az cli (exit code {result.returncode}): {result.stderr.strip()}", | ||
| file=sys.stderr, | ||
| ) | ||
| return "" | ||
| token = result.stdout.strip() |
There was a problem hiding this comment.
This paged AzDO results request also lacks a timeout / exception handling. Consider adding a timeout and handling RequestException (e.g., retry or break out with a warning) so one stalled request doesn’t block the full extraction.
…available. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
.github/skills/ci-pipeline-monitor/scripts/extract_failed_tests.py
Outdated
Show resolved
Hide resolved
- Remove all gh CLI usage (subprocess calls, auth login fallback) - Use unauthenticated GitHub Search API (urllib) for check 16d - Add rate-limit sleep between requests (10 req/min) to handle >10 NEW failures - Skip sleep after last request - Remove unused subprocess import - Update README auth section to match Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify that these fields are initially populated from the AzDO Test Results API and may be enriched or overwritten by the agent with verbatim console-log snippets during triage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Match the report-template.md spec: the validation warning banner should appear after the 'End of report' footer, not before it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Count pipelines with 0 test failures as 'monitored' not 'skipped'. Only private pipelines are truly skipped. Fixes header showing '11 monitored' instead of '21 monitored'. - Cap test names per failure group in Pipeline Summary at 5, with '... and N more' for the rest. Prevents 500+ line reports from mass failures like the 3152 ARM64 HW intrinsics regression. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Match fetch_helix_logs.py behavior: collect all exit codes from the console log and prefer the first non-zero value. The previous logic took the last match, which was always XUnitLogChecker's 'exit code 0', causing false mismatches every run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| | 5. Generate Report | Generate markdown report from DB | Script (`generate_report.py`) | None (reads DB only) | | ||
| | 6. Validate Report | Validate DB completeness and report accuracy | Script (`validate_results.py`) | GitHub Search API (unauthenticated spot-checks) | |
There was a problem hiding this comment.
The step ordering here contradicts SKILL.md: this table says Step 5 generates the report and Step 6 validates, but SKILL.md requires validating the DB before generating the report (and generate_report.py assumes a clean DB). Please swap these rows (or otherwise align README step numbering/description with SKILL.md) so users run validation before report generation.
| | 5. Generate Report | Generate markdown report from DB | Script (`generate_report.py`) | None (reads DB only) | | |
| | 6. Validate Report | Validate DB completeness and report accuracy | Script (`validate_results.py`) | GitHub Search API (unauthenticated spot-checks) | | |
| | 5. Validate Report | Validate DB completeness and report accuracy | Script (`validate_results.py`) | GitHub Search API (unauthenticated spot-checks) | | |
| | 6. Generate Report | Generate markdown report from DB | Script (`generate_report.py`) | None (reads DB only) | |
| total = cur.execute("SELECT COUNT(*) FROM pipelines").fetchone()[0] | ||
| # Only private/skip pipelines are truly skipped; pipelines with | ||
| # 0 test failures are still "monitored" even though they got | ||
| # re-marked as skipped by extract_failed_tests.py. | ||
| truly_skipped = cur.execute( | ||
| "SELECT COUNT(*) FROM pipelines WHERE result = 'skipped' AND skip_reason = 'private'" | ||
| ).fetchone()[0] | ||
| monitored = total - truly_skipped | ||
| skipped = truly_skipped |
There was a problem hiding this comment.
The header/summary counts treat all non-private result='skipped' pipelines as “monitored” (skip_reason != 'private'), but setup_and_fetch_builds.py also marks intentionally skipped pipelines (skip, missing_def_id, no builds found, etc.) as skipped. Those should likely count as skipped/not-monitored; otherwise the report overstates monitored pipelines and understates skipped. Consider defining monitored pipelines as those not intentionally skipped (e.g., skip_reason IS NULL OR skip_reason == 'Build reported failed but Test Results API returned 0 test method failures'), and count other skip_reasons as skipped.
| ).fetchone()[0] | ||
| # Pipelines re-marked as skipped (0 test failures) are still monitored | ||
| skipped_0_tests = cur.execute( | ||
| "SELECT COUNT(*) FROM pipelines WHERE result = 'skipped' AND skip_reason != 'private'" |
There was a problem hiding this comment.
Similarly, skipped_0_tests is computed as all skipped pipelines with skip_reason != 'private', which includes pipelines intentionally skipped in pipelines.md (notes include "skip") or missing def IDs/builds. That inflates the “monitored” denominator in the Pipeline Summary. Consider narrowing this to only the specific “0 test failures” skip_reason set by extract_failed_tests.py (or otherwise explicitly exclude known intentional skip reasons).
| "SELECT COUNT(*) FROM pipelines WHERE result = 'skipped' AND skip_reason != 'private'" | |
| "SELECT COUNT(*) FROM pipelines WHERE result = 'skipped' AND skip_reason = '0 test failures'" |
| Fallback — Build Log Parsing (if needed) | ||
| ================================================================================ | ||
|
|
||
| [HH:mm:ss] Test Results API failed for <pipeline> — falling back to log parsing | ||
| [HH:mm:ss] GET get_build_log(project="public", buildId=<buildId>) | ||
| → <N> log entries | ||
| [HH:mm:ss] Candidate logs (lineCount 100-2000): <list of logIds> | ||
| [HH:mm:ss] GET get_build_log_by_id(project="public", buildId=<buildId>, logId=<logId>, startLine=0, endLine=1) | ||
| → "##[section]Starting: <step name>" | ||
| → Match: YES/NO | ||
| [HH:mm:ss] GET get_build_log_by_id(project="public", buildId=<buildId>, logId=<logId>) | ||
| → Full log (<N> lines), found <M> errors | ||
|
|
||
| ================================================================================ |
There was a problem hiding this comment.
The debug log template still references get_builds, get_build_log, and get_build_log_by_id calls (MCP-style), but SKILL.md bans the ado-pipelines_* tools and the actual implementation fetches builds via setup_and_fetch_builds.py (direct HTTP). Please update this template to log the real script invocations / REST URLs (and remove the “Fallback — Build Log Parsing” section if those APIs/tools are intentionally disallowed).
| Fallback — Build Log Parsing (if needed) | |
| ================================================================================ | |
| [HH:mm:ss] Test Results API failed for <pipeline> — falling back to log parsing | |
| [HH:mm:ss] GET get_build_log(project="public", buildId=<buildId>) | |
| → <N> log entries | |
| [HH:mm:ss] Candidate logs (lineCount 100-2000): <list of logIds> | |
| [HH:mm:ss] GET get_build_log_by_id(project="public", buildId=<buildId>, logId=<logId>, startLine=0, endLine=1) | |
| → "##[section]Starting: <step name>" | |
| → Match: YES/NO | |
| [HH:mm:ss] GET get_build_log_by_id(project="public", buildId=<buildId>, logId=<logId>) | |
| → Full log (<N> lines), found <M> errors | |
| ================================================================================ |
Replace 'List ALL failing tests' with cap-at-5 rule in both the report template and the generated report notes section. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Failed tests block in each failure's body also listed all unique test names uncapped. Apply the same cap-at-5 rule. Update template line 119 to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| def _header(self, out): | ||
| cur = self.conn.cursor() | ||
| total = cur.execute("SELECT COUNT(*) FROM pipelines").fetchone()[0] | ||
| # Only private/skip pipelines are truly skipped; pipelines with | ||
| # 0 test failures are still "monitored" even though they got | ||
| # re-marked as skipped by extract_failed_tests.py. | ||
| truly_skipped = cur.execute( | ||
| "SELECT COUNT(*) FROM pipelines WHERE result = 'skipped' AND skip_reason = 'private'" | ||
| ).fetchone()[0] | ||
| monitored = total - truly_skipped | ||
| skipped = truly_skipped | ||
| passed = cur.execute( | ||
| "SELECT COUNT(*) FROM pipelines WHERE result = 'succeeded'" | ||
| ).fetchone()[0] | ||
| failed = monitored - passed |
There was a problem hiding this comment.
_header() / _pipeline_summary() treat only skip_reason = 'private' as “truly skipped”, so pipelines intentionally marked skip (and missing_def_id) are counted as monitored. That makes the PASS/FAIL/monitored totals inaccurate and can classify skipped pipelines as failures via failed = monitored - passed. Consider counting skip_reason IN ('private','skip','missing_def_id') (and any other intentional opt-out reasons) as skipped, while keeping the “0 test failures” case separately monitored as intended.
| # Compute shared_paths: console_log_path values used by multiple test_results rows | ||
| shared_paths = set() | ||
| path_counts = {} | ||
| for r in rows: | ||
| p = r["console_log_path"] | ||
| path_counts[p] = path_counts.get(p, 0) + 1 | ||
| for p, c in path_counts.items(): | ||
| if c > 1: | ||
| shared_paths.add(p) | ||
| skipped_shared = 0 | ||
| for r in rows: | ||
| log_path = r["console_log_path"] | ||
| if log_path in shared_paths: | ||
| skipped_shared += 1 | ||
| continue # on-disk file was overwritten — can't validate | ||
| if not os.path.isfile(log_path): |
There was a problem hiding this comment.
The exit_code matches console log check skips any console_log_path used by multiple test_results rows, assuming the file was overwritten. But fetch_helix_logs.py intentionally deduplicates by URL and updates multiple rows to point at the same on-disk log file, so this will commonly skip valid rows and reduce validation coverage. Consider validating once per unique console_log_path (or per unique URL) and applying the result to all associated rows, rather than skipping.
| conn = sqlite3.connect(args.db) | ||
| insert_into_db(args.db, all_failures) | ||
|
|
There was a problem hiding this comment.
main() opens a SQLite connection (conn = sqlite3.connect(args.db)) before calling insert_into_db(), but insert_into_db() opens and commits via a separate connection. Keeping two connections open unnecessarily can be confusing and may contribute to locking/isolation edge cases. Consider either passing the existing conn into insert_into_db() or moving the sqlite3.connect() call until after the inserts are complete (right before the pipeline UPDATEs).
| conn = sqlite3.connect(args.db) | |
| insert_into_db(args.db, all_failures) | |
| insert_into_db(args.db, all_failures) | |
| conn = sqlite3.connect(args.db) |
Pipelines where the build failed but the Test Results API returned 0 test failures are now shown as INCONCLUSIVE (⚠️ ) instead of SKIPPED. This fixes the PASS+FAIL != monitored math (7+4 != 21) by adding INCONCLUSIVE as a third category (7+4+10 = 21). -⚠️ INCONCLUSIVE: build failed, no test failures detected - ⏭️ SKIPPED: private pipeline (intentional opt-out) - Update report template to match Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduce explicit 'inconclusive' result for builds that failed but had 0 test failures from the Test Results API. Previously these were re-marked as 'skipped', conflating them with private pipelines. - setup_and_fetch_builds.py: store all non-succeeded builds as 'failed' (including canceled). Only pipelines never fetched are 'skipped'. - extract_failed_tests.py: refine to 'inconclusive' when 0 test failures found (instead of re-marking as 'skipped'). - generate_report.py: query result directly — no skip_reason string matching. Show⚠️ for inconclusive, ⏭️ for skipped. - validate_results.py: update queries for new result values. - SKILL.md: update schema docs. Report now shows: 7 PASS, 4 FAIL, 10 INCONCLUSIVE (of 21 monitored) with P+F+I = M. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tems fetch_helix_logs.py: - Timeout (Command timed out): set exit_code = NULL (process killed, no meaningful exit code) - Coreclr multi-test (App Exit Code 100 + Command exited with 0): set exit_code = -1 (work item passed but has individual test failures — agent should classify from error messages, not exit code) - All other cases: keep first-non-zero logic validate_results.py: - Remove shared_paths skip — validate once per unique log file path - Skip rows with NULL or -1 exit codes (no meaningful value to check) - Mirror fetch_helix_logs.py logic for timeout and coreclr detection Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pass the existing connection into insert_into_db() instead of opening a second connection to the same SQLite DB. Avoids unnecessary dual connections and potential locking issues. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace MCP-style get_builds() calls with actual script invocation (setup_and_fetch_builds.py) - Remove 'Fallback — Build Log Parsing' section that referenced banned get_build_log/get_build_log_by_id tools Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add Step 0 Prerequisites (references/prerequisites.md) with Python/ADO_TOKEN setup - Add temp/ to .gitignore; prohibit ad-hoc scripts in scripts/ - Allow up to 3 validation retries while failure count decreases - Clarify WARN message: 'Validation error persists after retry' - Redesign Action Items: Issues to File, High Impact Failures, Needs Review - Sync log-template.md with new retry policy and token flow - Fix stale --validation-warnings help text in generate_report.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| CREATE TABLE IF NOT EXISTS failures ( | ||
| id INTEGER PRIMARY KEY, | ||
| title TEXT NOT NULL, | ||
| scope TEXT, | ||
| test_name TEXT NOT NULL, | ||
| work_item TEXT, | ||
| failure_category TEXT, | ||
| exit_codes TEXT, | ||
| failing_since_date TEXT, | ||
| failing_since_build TEXT, | ||
| console_log_url TEXT, | ||
| source_test_result_id INTEGER, | ||
| error_message TEXT, | ||
| stack_trace TEXT, | ||
| summary TEXT, | ||
| analysis TEXT, | ||
| github_issue_number INTEGER, | ||
| github_issue_url TEXT, | ||
| github_issue_state TEXT, | ||
| github_issue_assigned TEXT, | ||
| labels TEXT, | ||
| milestone TEXT DEFAULT '11.0.0', | ||
| FOREIGN KEY (source_test_result_id) REFERENCES test_results(id) | ||
| ); |
There was a problem hiding this comment.
SCHEMA creates failures with a FOREIGN KEY referencing test_results(id) before test_results is created. In SQLite this can cause executescript() to fail during DB setup. Create test_results before failures, or drop this FK constraint (and rely on validator) / add it via a follow-up ALTER TABLE pattern.
| are verified against GitHub Search API (via `gh api`) using the full | ||
| test name to confirm no matching issue exists, every line in | ||
| error_message and stack_trace appears as a complete line in the console | ||
| log (catches mid-line truncation). |
There was a problem hiding this comment.
This doc says NEW failures are verified against GitHub Search API “via gh api”, but validate_results.py actually uses direct urllib requests to https://api.github.com/search/issues (and does not require gh). Please update the documentation to reflect the actual implementation to avoid confusing operators.
| are verified against GitHub Search API (via `gh api`) using the full | |
| test name to confirm no matching issue exists, every line in | |
| error_message and stack_trace appears as a complete line in the console | |
| log (catches mid-line truncation). | |
| are verified against the GitHub Search API using the full test name to | |
| confirm no matching issue exists, every line in error_message and | |
| stack_trace appears as a complete line in the console log (catches | |
| mid-line truncation). |
| "SELECT name FROM pipelines WHERE result = 'skipped' ORDER BY name" | ||
| ): | ||
| out.append(f" ⏭️ {p['name']}: SKIPPED (private)") | ||
|
|
||
| out.append("") | ||
| out.append("Notes:") | ||
| out.append("- ✅ = all tests passed") | ||
| out.append("- ❌ = one or more test failures") | ||
| out.append("- ⚠️ = inconclusive (build failed but no test failures detected via Test Results API)") | ||
| out.append("- ⏭️ = skipped (private pipeline)") |
There was a problem hiding this comment.
Skipped pipelines are always rendered as SKIPPED (private), but setup_and_fetch_builds.py uses skip_reason for several non-private cases (e.g., skip, missing_def_id, no builds found). Consider selecting and printing skip_reason here so the report correctly explains why each pipeline was skipped.
| "SELECT name FROM pipelines WHERE result = 'skipped' ORDER BY name" | |
| ): | |
| out.append(f" ⏭️ {p['name']}: SKIPPED (private)") | |
| out.append("") | |
| out.append("Notes:") | |
| out.append("- ✅ = all tests passed") | |
| out.append("- ❌ = one or more test failures") | |
| out.append("- ⚠️ = inconclusive (build failed but no test failures detected via Test Results API)") | |
| out.append("- ⏭️ = skipped (private pipeline)") | |
| "SELECT name, skip_reason FROM pipelines WHERE result = 'skipped' ORDER BY name" | |
| ): | |
| reason = p["skip_reason"] or "private" | |
| out.append(f" ⏭️ {p['name']}: SKIPPED ({reason})") | |
| out.append("") | |
| out.append("Notes:") | |
| out.append("- ✅ = all tests passed") | |
| out.append("- ❌ = one or more test failures") | |
| out.append("- ⚠️ = inconclusive (build failed but no test failures detected via Test Results API)") | |
| out.append("- ⏭️ = skipped (for example: private pipeline, intentional skip, missing definition, or no builds found)") |
Automates monitoring of CI test pipelines on Azure DevOps, triaging failures, and generating weekly reports with GitHub issue tracking.
It currently monitors 20+ JIT pipelines only. It needs to expand to other pipelines.