Skip to content

feat: add windows-log-analysis Copilot skill (LLM sub-skills)#8214

Open
timmy-wright wants to merge 8 commits intomainfrom
timmy/windows-log-analysis-skill
Open

feat: add windows-log-analysis Copilot skill (LLM sub-skills)#8214
timmy-wright wants to merge 8 commits intomainfrom
timmy/windows-log-analysis-skill

Conversation

@timmy-wright
Copy link
Copy Markdown
Contributor

@timmy-wright timmy-wright commented Apr 1, 2026

feat: add windows-log-analysis Copilot skill (LLM sub-skills)

Summary

Adds a Copilot CLI skill for diagnosing Windows AKS node issues from log bundles produced by collect-windows-logs.ps1. This skill uses LLM sub-skill markdown files that instruct AI agents how to analyze each log category.

It also adds a skill to save markdown to disk because my agent kept having so many issues with this task.

Why LLM sub-skills instead of scripts?

  • Resilient to format changes — LLM reads raw files instead of brittle regex/column parsing
  • Discovers novel issues — not limited to hard-coded patterns
  • Parallel dispatch — each sub-skill runs as an independent sub-agent
  • Domain knowledge preserved — HCS error codes, HNS failure modes, CSE error codes, known bugs from GitHub issues all encoded as analyst instructions

Architecture

SKILL.md (orchestrator)
├── common-reference.md (encoding, thresholds, error codes, dispatch guidance)
└── sub-skills/
    ├── analyze-containers.md    # Pod restarts, crash-loops, readiness
    ├── analyze-termination.md   # Stuck Terminating pods, zombie HCS, Defender file locks
    ├── analyze-images.md        # Dangling images, mutable tags, GC failures, snapshot bloat
    ├── analyze-disk.md          # C: drive free space trends
    ├── analyze-hcs.md           # Host Compute Service: lifecycle tracking, error codes, vmcompute health
    ├── analyze-hns.md           # Host Network Service: endpoints, LBs, CNI, DNS, WFP/VFP
    ├── analyze-kubeproxy.md     # kube-proxy: HNS policy sync, DSR, port range conflicts, SNAT
    ├── analyze-kubelet.md       # Node conditions, lease renewal, clock skew, cert rotation
    ├── analyze-memory.md        # Physical RAM, pagefile, OOM, process memory
    ├── analyze-crashes.md       # WER reports, minidumps, BSODs, unexpected reboots
    ├── analyze-csi.md           # CSI proxy, SMB/Azure Files mounts, Azure Disk
    ├── analyze-gmsa.md          # gMSA/CCG authentication, Kerberos, credential specs
    ├── analyze-gpu.md           # nvidia-smi, DirectX device plugin, Xid errors
    ├── analyze-bootstrap.md     # CSE flow, WINDOWS_CSE_ERROR codes (0-83), bootstrap config
    ├── analyze-extensions.md    # Azure VM extension execution errors
    └── analyze-services.md      # Windows service health, node versions, OS info

What the skill detects

Sub-Skill Key Detections
containers Crash-looping containers (≥10 restarts), pods not Ready
termination Zombie HCS containers, orphaned shims, containerd reinstall without drain, Defender file lock interference
images Dangling images from mutable tags, containerd GC failure (k8s#116020), snapshot accumulation
disk C: drive free space with cross-snapshot trend analysis
hcs vmcompute memory/handle leaks, HCS operation duration degradation, creation storms, 30+ error codes
hns Endpoint leaks (IP exhaustion), LB count drops after HNS reset, stale LB rules, WFP filter accumulation
kubeproxy DSR degraded policies, excluded port range conflicts with NodePort, stale LB rules, SNAT exhaustion
kubelet NotReady/DiskPressure/MemoryPressure, lease renewal failures, clock skew, certificate rotation
memory Physical RAM exhaustion, pagefile misconfiguration, per-process working set analysis
crashes Application crashes (kubelet, containerd, shim), BSODs, WER correlation with service events
csi CSI proxy crashes, stale SMB global mappings, credential rotation failures, named pipe version mismatches
gmsa CCG plugin errors, Kerberos ticket failures, domain controller connectivity, credential spec validation
gpu nvidia-smi parsing, Xid error classification, ECC memory errors, DirectX device plugin scheduling
bootstrap CSE execution timeline, 83 WINDOWS_CSE_ERROR codes, bootstrap config validation, service startup ordering
extensions VM extension exit codes with curl progress false-positive filtering
services 12 critical AKS service health checks, service PID cross-reference, start type validation

Orchestrator features

  • Symptom-based dispatchcommon-reference.md includes a dispatch table so agents pick the right 3-5 sub-skills instead of running all 16
  • Synthesis decision treeSKILL.md provides a full decision tree for combining findings across sub-skills
  • 17 root cause chains — maps symptoms → checks → root causes (e.g., disk pressure → images → mutable tags)
  • Timeline correlation — instructions for building cross-sub-skill event timelines from anchor events
  • Consistent structure — all sub-skills use identical sections: Purpose, Input Files, Analysis Steps, Findings Format, Known Patterns, Cross-References

Key research that informed the sub-skills

Files changed

  • .github/skills/windows-log-analysis/SKILL.md — orchestrator with decision tree and root cause chains
  • .github/skills/windows-log-analysis/sub-skills/*.md — 16 sub-skills + common reference (3,337 lines total)
  • .github/skills/windows-log-analysis/.gitignore

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new GitHub Copilot skill under .github/skills/windows-log-analysis/ to help diagnose Windows AKS node issues from log bundles produced by staging/cse/windows/debug/collect-windows-logs.ps1, including an accompanying Python analyzer script.

Changes:

  • Introduces SKILL.md with a log-bundle reference guide and troubleshooting playbooks.
  • Adds analyze-windows-logs.py to scan multi-snapshot bundles, trend key metrics, and emit prioritized findings.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

File Description
.github/skills/windows-log-analysis/SKILL.md Skill definition and reference guide for interpreting collected Windows node logs
.github/skills/windows-log-analysis/analyze-windows-logs.py Python 3 analyzer for automated triage of collected Windows log bundles

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

@timmy-wright timmy-wright changed the title Add windows-log-analysis Copilot skill feat: add windows-log-analysis Copilot skill Apr 1, 2026
Copilot AI review requested due to automatic review settings April 1, 2026 01:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

@timmy-wright timmy-wright force-pushed the timmy/windows-log-analysis-skill branch from b63d0ec to b6818e8 Compare April 2, 2026 09:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

| `kubelet.log` | UTF-8 | Kubelet stdout logs (if present) |
| `kubelet.err.log` | UTF-8 | Kubelet stderr logs (if present) |
| `<ts>-cri-containerd-pods.txt` | UTF-16-LE with BOM | `crictl pods` — cross-reference pod state |
| `*_services.csv` | UTF-8 | Service status timeline used for kubelet crash/restart and clock skew checks |
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*_services.csv is exported by collect-windows-logs.ps1 via Export-Csv without -Encoding, which defaults to UTF-16LE on Windows PowerShell. Marking it as UTF-8 here will cause parsers to mis-decode the file; update the encoding (and ideally the pattern to <ts>_services.csv for consistency with other entries).

Suggested change
| `*_services.csv` | UTF-8 | Service status timeline used for kubelet crash/restart and clock skew checks |
| `<ts>_services.csv` | UTF-16-LE with BOM | Service status timeline used for kubelet crash/restart and clock skew checks |

Copilot uses AI. Check for mistakes.
Comment on lines +337 to +340
| `bootstrap-config` | analyze-bootstrap |
| `*-ccg-*.evtx` or CCG event logs | analyze-gmsa |
| `gmsa-*.log` or gMSA credential spec files | analyze-gmsa |
| `kubectl-describe-nodes.log` | analyze-gpu, analyze-kubelet |
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File-dispatch mapping seems incomplete: kubectl-describe-nodes.log is consumed by analyze-kubelet.md (node conditions/taints/events) as well as GPU analysis, but the table only routes it to analyze-gpu. Add analyze-kubelet here to avoid skipping kubelet triage when this file is present.

Copilot uses AI. Check for mistakes.
timmy-wright and others added 2 commits April 7, 2026 09:57
…e.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Comment on lines +24 to +72
**WINDOWS_CSE_ERROR codes** (from AgentBaker `windowscsehelper.ps1`):

| Code | Name | Meaning |
|------|------|---------|
| 0 | SUCCESS | CSE completed successfully |
| 1 | UNKNOWN | Unexpected error in catch block |
| 2 | DOWNLOAD_FILE_WITH_RETRY | File download failed after retries |
| 3 | INVOKE_EXECUTABLE | Executable invocation failed |
| 4 | FILE_NOT_EXIST | Required file missing |
| 5 | CHECK_API_SERVER_CONNECTIVITY | Cannot reach API server |
| 6 | PAUSE_IMAGE_NOT_EXIST | Pause container image missing |
| 7 | GET_SUBNET_PREFIX | Failed to get subnet prefix |
| 8 | GENERATE_TOKEN_FOR_ARM | ARM token generation failed |
| 9 | NETWORK_INTERFACES_NOT_EXIST | No network interfaces found |
| 10 | NETWORK_ADAPTER_NOT_EXIST | Network adapter missing |
| 11 | MANAGEMENT_IP_NOT_EXIST | Management IP not found |
| 12 | CALICO_SERVICE_ACCOUNT_NOT_EXIST | Calico SA missing |
| 13 | CONTAINERD_NOT_INSTALLED | containerd binary not found |
| 14 | CONTAINERD_NOT_RUNNING | containerd service not running |
| 15 | OPENSSH_NOT_INSTALLED | OpenSSH not installed |
| 16 | OPENSSH_FIREWALL_NOT_CONFIGURED | OpenSSH firewall rule missing |
| 17 | INVALID_PARAMETER_IN_AZURE_CONFIG | Bad azure.json parameter |
| 19 | GET_CA_CERTIFICATES | CA cert retrieval failed |
| 20 | DOWNLOAD_CA_CERTIFICATES | CA cert download failed |
| 21 | EMPTY_CA_CERTIFICATES | CA certs empty |
| 22 | ENABLE_SECURE_TLS | Secure TLS enablement failed |
| 23–28 | GMSA_* | gMSA setup failures |
| 29 | NOT_FOUND_MANAGEMENT_IP | Management IP lookup failed |
| 30 | NOT_FOUND_BUILD_NUMBER | Windows build number not found |
| 31 | NOT_FOUND_PROVISIONING_SCRIPTS | Provisioning scripts missing |
| 32 | START_NODE_RESET_SCRIPT_TASK | Node reset task failed to start |
| 33–40 | DOWNLOAD_*_PACKAGE | Package download failures (CSE, K8s, CNI, HNS, Calico, gMSA, CSI proxy, containerd) |
| 41 | SET_TCP_DYNAMIC_PORT_RANGE | TCP port range configuration failed |
| 43 | PULL_PAUSE_IMAGE | Pause image pull failed |
| 45 | CONTAINERD_BINARY_EXIST | containerd binary check failed |
| 46–48 | SET_*_PORT_RANGE | Port range exclusion failures |
| 49 | NO_CUSTOM_DATA_BIN | CustomData.bin missing (very early failure) |
| 50 | NO_CSE_RESULT_LOG | CSE did not produce result log |
| 52 | RESIZE_OS_DRIVE | OS drive resize failed |
| 53–61 | GPU_* | GPU driver installation failures |
| 62 | UPDATING_KUBE_CLUSTER_CONFIG | Kube cluster config update failed |
| 64 | GET_CONTAINERD_VERSION | containerd version detection failed |
| 65–67 | CREDENTIAL_PROVIDER_* | Credential provider install/config failures |
| 68 | ADJUST_PAGEFILE_SIZE | Pagefile resize failed |
| 70–71 | SECURE_TLS_BOOTSTRAP_* | Secure TLS bootstrap client failures |
| 72 | CILIUM_NETWORKING_INSTALL_FAILED | Cilium install failed |
| 73 | EXTRACT_ZIP | Zip extraction failed |
| 74–75 | LOAD/PARSE_METADATA | Metadata failures |
| 76–83 | ORAS_* | Network-isolated cluster artifact pull failures |
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says the table is sourced from windowscsehelper.ps1 and later references a “full code table”, but the table omits several defined codes (e.g., 18, 42, 44, 51, 63, 69). To avoid misdiagnosis, either (a) include the missing codes/ranges, or (b) label this as a partial list of common codes and link readers to parts/windows/windowscsehelper.ps1 for the authoritative set.

Copilot uses AI. Check for mistakes.
timmy-wright and others added 4 commits April 9, 2026 12:09
- Fix HCS error code 0xC0370103/0x8037011F mapping: separate into two
  rows with clarifying note on HRESULT/NTSTATUS pairing uncertainty
- Add HNS Error Codes section to common-reference.md (0x1392, 0x490,
  0x57, 0x5) with note that no official HNS error reference exists
- Add single vs. multi-snapshot guidance to common-reference.md
- Add wcifs.sys kernel file handle leak pattern to analyze-hcs.md
- Document HCSSHIM_TIMEOUT_* env vars in analyze-hcs.md
- Add deployment context caveat to HCS container churn threshold
- Add prominent Windows-specific note to analyze-kubelet.md: kubelet
  eviction is NOT implemented on Windows; DiskPressure/MemoryPressure
  won't auto-evict pods
- Add containerfs.inodesFree log spam as known noise (k8s#130142)
- Add note: no confirmed kubelet auto-restart watchdog on Windows
- Add Windows CRI named pipe path to analyze-kubelet.md
- Add container log rotation broken on Windows (containerd#7075) to
  analyze-containers.md and analyze-disk.md
- Add CNI logs in System32 warning to analyze-hns.md (containerd#4928)
- Note no official HNS error code reference in analyze-hns.md
- Add explicit note to analyze-gmsa.md: kubelet logs will NOT contain
  gMSA/Kerberos errors, only CCG evtx logs will
- Add explicit 3-skill quick triage path to SKILL.md
- Reference save-markdown-to-disk skill in SKILL.md for report output

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ons.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…e.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Comment on lines +9 to +13
| File Pattern | Encoding | Contents |
|-------------|----------|----------|
| `kubectl-describe-nodes.log` | UTF-8 | `kubectl describe node` output |
| `<ts>-aks-info.log` | UTF-16-LE with BOM | `kubectl describe node` + node YAML (allocatable, capacity, conditions) |
| `kubelet.log` | UTF-8 | Kubelet stdout logs (if present) |
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl-describe-nodes.log is marked as UTF-8, but in collect-windows-logs.ps1 it is produced via PowerShell redirection (kubectl ... > file), which writes UTF-16LE (“Unicode”) by default on Windows PowerShell 5.1. Please update the expected encoding (or note that it may be UTF-16LE) so the analysis doesn’t mis-decode the file.

Copilot uses AI. Check for mistakes.
Comment on lines +331 to +334
| `<ts>-hnsdiag-list.txt` | analyze-hns, analyze-kubeproxy |
| `<ts>-aks-info.log` | analyze-bootstrap, analyze-memory, analyze-gpu |
| `<ts>-containerd-info.txt` | analyze-hcs |
| `<ts>-containerd-toml.txt` | analyze-hcs, analyze-images |
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skill references <ts>-aks-info.log, but collect-windows-logs.ps1 does not generate any *-aks-info.log file (it generates kubectl-describe-nodes.log / kubectl-get-nodes.log instead). Either update the collector to emit this file, or update the skill docs/file discovery table to use the actual bundle filenames.

Copilot uses AI. Check for mistakes.
| `<ts>_services.csv` | UTF-16-LE with BOM, CSV with embedded newlines | Service Control Manager event log |
| `silconfig.log` | UTF-16-LE with BOM or UTF-8 | Software Inventory Logging configuration |
| `processes.txt` | UTF-16-LE with BOM | Running processes with PIDs |
| `kubectl-get-nodes.log` | UTF-8 | `kubectl get nodes -o wide` output |
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl-get-nodes.log is marked as UTF-8, but it is produced by collect-windows-logs.ps1 via PowerShell redirection (kubectl ... > file), which writes UTF-16LE by default on Windows PowerShell 5.1. Please update the encoding guidance (or mention it may be UTF-16LE).

Suggested change
| `kubectl-get-nodes.log` | UTF-8 | `kubectl get nodes -o wide` output |
| `kubectl-get-nodes.log` | UTF-16-LE with BOM or UTF-8 | `kubectl get nodes -o wide` output; often UTF-16-LE when collected via Windows PowerShell 5.1 redirection (`>`) |

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +16
| `available-memory.txt` | UTF-16-LE with BOM | Available physical RAM at collection time |
| `processes.txt` | UTF-16-LE with BOM | `Get-Process` snapshot — per-process memory usage |
| `<ts>_pagefile.txt` | UTF-16-LE with BOM | Pagefile configuration and usage (size, auto-managed, peak) |
| `<ts>_services.csv` | UTF-16-LE with BOM, CSV with embedded newlines | Event ID 2004 = low memory condition |
| `<ts>-aks-info.log` | UTF-16-LE with BOM | Node YAML with allocatable memory |

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This references <ts>-aks-info.log, but the log bundle produced by collect-windows-logs.ps1 doesn’t include such a file. Consider switching this input to kubectl-describe-nodes.log (which the collector does generate) or documenting how <ts>-aks-info.log is expected to be produced.

Copilot uses AI. Check for mistakes.
|-------------|----------|----------|
| `windowsnodereset.log` | UTF-8 or UTF-16-LE with BOM | Node reset/reimage flow log — full provisioning timeline |
| `bootstrap-config` | UTF-8 or UTF-16-LE with BOM | Bootstrap parameters passed to CSE (JSON or key-value) |
| `<ts>-aks-info.log` | UTF-16-LE with BOM | `kubectl describe node` + node YAML, component versions |
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This references <ts>-aks-info.log, but the Windows bundle collector in this repo (collect-windows-logs.ps1) does not generate it. Either add it to the collector, or update this skill to rely on the existing kubectl-describe-nodes.log/kubectl-get-nodes.log outputs to extract versions and node YAML.

Suggested change
| `<ts>-aks-info.log` | UTF-16-LE with BOM | `kubectl describe node` + node YAML, component versions |
| `kubectl-describe-nodes.log` | UTF-8 or UTF-16-LE with BOM | `kubectl describe node` output for node conditions, taints, addresses, and event history |
| `kubectl-get-nodes.log` | UTF-8 or UTF-16-LE with BOM | `kubectl get nodes` output (including wide/YAML forms when present) for Kubernetes versions and node object details |

Copilot uses AI. Check for mistakes.
| File Pattern | Encoding | Contents |
|-------------|----------|----------|
| `*-nvidia-smi.txt` or `*nvidia-smi*` | UTF-8 or UTF-16-LE with BOM | `nvidia-smi` output — GPU inventory, utilization, temperature, errors |
| `kubectl-describe-nodes.log` | UTF-8 | `kubectl describe node` — resource capacity/allocatable including GPU |
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl-describe-nodes.log is marked as UTF-8, but it’s produced via PowerShell redirection in collect-windows-logs.ps1 (Windows PowerShell 5.1), which writes UTF-16LE by default. Update the encoding expectations so GPU analysis can actually parse the file in real bundles.

Suggested change
| `kubectl-describe-nodes.log` | UTF-8 | `kubectl describe node` — resource capacity/allocatable including GPU |
| `kubectl-describe-nodes.log` | UTF-16-LE with BOM | `kubectl describe node` — resource capacity/allocatable including GPU |

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ignore-for-release This pull request will not be included within official release notes windows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants