From 758fa7ba5732d824181631f9deaca97a14c9ebd7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Gustavo=20Andr=C3=A9=20dos=20Santos=20Lopes?= Date: Thu, 4 Jun 2026 11:26:34 +0200 Subject: [PATCH 1/2] Move check ci procedure to new /check-ci skill --- .claude/ci/index.md | 134 ++++--------------------- .claude/skills/check-ci/SKILL.md | 164 +++++++++++++++++++++++++++++++ 2 files changed, 183 insertions(+), 115 deletions(-) create mode 100644 .claude/skills/check-ci/SKILL.md diff --git a/.claude/ci/index.md b/.claude/ci/index.md index 55e3160823c..1a9a9c6b482 100644 --- a/.claude/ci/index.md +++ b/.claude/ci/index.md @@ -181,121 +181,25 @@ curl -s -H "PRIVATE-TOKEN: $GITLAB_PERSONAL_ACCESS_TOKEN" \ ### Checking CI (Gitlab) -Use `.claude/ci/check-ci` to follow a pipeline until all jobs complete. -When invoked with `--commit` (or defaulting to HEAD), it monitors both the GitLab -pipeline **and** any GitHub Actions workflows for the same commit. GitHub monitoring -requires `ddtool auth github login --org DataDog` to be configured; if the token is -unavailable, a warning is printed and only GitLab is monitored. - -Results are written to `/tmp/gitlab_/`: -- `success.txt` — `\t` per line -- `failure.txt` — same format for failed jobs (GitLab and GitHub; GitHub entries are prefixed `[GH]`) -- `fail_logs/.log` — full job trace for each GitLab failure -- `gh_fail_logs/gh_.log` — log for each GitHub Actions job failure - -Exit codes: 0 = all passed, 1 = failures or threshold reached. - -#### Invocation pattern - -Available options: `--commit ` OR `--pipeline ` (GitLab only, skips GitHub), -`--discovery-timeout ` (default 60), `--poll-interval ` (default 60), -`--max-failures ` (default 50), `--timeout ` (default 7200 = 2 h), -`--list-jobs` (see below). - -##### `--list-jobs` - -Prints all jobs grouped by pipeline with their status, then exits -immediately — does not monitor or download logs. Shows both GitLab pipelines -and GitHub Actions workflow runs. Useful for a quick snapshot of what ran and -what failed: - -```bash -.claude/ci/check-ci --commit HEAD --list-jobs -``` - -Output format: - -``` -Pipeline 105413994 (status: failed): - failed test_extension_ci: [7.2] - success compile extension: debug [8.3] - ... - -GitHub Actions run 12345678 'Profiling ASAN Tests' (status: completed, conclusion: failure): - failure prof-asan (8.5, ubuntu-8-core-latest) - success prof-asan (8.3, ubuntu-8-core-latest) - ... -``` - -#### Monitor CI - -If --list-jobs is not passed, check-ci will run until all monitored pipelines -finish, until a timeout, or until the maximum number of failures is reached. - -**Step 1 — Start check-ci in the background (Bash tool, -`run_in_background: true`):** - -```bash -PYTHONUNBUFFERED=1 .claude/ci/check-ci [OPTIONS] -``` - -Do NOT add `&` or `mktemp` — run the command directly and let -`run_in_background: true` handle backgrounding. `PYTHONUNBUFFERED=1` -is required so Python flushes stdout into the task output file. -The Bash tool returns immediately with a line like: -``` -Output is being written to: /path/to/tasks/.output -``` -Note that path — it is the output file for the next step. - -**Step 2 — Run ci-watch in the background (Bash tool, -`run_in_background: true`):** - -```bash -.claude/ci/ci-watch [--start-offset N] OUTPUT_FILE -``` - -**`OUTPUT_FILE` must be the output file from a `check-ci` process** — not an -arbitrary background task. `ci-watch` parses `check-ci`'s structured -`FAILED:` / `SUCCESS:` lines and exits silently on anything else. - -`ci-watch` tails the output file and exits when there is something to -act on. Run it with `run_in_background: true` — you will be notified -when it completes. While it runs, you can do other work. - -Exit codes: -- 0 — all pipelines completed (no failures) -- 1 — one or more FAILED: lines detected -- 2 — stale: no new output for 5 minutes -- 3 — check-ci timed out - -On exit, ci-watch always prints `RESUME_OFFSET: `. Record this -value — pass it as `--start-offset N` when re-running ci-watch to -skip already-processed content and wait for further failures. - -When ci-watch completes, immediately call the `speak_when_done` MCP tool: -- "All CI jobs passed" if exit 0. -- " CI jobs failed" if exit 1 (count is - `grep "^FAILED:" OUTPUT_FILE | wc -l`). -- "CI monitor timed out" if exit 2 or 3. - -**Step 3 — Act on the result** - -Choose mong these actions, as appropriate: - -- **Just report:** summarise the result to the user and stop. -- **Investigate failures:** read `fail_logs/.log` under the - output directory for each failed job and diagnose the root cause. -- **Wait for more failures:** if check-ci is still running and you want - to keep watching after investigating, re-run ci-watch with - `--start-offset ` (back to Step 2). -- **Kill check-ci:** if you want to stop monitoring entirely, kill it - by its task ID or PID (noted from Step 1). -- **Push fixes**: if a) the user asked you to (NOT OTHERWISE), AND b) - you have made changes to fix the CI failures AND c) the current - branch has an upstream branch, then commit and push. Then go back to - Step 1. If any of the three preconditions don't match, stop and - report the results (and your findings, if any). +Use the `/check-ci` skill — it encapsulates the full procedure: starting +`check-ci` and `ci-watch` in the background, speaking the result, and +investigating failures. See +[`.claude/skills/check-ci/SKILL.md`](../skills/check-ci/SKILL.md). + +Quick reference for the underlying tools: + +- `check-ci` options: `--commit ` OR `--pipeline ` (GitLab only, + skips GitHub), `--discovery-timeout ` (default 60), + `--poll-interval ` (default 60), `--max-failures ` (default 50), + `--timeout ` (default 7200 = 2 h), `--list-jobs`. +- When `--commit` is used, both GitLab and GitHub Actions are monitored. + GitHub monitoring requires `ddtool auth github login --org DataDog`; if + unavailable, a warning is printed and only GitLab is monitored. +- Results land in `/tmp/gitlab_/`: `success.txt`, + `failure.txt` (GitHub entries prefixed `[GH]`), + `fail_logs/.log`, `gh_fail_logs/gh_.log`. +- `--list-jobs` prints a grouped job table (GitLab + GitHub Actions) and + exits immediately — useful for a quick snapshot without monitoring. ### Downloading artifacts diff --git a/.claude/skills/check-ci/SKILL.md b/.claude/skills/check-ci/SKILL.md new file mode 100644 index 00000000000..66389b92b5d --- /dev/null +++ b/.claude/skills/check-ci/SKILL.md @@ -0,0 +1,164 @@ +--- +name: check-ci +description: >- + Monitor GitLab CI and GitHub Actions for this repo: start check-ci, tail + results with ci-watch, investigate failures, and report. Use when the user + asks to check, watch, or monitor CI, or to see whether a pipeline passed. +argument-hint: "[--commit | --pipeline ] [--list-jobs]" +allowed-tools: Bash Read Grep Glob Agent TaskCreate TaskUpdate TaskStop mcp__speak_when_done__speak +effort: high +--- + +# Check CI + +Monitor GitLab CI and GitHub Actions until all jobs finish, then investigate +any failures and report results. + +When `--commit` is used (or defaulting to HEAD), both GitLab pipelines and +GitHub Actions workflow runs are monitored. GitHub monitoring requires +`ddtool auth github login --org DataDog`; if unavailable, a warning is +printed and only GitLab is monitored. `--pipeline ` is GitLab-only and +skips GitHub. + +## Input + +`$ARGUMENTS` may contain any combination of: +- `--commit ` — git ref to resolve (default: HEAD); monitors GitLab + + GitHub +- `--pipeline ` — specific GitLab pipeline ID (skips GitHub monitoring) +- `--list-jobs` — quick snapshot mode (no monitoring) + +If no `--commit` or `--pipeline` is given, default to `--commit HEAD`. + +## Quick mode — `--list-jobs` + +Run synchronously and exit immediately: + +```bash +.claude/ci/check-ci --commit --list-jobs +``` + +Prints all jobs grouped by pipeline (GitLab) and workflow run (GitHub +Actions) with their status. Print the table to the user and stop. Do not +continue to the monitoring steps. + +## Full monitoring mode + +### Step 1 — Start check-ci in the background + +```bash +PYTHONUNBUFFERED=1 .claude/ci/check-ci [OPTIONS] +``` + +- Use `run_in_background: true` in Bash tool invocation. Do NOT append `&` or + redirect output. +- The Bash tool returns immediately with an output file path like + `/path/to/tasks/.output` ("Output is being written to ..." in the tool + invocation output). Note this path — it is required in Step 2. This file path + will be referred to as `OUTPUT_FILE` henceforth. +- Default options if the user provided none: `--commit HEAD`. +- You may also pass `--max-failures 50` (default) and + `--timeout 7200` (default, 2 h). + +### Step 2 — Start ci-watch in the background + +```bash +.claude/ci/ci-watch [--start-offset N] OUTPUT_FILE +``` + +- `OUTPUT_FILE` must be the output file from the check-ci task above. +- Use `run_in_background: true`. +- You are notified when ci-watch exits through a task notification. While it + runs, you may do other work. +- On exit, ci-watch always prints `RESUME_OFFSET: `. Record it for re-runs. + +ci-watch exit codes: +| Code | Meaning | +|------|---------| +| 0 | All pipelines completed — no failures | +| 1 | One or more `FAILED:` lines detected | +| 2 | Stale — no new output for 5 minutes | +| 3 | check-ci timed out | + +### Step 3 — Speak and act on the result + +**Immediately after ci-watch exits**, call +`mcp__speak_when_done__speak(message="...")` (the first time, you'll need to do +invoke `ToolSearch("select:mcp__speak_when_done__speak")`: +- Exit 0 → "All CI jobs passed" +- Exit 1 → " CI jobs failed" (count with `grep "^FAILED:" OUTPUT_FILE | wc + -l`) +- Exit 2 or 3 → "CI monitor timed out" + +Then choose the appropriate action: + +#### All jobs passed (exit 0) + +Report success to the user and stop. + +#### Failures detected (exit 1) + +1. List the failed jobs: + ```bash + grep "^FAILED:" OUTPUT_FILE + ``` + The output directory is `/tmp/gitlab_/`. Logs are at: + - `fail_logs/.log` — GitLab job traces + - `gh_fail_logs/gh_.log` — GitHub Actions job logs + GitHub entries in `failure.txt` are prefixed `[GH]`. + +2. Read each failure log and diagnose the root cause. Look for: + - Compile errors or linker failures + - Test assertion failures (include the failing test name and diff) + - Infrastructure/flakiness signals (timeout, network, Docker pull failures, + OOM) — mark these as flaky rather than real failures. + + Except you don't need to go through of them if it becomes evident it's + unnecessary. + +3. Report findings grouped by root cause. + +4. **Fix and push only when all three conditions hold:** + a. The user explicitly asked you to fix CI failures. + b. You have made changes to address the failures. + c. The current branch has an upstream remote branch. + If any condition is missing, stop and report instead. + + When all three hold: commit the fix, push, then go back to Step 1 + to re-monitor. + + If possible, before attempting a fix, try to reproduce the failure locally. + Check @.claude/ci/index.md for instructions. Then attempt your fix and rerun + to confirm the fix resolves the problem. + +#### Stale or timed out (exit 2 or 3) + +Re-run ci-watch with `--start-offset ` (Step 2) to +resume watching from where you left off. If check-ci itself has also +exited, restart from Step 1. + +#### Keep watching (user wants to continue after investigation) + +Re-run ci-watch with `--start-offset ` (back to Step 2). + +## Downloading artifacts + +Use `tooling/bin/download-artifacts` to fetch build outputs from CI jobs +(e.g., compiled extensions, SSI loader, datadog-setup.php). Useful when +investigating a failure that produced an artifact worth inspecting locally. + +## Rules + +- Never push unless the user explicitly asked for it. See the global + instruction "Do not push to git remotes unless explicitly asked to." +- Flaky jobs (known to be intermittent, unrelated to the current + changes) should be noted but not treated as real failures requiring + a fix. However, to confirm that a test is failure you should look for + similar failures in the merge base. +- `GITLAB_PERSONAL_ACCESS_TOKEN` is already set in the environment — + do not re-export it. +- Raw job logs can also be fetched directly. For Gitlab: + ```bash + curl -s -H "PRIVATE-TOKEN: $GITLAB_PERSONAL_ACCESS_TOKEN" \ + "https://gitlab.ddbuild.io/api/v4/projects/355/jobs//trace" + ``` From e3e161c7df1bd96b994b30d6293f3416e2d11743 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Gustavo=20Andr=C3=A9=20dos=20Santos=20Lopes?= Date: Tue, 9 Jun 2026 11:42:36 +0100 Subject: [PATCH 2/2] /crash-analysis: adapt to 1.8 schema --- .claude/skills/crash-analysis/SKILL.md | 102 ++++++++++++++++++++----- 1 file changed, 85 insertions(+), 17 deletions(-) diff --git a/.claude/skills/crash-analysis/SKILL.md b/.claude/skills/crash-analysis/SKILL.md index 4d59ccdd447..cdb6703c78b 100644 --- a/.claude/skills/crash-analysis/SKILL.md +++ b/.claude/skills/crash-analysis/SKILL.md @@ -13,6 +13,9 @@ effort: max Systematically analyze a dd-trace-php crash report to identify the root cause. +> **Schema note:** The event file you receive is the **backend-enriched** form, +> not the one libdatadog emits. + ## Input The user provides a crash event JSON file (path via `$ARGUMENTS`, or pasted @@ -25,23 +28,58 @@ file. Do all extractions in parallel where possible. | Field | Command | |-------|---------| -| Library version | `jq -c .metadata.library_version $EVENT` | +| Library version | `jq -r '.tracer_version // (.library_version \| "\(.major).\(.minor).\(.patch)")' $EVENT` | | Signal | `jq -c .sig_info $EVENT` | -| Native stacktrace | `jq -c .error.stack.frames $EVENT` | +| Error summary | `jq -r '"\(.error.type // "") \(.error.message // "")"' $EVENT` | +| Crash diagnosis | `jq -c .crash_diagnosis $EVENT` | +| Native stacktrace | `jq -c '.error.stack.frames' $EVENT` | | PHP stacktrace | `jq -c .experimental.runtime_stack.frames $EVENT` | | Mapped files | `jq -c '.files["/proc/self/maps"]' $EVENT` | | Registers | `jq -c '.ucontext // .experimental.ucontext' $EVENT \| .claude/parse_ucontext.py` | -| PHP version | `jq -r '.language_version // (.metadata.tags[] \| select(startswith("runtime_version:")) \| split(":")[1])' $EVENT` | +| PHP version | `jq -r '.language_version // .runtime_version // (.metadata.tags[] \| select(startswith("runtime_version:")) \| split(":")[1])' $EVENT` | +| OS / arch | `jq -r '(.os_info.os_type // .host.os) + " " + (.os_info.architecture // .host.arch // "unknown") + " (kernel " + (.host.version // "?") + ")"' $EVENT` | -> **Note:** `parse_ucontext.py` only supports amd64. On aarch64, skip this step -> and read register values directly from `.ucontext` (or `.experimental.ucontext` -> in older events). +> **Note:** `parse_ucontext.py` only supports amd64. Check `.ucontext.arch` +> first; on aarch64, skip the script and read register values directly from +> `.ucontext.raw`. -From the mapped files, determine: -- **Products loaded**: look for `ddtrace.so`, `ddappsec.so`, `datadog-profiling.so` -- **SSI mode**: check for `libdatadog_php.so` and `dd_library_loader.so` — if present, the process is running the SSI (Single-Step Instrumentation) package. See [SSI architecture](#ssi-architecture) below. -- **OS/arch**: architecture (x86_64 or aarch64) -- **libc**: GNU (`ld-linux-x86-64.so`) or musl (`ld-musl-x86-64.so`) +### crash_diagnosis (schema 1.8+) + +`crash_diagnosis` is computed **server-side by the Datadog errors-worker** +(DataDog/dd-source, +`domains/evp-workers/apps/errors-worker/src/crashtracking/`), not by +libdatadog. It consumes `sig_info`, `ucontext.registers`, and `/proc/self/maps` +from the event; if any is absent, the field is omitted. Use it to confirm (not +skip!) manual triage steps: + +| Field | Meaning | +|-------|---------| +| `category` | Crash category — see enum below | +| `summary` | One-line human-readable description | +| `details` | Extended description with signal/address details and analysis rationale | +| `crashLocation` | Optional. The memory mapping containing the instruction pointer at crash time | +| `crashLocation.path` | Binary where the crashing instruction lives | +| `crashLocation.offsetInMapping` | Offset within that binary's mapped region (hex) | +| `crashLocation.permissions` | Mapping permissions (`r-xp` = executable code) | +| `faultAddressMapped` | Optional. `true` = fault address is in a mapped region; `false` = not mapped (wild pointer); absent if `si_addr` unavailable | +| `faultAddressMapping` | If `faultAddressMapped` is `true`, the mapping containing the fault address | +| `nullRegisters` | Registers whose value was < 0x1000 (null page threshold) at crash time — these are the likely null pointer sources for a `NullPointerDereference` | +| `stackPointerValid` | Optional. `false` = SP is outside the `[stack]` mapping; stack is corrupt; makes native stacktrace unreliable | + +#### DiagnosisCategory enum (complete) + +| Value | Signal | Condition | +|-------|--------|-----------| +| `NullPointerDereference` | SIGSEGV/SEGV_MAPERR | fault addr < 0x1000 (null page) | +| `StackOverflow` | SIGSEGV/SEGV_MAPERR | fault addr within 8 KB of stack guard page | +| `UseAfterFree` | SIGSEGV/SEGV_MAPERR | fault addr within 1 MB past heap end | +| `WildPointer` | SIGSEGV/SEGV_MAPERR | unmapped address, no recognizable pattern | +| `WriteToReadOnly` | SIGSEGV/SEGV_ACCERR | faulting mapping is non-writable | +| `ExecuteNonExecutable` | SIGSEGV/SEGV_ACCERR or SIGILL | fault addr == IP and mapping is non-executable | +| `MisalignedAccess` | SIGBUS/BUS_ADRALN | misaligned memory access (BUS_ADRALN only — BUS_ADRERR, e.g. file-mapped access beyond EOF, maps to `Unknown`) | +| `IllegalInstruction` | SIGILL | invalid opcode in executable region | +| `IntentionalAbort` | SIGABRT | assert(), panic!(), or allocator corruption | +| `Unknown` | any | no pattern matched | ### SSI architecture @@ -76,16 +114,17 @@ crash and understand context: | `profiler_unwinding` | `0` | `counters.rs` | Nonzero = profiler was unwinding the stack at crash time. | | `profiler_serializing` | `0` | `counters.rs` | Nonzero = profiler was serializing data at crash time. | | `si_signo` | `11` | `sig_info.rs` | Raw signal number (`11` = `SIGSEGV`). | -| `si_signo_human_readable` | `sigsegv` | `sig_info.rs` | Signal name (`SIGSEGV`, `SIGBUS`, `SIGILL`, `SIGFPE`, …). Older versions may be lowercase. | +| `si_signo_human_readable` | `SIGSEGV` | `sig_info.rs` | Signal name (`SIGSEGV`, `SIGBUS`, `SIGILL`, `SIGFPE`, …). Always uppercase. | | `si_code` | `1` | `sig_info.rs` | Raw signal code; meaning is signal-dependent. | -| `si_code_human_readable` | `segv_maperr` | `sig_info.rs` | Signal code name (`SEGV_MAPERR`, `SEGV_ACCERR`, `BUS_ADRALN`, `ILL_ILLOPC`, …). | +| `si_code_human_readable` | `SEGV_MAPERR` | `sig_info.rs` | Signal code name (`SEGV_MAPERR`, `SEGV_ACCERR`, `BUS_ADRALN`, `ILL_ILLOPC`, …). | | `si_addr` | `0x00007ff894af86c8` | `sig_info.rs` | Fault address from `siginfo_t.si_addr`. | | `is_crash` | `true` | `errors_intake.rs` / `sidecar.c` | Always `true` for crash reports. | | `incomplete` | `false` | `errors_intake.rs` | `true` = stack trace is truncated / could not fully unwind. | -| `data_schema_version` | `1.4` | `errors_intake.rs` | JSON schema version; current is `1.5`. | +| `language` | `php` | `sidecar.c` | Language identifier pushed as `language:php`. | +| `runtime` | `php` | `sidecar.c` | Runtime identifier pushed as `runtime:php`. | +| `data_schema_version` | `1.8` | `errors_intake.rs` | JSON schema version; current is `1.8`. | | `uuid` | `2f530826-…` | `errors_intake.rs` | RFC 4122 UUID shared between crash ping and crash report. | | `version` | `1.16.0` | `sidecar.c` | Service version from `DD_VERSION` or the active APM span. | -| `source` | `php` | `sidecar.c` | Language/runtime identifier (`"php"`). | | `team` | `telemetry-and-analytics` | Datadog backend | Internal routing tag injected by the intake pipeline. Not from PHP code. | | `instrumented_service` | `web.request` | Datadog Agent/backend | Resource/span type at crash time. Not from PHP code. | | `datacenter` | `us1.prod.dog` | Datadog backend | Intake datacenter/region tag. Not from PHP code. | @@ -96,7 +135,17 @@ Check whether any profiler counter (`profiler_collecting_sample`, `profiler_unwinding`, `profiler_serializing`) is nonzero — this attributes the crash to profiler activity. -Print the triage summary before continuing. +From the mapped files, determine: +- **Products loaded**: look for `ddtrace.so`, `ddappsec.so`, + `datadog-profiling.so` +- **SSI mode**: check for `libdatadog_php.so` and `dd_library_loader.so` — if + present, the process is running the SSI (Single-Step Instrumentation) + package. See [SSI architecture](#ssi-architecture) below. +- **OS/arch**: prefer `os_info.architecture` (schema 1.8+) over `host.arch` + (which may be empty), but fall back to reading the mapped ld-linux file name +- **libc**: GNU (`ld-linux-x86-64.so`) or musl (`ld-musl-x86-64.so`) + +Finally, print the triage summary before continuing. ## Phase 2 — Stacktrace correlation @@ -104,6 +153,12 @@ Checkout the matching version tag in a worktree (tags are like `1.16.0`). For PHP source, use the `php-src` repository next to this checkout; PHP tags are like `PHP-8.1.33`. +PHP runtime frames (`experimental.runtime_stack.frames`, format: `"Datadog +Runtime Callback 1.0"`) contain: +- `file` / `function` / `line` — source location +- `type_name` — class name when the frame is a method call (e.g. + `"Couchbase\\Collection"`) + > **Note:** Ondřej Surý packages for Debian may be slightly modified relative to > upstream PHP. If discrepancies appear, use `apt-get source` inside an > appropriate Docker container to obtain the exact source. @@ -132,6 +187,18 @@ If frames land in unknown binaries, note them but focus on Datadog frames first. **If you can identify the root cause at this point, stop and report.** Only continue to Phase 3/4 if the analysis is ambiguous or low-confidence. +Note: the authoritative native stacktrace for the crashing thread is +**`.error.stack.frames`** (format: `"Datadog Crashtracker 1.0"`, always +populated when a stack could be captured). The crashing thread name is in +`.error.thread_name`. + +`error.threads` is a per-thread snapshot array present in schema 1.8+. Each +element carries a `crashed` boolean flag, `name`, `state`, and `stack.{frames, +incomplete}`. In practice, `crashed` is often `false` on every thread and +`stack.frames` is `null` with `incomplete: true` — the per-thread stacks are +frequently unavailable. Use them as supplementary context only; do not rely on +them as the primary frame source. + ## Phase 3 — Binary verification (if needed) If the stacktrace correlation is ambiguous or the crash is in Datadog code: @@ -148,7 +215,8 @@ If the stacktrace correlation is ambiguous or the crash is in Datadog code: .claude/dd_php_release_url '' '' '' '' ``` Both print a temp directory with the extracted package. Use the version - exactly as it appears in `metadata.library_version`. + exactly as it appears in `tracer_version` (or reconstructed from + `library_version`). 2. Verify the binary matches the crash by comparing: - Size of first mapped region (from `/proc/self/maps`) vs. `p_memsz` of the