Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 19 additions & 115 deletions .claude/ci/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,121 +181,25 @@ curl -s -H "PRIVATE-TOKEN: $GITLAB_PERSONAL_ACCESS_TOKEN" \

### Checking CI (Gitlab)

Use `.claude/ci/check-ci` to follow a pipeline until all jobs complete.
When invoked with `--commit` (or defaulting to HEAD), it monitors both the GitLab
pipeline **and** any GitHub Actions workflows for the same commit. GitHub monitoring
requires `ddtool auth github login --org DataDog` to be configured; if the token is
unavailable, a warning is printed and only GitLab is monitored.

Results are written to `/tmp/gitlab_<pipeline_id>/`:
- `success.txt` — `<job_id>\t<job_name>` per line
- `failure.txt` — same format for failed jobs (GitLab and GitHub; GitHub entries are prefixed `[GH]`)
- `fail_logs/<job_id>.log` — full job trace for each GitLab failure
- `gh_fail_logs/gh_<job_id>.log` — log for each GitHub Actions job failure

Exit codes: 0 = all passed, 1 = failures or threshold reached.

#### Invocation pattern

Available options: `--commit <ref>` OR `--pipeline <id>` (GitLab only, skips GitHub),
`--discovery-timeout <s>` (default 60), `--poll-interval <s>` (default 60),
`--max-failures <n>` (default 50), `--timeout <s>` (default 7200 = 2 h),
`--list-jobs` (see below).

##### `--list-jobs`

Prints all jobs grouped by pipeline with their status, then exits
immediately — does not monitor or download logs. Shows both GitLab pipelines
and GitHub Actions workflow runs. Useful for a quick snapshot of what ran and
what failed:

```bash
.claude/ci/check-ci --commit HEAD --list-jobs
```

Output format:

```
Pipeline 105413994 (status: failed):
failed test_extension_ci: [7.2]
success compile extension: debug [8.3]
...

GitHub Actions run 12345678 'Profiling ASAN Tests' (status: completed, conclusion: failure):
failure prof-asan (8.5, ubuntu-8-core-latest)
success prof-asan (8.3, ubuntu-8-core-latest)
...
```

#### Monitor CI

If --list-jobs is not passed, check-ci will run until all monitored pipelines
finish, until a timeout, or until the maximum number of failures is reached.

**Step 1 — Start check-ci in the background (Bash tool,
`run_in_background: true`):**

```bash
PYTHONUNBUFFERED=1 .claude/ci/check-ci [OPTIONS]
```

Do NOT add `&` or `mktemp` — run the command directly and let
`run_in_background: true` handle backgrounding. `PYTHONUNBUFFERED=1`
is required so Python flushes stdout into the task output file.
The Bash tool returns immediately with a line like:
```
Output is being written to: /path/to/tasks/<id>.output
```
Note that path — it is the output file for the next step.

**Step 2 — Run ci-watch in the background (Bash tool,
`run_in_background: true`):**

```bash
.claude/ci/ci-watch [--start-offset N] OUTPUT_FILE
```

**`OUTPUT_FILE` must be the output file from a `check-ci` process** — not an
arbitrary background task. `ci-watch` parses `check-ci`'s structured
`FAILED:` / `SUCCESS:` lines and exits silently on anything else.

`ci-watch` tails the output file and exits when there is something to
act on. Run it with `run_in_background: true` — you will be notified
when it completes. While it runs, you can do other work.

Exit codes:
- 0 — all pipelines completed (no failures)
- 1 — one or more FAILED: lines detected
- 2 — stale: no new output for 5 minutes
- 3 — check-ci timed out

On exit, ci-watch always prints `RESUME_OFFSET: <N>`. Record this
value — pass it as `--start-offset N` when re-running ci-watch to
skip already-processed content and wait for further failures.

When ci-watch completes, immediately call the `speak_when_done` MCP tool:
- "All CI jobs passed" if exit 0.
- "<N> CI jobs failed" if exit 1 (count is
`grep "^FAILED:" OUTPUT_FILE | wc -l`).
- "CI monitor timed out" if exit 2 or 3.

**Step 3 — Act on the result**

Choose mong these actions, as appropriate:

- **Just report:** summarise the result to the user and stop.
- **Investigate failures:** read `fail_logs/<job_id>.log` under the
output directory for each failed job and diagnose the root cause.
- **Wait for more failures:** if check-ci is still running and you want
to keep watching after investigating, re-run ci-watch with
`--start-offset <RESUME_OFFSET>` (back to Step 2).
- **Kill check-ci:** if you want to stop monitoring entirely, kill it
by its task ID or PID (noted from Step 1).
- **Push fixes**: if a) the user asked you to (NOT OTHERWISE), AND b)
you have made changes to fix the CI failures AND c) the current
branch has an upstream branch, then commit and push. Then go back to
Step 1. If any of the three preconditions don't match, stop and
report the results (and your findings, if any).
Use the `/check-ci` skill — it encapsulates the full procedure: starting
`check-ci` and `ci-watch` in the background, speaking the result, and
investigating failures. See
[`.claude/skills/check-ci/SKILL.md`](../skills/check-ci/SKILL.md).

Quick reference for the underlying tools:

- `check-ci` options: `--commit <ref>` OR `--pipeline <id>` (GitLab only,
skips GitHub), `--discovery-timeout <s>` (default 60),
`--poll-interval <s>` (default 60), `--max-failures <n>` (default 50),
`--timeout <s>` (default 7200 = 2 h), `--list-jobs`.
- When `--commit` is used, both GitLab and GitHub Actions are monitored.
GitHub monitoring requires `ddtool auth github login --org DataDog`; if
unavailable, a warning is printed and only GitLab is monitored.
- Results land in `/tmp/gitlab_<pipeline_id>/`: `success.txt`,
`failure.txt` (GitHub entries prefixed `[GH]`),
`fail_logs/<job_id>.log`, `gh_fail_logs/gh_<job_id>.log`.
- `--list-jobs` prints a grouped job table (GitLab + GitHub Actions) and
exits immediately — useful for a quick snapshot without monitoring.

### Downloading artifacts

Expand Down
164 changes: 164 additions & 0 deletions .claude/skills/check-ci/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
---
name: check-ci
description: >-
Monitor GitLab CI and GitHub Actions for this repo: start check-ci, tail
results with ci-watch, investigate failures, and report. Use when the user
asks to check, watch, or monitor CI, or to see whether a pipeline passed.
argument-hint: "[--commit <ref> | --pipeline <id>] [--list-jobs]"
allowed-tools: Bash Read Grep Glob Agent TaskCreate TaskUpdate TaskStop mcp__speak_when_done__speak
effort: high
---

# Check CI

Monitor GitLab CI and GitHub Actions until all jobs finish, then investigate
any failures and report results.

When `--commit` is used (or defaulting to HEAD), both GitLab pipelines and
GitHub Actions workflow runs are monitored. GitHub monitoring requires
`ddtool auth github login --org DataDog`; if unavailable, a warning is
printed and only GitLab is monitored. `--pipeline <id>` is GitLab-only and
skips GitHub.

## Input

`$ARGUMENTS` may contain any combination of:
- `--commit <ref>` — git ref to resolve (default: HEAD); monitors GitLab +
GitHub
- `--pipeline <id>` — specific GitLab pipeline ID (skips GitHub monitoring)
- `--list-jobs` — quick snapshot mode (no monitoring)

If no `--commit` or `--pipeline` is given, default to `--commit HEAD`.

## Quick mode — `--list-jobs`

Run synchronously and exit immediately:

```bash
.claude/ci/check-ci --commit <ref> --list-jobs
```

Prints all jobs grouped by pipeline (GitLab) and workflow run (GitHub
Actions) with their status. Print the table to the user and stop. Do not
continue to the monitoring steps.

## Full monitoring mode

### Step 1 — Start check-ci in the background

```bash
PYTHONUNBUFFERED=1 .claude/ci/check-ci [OPTIONS]
```

- Use `run_in_background: true` in Bash tool invocation. Do NOT append `&` or
redirect output.
- The Bash tool returns immediately with an output file path like
`/path/to/tasks/<id>.output` ("Output is being written to ..." in the tool
invocation output). Note this path — it is required in Step 2. This file path
will be referred to as `OUTPUT_FILE` henceforth.
- Default options if the user provided none: `--commit HEAD`.
- You may also pass `--max-failures 50` (default) and
`--timeout 7200` (default, 2 h).

### Step 2 — Start ci-watch in the background

```bash
.claude/ci/ci-watch [--start-offset N] OUTPUT_FILE
```

- `OUTPUT_FILE` must be the output file from the check-ci task above.
- Use `run_in_background: true`.
- You are notified when ci-watch exits through a task notification. While it
runs, you may do other work.
- On exit, ci-watch always prints `RESUME_OFFSET: <N>`. Record it for re-runs.

ci-watch exit codes:
| Code | Meaning |
|------|---------|
| 0 | All pipelines completed — no failures |
| 1 | One or more `FAILED:` lines detected |
| 2 | Stale — no new output for 5 minutes |
| 3 | check-ci timed out |

### Step 3 — Speak and act on the result

**Immediately after ci-watch exits**, call
`mcp__speak_when_done__speak(message="...")` (the first time, you'll need to do
invoke `ToolSearch("select:mcp__speak_when_done__speak")`:
- Exit 0 → "All CI jobs passed"
- Exit 1 → "<N> CI jobs failed" (count with `grep "^FAILED:" OUTPUT_FILE | wc
-l`)
- Exit 2 or 3 → "CI monitor timed out"

Then choose the appropriate action:

#### All jobs passed (exit 0)

Report success to the user and stop.

#### Failures detected (exit 1)

1. List the failed jobs:
```bash
grep "^FAILED:" OUTPUT_FILE
```
The output directory is `/tmp/gitlab_<pipeline_id>/`. Logs are at:
- `fail_logs/<job_id>.log` — GitLab job traces
- `gh_fail_logs/gh_<job_id>.log` — GitHub Actions job logs
GitHub entries in `failure.txt` are prefixed `[GH]`.

2. Read each failure log and diagnose the root cause. Look for:
- Compile errors or linker failures
- Test assertion failures (include the failing test name and diff)
- Infrastructure/flakiness signals (timeout, network, Docker pull failures,
OOM) — mark these as flaky rather than real failures.

Except you don't need to go through of them if it becomes evident it's
unnecessary.

3. Report findings grouped by root cause.

4. **Fix and push only when all three conditions hold:**
a. The user explicitly asked you to fix CI failures.
b. You have made changes to address the failures.
c. The current branch has an upstream remote branch.
If any condition is missing, stop and report instead.

When all three hold: commit the fix, push, then go back to Step 1
to re-monitor.

If possible, before attempting a fix, try to reproduce the failure locally.
Check @.claude/ci/index.md for instructions. Then attempt your fix and rerun
to confirm the fix resolves the problem.

#### Stale or timed out (exit 2 or 3)

Re-run ci-watch with `--start-offset <RESUME_OFFSET>` (Step 2) to
resume watching from where you left off. If check-ci itself has also
exited, restart from Step 1.

#### Keep watching (user wants to continue after investigation)

Re-run ci-watch with `--start-offset <RESUME_OFFSET>` (back to Step 2).

## Downloading artifacts

Use `tooling/bin/download-artifacts` to fetch build outputs from CI jobs
(e.g., compiled extensions, SSI loader, datadog-setup.php). Useful when
investigating a failure that produced an artifact worth inspecting locally.

## Rules

- Never push unless the user explicitly asked for it. See the global
instruction "Do not push to git remotes unless explicitly asked to."
- Flaky jobs (known to be intermittent, unrelated to the current
changes) should be noted but not treated as real failures requiring
a fix. However, to confirm that a test is failure you should look for
similar failures in the merge base.
- `GITLAB_PERSONAL_ACCESS_TOKEN` is already set in the environment —
do not re-export it.
- Raw job logs can also be fetched directly. For Gitlab:
```bash
curl -s -H "PRIVATE-TOKEN: $GITLAB_PERSONAL_ACCESS_TOKEN" \
"https://gitlab.ddbuild.io/api/v4/projects/355/jobs/<JOB_ID>/trace"
```
Loading
Loading