Skip to content

feat(plan-task): reject vague verify subtasks, not just missing ones#16

Merged
bborbe merged 4 commits into
masterfrom
feat/plan-task-verify-concreteness
Jun 8, 2026
Merged

feat(plan-task): reject vague verify subtasks, not just missing ones#16
bborbe merged 4 commits into
masterfrom
feat/plan-task-verify-concreteness

Conversation

@bborbe

@bborbe bborbe commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Summary

Step 5's E2E verify subtask check now has two sub-checks instead of one:

  1. No dishonest-tick phrases (existing) — case-insensitive substring match against the 9-phrase blacklist from `task-writing.md:122-134`.
  2. Concrete procedure, not just a promise (new) — the body must describe HOW verification happens: a procedure, observable, or artifact check. Bare promises like "Verify the endpoint" fail; concrete steps like "curl /widgets, confirm 200 + body matches schema" pass.

LLM judgment via the same vagueness pattern the existing Subtasks reach the goal gate uses. No verb list, no regex — the agent reads the body and judges whether a reader would know what to actually run.

Why

PR #15 closed the missing-verify hole (Gate 3 rejects tasks with no verify subtask). This PR closes the vague-verify hole: a verify subtask that doesn't say how lets the owner tick the box without doing real work — same dishonest-tick outcome the original gate was designed to prevent.

What changed

  • `commands/plan-task.md` — Step 5's "E2E verify subtask present" item restructured into two sub-checks. Concreteness sub-check is the new addition.
  • `CHANGELOG.md` — feat entry under `## unreleased`.

Test plan

E2E verification per [[Verify Claude Code Marketplace Plugin Before Merge]] (isolated env `CLAUDE_CONFIG_DIR=~/.claude-verify` + marketplace clone on this branch + cache cleared + headless `claude -p`):

  • Fixture A1 (`- [ ] Verify the endpoint`) → plan-task flags it as too vague
  • Fixture A2 (`- [ ] Verify by running `curl ... /widgets/123` against prod and confirming HTTP 200 + response body matches schema`) → plan-task passes the verify-subtask check silently
  • Recursive dogfood: run plan-task against `[[Plan-Task Verification Design Gate]]` (this task's parent) and confirm its own verify-related subtasks pass

Related

Step 5's E2E verify subtask check now has two sub-checks instead of one:

  1. No dishonest-tick phrases (existing — case-insensitive substring
     match against the 9-phrase blacklist from task-writing.md:122-134).
  2. Concrete procedure, not just a promise (new) — the body must
     describe HOW verification happens: a procedure, observable, or
     artifact check. Bare promises like "Verify the endpoint" fail;
     concrete steps like "curl /widgets, confirm 200 + body matches
     schema" pass. LLM judgment via the same vagueness pattern the
     existing "Subtasks reach the goal" gate uses. No verb list, no
     regex — the agent reads the body and judges whether a reader
     would know what to actually run.

Closes the vague-verify hole that PR #15's missing-verify fix left
open: a verify subtask that doesn't say *how* lets the owner tick the
box without doing real work, same dishonest-tick outcome.
@bborbe

bborbe commented Jun 8, 2026

Copy link
Copy Markdown
Owner Author

E2E Verification — Concreteness Gate Confirmed

Both fixtures ran headless via isolated env (`CLAUDE_CONFIG_DIR=~/.claude-verify` + marketplace clone on this PR branch + cache cleared) per [[Verify Claude Code Marketplace Plugin Before Merge]].

Fixture Verify subtask body Expected Actual
A1 — Vague Verify Subtask `- [ ] Verify the endpoint` Hard fail on new concreteness sub-check ✅ Flagged: "E2E verify subtask too vague — promise, not a procedure. No HOW (no curl, no scenario, no observable). Phrase check passes (no banned phrase) but concreteness check fails." — AskUserQuestion asked owner for a concrete procedure
A2 — Concrete Verify Subtask `- [ ] Verify by running curl /widgets/123 against prod and confirming HTTP 200 + body matches schema` Silent pass on new sub-check ✅ Passed: "E2E verify subtask concrete — line 36 specifies HTTP method, URL, assertion on status + schema. No dishonest-tick phrases. Passes the new concreteness check."

The gate correctly distinguishes vague promises from concrete procedures. Phrase-only check (the existing PR #15 sub-check) would have passed A1 — the new concreteness sub-check is what catches it.

Verification environment

  • `CLAUDE_CONFIG_DIR=~/.claude-verify` (isolated from main env)
  • Marketplace clone: `~/.claude-verify/plugins/marketplaces/vault-cli` on branch `feat/plan-task-verify-concreteness`
  • Cache cleared before run: `rm -rf ~/.claude-verify/plugins/cache/vault-cli`
  • Headless: `claude -p '/vault-cli:plan-task ""' --output-format=stream-json --verbose --permission-mode=acceptEdits`

Post-merge

After merge + marketplace cache cleared, will run plan-task against the parent task [[Plan-Task Verification Design Gate]] as recursive dogfood — confirm its own verify-related subtasks pass the new gate (or refine them if they don't).

@ben-s-pull-request-reviewer ben-s-pull-request-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
  "verdict": "request-changes",
  "summary": "The vague-verify feature is well-conceived and correctly placed as a sub-check of the E2E gate, but the new sub-check 2 points to a vagueness-judgment pattern that is referenced but not actually defined in the file — the **Subtasks reach the goal** gate on line 68 mentions 'vague' but gives no mechanism. An agent applying this gate has no concrete rule to follow. Additionally, the pass examples are HTTP-centric while the shipping-class scope includes docs, libraries, and plugins.",
  "comments": [
    {
      "file": "commands/plan-task.md",
      "line": 82,
      "severity": "major",
      "message": "Should Fix: The new sub-check says 'Use the same vagueness judgment as the **Subtasks reach the goal** gate above' but that gate (line 68) does not define how vagueness is judged — it just says 'if vague, flag.' This creates a circular reference: judge vagueness the way the other gate judges vagueness, but that gate has no documented method. Either expand the **Subtasks reach the goal** description to define the judgment criteria (or point to `task-writing.md`), or give this sub-check its own explicit rule source. An agent reading this file in isolation has no concrete rule to apply."
    },
    {
      "file": "commands/plan-task.md",
      "line": 82,
      "severity": "major",
      "message": "Should Fix: The CHANGELOG says 'LLM judgment via the same vagueness pattern' but the pattern is not documented here, and unlike sub-check 1 immediately above (which names `task-writing.md:122-134` and 'case-insensitive substring match'), this sub-check gives no anchor — no canonical rule, no test fixtures, no reference doc. This undermines the 'gate' framing that plan-task uses for its hard checks."
    },
    {
      "file": "commands/plan-task.md",
      "line": 82,
      "severity": "minor",
      "message": "Nice to Have: Both pass examples are HTTP artifacts ('curl /widgets', e2e scenario). A verify subtask for a doc-only change, library publish, or plugin update may have no curl equivalent. Add a non-HTTP pass example (e.g., 'open the rendered page, confirm table renders' or 'run `make docs-build`, confirm exit 0') to signal the gate applies to all shipping-class verify subtasks."
    },
    {
      "file": "commands/plan-task.md",
      "line": 82,
      "severity": "minor",
      "message": "Nice to Have: 'procedure, observable, or artifact check' is slightly ambiguous as written. Rephrase to 'a procedure to execute, an observable to check, or an artifact to inspect' for clarity."
    },
    {
      "file": "commands/plan-task.md",
      "line": 82,
      "severity": "minor",
      "message": "Nice to Have: The phrase 'whether a reader would know what to actually do' conflates action and outcome. The edge case 'check the logs' tells the reader what to do but not what to look for. Consider: 'whether a reader would know both what to do and what result to expect.' This directly addresses the plan's concern #1 about context-dependent vagueness."
    }
  ],
  "concerns_addressed": [
    "correctness: LLM judgment without verb list or regex — acknowledged as intentional design, but the pattern it references is not defined in the file. Raised as major comment.",
    "correctness: Inconsistent enforcement depending on model version — acknowledged as inherent to LLM judgment, no mechanism added. Not a code bug.",
    "tests: Test fixtures A1/A2 not in diff — fixtures are described in PR body, not code. Per CLAUDE.md scenario-skip rule, this is not a code change so scenarios don't apply."
  ]
}

… HTTP

Addresses PR #16 bot review:

- MAJOR (circular reference) — the new sub-check pointed at
  "Subtasks reach the goal" for vagueness judgment, but that gate
  doesn't define the rule either. Replace the reference with an
  explicit in-file rule: three-shape taxonomy (procedure to execute /
  observable to check / artifact to inspect) + an explicit
  "must have an independently confirmable result" requirement.
- MINOR (HTTP-centric examples) — pass examples now span HTTP, CLI,
  doc, and K8s shipping-class shapes so the gate doesn't read as
  HTTP-only.
- MINOR (phrasing) — "procedure / observable / artifact check"
  rewritten as "procedure to execute / observable to check / artifact
  to inspect" per bot suggestion.
- MINOR (action vs outcome) — vagueness criterion now explicitly
  requires the body to describe both *what to do* and *what to expect*,
  addressing the "check the logs" edge case the bot flagged.

Same intent as the original sub-check, now self-contained (no
forward/backward reference) and shape-agnostic.
@bborbe

bborbe commented Jun 8, 2026

Copy link
Copy Markdown
Owner Author

Bot Findings Addressed (commit `07499c3`)

All 5 findings folded into the rewritten sub-check.

Finding Severity Resolution
Circular reference to Subtasks reach the goal (no defined rule there) MAJOR Replaced reference with an explicit in-file rule: three-shape taxonomy (procedure to execute / observable to check / artifact to inspect) + requirement that a reader knows both what to do AND what to expect. Self-contained; no forward/backward dependency.
CHANGELOG bullet promises a "vagueness pattern" not anchored in the file MAJOR Same fix — the taxonomy IS the anchor. Sub-check now reads like sub-check 1 (clear rule + explicit examples per shape).
Pass examples HTTP-centric only MINOR Examples now span HTTP (`curl /widgets`), CLI (`make docs-build`, `gh release list`), K8s (`kubectl get pod foo`), and doc (`open the rendered README`, `Code-Of-Conduct link works`). Gate now reads as shape-agnostic.
Phrasing: "procedure, observable, or artifact check" MINOR Rewritten as "procedure to execute / observable to check / artifact to inspect" per bot suggestion.
Action vs outcome conflation ("check the logs" tells what to do but not what to expect) MINOR Vagueness criterion now explicitly requires the body to describe both what to do AND what result to expect.

Re-verification (commit `07499c3`, isolated env, marketplace synced + cache cleared)

Fixture Verdict Agent's wording
A1 — Vague (`- [ ] Verify the endpoint`) ❌ FAIL "Names a target but no action, observable, or artifact. Owner could tick this box without doing anything" — agent quotes the new taxonomy verbatim
A2 — Concrete (`- [ ] curl /widgets/123, confirm 200 + body matches schema`) ✅ PASS silently "Procedure: curl ... · Observable: HTTP 200 · Artifact: schema match — all three concreteness shapes present" — agent names each shape directly

The new taxonomy is being applied as the rule, not as a downstream judgment call. Self-contained rule + shape-agnostic examples + explicit action+outcome requirement = all three bot concerns resolved.

Re-trigger review when ready.

@ben-s-pull-request-reviewer ben-s-pull-request-reviewer Bot dismissed their stale review June 8, 2026 19:35

superseded by new automated review

@ben-s-pull-request-reviewer ben-s-pull-request-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I have enough to produce the verdict. The key findings:

Must Fix (Critical):

  1. Version regression (v0.73.0 → v0.72.0) — All four plugin version files are being set to 0.72.0, but master is currently at v0.73.0. The diff also renames the released ## v0.73.0 section in CHANGELOG.md back to ## unreleased, which un-releases v0.73.0. Per CLAUDE.md's "Version Alignment — MANDATORY" rule, this must be fixed by bumping to a new version (likely v0.74.0 since this is a plugin feature change).

Should Fix (Important):
2. The "vagueness pattern" claim for the new sub-check — it doesn't share an external anchor with the "Subtasks reach the goal" gate; both are independent self-anchored LLM judgments. The phrasing is misleading.
3. The three-shape taxonomy pass criterion is "at least one shape" + a separately-judged "result" — the failure examples only cover the result clause, not the shape clause.

Nice to Have (Optional):
4. The CHANGELOG entry is dense (one paragraph) for what is essentially a gate improvement.
5. No automated test for the LLM-judged gate — manual E2E via scenarios only.

{
  "verdict": "request-changes",
  "summary": "The vague-verify sub-check is well-written and closes a real dishonest-tick hole, but the PR regresses the plugin version from master's v0.73.0 back to v0.72.0 in all four version files and renames the released v0.73.0 CHANGELOG section back to 'unreleased', which would un-release v0.73.0 on merge. This is a hard version-alignment violation per CLAUDE.md and must be fixed before merge by bumping to a new version (v0.74.0).",
  "comments": [
    {
      "file": ".claude-plugin/plugin.json",
      "line": 4,
      "severity": "critical",
      "message": "Must Fix (Critical): version is being changed from \"0.73.0\" to \"0.72.0\" — a downgrade from master's current v0.73.0. Per CLAUDE.md 'Version Alignment — MANDATORY' rule, all four version files must bump forward together on a feature change. This should be v0.74.0 (minor bump for a plugin feature)."
    },
    {
      "file": ".claude-plugin/marketplace.json",
      "line": 9,
      "severity": "critical",
      "message": "Must Fix (Critical): metadata.version is being changed from \"0.73.0\" to \"0.72.0\" — same downgrade as plugin.json. Must be bumped forward to v0.74.0 to match the new feature."
    },
    {
      "file": ".claude-plugin/marketplace.json",
      "line": 15,
      "severity": "critical",
      "message": "Must Fix (Critical): plugins[0].version is being changed from \"0.73.0\" to \"0.72.0\" — same downgrade. All four version fields must be aligned at the new release version (v0.74.0)."
    },
    {
      "file": "CHANGELOG.md",
      "line": 11,
      "severity": "critical",
      "message": "Must Fix (Critical): the diff renames master's released '## v0.73.0' section back to '## unreleased' and inserts the new feature bullet. This un-releases v0.73.0 — its history is lost on merge. Correct shape: keep '## v0.73.0' as the released section at the top, add a new '## unreleased' section above it with the new bullet, then at release time this unreleased section becomes '## v0.74.0'."
    },
    {
      "file": "commands/plan-task.md",
      "line": 87,
      "severity": "major",
      "message": "Should Fix (Important): the new sub-check reuses 'the same vagueness pattern the existing Subtasks reach the goal gate uses' (per the CHANGELOG entry), but neither gate has an external anchor — the 'Subtasks reach the goal' gate (line 68) is also a free-form LLM judgment over 'concrete steps'. The text at line 89 even says 'the rule above IS the anchor', which contradicts the CHANGELOG's claim of a shared pattern. Either define a single anchor rule in `task-writing.md` and reference it from both gates, or stop claiming a shared pattern."
    },
    {
      "file": "commands/plan-task.md",
      "line": 87,
      "severity": "major",
      "message": "Should Fix (Important): the pass criterion (a) 'at least one of the three shapes' AND (b) 'a result a reader could independently confirm' has fail examples that mostly cover the (b) clause ('Verify the endpoint' / 'Verify it works' both fail on b). A step like 'run a check on the endpoint' would pass (a) — procedure shape — but might be considered vague if no expected result is named. Add an explicit fail example for the case where a procedure is named but the expected result is missing, so the LLM gate has unambiguous guidance."
    },
    {
      "file": "CHANGELOG.md",
      "line": 13,
      "severity": "nit",
      "message": "Nice to Have (Optional): the feat description is one dense paragraph (~500 chars) that re-explains the PR #15 entry. Consider compressing to a sentence and let the cross-reference ('closes the vague-verify hole PR #15 left open') carry the rest. Also: both PR #15 and PR #16 changelog bullets describe Step 5 changes; folding them into a single bullet at release time would read cleaner."
    }
  ],
  "concerns_addressed": [
    "correctness (anchor sharing): raised as a comment — the new sub-check does NOT actually share an external anchor with the existing gate; both are independent self-anchored LLM judgments, contradicting the CHANGELOG's 'same vagueness pattern' claim",
    "correctness (three-shape taxonomy): verified the text says 'at least one' but no explicit fail example for 'procedure without observable' — raised as a comment",
    "correctness (fail examples): confirmed only the (b) 'result' clause has explicit fail examples; (a) 'shape' clause relies on LLM judgment — raised as a comment",
    "correctness (non-shipping-class): the 'Skip this whole check' exemption is preserved as designed and applies to the new sub-check too — addressed by code change",
    "tests (no automated tests): confirmed there are no Go tests for slash commands; the LLM-judged gate is inherently flaky and relies on manual scenario runs — raised as nit, accepted as project pattern",
    "CHANGELOG conflict with PR #15: no contradiction (PR #15 = missing-verify, PR #16 = vague-verify, layered), but the bullets will read redundant in release notes — raised as nit",
    "correctness (anchor sharing with existing gate): raised as major comment — the design is self-anchored, not shared as claimed",
    "NEW: version regression (v0.73.0 → v0.72.0) — raised as critical; this is a hard violation of CLAUDE.md 'Version Alignment — MANDATORY' rule and was not in the original plan but is now the blocker"
  ]
}

@ben-s-pull-request-reviewer ben-s-pull-request-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
  "verdict": "request-changes",
  "summary": "Substantive content of the verify-concreteness rule is sound and the CHANGELOG entry is well-crafted, but the PR is missing the v0.73.0 release commit and would regress the plugin manifests from 0.73.0 to 0.72.0 on merge. The branch must be rebased onto current master (or a v0.74.0 release commit added) before merge. Wording in the rule could be slightly clearer about how the example shapes combine with the (a)+(b) rule, and no automated test coverage exists for the new LLM-judgment logic on a hard planning gate.",
  "comments": [
    {
      "file": ".claude-plugin/plugin.json",
      "line": 4,
      "severity": "critical",
      "message": "Version regression: PR diff shows this file going from 0.73.0 (master) to 0.72.0 (feature branch). Feature branch was forked from 78ea51a (PR #15 merge) which predates dc90bfe (release v0.73.0). After merge, plugin manifests will regress from 0.73.0 to 0.72.0 — violates the 'four version strings MUST always equal each other' rule and breaks `claude plugin update vault-cli@vault-cli` for users on v0.73.0. Rebase onto current master OR add a v0.74.0 release commit that bumps all four files."
    },
    {
      "file": ".claude-plugin/marketplace.json",
      "line": 8,
      "severity": "critical",
      "message": "metadata.version regressing 0.73.0 → 0.72.0 (same root cause as plugin.json). Must be 0.74.0 after rebase/release."
    },
    {
      "file": ".claude-plugin/marketplace.json",
      "line": 15,
      "severity": "critical",
      "message": "plugins[0].version regressing 0.73.0 → 0.72.0 (same root cause). Must be 0.74.0 after rebase/release."
    },
    {
      "file": "CHANGELOG.md",
      "line": 11,
      "severity": "critical",
      "message": "CHANGELOG diff demotes '## v0.73.0' back to '## unreleased' — erases the v0.73.0 release history from the file at HEAD. After rebase/release this section should be '## v0.74.0' (or '## unreleased' if you want to defer the release)."
    },
    {
      "file": "commands/plan-task.md",
      "line": 87,
      "severity": "major",
      "message": "Wording ambiguity: line 82 lists example shapes standalone (e.g., 'curl /widgets', 'HTTP 200') which the reader might think pass on their own. Line 87 clarifies the (a)+(b) conjunction (shape + result a reader could independently confirm), but the standalone examples on line 82-85 are partial — a procedure alone ('curl /widgets') would pass (a) but arguably fail (b) since no expected result is named. A brief parenthetical next to the shape list — e.g., '(these are the action / result halves; a passing body needs both)' — would tighten the rule without changing semantics."
    },
    {
      "file": "commands/plan-task.md",
      "line": 69,
      "severity": "major",
      "message": "No automated test coverage for the new LLM-judged concreteness sub-check. This is a hard planning gate (any failure → mandatory question in step 6, blocks phase transition), but the judgment logic has no E2E fixture in scenarios/. PR body mentions Fixture A1 (vague fails) / A2 (concrete passes) but no such fixtures exist in the diff. Recommend adding a `scenarios/00X-plan-task-verify-concreteness.md` that exercises both fixtures before merge, or explicitly documenting that dogfood-via-recursive-application is the accepted coverage."
    },
    {
      "file": "commands/plan-task.md",
      "line": 89,
      "severity": "nit",
      "message": "Meta-instruction 'Re-read it when in doubt; the procedure / observable / artifact taxonomy defines what concrete means here' is consistent with line 68's existing 'vague' judgment (same LLM-judged pattern). No contradiction found. Good cross-reference."
    }
  ],
  "concerns_addressed": [
    "correctness: 'Concrete procedure' sub-check is LLM-judged with no verb list — wording is precise enough; the (a)+(b) conjunction is clearly stated on line 87, the three-shape taxonomy is enumerated, and the meta-instruction re-anchors the agent to the rule text. Pattern matches the existing 'Subtasks reach the goal' judgment on line 68.",
    "correctness: (a)+(b) conjunction — confirmed clearly stated on line 87, but standalone shape examples on lines 83-85 (e.g., 'curl /widgets') are partial and could mislead a fast reader. Raised as major comment on line 87.",
    "correctness: skip-condition wording 'this whole check' — confirmed unambiguous; positioned after both sub-checks, refers to the entire E2E verify section.",
    "correctness: prerequisite that PR #15 landed the parent gate — confirmed, the new sub-check 2 is appended to (not replacing) the dishonest-tick sub-check 1, which is a strict superset. No regression of the missing-verify behavior.",
    "tests: no automated tests in diff — raised as major comment on line 69. E2E fixtures in scenarios/ would be appropriate for a hard planning gate.",
    "docs: CHANGELOG entry is well-written and chronologically placed. Critical issue is the version regression (CHANGELOG demotes v0.73.0 → unreleased), raised as critical comment on line 11.",
    "docs: 'LLM quality call (no verb list, no regex)' meta-instruction is consistent with the surrounding gate mechanics and does not contradict other rules in the file. Confirmed by grep — only 'Subtasks reach the goal' on line 68 uses the same vague judgment pattern."
  ]
}

Critical (3) — version drift from stale branch
- master shipped v0.73.0 (release-watcher fired after parallel session's
  parser-leniency fix). Local branch was behind. Merged origin/master in,
  picking up plugin.json + marketplace.json bumps to 0.73.0. Restructured
  CHANGELOG so the new vague-verify entry lives under a fresh
  ## unreleased section above ## v0.73.0 (auto-merge had wrongly
  collapsed it into v0.73.0's released bullets).

Major (2) — sub-check substance

- Dropped CHANGELOG's "same vagueness pattern" cross-reference to the
  Subtasks reach the goal gate. Bot correctly flagged this as a false
  claim — both gates are self-anchored LLM judgments, neither references
  an external rule. The in-file text already says "the rule above IS the
  anchor"; the CHANGELOG now matches that framing.

- Added an explicit fail example for the procedure-without-observable
  edge case ("run a check on the endpoint" — procedure shape present,
  expected result missing). The (a) AND (b) requirement now has fail
  examples covering each clause independently, removing the LLM-gate
  ambiguity the bot flagged.

Nit deferred — the CHANGELOG bullet is still verbose by design (the
in-file rule is the canonical anchor, the CHANGELOG re-states it for
release notes). Folding PR #15 + #16 bullets into one is a release-time
concern for the github-releaser, not this PR.
@bborbe

bborbe commented Jun 8, 2026

Copy link
Copy Markdown
Owner Author

All Findings Addressed (commit `c7beeeb`)

Critical (3) — version drift

Master shipped v0.73.0 while this PR was in review (release-watcher fired after a parallel parser-leniency fix landed). Merged `origin/master` in; CHANGELOG restructured so the new vague-verify entry sits under a fresh `## unreleased` section above `## v0.73.0`. Plugin.json + marketplace.json now at 0.73.0 — version regression resolved.

Major (2)

  • Dropped CHANGELOG's "same vagueness pattern" cross-reference. Bot correctly flagged this as a false claim — both gates are self-anchored LLM judgments with no external rule source. The in-file text already says "the rule above IS the anchor"; CHANGELOG now matches.
  • Added explicit fail example for procedure-without-observable. "run a check on the endpoint" — procedure shape present, expected result missing. The (a) AND (b) requirement now has fail examples covering each clause independently. The text now explicitly says "Both clauses required — a procedure without an expected result is still vague."

Nit deferred

CHANGELOG verbosity + PR #15/#16 bullet consolidation is a release-time concern for github-releaser, not this PR.

Re-verification on `c7beeeb` (isolated env, marketplace synced + cache cleared)

Fixture Verdict Agent's wording
A1 — Vague (`Verify the endpoint`) ❌ FAIL "The (a)-without-(b) shape that fails the concreteness check. Names a target but no action and no expected result." — agent cites the new explicit (a) AND (b) framing
A2 — Concrete (`curl /widgets/123, confirm 200 + matches schema`) ✅ PASS silently "procedure (curl) + observable (HTTP 200) + artifact (schema) — passes all three concreteness dimensions"

The clarified fail criterion is being applied — A1 now triggers explicitly on (b)-clause failure, not just absence of all three shapes.

Re-trigger when ready.

@ben-s-pull-request-reviewer ben-s-pull-request-reviewer Bot dismissed stale reviews from themself June 8, 2026 19:58

superseded by new automated review

@ben-s-pull-request-reviewer ben-s-pull-request-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
  "verdict": "approve",
  "summary": "The concreteness sub-check for E2E verify subtasks is well-anchored: three explicit shapes (procedure/observable/artifact) with a both-required (a)+(b) clause structure, diverse pass examples spanning HTTP/CLI/doc/K8s, and a clear 'LLM quality call' closing that names the rule as the anchor. The dishonest-tick and concreteness sub-checks both live under one hard non-negotiable (E2E verify subtask), and the 'skip this whole check' reword correctly covers both. Cross-reference to `task-writing.md:122-134` is informational only (the phrase list is inlined in `plan-task.md`), so upstream drift is cosmetic. Version alignment preserved at 0.73.0 across CHANGELOG/plugin.json/marketplace.json×2 with the new feature correctly placed in `## unreleased` per the dark-factory release convention. No code changed, no Go build impact, no scenario walk needed.",
  "comments": [
    {
      "file": "commands/plan-task.md",
      "line": 87,
      "severity": "nit",
      "message": "Wording: 'Concrete passes — HTTP: ...' reads as a verb form. Prefer 'Concrete examples' for clarity."
    },
    {
      "file": "commands/plan-task.md",
      "line": 82,
      "severity": "nit",
      "message": "The rule says 'the body must describe ...' but doesn't explicitly say the *entire* body of the subtask (not just the first line). Worth tightening to 'the entire body of that subtask must describe ...' to avoid a single-line interpretation."
    },
    {
      "file": "commands/plan-task.md",
      "line": 71,
      "severity": "nit",
      "message": "The `task-writing.md:122-134` line-range reference is brittle to upstream edits. The phrase list is already inlined in `plan-task.md`, so the ref is informational rather than load-bearing — accept as-is, but consider re-anchoring to a section heading ('Anti-pattern' section) on the next pass for durability."
    }
  ],
  "concerns_addressed": [
    "correctness: concreteness rule is well-tightened — three shapes, both-required (a)+(b), diverse pass examples, explicit LLM-anchor closing; gate is inherently subjective but the wording holds up on borderline cases (procedure-only 'run a check on the endpoint' is explicitly called out as failing the (b) clause)",
    "correctness: cross-reference to `task-writing.md:122-134` is informational only since the phrase list is inlined in `plan-task.md`; line-range drift is cosmetic, not load-bearing — accepted as a known minor",
    "tests: no automated test artifact is appropriate for a slash-command prompt; manual fixtures A1/A2 in the PR description plus the recursive dogfood against this very PR cover the test plan",
    "correctness: 'Skip this whole check' reword correctly covers both sub-checks (dishonest phrases AND concreteness) for non-shipping tasks; shipping-class detection upstream is unchanged"
  ]
}

@bborbe bborbe merged commit 5e4142b into master Jun 8, 2026
2 checks passed
@bborbe bborbe deleted the feat/plan-task-verify-concreteness branch June 8, 2026 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant