Skip to content

Add diagram anchor integrity check (EDUENG-613)#23138

Open
ebembi-crdb wants to merge 2 commits intomainfrom
edueng-613-validate-diagram-anchors
Open

Add diagram anchor integrity check (EDUENG-613)#23138
ebembi-crdb wants to merge 2 commits intomainfrom
edueng-613-validate-diagram-anchors

Conversation

@ebembi-crdb
Copy link
Copy Markdown
Contributor

Summary

Adds a script and GitHub Actions workflow that catch broken sql-grammar.html#anchor references in SQL diagram files before they block production builds.

How it works:

  • Scans doc files for {% remote_include ... crdb_branch_name .../grammar_svg/DIAGRAM.html %} tags
  • Fetches each referenced diagram HTML from cockroachdb/generated-diagrams
  • Verifies every sql-grammar.html#ANCHOR link inside it resolves against stmt_block.html on the same branch

Files added:

  • .github/scripts/validate_diagram_anchors.py — stdlib-only Python script, no pip installs required
  • .github/workflows/validate-diagram-anchors.yml — triggers on *.md changes in PRs (changed files only) and runs a full scan daily at 07:15 UTC

On PR failure: posts/updates a bot comment with the broken diagram, branch, and anchor, plus which doc files reference it. Blocks merge.
On scheduled failure: opens a GitHub issue (or updates an existing open one) with label sql-diagram-validation.

Context

Addresses EDUENG-613, part of the follow-up to the production build outage on 2026-01-29 discussed in #docs-site-status.

Root cause of that outage: show_statement_hints.html was added with a link to sql-grammar.html#opt_with_show_hints_options, but that anchor didn't yet exist in stmt_block.html on release-26.1. This check would have caught it before merge.

The branch existence check (EDUENG-614) is in a separate PR: #23137

Test plan

  • Manually run python .github/scripts/validate_diagram_anchors.py src/current/v26.1/show-statement-hints.md from repo root
  • Verify the workflow triggers on a .md change in a test PR and only scans changed files
  • Confirm the daily schedule appears in the Actions tab after merge

Adds a script and daily workflow that scan doc files for remote_include tags
pulling from cockroachdb/generated-diagrams grammar_svg, fetch each referenced
diagram HTML, and verify that every sql-grammar.html#ANCHOR link inside it
resolves against stmt_block.html on the same branch.

This catches the exact failure that blocked production builds on 2026-01-29:
show_statement_hints.html referenced sql-grammar.html#opt_with_show_hints_options
but that anchor did not yet exist in stmt_block.html on release-26.1.

Files added:
- .github/scripts/validate_diagram_anchors.py
- .github/workflows/validate-diagram-anchors.yml
@ebembi-crdb ebembi-crdb requested a review from a team as a code owner March 26, 2026 14:03
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 26, 2026

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit 4c90fa9
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-api-docs/deploys/69cbd0ba0df7c200087200ac

@github-actions
Copy link
Copy Markdown

Files changed:

  • .github/scripts/validate_diagram_anchors.py
  • .github/workflows/validate-diagram-anchors.yml

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 26, 2026

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit 4c90fa9
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/69cbd0ba259e1100086b2dce

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 26, 2026

Deploy Preview for cockroachdb-docs canceled.

Name Link
🔨 Latest commit 4c90fa9
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-docs/deploys/69cbd0badfcae80007dd6e24

import urllib.request
from pathlib import Path
from typing import Optional

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script fetches diagram and stmt_block.html via raw.githubusercontent.com and adds Authorization: Bearer if GITHUB_TOKEN is present. The raw host (raw.githubusercontent.com) does not use GitHub API auth tokens the same way the API does; adding the header is benign but may not help for private repos or rate limits. For authenticated/robust fetches:

Use the GitHub REST API endpoint:

GET /repos/{owner}/{repo}/contents/{path}?ref={branch}

which returns base64-encoded content and works with Authorization: Bearer <GITHUB_TOKEN>.

Or check that the raw URLs will not be rate-limited for your usage pattern (daily + PR checks is probably fine).

# ---------------------------------------------------------------------------
# Parsing helpers
# ---------------------------------------------------------------------------

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_stmt_block_anchors currently uses re.findall(r'\bid="'["']', content) which works in most cases but is fragile for edge HTML (e.g., ids broken across attributes/newlines, or presence of HTML comments/inline scripts). Since you explicitly avoid external deps, consider using Python’s stdlib html.parser to reliably collect id attributes:

…traction

- Replace raw.githubusercontent.com fetch with GitHub Contents API
  (GET /repos/.../contents/...?ref=) so that GITHUB_TOKEN properly
  raises the rate limit and works for private repos; add >1 MB fallback
  via download_url
- Replace fragile id= regex with stdlib html.parser (_IDCollector) for
  robust attribute extraction across edge HTML
- URL-encode path and ref components in the new fetch helper
Copy link
Copy Markdown
Contributor

@mohini-crl mohini-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants