From 892da55d4e126f079eed2eb4319375042ec14e57 Mon Sep 17 00:00:00 2001 From: Kuba Sunderland-Ober Date: Sun, 17 May 2026 20:10:36 +0200 Subject: [PATCH 1/3] Add a python-based link checker that works a wee bit faster than Lychee. --- docs/check.bat | 14 +- docs/lychee.bat | 45 ++ .../lcheck-fixture/site/assets/script.js | 1 + .../lcheck-fixture/site/assets/style.css | 1 + .../site/good-dir-noindex/other.html | 2 + .../lcheck-fixture/site/good-dir/index.html | 3 + .../lcheck-fixture/site/good-fallback.html | 2 + .../lcheck-fixture/site/good-fragments.html | 4 + experiments/lcheck-fixture/site/good.html | 3 + experiments/lcheck-fixture/site/index.html | 63 +++ .../lcheck-fixture/site/subdir/image.png | 1 + .../lcheck-fixture/site/subdir/nested.html | 4 + requirements.txt | 1 + scripts/check_links.py | 401 ++++++++++++++++++ 14 files changed, 541 insertions(+), 4 deletions(-) create mode 100644 docs/lychee.bat create mode 100644 experiments/lcheck-fixture/site/assets/script.js create mode 100644 experiments/lcheck-fixture/site/assets/style.css create mode 100644 experiments/lcheck-fixture/site/good-dir-noindex/other.html create mode 100644 experiments/lcheck-fixture/site/good-dir/index.html create mode 100644 experiments/lcheck-fixture/site/good-fallback.html create mode 100644 experiments/lcheck-fixture/site/good-fragments.html create mode 100644 experiments/lcheck-fixture/site/good.html create mode 100644 experiments/lcheck-fixture/site/index.html create mode 100644 experiments/lcheck-fixture/site/subdir/image.png create mode 100644 experiments/lcheck-fixture/site/subdir/nested.html create mode 100644 requirements.txt create mode 100644 scripts/check_links.py diff --git a/docs/check.bat b/docs/check.bat index 6d19e3bb..094bbce5 100644 --- a/docs/check.bat +++ b/docs/check.bat @@ -1,6 +1,12 @@ -@rem Use lychee to check the links in both build outputs, then scan +@rem Run the Python-based link checker on both build outputs, then scan @rem _site-offline/ for live-site links that survived offlinify. @rem +@rem Same arguments as lychee.bat -- only the executable differs. The Python +@rem script is faster on this workload (~25x on Windows) and a bit stricter: +@rem it flags + + + +

Fixture

+ +

Plain hrefs

+OK file +BROKEN: missing file + +

Fallback extension (only good-fallback.html exists)

+OK via fallback .html +BROKEN: no missing-fallback.html + +

Trailing slash forces directory resolution

+BROKEN: good.html is a file, not a dir +BROKEN: good-fallback resolves to .html via fallback, not via dir + +

Directory with index.html (both forms should be OK)

+OK dir with index +OK dir with index (no slash) + +

Directory without index.html, but `.` in --index-files

+OK: accepts dir itself +OK: same, via dir fallback + +

Missing dirs

+BROKEN: dir does not exist +BROKEN: not a file, not a dir, no fallback + +

Fragments

+OK: same-page fragment +BROKEN: same-page bad fragment +OK: cross-page known fragment +BROKEN: cross-page bad fragment +OK: dir-link with fragment +BROKEN: dir-link bad fragment + +

Relative and absolute paths

+OK relative subdir +OK absolute via root-dir +BROKEN absolute + +

External and skipped schemes

+SKIP: http +SKIP: mailto +SKIP: js +SKIP: tel + +

Images

+OK +BROKEN + +

Image srcset

+srcset mixed + + diff --git a/experiments/lcheck-fixture/site/subdir/image.png b/experiments/lcheck-fixture/site/subdir/image.png new file mode 100644 index 00000000..7838a96f --- /dev/null +++ b/experiments/lcheck-fixture/site/subdir/image.png @@ -0,0 +1 @@ +fake png \ No newline at end of file diff --git a/experiments/lcheck-fixture/site/subdir/nested.html b/experiments/lcheck-fixture/site/subdir/nested.html new file mode 100644 index 00000000..1c22b823 --- /dev/null +++ b/experiments/lcheck-fixture/site/subdir/nested.html @@ -0,0 +1,4 @@ + +nested +up and over +BROKEN: missing from subdir diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 00000000..4eb0354a --- /dev/null +++ b/requirements.txt @@ -0,0 +1 @@ +selectolax>=0.4 diff --git a/scripts/check_links.py b/scripts/check_links.py new file mode 100644 index 00000000..cea09f4f --- /dev/null +++ b/scripts/check_links.py @@ -0,0 +1,401 @@ +""" +Offline link checker for static sites. + +CLI mirrors the subset of lychee flags used by docs/check.bat, so that an +invocation like + + python scripts/check_links.py --offline --include-fragments + --fallback-extensions html --index-files "index.html,." + --root-dir docs/_site docs/_site + +produces the same correctness verdict as the equivalent lychee call (only +faster and a bit stricter -- see "Differences from lychee" below). + +Why this exists: lychee's offline pipeline funnels every link occurrence +through an async channel before its dedup cache short-circuits the work. +On this site (~733k occurrences, ~12k unique targets) that fixed-per- +occurrence overhead is ~50s on Windows. This script dedupes (target, frag) +up front, so the filesystem and fragment checks run once per unique target. + +Online (network) link checking is not implemented. --offline is therefore +required; the script exits non-zero if it is absent. + +Differences from lychee (correctness): + * Trailing slash on a file-shaped URL ('foo.html/') is reported broken, + where lychee normalises and accepts. Catches authoring mistakes. + *