docs(huggingface): document safetensors header padding bug + repair utility#809
Open
lockewerks wants to merge 2 commits into
Open
docs(huggingface): document safetensors header padding bug + repair utility#809lockewerks wants to merge 2 commits into
lockewerks wants to merge 2 commits into
Conversation
…ders The SafeTensorsWriter in vendor/ruvector/.../export.js zero-initialises its output buffer and then copies the JSON header in without overwriting the padding zone, so the bytes between the JSON's last '}' and the declared 8-byte-aligned header length are left as 0x00 instead of the spec-required 0x20 (space). Strict readers — the Rust safetensors crate, Candle, and the safetensors.torch.load_file Python helper that wraps the Rust binding — reject the file with 'trailing characters at line 1 column N+1'. This is why model.safetensors at huggingface.co/ruvnet/wifi-densepose-pretrained currently fails to load anywhere outside our hand-rolled JS / Python parsers (both of which strip trailing NULs before json.loads). The utility opens a .safetensors file, locates the header zone, detects NUL padding, and rewrites just the padding bytes with 0x20. Declared header length, JSON content, and every tensor byte are preserved — only the padding bytes flip from NUL to space, so the SHA-256 of the tensor data is unchanged. Idempotent (a clean file reports 'already clean' and exits 0 without rewriting), supports --dry-run, accepts multiple paths.
The model.safetensors file currently published at huggingface.co/ruvnet/wifi-densepose-pretrained has a malformed header: the 8-byte u64 declares 1464 header bytes, the JSON document ends at byte 1461, and the last 3 bytes of the header zone are literal 0x00 padding instead of the spec-required 0x20 spaces. Strict safetensors readers — Rust safetensors crate, Candle, safetensors.torch.load_file — reject with 'SafetensorError: trailing characters at line 1 column 1462'. This commit: - adds docs/huggingface/SAFETENSORS-HEADER-BUG.md with byte-level evidence, spec citation, source-of-bug location (the SafeTensorsWriter in vendor/ruvector/.../export.js — separate repo at ruvnet/ruvector), list of three trainer scripts that go through this path (train-wiflow.js, train-ruvllm.js, train-camera-free.js), table of affected vs lenient consumers, 10-line strict-reader repro that reproduces the exact error class against a synthetic file, proposed upstream fix (0x20 padding or no padding), and a follow-ups checklist including the need to re-train/re-export and re-upload the HF artifact - flags the bundle as needing republish under [Unreleased] in CHANGELOG.md - updates the HF model section of docs/user-guide.md so the load example now patches the header with scripts/fix-safetensors-header.py before calling safetensors.torch.load_file (which would otherwise crash on the current bundle), and flips the Python/PyTorch row of the consumer-status table from 'Works' to 'Broken header — strict readers reject; patch with scripts/fix-safetensors-header.py'
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
model.safetensorsfile currently published at https://huggingface.co/ruvnet/wifi-densepose-pretrained has a malformed header that strict safetensors readers (Rustsafetensorscrate, Candle,safetensors.torch.load_filevia its Rust binding) reject:Lenient readers (custom JSON parsers that strip trailing NULs, e.g.
scripts/export-onnx.pyand the JSSafeTensorsReaderinvendor/ruvector) accept it, which is why this hasnt surfaced until someone tried loading the bundle from strict Rust.Root cause
SafeTensorsWriter.build()invendor/ruvector/npm/packages/ruvllm/src/export.js:95-105(vendored fromruvnet/ruvector):\x00instead of the spec-required0x20(space)Three trainers consume this writer:
train-wiflow.js:933,train-ruvllm.js:1541,train-camera-free.js:2276. The Python writer inscripts/train-count.py::write_safetensorsis unaffected (it sizes the JSON exactly with no padding, which is spec-compliant). The HF publisher (scripts/publish-huggingface.py) only uploads bytes — it doesnt touch them.What this PR ships
This PR cannot fix the upstream
ruvnet/ruvectorwriter (separate repo, vendored as a submodule here). Instead:docs/huggingface/SAFETENSORS-HEADER-BUG.md— full byte-level analysis (offsets, declared header length 1464 vs JSON termination at 1461, three NUL padding bytes), spec citation, repro snippet usingsafetensors::SafeTensors::deserialize, follow-up planscripts/fix-safetensors-header.py— user-side repair utility. Opens a.safetensorsfile, locates the header zone via the leading u64 length prefix, detects NUL bytes after the closing JSON brace, rewrites those bytes in-place as ASCII spaces. Tensor data byte-preserved. Idempotent. Supports--dry-runand multiple file arguments.CHANGELOG.md—[Unreleased]entry under Known Issues + Added blocksdocs/user-guide.md— Hugging Face model section now warns about the broken header and points readers at the repair utility until the upstream republish landsFollow-ups (separate PRs needed)
ruvnet/ruvector: fixSafeTensorsWriter.build()to pad with0x20(or drop alignment padding entirely). One-line change.vendor/ruvectorsubmodule pointer once the fix lands.model.safetensorswith the fixed writer and re-upload toruvnet/wifi-densepose-pretrained. Consumers that pinned arevision=will keep pulling the broken file until then.scripts/publish-huggingface.pythat strict-loads every.safetensorsbefore upload to prevent regression.Verification
cargo build --workspace --no-default-features— clean (warnings only)python -m py_compile scripts/fix-safetensors-header.py— OKsafetensors.safe_openraisesSafetensorError: trailing characters at line 1 column N+1(same class as the published bundlescolumn 1462); post-fix the same file loads cleanly viasafe_open. Declared header length, file size, and tensor payload all bit-preserved. Second run reportsalready clean(idempotent).Test plan
python scripts/fix-safetensors-header.py --dry-run <path>reports the patch without writingpython scripts/fix-safetensors-header.py <path>patches in place, exits 0already clean, exits 0safetensors.torch.load_filewithout error