Skip to content

docs(huggingface): document safetensors header padding bug + repair utility#809

Open
lockewerks wants to merge 2 commits into
ruvnet:mainfrom
lockewerks:fix/safetensors-header-padding
Open

docs(huggingface): document safetensors header padding bug + repair utility#809
lockewerks wants to merge 2 commits into
ruvnet:mainfrom
lockewerks:fix/safetensors-header-padding

Conversation

@lockewerks
Copy link
Copy Markdown

Summary

The model.safetensors file currently published at https://huggingface.co/ruvnet/wifi-densepose-pretrained has a malformed header that strict safetensors readers (Rust safetensors crate, Candle, safetensors.torch.load_file via its Rust binding) reject:

SafetensorError: trailing characters at line 1 column 1462

Lenient readers (custom JSON parsers that strip trailing NULs, e.g. scripts/export-onnx.py and the JS SafeTensorsReader in vendor/ruvector) accept it, which is why this hasnt surfaced until someone tried loading the bundle from strict Rust.

Root cause

SafeTensorsWriter.build() in vendor/ruvector/npm/packages/ruvllm/src/export.js:95-105 (vendored from ruvnet/ruvector):

  • Zero-initialises the header buffer
  • Copies in the JSON
  • Pads to the next 8-byte alignment boundary
  • Never overwrites the padding zone, so it stays \x00 instead of the spec-required 0x20 (space)

Three trainers consume this writer: train-wiflow.js:933, train-ruvllm.js:1541, train-camera-free.js:2276. The Python writer in scripts/train-count.py::write_safetensors is unaffected (it sizes the JSON exactly with no padding, which is spec-compliant). The HF publisher (scripts/publish-huggingface.py) only uploads bytes — it doesnt touch them.

What this PR ships

This PR cannot fix the upstream ruvnet/ruvector writer (separate repo, vendored as a submodule here). Instead:

  • docs/huggingface/SAFETENSORS-HEADER-BUG.md — full byte-level analysis (offsets, declared header length 1464 vs JSON termination at 1461, three NUL padding bytes), spec citation, repro snippet using safetensors::SafeTensors::deserialize, follow-up plan
  • scripts/fix-safetensors-header.py — user-side repair utility. Opens a .safetensors file, locates the header zone via the leading u64 length prefix, detects NUL bytes after the closing JSON brace, rewrites those bytes in-place as ASCII spaces. Tensor data byte-preserved. Idempotent. Supports --dry-run and multiple file arguments.
  • CHANGELOG.md[Unreleased] entry under Known Issues + Added blocks
  • docs/user-guide.md — Hugging Face model section now warns about the broken header and points readers at the repair utility until the upstream republish lands

Follow-ups (separate PRs needed)

  1. In ruvnet/ruvector: fix SafeTensorsWriter.build() to pad with 0x20 (or drop alignment padding entirely). One-line change.
  2. In this repo: bump the vendor/ruvector submodule pointer once the fix lands.
  3. Re-export model.safetensors with the fixed writer and re-upload to ruvnet/wifi-densepose-pretrained. Consumers that pinned a revision= will keep pulling the broken file until then.
  4. Add a release-time check to scripts/publish-huggingface.py that strict-loads every .safetensors before upload to prevent regression.

Verification

  • cargo build --workspace --no-default-features — clean (warnings only)
  • python -m py_compile scripts/fix-safetensors-header.py — OK
  • End-to-end self-test of the utility against a synthetic NUL-padded file: pre-fix safetensors.safe_open raises SafetensorError: trailing characters at line 1 column N+1 (same class as the published bundles column 1462); post-fix the same file loads cleanly via safe_open. Declared header length, file size, and tensor payload all bit-preserved. Second run reports already clean (idempotent).
  • Strict-reader repro in the doc (the 10-line script under "Repro") was executed and produces the documented error.

Test plan

  • python scripts/fix-safetensors-header.py --dry-run <path> reports the patch without writing
  • python scripts/fix-safetensors-header.py <path> patches in place, exits 0
  • Re-running on the already-patched file reports already clean, exits 0
  • Patched file loads via safetensors.torch.load_file without error

…ders

The SafeTensorsWriter in vendor/ruvector/.../export.js zero-initialises its
output buffer and then copies the JSON header in without overwriting the
padding zone, so the bytes between the JSON's last '}' and the declared
8-byte-aligned header length are left as 0x00 instead of the spec-required
0x20 (space). Strict readers — the Rust safetensors crate, Candle, and
the safetensors.torch.load_file Python helper that wraps the Rust binding —
reject the file with 'trailing characters at line 1 column N+1'. This is
why model.safetensors at huggingface.co/ruvnet/wifi-densepose-pretrained
currently fails to load anywhere outside our hand-rolled JS / Python
parsers (both of which strip trailing NULs before json.loads).

The utility opens a .safetensors file, locates the header zone, detects
NUL padding, and rewrites just the padding bytes with 0x20. Declared
header length, JSON content, and every tensor byte are preserved — only
the padding bytes flip from NUL to space, so the SHA-256 of the tensor
data is unchanged. Idempotent (a clean file reports 'already clean' and
exits 0 without rewriting), supports --dry-run, accepts multiple paths.
The model.safetensors file currently published at
huggingface.co/ruvnet/wifi-densepose-pretrained has a malformed header:
the 8-byte u64 declares 1464 header bytes, the JSON document ends at
byte 1461, and the last 3 bytes of the header zone are literal 0x00
padding instead of the spec-required 0x20 spaces. Strict safetensors
readers — Rust safetensors crate, Candle, safetensors.torch.load_file —
reject with 'SafetensorError: trailing characters at line 1 column 1462'.

This commit:
- adds docs/huggingface/SAFETENSORS-HEADER-BUG.md with byte-level
  evidence, spec citation, source-of-bug location (the SafeTensorsWriter
  in vendor/ruvector/.../export.js — separate repo at ruvnet/ruvector),
  list of three trainer scripts that go through this path
  (train-wiflow.js, train-ruvllm.js, train-camera-free.js), table of
  affected vs lenient consumers, 10-line strict-reader repro that
  reproduces the exact error class against a synthetic file, proposed
  upstream fix (0x20 padding or no padding), and a follow-ups checklist
  including the need to re-train/re-export and re-upload the HF artifact
- flags the bundle as needing republish under [Unreleased] in CHANGELOG.md
- updates the HF model section of docs/user-guide.md so the load example
  now patches the header with scripts/fix-safetensors-header.py before
  calling safetensors.torch.load_file (which would otherwise crash on
  the current bundle), and flips the Python/PyTorch row of the
  consumer-status table from 'Works' to 'Broken header — strict readers
  reject; patch with scripts/fix-safetensors-header.py'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant