Read the paper (preprint)
A linear probe on frozen mid-layer activations detects transformer errors that output confidence misses. Whether training preserves this signal depends on architecture and training recipe, not scale.
5 to 9% of confident model errors at 10% flag rate are invisible to the output distribution. Confidence thresholds miss them. Calibrated probabilities miss them. A trained predictor on the full output representation misses them. They reach users undetected.
A single dot product on frozen mid-layer activations catches them. No fine-tuning, no task-specific data. A probe trained on Wikipedia reads the same failure signal zero-shot on medical licensing questions and reading comprehension.
Which model you deploy determines whether this signal exists at all. Some architectures undergo observability collapse: the mid-layer readable signal falls from +0.21 to +0.10 and stays there. No layer recovers it. A nonlinear probe does not recover it. The information is not preserved in linearly readable form. Checkpoint dynamics show this is training-emergent: both matched-width Pythia configurations form the signal at the earliest measured checkpoint, but training erases it in the (24L, 16H) class while the healthy configuration recovers.
Both panels use the same protocol, the same token budget per hidden dimension, and the same shaded detection band. Left panel: Llama 3.2 under a cross-recipe split, where 1B preserves the signal and 3B and 8B do not. Right panel: Pythia under held-recipe training, where three of nine configurations collapse. All three are 24 layers, 16 heads. The replication spans a 3.5x parameter gap, two Pile variants, and two hidden dimensions. Six other Pythia depths are healthy. No intermediate values appear.
The code, data, and analysis behind the paper. Every number in the PDF traces to a committed JSON in results/ through an automated verification pipeline.
git clone https://github.com/tmcarmichael/nn-observability
cd nn-observability
uv sync # or: pip install -e .
uv run pytest tests/ -q # 410 tests, CPU only, schema + property + smoke
uv run python analysis/run_all.py # permutation test, mixed-effects, variance decompositionHalf to two-thirds of what standard probes measure is confidence in disguise. Raw probe-loss correlation on GPT-2 124M is +0.55. After controlling for max softmax and activation norm: +0.28 survives. Four hand-designed activation statistics that show strong raw correlation all collapse to near zero under the same controls.
The signal that survives is real, linear, and output-independent. Twenty probe initializations converge to the same direction within 0.001. A nonlinear MLP is statistically equivalent. A 512-unit output predictor absorbs no more than a 64-unit bottleneck. The information exists in the model's hidden layers, and the output layer does not preserve it. Output-independence grows with scale, from 34% at GPT-2 124M to 60% at GPT-2 XL.
Scale does not predict whether the signal is present. Configuration does. At matched 3B scale, Qwen produces +0.263 and Llama produces +0.091, a 2.9x gap with non-overlapping per-probe-seed distributions. Within Llama 3.2, the signal is present at 1B and absent at 3B and 8B. Under Pythia's held-recipe training, both 24-layer, 16-head configurations collapse to ~+0.10, with a third replication on the deduplicated Pile variant. Across 16 cross-family models, family membership explains 92% of the variance at permutation p = 0.006.
Every number in the paper traces to a committed JSON through an automated verification pipeline. Pick a claim and verify it:
| Paper claim | Value | Command | Source |
|---|---|---|---|
| Cross-family permutation F (family effect) | p = 0.006 | uv run python analysis/permutation_test.py |
13-model scope in analysis/load_results.py |
| Llama 1B partial correlation | +0.286 | uv run python analysis/load_results.py |
results/llama1b_v3_results.json |
| Exclusive catch rate at 20% flag rate | 12-15% | uv run python analysis/exclusive_catch_rates.py |
results/transformer_observe.json key 6a |
| Model | Family | Params | pcorr | OC residual |
|---|---|---|---|---|
| Gemma 3 1B* | Gemma | 1B | +0.388* | +0.307 |
| Mistral 7B | Mistral | 7B | +0.313 | +0.156 |
| Phi-3 Mini | Phi | 3.8B | +0.300 | +0.144 |
| GPT-2 XL | GPT-2 | 1.5B | +0.290 | +0.174 |
| Llama 1B | Llama | 1.2B | +0.286 | +0.120 |
| Qwen 7B | Qwen | 7B | +0.255 | +0.137 |
| Llama 3B | Llama | 3B | +0.091 | +0.031 |
| Llama 8B | Llama | 8B | +0.093 | -0.007 |
Sorted by signal strength. Every row except the bold Llama entries produces observability above +0.19. Gemma 3 1B* has anomalous representation geometry: a random untrained probe achieves +0.213, so the high score reflects this artifact rather than stronger observability. Within Llama, the signal is present at 1B and absent at 3B and 8B. Same lab, same training pipeline, different architectural configuration.
pcorr: partial Spearman correlation between probe scores and per-token loss, controlling for max softmax probability and activation norm. OC residual: the additional partial correlation after also controlling for a trained MLP on the last-layer activations. All values are 7-seed means on WikiText-103, evaluated at each model's peak layer with matched token budget per hidden dimension. GPT-2 family uses 3 seeds. The full 13-model table with standard deviations, seed agreement, and random head baselines is in the paper.
Across 25 models spanning seven families, the output-controlled residual tracks partial correlation with slope 0.88. Collapse points sit near the origin. A monitoring tool that reads the mid-layer signal exposes information not recovered by the tested output-side predictors, and this surplus vanishes at exactly the configurations where the partial correlation collapses.
Three documented boundaries that any deployment should respect.
Fluent factual errors. TruthfulQA isolates the subset of confidently wrong answers where the model asserts a smooth falsehood. The observer scores at chance on this subset, with AUC 0.499 to 0.568 across three production instruct models. Activation monitoring catches token-level prediction failures, not learned falsehoods.
Architectures where the signal collapsed. Llama 3B and 8B, and the Pythia 24-layer, 16-head configurations. On these, the linear probe scores at the detection floor and a held-out-tuned nonlinear probe does not cross it. Whether a deployed model is observable in this sense is a property of the architecture, not of better tooling.
Adversarial evasion. McGuinness et al. (2025) show that activation monitors can be evaded under training pressure. The observer has not been tested against adaptive attacks. PC1 cosine of 0.002 indicates the observer direction is not on a dominant variance axis, but the threat model still applies.
The observer's value is the complementary catch: errors confidence marks correct. Confidence has higher single-signal precision at every flag rate. Use both.
pip install -e ".[transformer]" # or: uv sync --extra transformer
python scripts/run_model.py \
--model Qwen/Qwen2.5-7B \
--output qwen7b_results.jsonThis runs the full protocol: layer sweep, 7-seed evaluation, output-controlled residual, cross-domain transfer, control sensitivity, and flagging analysis. Output is a self-contained JSON with provenance metadata.
To add the result to the analysis scope, validate the JSON and add one line to analysis/load_results.py:
just validate-results # check required fields# In analysis/load_results.py, add to the appropriate family list:
QWEN_MODELS = [
...
("qwen7b_v3_results.json", 7.0, "Qwen 7B"), # existing
("your_model_results.json", 7.0, "Your 7B"), # new entry
]Then uv run python analysis/run_all.py includes the new model in every statistical test. See analysis/README.md for the full schema and checklist.
src/ Core library (probe, observer, experiment engine)
scripts/ GPU experiment launchers (run_model.py is the entry point)
analysis/ CPU statistical analysis (permutation test, mixed-effects, schema validation)
results/ All result JSONs (committed, reproducible, schema-validated)
figures/ Shared matplotlib style and save helper
tests/ Schema, metrics, analysis smoke, probe-sync drift guards
Full directory map and script descriptions in analysis/README.md and results/README.md.
The analysis package is the stable public API (v3.x). Install the repo as a package (uv sync or pip install -e .) and import directly:
from analysis import load_all_models, load_model_means, family_f_stat, validate_allNine exported functions cover data loading, statistical primitives, and schema validation. See analysis/__init__.py for the full list with descriptions.
| I want to | Start here |
|---|---|
| Read the paper | Zenodo pre-print |
| Run the tests | uv run pytest tests/ -q |
| Run the full analysis pipeline | uv run python analysis/run_all.py |
| Reproduce a specific paper number | "Reproduce a paper number" table above |
| See the raw experimental data | results/*.json (every paper number traces here) |
| Walk through the analysis pipeline | notebooks/walkthrough_analysis.ipynb (CPU-only, no GPU) |
| Use the analysis library in your own code | analysis/__init__.py (public API, stable across v3.x) |
| Add my own model to the cross-family scope | "Run it on your model" section above, then analysis/README.md |
| Understand the result-JSON schema | analysis/load_results.py and results/README.md |
| Look at how a specific number was produced | notebooks/README.md (per-model run history) |
Cite the paper and the code separately. Both share a Zenodo concept DOI that resolves to the latest version; pin to a specific version DOI from the Zenodo record for reproducibility.
@article{carmichael2026observability,
title={Architecture Determines Observability in Transformers},
author={Carmichael, Thomas},
year={2026},
journal={Zenodo pre-print},
doi={10.5281/zenodo.19435674},
url={https://doi.org/10.5281/zenodo.19435674},
note={v3.3.0}
}
@software{carmichael2026code,
title={nn-observability: code for ``Architecture Determines Observability in Transformers''},
author={Carmichael, Thomas},
year={2026},
version={3.3.0},
doi={10.5281/zenodo.19435674},
url={https://github.com/tmcarmichael/nn-observability}
}
