Architecture Determines Observability in Transformers

A linear probe on frozen mid-layer activations detects transformer errors that output confidence misses. Whether training preserves this signal depends on architecture and training recipe, not scale.

5 to 9% of confident model errors at 10% flag rate are invisible to the output distribution. Confidence thresholds miss them. Calibrated probabilities miss them. A trained predictor on the full output representation misses them. They reach users undetected.

A single dot product on frozen mid-layer activations catches them. No fine-tuning, no task-specific data. A probe trained on Wikipedia reads the same failure signal zero-shot on medical licensing questions and reading comprehension.

Which model you deploy determines whether this signal exists at all. Some architectures undergo observability collapse: the mid-layer readable signal falls from +0.21 to +0.10 and stays there. No layer recovers it. A nonlinear probe does not recover it. The information is not preserved in linearly readable form. Checkpoint dynamics show this is training-emergent: both matched-width Pythia configurations form the signal at the earliest measured checkpoint, but training erases it in the (24L, 16H) class while the healthy configuration recovers.

Both panels use the same protocol, the same token budget per hidden dimension, and the same shaded detection band. Left panel: Llama 3.2 under a cross-recipe split, where 1B preserves the signal and 3B and 8B do not. Right panel: Pythia under held-recipe training, where three of nine configurations collapse. All three are 24 layers, 16 heads. The replication spans a 3.5x parameter gap, two Pile variants, and two hidden dimensions. Six other Pythia depths are healthy. No intermediate values appear.

What this repo contains

The code, data, and analysis behind the paper. Every number in the PDF traces to a committed JSON in results/ through an automated verification pipeline.

git clone https://github.com/tmcarmichael/nn-observability
cd nn-observability
uv sync                             # or: pip install -e .

uv run pytest tests/ -q             # 410 tests, CPU only, schema + property + smoke
uv run python analysis/run_all.py   # permutation test, mixed-effects, variance decomposition

The finding

Half to two-thirds of what standard probes measure is confidence in disguise. Raw probe-loss correlation on GPT-2 124M is +0.55. After controlling for max softmax and activation norm: +0.28 survives. Four hand-designed activation statistics that show strong raw correlation all collapse to near zero under the same controls.

The signal that survives is real, linear, and output-independent. Twenty probe initializations converge to the same direction within 0.001. A nonlinear MLP is statistically equivalent. A 512-unit output predictor absorbs no more than a 64-unit bottleneck. The information exists in the model's hidden layers, and the output layer does not preserve it. Output-independence grows with scale, from 34% at GPT-2 124M to 60% at GPT-2 XL.

Scale does not predict whether the signal is present. Configuration does. At matched 3B scale, Qwen produces +0.263 and Llama produces +0.091, a 2.9x gap with non-overlapping per-probe-seed distributions. Within Llama 3.2, the signal is present at 1B and absent at 3B and 8B. Under Pythia's held-recipe training, both 24-layer, 16-head configurations collapse to ~+0.10, with a third replication on the deduplicated Pile variant. Across 16 cross-family models, family membership explains 92% of the variance at permutation p = 0.006.

Reproduce a paper number

Every number in the paper traces to a committed JSON through an automated verification pipeline. Pick a claim and verify it:

Paper claim	Value	Command	Source
Cross-family permutation F (family effect)	p = 0.006	`uv run python analysis/permutation_test.py`	13-model scope in `analysis/load_results.py`
Llama 1B partial correlation	+0.286	`uv run python analysis/load_results.py`	`results/llama1b_v3_results.json`
Exclusive catch rate at 20% flag rate	12-15%	`uv run python analysis/exclusive_catch_rates.py`	`results/transformer_observe.json` key `6a`

The cross-family comparison

Model	Family	Params	pcorr	OC residual
Gemma 3 1B*	Gemma	1B	+0.388*	+0.307
Mistral 7B	Mistral	7B	+0.313	+0.156
Phi-3 Mini	Phi	3.8B	+0.300	+0.144
GPT-2 XL	GPT-2	1.5B	+0.290	+0.174
Llama 1B	Llama	1.2B	+0.286	+0.120
Qwen 7B	Qwen	7B	+0.255	+0.137
Llama 3B	Llama	3B	+0.091	+0.031
Llama 8B	Llama	8B	+0.093	-0.007

Sorted by signal strength. Every row except the bold Llama entries produces observability above +0.19. Gemma 3 1B* has anomalous representation geometry: a random untrained probe achieves +0.213, so the high score reflects this artifact rather than stronger observability. Within Llama, the signal is present at 1B and absent at 3B and 8B. Same lab, same training pipeline, different architectural configuration.

pcorr: partial Spearman correlation between probe scores and per-token loss, controlling for max softmax probability and activation norm. OC residual: the additional partial correlation after also controlling for a trained MLP on the last-layer activations. All values are 7-seed means on WikiText-103, evaluated at each model's peak layer with matched token budget per hidden dimension. GPT-2 family uses 3 seeds. The full 13-model table with standard deviations, seed agreement, and random head baselines is in the paper.

Across 25 models spanning seven families, the output-controlled residual tracks partial correlation with slope 0.88. Collapse points sit near the origin. A monitoring tool that reads the mid-layer signal exposes information not recovered by the tested output-side predictors, and this surplus vanishes at exactly the configurations where the partial correlation collapses.

What the observer does not catch

Three documented boundaries that any deployment should respect.

Fluent factual errors. TruthfulQA isolates the subset of confidently wrong answers where the model asserts a smooth falsehood. The observer scores at chance on this subset, with AUC 0.499 to 0.568 across three production instruct models. Activation monitoring catches token-level prediction failures, not learned falsehoods.

Architectures where the signal collapsed. Llama 3B and 8B, and the Pythia 24-layer, 16-head configurations. On these, the linear probe scores at the detection floor and a held-out-tuned nonlinear probe does not cross it. Whether a deployed model is observable in this sense is a property of the architecture, not of better tooling.

Adversarial evasion. McGuinness et al. (2025) show that activation monitors can be evaded under training pressure. The observer has not been tested against adaptive attacks. PC1 cosine of 0.002 indicates the observer direction is not on a dominant variance axis, but the threat model still applies.

The observer's value is the complementary catch: errors confidence marks correct. Confidence has higher single-signal precision at every flag rate. Use both.

Run it on your model

pip install -e ".[transformer]"   # or: uv sync --extra transformer

python scripts/run_model.py \
  --model Qwen/Qwen2.5-7B \
  --output qwen7b_results.json

This runs the full protocol: layer sweep, 7-seed evaluation, output-controlled residual, cross-domain transfer, control sensitivity, and flagging analysis. Output is a self-contained JSON with provenance metadata.

To add the result to the analysis scope, validate the JSON and add one line to analysis/load_results.py:

just validate-results                          # check required fields

# In analysis/load_results.py, add to the appropriate family list:
QWEN_MODELS = [
    ...
    ("qwen7b_v3_results.json", 7.0, "Qwen 7B"),   # existing
    ("your_model_results.json", 7.0, "Your 7B"),   # new entry
]

Then uv run python analysis/run_all.py includes the new model in every statistical test. See analysis/README.md for the full schema and checklist.

Repository structure

src/                  Core library (probe, observer, experiment engine)
scripts/              GPU experiment launchers (run_model.py is the entry point)
analysis/             CPU statistical analysis (permutation test, mixed-effects, schema validation)
results/              All result JSONs (committed, reproducible, schema-validated)
figures/              Shared matplotlib style and save helper
tests/                Schema, metrics, analysis smoke, probe-sync drift guards

Full directory map and script descriptions in analysis/README.md and results/README.md.

Using the analysis library

The analysis package is the stable public API (v3.x). Install the repo as a package (uv sync or pip install -e .) and import directly:

from analysis import load_all_models, load_model_means, family_f_stat, validate_all

Nine exported functions cover data loading, statistical primitives, and schema validation. See analysis/__init__.py for the full list with descriptions.

Where to find what

I want to	Start here
Read the paper	Zenodo pre-print
Run the tests	`uv run pytest tests/ -q`
Run the full analysis pipeline	`uv run python analysis/run_all.py`
Reproduce a specific paper number	"Reproduce a paper number" table above
See the raw experimental data	`results/*.json` (every paper number traces here)
Walk through the analysis pipeline	`notebooks/walkthrough_analysis.ipynb` (CPU-only, no GPU)
Use the analysis library in your own code	`analysis/__init__.py` (public API, stable across v3.x)
Add my own model to the cross-family scope	"Run it on your model" section above, then `analysis/README.md`
Understand the result-JSON schema	`analysis/load_results.py` and `results/README.md`
Look at how a specific number was produced	`notebooks/README.md` (per-model run history)

Citation

Cite the paper and the code separately. Both share a Zenodo concept DOI that resolves to the latest version; pin to a specific version DOI from the Zenodo record for reproducibility.

@article{carmichael2026observability,
  title={Architecture Determines Observability in Transformers},
  author={Carmichael, Thomas},
  year={2026},
  journal={Zenodo pre-print},
  doi={10.5281/zenodo.19435674},
  url={https://doi.org/10.5281/zenodo.19435674},
  note={v3.3.0}
}

@software{carmichael2026code,
  title={nn-observability: code for ``Architecture Determines Observability in Transformers''},
  author={Carmichael, Thomas},
  year={2026},
  version={3.3.0},
  doi={10.5281/zenodo.19435674},
  url={https://github.com/tmcarmichael/nn-observability}
}

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
analysis		analysis
assets		assets
figures		figures
notebooks		notebooks
results		results
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
justfile		justfile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Architecture Determines Observability in Transformers

5 to 9% of confident model errors at 10% flag rate are invisible to the output distribution. Confidence thresholds miss them. Calibrated probabilities miss them. A trained predictor on the full output representation misses them. They reach users undetected.

What this repo contains

The finding

Reproduce a paper number

The cross-family comparison

What the observer does not catch

Run it on your model

Repository structure

Using the analysis library

Where to find what

Citation

License

About

Uh oh!

Releases 13

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Architecture Determines Observability in Transformers

5 to 9% of confident model errors at 10% flag rate are invisible to the output distribution. Confidence thresholds miss them. Calibrated probabilities miss them. A trained predictor on the full output representation misses them. They reach users undetected.

What this repo contains

The finding

Reproduce a paper number

The cross-family comparison

What the observer does not catch

Run it on your model

Repository structure

Using the analysis library

Where to find what

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages