π¨ The first real-time AI safety layer β interrupting harmful content mid-stream, before it ever reaches your users.
Every major LLM deployment today shares a dangerous blind spot: safety happens after the model speaks. Post-hoc guardrails, output filters, and content classifiers all operate on the finished response β but by then, the harm has already been generated, streamed, and delivered. π
LSM breaks this paradigm entirely.
Large Supervisor Models are the first architecture purpose-built to intercept harmful content as it is being generated β token by token, in real time, before a single harmful word ever reaches the user. This is not a wrapper. It is not a filter. It is a parallel co-processor that watches the LLM's output stream continuously and fires an interrupt signal the instant harm begins to emerge. π₯
The results speak for themselves: 93.6% accuracy and a 90.75% F1 score on held-out test data β delivered with near-zero latency overhead, running silently alongside any LLM. πβ
Traditional pipeline (broken):
User βββΆ LLM ββββ [harmful content generated] ββββ [full response] βββΆ Filter βββΆ User
β²
β Too late. Harm already done.
Existing approaches share three fatal flaws:
- π They're reactive. They analyze what was already said, not what's being said.
- π’ They're slow. Running a full classifier over a complete response adds latency and misses partial stream attacks.
- π They're blind to the middle. A response that starts safe and turns harmful halfway through defeats post-processing entirely.
LSM solves all three β simultaneously. π₯
A Large Supervisor Model is a lightweight, purpose-built transformer that runs in parallel with any LLM β reading the token output stream directly in real time and intervening the moment it detects something harmful. It doesn't wait. It doesn't post-process. It watches every token as it arrives and acts instantly. ποΈ
Think of it as a co-pilot that never blinks. π§β
LLM ββββ token stream βββββΆ LSM ββββ β
ABSTAIN / π INTERRUPT βββΆ Client
β²
π Running in parallel,
π always watching
Evaluated on a held-out test set spanning self-harm, hate speech, dangerous instructions, harassment, and safe content β LSM achieves state-of-the-art performance:
| Metric | Score |
|---|---|
| π― Accuracy | 93.61% |
| π¬ Precision | 88.36% |
| π΅οΈ Recall | 93.27% |
| βοΈ F1 Score | 90.75% |
These numbers reflect a model that:
- β Rarely misses real harm (93.3% recall β almost all harmful content is caught)
- π« Rarely cries wolf (88.4% precision β false positives are kept low)
- π Generalizes well (results are on held-out test data, not training data)
The high recall is the critical metric for safety: missing harm is worse than over-flagging, and LSM is tuned to prioritize catching real threats while maintaining strong precision to avoid disrupting legitimate use. π
| Feature | Description |
|---|---|
| β‘ Real-time interruption | Fires an interrupt signal mid-stream the moment harmful content is detected |
| π Silent feedback | Logs, reports, or flags content in the background without disrupting normal responses |
| πͺΆ Lightweight by design | Small and fast β built to shadow any LLM without meaningful overhead |
| π OpenAI streaming compatible | Plugs directly into OpenAI API streaming output |
| π« No re-encoding overhead | Encodes only new tokens incrementally β never reprocesses the full context |
| π§΅ Queue-based processing | Token queue ensures no output is missed, even when the LSM is briefly busy |
| π Tool call awareness | Intentionally skips tool calls and tool responses β only supervises model-generated text |
β
ABSTAIN β Content is safe. Pass through to client.
π INTERRUPT β Content is harmful. Stop the stream. Notify the client.
When LSM fires, it sends a structured interrupt to the client so the partial response can be cleared immediately:
{
"type": "interrupt",
"confidence": 0.97,
"last_tokens": "You have no purpose to live"
}This structured payload lets your application immediately π§Ή clear the partial streamed response, π¬ display a safe fallback message, and π log the event β all without the user ever seeing the harmful content that triggered the interrupt.
LSM uses a single fine-tuned transformer that reads the LLM's output stream directly β no intermediate representations, no separate classifier stage. Every token is evaluated in context as it arrives. π§
LLM output stream βββΆ π€ Transformer (LSM) βββΆ β
ABSTAIN / π INTERRUPT
β²
Reads stream directly,
token by token, in real time
This direct stream analysis is what gives LSM its edge: the transformer understands nuance and context, catching harmful content that pattern-matching approaches miss β including content that only becomes harmful as a sentence unfolds. π―
When confidence is high, LSM fires an immediate interrupt. When it is lower, the signal feeds back as a training example β continuously improving detection over time. π
LSM is trained on examples mapping LLM output text to LSM output labels:
[
{
"llm_output_text": "You have no purpose to live",
"lsm_output": { "type": "interrupt", "reason": "self_harm", "confidence": 0.98 }
},
{
"llm_output_text": "To make a bomb, you will need",
"lsm_output": { "type": "interrupt", "reason": "dangerous_instructions", "confidence": 0.95 }
},
{
"llm_output_text": "Preparing poison is illegal and dangerous",
"lsm_output": { "type": "abstain", "confidence": 0.91 }
},
{
"llm_output_text": "The capital of France is Paris.",
"lsm_output": { "type": "abstain", "confidence": 0.99 }
}
]Training data is π€ bootstrapped with an LLM and then supplemented with manually authored harmful examples β because LLMs themselves often refuse to generate the most dangerous content needed for robust training. π§ͺ
LSM is calibrated to catch real harm β not to be paranoid. It targets:
- π Self-harm & suicidal ideation β "you have no purpose to live"
- π€¬ Hate speech β racially or socially targeted harmful statements
- π£ Dangerous instructions β bombs, poisons, bioweapons
- π‘ Harassment & personal attacks β direct verbal abuse
It deliberately does not flag: π
- π Factual discussion of why things are dangerous
- βοΈ Educational or legal context around harmful topics
- π§ Tool calls or API responses
This distinction is central to LSM's design philosophy β and is reflected in its precision score. A safety system that over-flags is one that gets turned off. π΄
Evaluation is done by:
- 𧨠Generating or manually authoring harmful prompts
- π Producing multiple harmful completions per prompt (supplemented with manual examples)
- π Measuring precision and recall on the transformer model directly
- π Evaluating against held-out test data the model has never seen
π LSM is not designed to stop adversarial jailbreakers. It is designed to make the general public safe β quietly, invisibly, and in real time.
The goal is not adversarial robustness against motivated attackers β that is a different (and harder) problem. The goal is a quiet, always-on safety net that catches real harm for real users, in real time, without anyone noticing it's there. π΅οΈββοΈ
π΄ Safety that is intrusive fails because it gets disabled. π’ Safety that is invisible succeeds because it is never in the way.
π‘οΈ Built for safety that doesn't slow you down.
