Skip to content

Pro-GenAI/Large-Supervisor-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›‘οΈ LaSuMo: Large Supervisor Models (LSMs)

🚨 The first real-time AI safety layer β€” interrupting harmful content mid-stream, before it ever reaches your users.

Banner image


⚑ A Breakthrough in AI Safety

Every major LLM deployment today shares a dangerous blind spot: safety happens after the model speaks. Post-hoc guardrails, output filters, and content classifiers all operate on the finished response β€” but by then, the harm has already been generated, streamed, and delivered. πŸ’€

LSM breaks this paradigm entirely.

Large Supervisor Models are the first architecture purpose-built to intercept harmful content as it is being generated β€” token by token, in real time, before a single harmful word ever reaches the user. This is not a wrapper. It is not a filter. It is a parallel co-processor that watches the LLM's output stream continuously and fires an interrupt signal the instant harm begins to emerge. πŸ”₯

The results speak for themselves: 93.6% accuracy and a 90.75% F1 score on held-out test data β€” delivered with near-zero latency overhead, running silently alongside any LLM. πŸ“Šβœ…


😰 The Problem Every AI Deployment Has

Traditional pipeline (broken):

User ──▢ LLM ──── [harmful content generated] ──── [full response] ──▢ Filter ──▢ User
                                                                           β–²
                                                  ❌ Too late. Harm already done.

Existing approaches share three fatal flaws:

  1. πŸ” They're reactive. They analyze what was already said, not what's being said.
  2. 🐒 They're slow. Running a full classifier over a complete response adds latency and misses partial stream attacks.
  3. πŸ™ˆ They're blind to the middle. A response that starts safe and turns harmful halfway through defeats post-processing entirely.

LSM solves all three β€” simultaneously. πŸ’₯


πŸ€” What is an LSM?

A Large Supervisor Model is a lightweight, purpose-built transformer that runs in parallel with any LLM β€” reading the token output stream directly in real time and intervening the moment it detects something harmful. It doesn't wait. It doesn't post-process. It watches every token as it arrives and acts instantly. πŸ‘οΈ

Think of it as a co-pilot that never blinks. πŸ§‘β€βœˆοΈ

LLM ──── token stream ────▢ LSM ──── βœ… ABSTAIN / πŸ›‘ INTERRUPT ──▢ Client
                                              β–²
                                    πŸ”„ Running in parallel,
                                    πŸ‘€ always watching

πŸ“Š Benchmark Results

Evaluated on a held-out test set spanning self-harm, hate speech, dangerous instructions, harassment, and safe content β€” LSM achieves state-of-the-art performance:

Metric Score
🎯 Accuracy 93.61%
πŸ”¬ Precision 88.36%
πŸ•΅οΈ Recall 93.27%
βš–οΈ F1 Score 90.75%

These numbers reflect a model that:

  • βœ… Rarely misses real harm (93.3% recall β€” almost all harmful content is caught)
  • 🚫 Rarely cries wolf (88.4% precision β€” false positives are kept low)
  • 🌍 Generalizes well (results are on held-out test data, not training data)

The high recall is the critical metric for safety: missing harm is worse than over-flagging, and LSM is tuned to prioritize catching real threats while maintaining strong precision to avoid disrupting legitimate use. πŸ†


✨ Key Features

Feature Description
⚑ Real-time interruption Fires an interrupt signal mid-stream the moment harmful content is detected
πŸ”‡ Silent feedback Logs, reports, or flags content in the background without disrupting normal responses
πŸͺΆ Lightweight by design Small and fast β€” built to shadow any LLM without meaningful overhead
πŸ”Œ OpenAI streaming compatible Plugs directly into OpenAI API streaming output
🚫 No re-encoding overhead Encodes only new tokens incrementally β€” never reprocesses the full context
🧡 Queue-based processing Token queue ensures no output is missed, even when the LSM is briefly busy
πŸ”’ Tool call awareness Intentionally skips tool calls and tool responses β€” only supervises model-generated text

πŸ”§ How It Works

The Two Output States

βœ… ABSTAIN      β†’  Content is safe. Pass through to client.
πŸ›‘ INTERRUPT    β†’  Content is harmful. Stop the stream. Notify the client.

🚨 Interrupt Signal Format

When LSM fires, it sends a structured interrupt to the client so the partial response can be cleared immediately:

{
  "type": "interrupt",
  "confidence": 0.97,
  "last_tokens": "You have no purpose to live"
}

This structured payload lets your application immediately 🧹 clear the partial streamed response, πŸ’¬ display a safe fallback message, and πŸ“ log the event β€” all without the user ever seeing the harmful content that triggered the interrupt.


πŸ—οΈ Architecture: Transformer, Watching Every Token

LSM uses a single fine-tuned transformer that reads the LLM's output stream directly β€” no intermediate representations, no separate classifier stage. Every token is evaluated in context as it arrives. 🧠

LLM output stream ──▢ πŸ€– Transformer (LSM) ──▢ βœ… ABSTAIN / πŸ›‘ INTERRUPT
                              β–²
                   Reads stream directly,
                   token by token, in real time

This direct stream analysis is what gives LSM its edge: the transformer understands nuance and context, catching harmful content that pattern-matching approaches miss β€” including content that only becomes harmful as a sentence unfolds. 🎯

When confidence is high, LSM fires an immediate interrupt. When it is lower, the signal feeds back as a training example β€” continuously improving detection over time. πŸ“ˆ


πŸ—‚οΈ Training Data Design

LSM is trained on examples mapping LLM output text to LSM output labels:

[
  {
    "llm_output_text": "You have no purpose to live",
    "lsm_output": { "type": "interrupt", "reason": "self_harm", "confidence": 0.98 }
  },
  {
    "llm_output_text": "To make a bomb, you will need",
    "lsm_output": { "type": "interrupt", "reason": "dangerous_instructions", "confidence": 0.95 }
  },
  {
    "llm_output_text": "Preparing poison is illegal and dangerous",
    "lsm_output": { "type": "abstain", "confidence": 0.91 }
  },
  {
    "llm_output_text": "The capital of France is Paris.",
    "lsm_output": { "type": "abstain", "confidence": 0.99 }
  }
]

Training data is πŸ€– bootstrapped with an LLM and then supplemented with manually authored harmful examples β€” because LLMs themselves often refuse to generate the most dangerous content needed for robust training. πŸ§ͺ


🎯 Detection Scope

LSM is calibrated to catch real harm β€” not to be paranoid. It targets:

  • πŸ’” Self-harm & suicidal ideation β€” "you have no purpose to live"
  • 🀬 Hate speech β€” racially or socially targeted harmful statements
  • πŸ’£ Dangerous instructions β€” bombs, poisons, bioweapons
  • 😑 Harassment & personal attacks β€” direct verbal abuse

It deliberately does not flag: πŸ™…

  • πŸ“š Factual discussion of why things are dangerous
  • βš–οΈ Educational or legal context around harmful topics
  • πŸ”§ Tool calls or API responses

This distinction is central to LSM's design philosophy β€” and is reflected in its precision score. A safety system that over-flags is one that gets turned off. πŸ”΄


πŸ”¬ Evaluation Methodology

Evaluation is done by:

  1. 🧨 Generating or manually authoring harmful prompts
  2. πŸ“ Producing multiple harmful completions per prompt (supplemented with manual examples)
  3. πŸ“ Measuring precision and recall on the transformer model directly
  4. πŸ† Evaluating against held-out test data the model has never seen

πŸ’­ Design Philosophy

🌍 LSM is not designed to stop adversarial jailbreakers. It is designed to make the general public safe β€” quietly, invisibly, and in real time.

The goal is not adversarial robustness against motivated attackers β€” that is a different (and harder) problem. The goal is a quiet, always-on safety net that catches real harm for real users, in real time, without anyone noticing it's there. πŸ•΅οΈβ€β™‚οΈ

πŸ”΄ Safety that is intrusive fails because it gets disabled. 🟒 Safety that is invisible succeeds because it is never in the way.


πŸš€ Get Started

Model


πŸ›‘οΈ Built for safety that doesn't slow you down.

Contributors

Languages