Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,658 changes: 1,658 additions & 0 deletions docs/user_guide/13_llm_router.ipynb

Large diffs are not rendered by default.

385 changes: 385 additions & 0 deletions redisvl/extensions/llm_router/DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,385 @@
# LLM Router Extension - Design Document

## Overview

The LLM Router is an extension to RedisVL that provides intelligent, cost-optimized LLM model selection using semantic routing. Instead of routing queries to topics (like SemanticRouter), it routes queries to **model tiers** - selecting the cheapest LLM capable of handling each task.

## Problem Statement

### The LLM Cost Problem
Modern applications often default to using the most capable (and expensive) LLM for all queries, even when simpler models would suffice:
- "Hello, how are you?" -> Claude Opus 4.5 ($5/M tokens)
- "Hello, how are you?" -> GPT-4.1 Nano ($0.10/M tokens)

### Existing Solutions and Their Limitations

**RouteLLM** (CMU/LMSys):
- Binary classification only (strong vs weak model)
- No support for >2 tiers
- Requires training data or preference matrices

**NVIDIA LLM Router Blueprint**:
- Complexity classification approach (simple/moderate/complex)
- Provides the taxonomy basis but no open-source Redis-native implementation

**RouterArena / Bloom's Taxonomy Approach**:
- Maps query complexity to Bloom's cognitive levels
- Informs our tier design but lacks production routing infrastructure

**OpenRouter Auto-Router**:
- Black box routing decisions
- Data flows through third-party servers
- No transparency into why a model was selected
- Can't self-host or customize

**NotDiamond**:
- Proprietary ML model for routing
- Requires API calls for every routing decision
- No local/offline capability

**FrugalGPT**:
- Sequential cascade approach (try cheap first, escalate)
- Higher latency due to serial model calls

## Solution: Semantic Model Tier Routing

Repurpose RedisVL's battle-tested SemanticRouter for model selection:

```
SemanticRouter -> LLMRouter
-----------------------------------------
Route -> ModelTier
route.name -> tier.name (simple/standard/expert)
route.references -> tier.references (task complexity examples)
route.metadata -> tier.metadata (cost, capabilities)
RouteMatch -> LLMRouteMatch (includes model string)
```

### Architecture

```
+---------------------------------------------------------------+
| LLMRouter |
+---------------------------------------------------------------+
| +-------------+ +-------------+ +-------------+ |
| | Simple | | Standard | | Expert | |
| | Tier | | Tier | | Tier | |
| +-------------+ +-------------+ +-------------+ |
| | gpt-4.1-nano| | sonnet 4.5 | | opus 4.5 | |
| | $0.10/M | | $3/M | | $5/M | |
| | threshold: | | threshold: | | threshold: | |
| | 0.5 | | 0.6 | | 0.7 | |
| +-------------+ +-------------+ +-------------+ |
| | | | |
| +----------------+----------------+ |
| v |
| +------------------------+ |
| | Redis Vector Index | |
| | (reference phrases) | |
| +------------------------+ |
+---------------------------------------------------------------+
|
v
+-------------+
| Query |
| "analyze |
| this..." |
+-------------+
|
v
+-------------+
| LiteLLM |
| (optional) |
+-------------+
```

## Key Design Decisions

### 1. Model Tiers, Not Individual Models

Routes map to **tiers** (simple, standard, expert) rather than specific models. This provides:
- Abstraction from model churn (swap haiku -> gemini-flash without changing routes)
- Clear mental model for users
- Easy cost optimization within tiers

### 2. Bloom's Taxonomy-Grounded Tiers

The default pretrained config maps tiers to Bloom's Taxonomy cognitive levels:
- **Simple** (Remember/Understand): Factual recall, greetings, format conversion
- **Standard** (Apply/Analyze): Code explanation, summarization, moderate analysis
- **Expert** (Evaluate/Create): Research, architecture, formal reasoning

This is informed by RouterArena's finding that cognitive complexity correlates with model capability requirements.

### 3. LiteLLM-Compatible Model Strings

Tier model identifiers use LiteLLM format (`provider/model`):
```python
ModelTier(
name="standard",
model="anthropic/claude-sonnet-4-5", # Works directly with LiteLLM
...
)
```

### 4. Per-Tier Distance Thresholds

Each tier has its own `distance_threshold`, allowing fine-grained control:
```python
simple_tier = ModelTier(..., distance_threshold=0.5) # Strict match
expert_tier = ModelTier(..., distance_threshold=0.7) # Looser match
```

### 5. Cost-Aware Routing

When `cost_optimization=True`, the router adds a cost penalty to distances:
```python
adjusted_distance = distance + (cost_per_1k * cost_weight)
```
This prefers cheaper tiers when semantic distances are close.

### 6. Pretrained Configs with Embedded Vectors

The built-in `default.json` provides a ready-to-use 3-tier configuration:
```python
# Instant setup - no embedding model needed at load time
router = LLMRouter.from_pretrained("default", redis_client=client)
```

The pretrained config includes pre-computed embeddings from
`sentence-transformers/all-mpnet-base-v2`, with 18 reference phrases per tier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would langcache-embed be a fit here, potentially?

covering the Bloom's Taxonomy spectrum.

Custom configs can also be exported and shared:
```python
# Export (one-time, with embedding model)
router.export_with_embeddings("my_router.json")

# Import (no embedding needed)
router = LLMRouter.from_pretrained("my_router.json", redis_client=client)
```

### 7. Async Support

`AsyncLLMRouter` provides the same functionality using async I/O. Since
`__init__` cannot be async, it uses a `create()` classmethod factory:

```python
router = await AsyncLLMRouter.create(
name="my-router",
tiers=tiers,
redis_client=async_client,
)
match = await router.route("hello")
```

Key async method mapping:

| Sync (`LLMRouter`) | Async (`AsyncLLMRouter`) |
|---------------------|--------------------------|
| `__init__()` | `await create()` |
| `from_existing()` | `await from_existing()` |
| `route()` | `await route()` |
| `route_many()` | `await route_many()` |
| `add_tier()` | `await add_tier()` |
| `remove_tier()` | `await remove_tier()` |
| `from_dict()` | `await from_dict()` |
| `from_pretrained()` | `await from_pretrained()` |
| `delete()` | `await delete()` |

## Module Structure

```
redisvl/extensions/llm_router/
+-- __init__.py # Public exports (LLMRouter, AsyncLLMRouter, schemas)
+-- DESIGN.md # This document
+-- schema.py # Pydantic models
| +-- ModelTier # Tier definition
| +-- LLMRouteMatch # Routing result
| +-- RoutingConfig # Router configuration
| +-- Pretrained* # Export/import schemas
+-- router.py # LLMRouter + AsyncLLMRouter implementations
+-- pretrained/
+-- __init__.py # Pretrained loader (get_pretrained_path)
+-- default.json # Standard 3-tier config (simple/standard/expert)
```

## API Examples

### Basic Usage

```python
from redisvl.extensions.llm_router import LLMRouter, ModelTier

tiers = [
ModelTier(
name="simple",
model="openai/gpt-4.1-nano",
references=[
"hello", "hi there", "thanks", "goodbye",
"what time is it?", "how are you?",
],
metadata={"cost_per_1k_input": 0.0001},
distance_threshold=0.5,
),
ModelTier(
name="standard",
model="anthropic/claude-sonnet-4-5",
references=[
"analyze this code for bugs",
"explain how neural networks learn",
"compare and contrast these approaches",
],
metadata={"cost_per_1k_input": 0.003},
distance_threshold=0.6,
),
ModelTier(
name="expert",
model="anthropic/claude-opus-4-5",
references=[
"prove this mathematical theorem",
"architect a distributed system",
"write a research paper analyzing",
],
metadata={"cost_per_1k_input": 0.005},
distance_threshold=0.7,
),
]

router = LLMRouter(
name="my-llm-router",
tiers=tiers,
redis_url="redis://localhost:6379",
)

# Route a query
match = router.route("hello, how's it going?")
print(match.tier) # "simple"
print(match.model) # "openai/gpt-4.1-nano"

# Use with LiteLLM (optional integration)
from litellm import completion
response = completion(model=match.model, messages=[{"role": "user", "content": query}])
```

### Cost-Optimized Routing

```python
router = LLMRouter(
name="cost-aware-router",
tiers=tiers,
cost_optimization=True, # Prefer cheaper tiers when distances are close
redis_url="redis://localhost:6379",
)
```

### Pretrained Router

```python
# Load without needing an embedding model for the references
router = LLMRouter.from_pretrained(
"default", # Built-in config, or path to JSON
redis_client=client,
)
```

### Async Usage

```python
from redisvl.extensions.llm_router import AsyncLLMRouter

router = await AsyncLLMRouter.create(
name="my-async-router",
tiers=tiers,
redis_url="redis://localhost:6379",
)

match = await router.route("explain how garbage collection works")
print(match.model) # "anthropic/claude-sonnet-4-5"

# Or load from pretrained
router = await AsyncLLMRouter.from_pretrained("default", redis_client=client)

await router.delete()
```

## Comparison with SemanticRouter

| Feature | SemanticRouter | LLMRouter |
|---------|---------------|-----------|
| Purpose | Topic classification | Model selection |
| Output | Route name | Model string + metadata |
| Cost awareness | No | Yes |
| Pretrained configs | No | Yes |
| Per-route thresholds | Yes | Yes |
| LiteLLM integration | No | Yes (model strings) |
| Async support | No | Yes (`AsyncLLMRouter`) |

## Testing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document is useful for LLM development and to an extent as documentation, but parts of it like this testing list might be a bit over-the-top 😅


### Unit Tests (`tests/unit/test_llm_router_schema.py`)
- Schema validation
- Pydantic model behavior
- Threshold bounds
- Empty/invalid inputs

### Integration Tests (`tests/integration/test_llm_router.py`)
- Router initialization
- Routing accuracy
- Cost optimization behavior
- Serialization (dict, YAML, JSON)
- Pretrained import/export
- Pretrained config loading (`from_pretrained("default")`)
- Tier management (add, remove, update)
- Persistence (from_existing)

### Async Integration Tests (`tests/integration/test_async_llm_router.py`)
- Mirrors all sync tests with `AsyncLLMRouter`
- Uses `async_client` fixture and async skip helpers
- Tests `create()` factory, async routing, serialization, tier management
- Pretrained config loading

Run tests:
```bash
uv run pytest tests/unit/test_llm_router_schema.py -v
uv run pytest tests/integration/test_llm_router.py -v
uv run pytest tests/integration/test_async_llm_router.py -v
```

## Future Enhancements

### 1. `complete()` Method
Direct LiteLLM integration for one-liner usage:
```python
response = router.complete("analyze this code", messages=[...])
```

### 2. Capability Filtering
Filter tiers by capability before routing:
```python
match = router.route("generate an image", capabilities=["vision"])
```

### 3. Budget Constraints
Enforce cost limits:
```python
router = LLMRouter(..., max_cost_per_1k=0.01) # Never select opus
```

### 4. Fallback Chains
Define fallback order when primary tier unavailable:
```python
tier = ModelTier(..., fallback=["standard", "simple"])
```

## References

- [RedisVL SemanticRouter](https://docs.redisvl.com/en/latest/user_guide/semantic_router.html)
- [LiteLLM Model List](https://docs.litellm.ai/docs/providers)
- [RouteLLM](https://github.com/lm-sys/RouteLLM) - LMSys binary router framework
- [NVIDIA LLM Router Blueprint](https://build.nvidia.com/blueprints/llm-router) - Complexity-based routing
- [RouterArena / Bloom's Taxonomy](https://arxiv.org/abs/2412.06644) - Cognitive complexity for routing
- [FrugalGPT](https://arxiv.org/abs/2305.05176) - Cost-efficient LLM strategies
- [OpenRouter](https://openrouter.ai/) - Auto-routing concept
- [NotDiamond](https://notdiamond.ai/) - ML-based model routing
- [Unify.ai](https://unify.ai/) - Quality-cost tradeoff routing
Loading