-
Notifications
You must be signed in to change notification settings - Fork 64
feat: LLM Router extension for cost-optimized model selection #476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bsbodden
wants to merge
11
commits into
main
Choose a base branch
from
llm-router
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
0b105e7
feat(llm-router): Initial LLMRouter extension
bsbodden d1791af
test(llm-router): Add unit tests for schema validation
bsbodden 8a603c7
fix(llm-router): fix Pydantic initialization and test assertions
bsbodden 1b7b0e1
test(llm-router): simplify test assertions for semantic matching
bsbodden 91e8c99
docs(llm-router): add comprehensive DESIGN.md
bsbodden 967db0e
feat(llm-router): add pretrained default config with pre-computed emb…
bsbodden 7045745
feat(llm-router): add AsyncLLMRouter and update exports
bsbodden 7f4f9de
test(llm-router): add async and pretrained integration tests
bsbodden fda6eb6
docs(llm-router): add user guide notebook and update DESIGN.md
bsbodden e5facb6
fix(llm-router): prioritize redis_client over default redis_url in cl…
bsbodden e1cd469
fix(llm-router): add threshold validation and empty-tiers guard
bsbodden File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,385 @@ | ||
| # LLM Router Extension - Design Document | ||
|
|
||
| ## Overview | ||
|
|
||
| The LLM Router is an extension to RedisVL that provides intelligent, cost-optimized LLM model selection using semantic routing. Instead of routing queries to topics (like SemanticRouter), it routes queries to **model tiers** - selecting the cheapest LLM capable of handling each task. | ||
|
|
||
| ## Problem Statement | ||
|
|
||
| ### The LLM Cost Problem | ||
| Modern applications often default to using the most capable (and expensive) LLM for all queries, even when simpler models would suffice: | ||
| - "Hello, how are you?" -> Claude Opus 4.5 ($5/M tokens) | ||
| - "Hello, how are you?" -> GPT-4.1 Nano ($0.10/M tokens) | ||
|
|
||
| ### Existing Solutions and Their Limitations | ||
|
|
||
| **RouteLLM** (CMU/LMSys): | ||
| - Binary classification only (strong vs weak model) | ||
| - No support for >2 tiers | ||
| - Requires training data or preference matrices | ||
|
|
||
| **NVIDIA LLM Router Blueprint**: | ||
| - Complexity classification approach (simple/moderate/complex) | ||
| - Provides the taxonomy basis but no open-source Redis-native implementation | ||
|
|
||
| **RouterArena / Bloom's Taxonomy Approach**: | ||
| - Maps query complexity to Bloom's cognitive levels | ||
| - Informs our tier design but lacks production routing infrastructure | ||
|
|
||
| **OpenRouter Auto-Router**: | ||
| - Black box routing decisions | ||
| - Data flows through third-party servers | ||
| - No transparency into why a model was selected | ||
| - Can't self-host or customize | ||
|
|
||
| **NotDiamond**: | ||
| - Proprietary ML model for routing | ||
| - Requires API calls for every routing decision | ||
| - No local/offline capability | ||
|
|
||
| **FrugalGPT**: | ||
| - Sequential cascade approach (try cheap first, escalate) | ||
| - Higher latency due to serial model calls | ||
|
|
||
| ## Solution: Semantic Model Tier Routing | ||
|
|
||
| Repurpose RedisVL's battle-tested SemanticRouter for model selection: | ||
|
|
||
| ``` | ||
| SemanticRouter -> LLMRouter | ||
| ----------------------------------------- | ||
| Route -> ModelTier | ||
| route.name -> tier.name (simple/standard/expert) | ||
| route.references -> tier.references (task complexity examples) | ||
| route.metadata -> tier.metadata (cost, capabilities) | ||
| RouteMatch -> LLMRouteMatch (includes model string) | ||
| ``` | ||
|
|
||
| ### Architecture | ||
|
|
||
| ``` | ||
| +---------------------------------------------------------------+ | ||
| | LLMRouter | | ||
| +---------------------------------------------------------------+ | ||
| | +-------------+ +-------------+ +-------------+ | | ||
| | | Simple | | Standard | | Expert | | | ||
| | | Tier | | Tier | | Tier | | | ||
| | +-------------+ +-------------+ +-------------+ | | ||
| | | gpt-4.1-nano| | sonnet 4.5 | | opus 4.5 | | | ||
| | | $0.10/M | | $3/M | | $5/M | | | ||
| | | threshold: | | threshold: | | threshold: | | | ||
| | | 0.5 | | 0.6 | | 0.7 | | | ||
| | +-------------+ +-------------+ +-------------+ | | ||
| | | | | | | ||
| | +----------------+----------------+ | | ||
| | v | | ||
| | +------------------------+ | | ||
| | | Redis Vector Index | | | ||
| | | (reference phrases) | | | ||
| | +------------------------+ | | ||
| +---------------------------------------------------------------+ | ||
| | | ||
| v | ||
| +-------------+ | ||
| | Query | | ||
| | "analyze | | ||
| | this..." | | ||
| +-------------+ | ||
| | | ||
| v | ||
| +-------------+ | ||
| | LiteLLM | | ||
| | (optional) | | ||
| +-------------+ | ||
| ``` | ||
|
|
||
| ## Key Design Decisions | ||
|
|
||
| ### 1. Model Tiers, Not Individual Models | ||
|
|
||
| Routes map to **tiers** (simple, standard, expert) rather than specific models. This provides: | ||
| - Abstraction from model churn (swap haiku -> gemini-flash without changing routes) | ||
| - Clear mental model for users | ||
| - Easy cost optimization within tiers | ||
|
|
||
| ### 2. Bloom's Taxonomy-Grounded Tiers | ||
|
|
||
| The default pretrained config maps tiers to Bloom's Taxonomy cognitive levels: | ||
| - **Simple** (Remember/Understand): Factual recall, greetings, format conversion | ||
| - **Standard** (Apply/Analyze): Code explanation, summarization, moderate analysis | ||
| - **Expert** (Evaluate/Create): Research, architecture, formal reasoning | ||
|
|
||
| This is informed by RouterArena's finding that cognitive complexity correlates with model capability requirements. | ||
|
|
||
| ### 3. LiteLLM-Compatible Model Strings | ||
|
|
||
| Tier model identifiers use LiteLLM format (`provider/model`): | ||
| ```python | ||
| ModelTier( | ||
| name="standard", | ||
| model="anthropic/claude-sonnet-4-5", # Works directly with LiteLLM | ||
| ... | ||
| ) | ||
| ``` | ||
|
|
||
| ### 4. Per-Tier Distance Thresholds | ||
|
|
||
| Each tier has its own `distance_threshold`, allowing fine-grained control: | ||
| ```python | ||
| simple_tier = ModelTier(..., distance_threshold=0.5) # Strict match | ||
| expert_tier = ModelTier(..., distance_threshold=0.7) # Looser match | ||
| ``` | ||
|
|
||
| ### 5. Cost-Aware Routing | ||
|
|
||
| When `cost_optimization=True`, the router adds a cost penalty to distances: | ||
| ```python | ||
| adjusted_distance = distance + (cost_per_1k * cost_weight) | ||
| ``` | ||
| This prefers cheaper tiers when semantic distances are close. | ||
|
|
||
| ### 6. Pretrained Configs with Embedded Vectors | ||
|
|
||
| The built-in `default.json` provides a ready-to-use 3-tier configuration: | ||
| ```python | ||
| # Instant setup - no embedding model needed at load time | ||
| router = LLMRouter.from_pretrained("default", redis_client=client) | ||
| ``` | ||
|
|
||
| The pretrained config includes pre-computed embeddings from | ||
| `sentence-transformers/all-mpnet-base-v2`, with 18 reference phrases per tier | ||
| covering the Bloom's Taxonomy spectrum. | ||
|
|
||
| Custom configs can also be exported and shared: | ||
| ```python | ||
| # Export (one-time, with embedding model) | ||
| router.export_with_embeddings("my_router.json") | ||
|
|
||
| # Import (no embedding needed) | ||
| router = LLMRouter.from_pretrained("my_router.json", redis_client=client) | ||
| ``` | ||
|
|
||
| ### 7. Async Support | ||
|
|
||
| `AsyncLLMRouter` provides the same functionality using async I/O. Since | ||
| `__init__` cannot be async, it uses a `create()` classmethod factory: | ||
|
|
||
| ```python | ||
| router = await AsyncLLMRouter.create( | ||
| name="my-router", | ||
| tiers=tiers, | ||
| redis_client=async_client, | ||
| ) | ||
| match = await router.route("hello") | ||
| ``` | ||
|
|
||
| Key async method mapping: | ||
|
|
||
| | Sync (`LLMRouter`) | Async (`AsyncLLMRouter`) | | ||
| |---------------------|--------------------------| | ||
| | `__init__()` | `await create()` | | ||
| | `from_existing()` | `await from_existing()` | | ||
| | `route()` | `await route()` | | ||
| | `route_many()` | `await route_many()` | | ||
| | `add_tier()` | `await add_tier()` | | ||
| | `remove_tier()` | `await remove_tier()` | | ||
| | `from_dict()` | `await from_dict()` | | ||
| | `from_pretrained()` | `await from_pretrained()` | | ||
| | `delete()` | `await delete()` | | ||
|
|
||
| ## Module Structure | ||
|
|
||
| ``` | ||
| redisvl/extensions/llm_router/ | ||
| +-- __init__.py # Public exports (LLMRouter, AsyncLLMRouter, schemas) | ||
| +-- DESIGN.md # This document | ||
| +-- schema.py # Pydantic models | ||
| | +-- ModelTier # Tier definition | ||
| | +-- LLMRouteMatch # Routing result | ||
| | +-- RoutingConfig # Router configuration | ||
| | +-- Pretrained* # Export/import schemas | ||
| +-- router.py # LLMRouter + AsyncLLMRouter implementations | ||
| +-- pretrained/ | ||
| +-- __init__.py # Pretrained loader (get_pretrained_path) | ||
| +-- default.json # Standard 3-tier config (simple/standard/expert) | ||
| ``` | ||
|
|
||
| ## API Examples | ||
|
|
||
| ### Basic Usage | ||
|
|
||
| ```python | ||
| from redisvl.extensions.llm_router import LLMRouter, ModelTier | ||
|
|
||
| tiers = [ | ||
| ModelTier( | ||
| name="simple", | ||
| model="openai/gpt-4.1-nano", | ||
| references=[ | ||
| "hello", "hi there", "thanks", "goodbye", | ||
| "what time is it?", "how are you?", | ||
| ], | ||
| metadata={"cost_per_1k_input": 0.0001}, | ||
| distance_threshold=0.5, | ||
| ), | ||
| ModelTier( | ||
| name="standard", | ||
| model="anthropic/claude-sonnet-4-5", | ||
| references=[ | ||
| "analyze this code for bugs", | ||
| "explain how neural networks learn", | ||
| "compare and contrast these approaches", | ||
| ], | ||
| metadata={"cost_per_1k_input": 0.003}, | ||
| distance_threshold=0.6, | ||
| ), | ||
| ModelTier( | ||
| name="expert", | ||
| model="anthropic/claude-opus-4-5", | ||
| references=[ | ||
| "prove this mathematical theorem", | ||
| "architect a distributed system", | ||
| "write a research paper analyzing", | ||
| ], | ||
| metadata={"cost_per_1k_input": 0.005}, | ||
| distance_threshold=0.7, | ||
| ), | ||
| ] | ||
|
|
||
| router = LLMRouter( | ||
| name="my-llm-router", | ||
| tiers=tiers, | ||
| redis_url="redis://localhost:6379", | ||
| ) | ||
|
|
||
| # Route a query | ||
| match = router.route("hello, how's it going?") | ||
| print(match.tier) # "simple" | ||
| print(match.model) # "openai/gpt-4.1-nano" | ||
|
|
||
| # Use with LiteLLM (optional integration) | ||
| from litellm import completion | ||
| response = completion(model=match.model, messages=[{"role": "user", "content": query}]) | ||
| ``` | ||
|
|
||
| ### Cost-Optimized Routing | ||
|
|
||
| ```python | ||
| router = LLMRouter( | ||
| name="cost-aware-router", | ||
| tiers=tiers, | ||
| cost_optimization=True, # Prefer cheaper tiers when distances are close | ||
| redis_url="redis://localhost:6379", | ||
| ) | ||
| ``` | ||
|
|
||
| ### Pretrained Router | ||
|
|
||
| ```python | ||
| # Load without needing an embedding model for the references | ||
| router = LLMRouter.from_pretrained( | ||
| "default", # Built-in config, or path to JSON | ||
| redis_client=client, | ||
| ) | ||
| ``` | ||
|
|
||
| ### Async Usage | ||
|
|
||
| ```python | ||
| from redisvl.extensions.llm_router import AsyncLLMRouter | ||
|
|
||
| router = await AsyncLLMRouter.create( | ||
| name="my-async-router", | ||
| tiers=tiers, | ||
| redis_url="redis://localhost:6379", | ||
| ) | ||
|
|
||
| match = await router.route("explain how garbage collection works") | ||
| print(match.model) # "anthropic/claude-sonnet-4-5" | ||
|
|
||
| # Or load from pretrained | ||
| router = await AsyncLLMRouter.from_pretrained("default", redis_client=client) | ||
|
|
||
| await router.delete() | ||
| ``` | ||
|
|
||
| ## Comparison with SemanticRouter | ||
|
|
||
| | Feature | SemanticRouter | LLMRouter | | ||
| |---------|---------------|-----------| | ||
| | Purpose | Topic classification | Model selection | | ||
| | Output | Route name | Model string + metadata | | ||
| | Cost awareness | No | Yes | | ||
| | Pretrained configs | No | Yes | | ||
| | Per-route thresholds | Yes | Yes | | ||
| | LiteLLM integration | No | Yes (model strings) | | ||
| | Async support | No | Yes (`AsyncLLMRouter`) | | ||
|
|
||
| ## Testing | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This document is useful for LLM development and to an extent as documentation, but parts of it like this testing list might be a bit over-the-top 😅 |
||
|
|
||
| ### Unit Tests (`tests/unit/test_llm_router_schema.py`) | ||
| - Schema validation | ||
| - Pydantic model behavior | ||
| - Threshold bounds | ||
| - Empty/invalid inputs | ||
|
|
||
| ### Integration Tests (`tests/integration/test_llm_router.py`) | ||
| - Router initialization | ||
| - Routing accuracy | ||
| - Cost optimization behavior | ||
| - Serialization (dict, YAML, JSON) | ||
| - Pretrained import/export | ||
| - Pretrained config loading (`from_pretrained("default")`) | ||
| - Tier management (add, remove, update) | ||
| - Persistence (from_existing) | ||
|
|
||
| ### Async Integration Tests (`tests/integration/test_async_llm_router.py`) | ||
| - Mirrors all sync tests with `AsyncLLMRouter` | ||
| - Uses `async_client` fixture and async skip helpers | ||
| - Tests `create()` factory, async routing, serialization, tier management | ||
| - Pretrained config loading | ||
|
|
||
| Run tests: | ||
| ```bash | ||
| uv run pytest tests/unit/test_llm_router_schema.py -v | ||
| uv run pytest tests/integration/test_llm_router.py -v | ||
| uv run pytest tests/integration/test_async_llm_router.py -v | ||
| ``` | ||
|
|
||
| ## Future Enhancements | ||
|
|
||
| ### 1. `complete()` Method | ||
| Direct LiteLLM integration for one-liner usage: | ||
| ```python | ||
| response = router.complete("analyze this code", messages=[...]) | ||
| ``` | ||
|
|
||
| ### 2. Capability Filtering | ||
| Filter tiers by capability before routing: | ||
| ```python | ||
| match = router.route("generate an image", capabilities=["vision"]) | ||
| ``` | ||
|
|
||
| ### 3. Budget Constraints | ||
| Enforce cost limits: | ||
| ```python | ||
| router = LLMRouter(..., max_cost_per_1k=0.01) # Never select opus | ||
| ``` | ||
|
|
||
| ### 4. Fallback Chains | ||
| Define fallback order when primary tier unavailable: | ||
| ```python | ||
| tier = ModelTier(..., fallback=["standard", "simple"]) | ||
| ``` | ||
|
|
||
| ## References | ||
|
|
||
| - [RedisVL SemanticRouter](https://docs.redisvl.com/en/latest/user_guide/semantic_router.html) | ||
| - [LiteLLM Model List](https://docs.litellm.ai/docs/providers) | ||
| - [RouteLLM](https://github.com/lm-sys/RouteLLM) - LMSys binary router framework | ||
| - [NVIDIA LLM Router Blueprint](https://build.nvidia.com/blueprints/llm-router) - Complexity-based routing | ||
| - [RouterArena / Bloom's Taxonomy](https://arxiv.org/abs/2412.06644) - Cognitive complexity for routing | ||
| - [FrugalGPT](https://arxiv.org/abs/2305.05176) - Cost-efficient LLM strategies | ||
| - [OpenRouter](https://openrouter.ai/) - Auto-routing concept | ||
| - [NotDiamond](https://notdiamond.ai/) - ML-based model routing | ||
| - [Unify.ai](https://unify.ai/) - Quality-cost tradeoff routing | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would langcache-embed be a fit here, potentially?