[runtime][vm] Add LoRA adapter metadata to paged KV cache#18890
[runtime][vm] Add LoRA adapter metadata to paged KV cache#18890MagellaX wants to merge 3 commits intoapache:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes the foundational runtime support for LoRA adapter metadata within the paged KV cache. It provides the necessary plumbing to track and manage LoRA adapter IDs at the sequence level, which is a crucial building block for future multi-LoRA serving capabilities. The changes are focused purely on runtime state management and do not yet include frontend request handling, LoRA execution in model operations, or specialized batching logic. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
@babusid lmk your thoughts!!! |
There was a problem hiding this comment.
Code Review
This pull request introduces support for LoRA adapter metadata within the paged KV cache. The changes are well-structured, adding a lora_adapter_id to the Sequence struct, exposing new FFI functions to manage these IDs, and ensuring correct propagation during sequence forking. The implementation is clean and includes a comprehensive set of tests covering the new functionality. I have one minor suggestion to improve code reuse and simplify the implementation.
I've been mulling over what the best architecture for Multi-LoRA support might look like, and I don't think the original breakdown I gave you was the best. Sorry about that. While the S-LoRA paper does use paged memory blocks to handle the adaptor weights, given that TVM's construct is explicitly a I'm wondering if we're not better served by implementing a separate memory pool construct, in the same vein as PagedKVCache.cc or RNNState.cc, designed specifically for holding LoRA adaptor weights at runtime. That said, by restructuring the efforts, we might be able to keep things strictly downstream (not completely sure on this last point). Let me discuss a bit more internally with @MasterJH5574 and others, and see if I can better formalize an idea / design. |
yeah, after digging through S-LoRA/Punica again and also looking at how the actual serving stacks handle this, I think you’re probably right.... The consistent pattern seems to be adapter identity may need to flow with request / cache correctness, but adapter weights themselves usually live in a separate adapter cache / pool, not inside the KV-cache object. S-LoRA’s unified paging reads more like a lower-level allocator/kernel strategy than the right API boundary. TRT-LLM is especially explicit about this since it keeps a LoRA cache distinct from the KV cache, and vLLM / SGLang also seem to converge on separate adapter management with adapter ids still affecting cache correctness. So my current leaning is to avoid pushing actual LoRA weights into |
|
@MagellaX Thank you so much for the contributions! My overall read is that we probably need to first establish end-to-end LoRA serving flow, with runnable tests and real commands, before upstreaming parts. The main reason is that we don't want to iterate over the implementations for too many times in the mainline repo without seeing end-to-end effects. |
yeah totally agree with that!!! i think with this PR the main useful outcome was clarifying the runtime boundary a bit, but I agree the next step should be downstream first, not upstream first. I’ll focus on getting a minimal end-to-end LoRA serving path working in MLC with real runnable commands, tests, and a clear single-adapter flow, and then only upstream the smallest TVM pieces that are actually required by that working path. That should make the design much easier to evaluate and avoid churning upstream abstractions too early. |
Summary
This PR adds minimal runtime support for LoRA adapter metadata in paged KV cache state.
This is intended as a small runtime building block for downstream multi-LoRA serving work discussed here:
Changes
lora_adapter_idtoSequenceBeginForwardForkSequenceinherit the parent adapter idBeginForwardScope
This is runtime-state plumbing only.
This PR does not yet add:
Validation
Validated on a fresh
apache/tvmbase with a local source build with Modal...