Optimize multimodal resource allocation with concurrency and improved batch RPC#1017
Open
dyyoungg wants to merge 9 commits intoModelTC:mainfrom
Open
Optimize multimodal resource allocation with concurrency and improved batch RPC#1017dyyoungg wants to merge 9 commits intoModelTC:mainfrom
dyyoungg wants to merge 9 commits intoModelTC:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
summary
This PR introduces a comprehensive performance overhaul of the multimodal resource allocation pipeline. It refactors both the
httpserver.managerand the server (CacheServer) to replace sequential, "chatty" operations with a concurrent, batched approach. This significantly reduces latency and improves throughput, especially for requests with a large number of multimodal items.bottleneck Problem
The original implementation was inefficient due to two primary bottlenecks:
httpserver.manager, I/O operations (reading files) and CPU-bound tasks (calculating MD5s,create_shm) for each multimodal item were executed one after anotherThe original exposed_alloc function signature was
alloc(self, md5sum_list: list[str], token_num_list: list[int]). Although it handled lists, rpyc serializes each argument(md5sum_list and token_num_list)independently. This process involves significant overhead:Solution (v2 Implementation)
✅ Concurrent Processing
ThreadPoolExecutorto concurrently read item data and calculate MD5 sums, fully leveraging available CPU cores.asyncio.gatherto expedite resource setup.✅ New Batched RPC Interface
exposed_*_v2methods have been added to the CacheServer.Include
alloc_v2,release_v2,set_items_data_v2,get_items_data_v2,set_items_embed_v2,get_items_embed_v2.✅ Server-Side Batch Handling
CacheServer'snew v2 endpoints deserialize the request blob, process the batch of items internally, and return a single serialized response. This makes the server-side logic more efficient and cohesive.✅ Feature Toggle
--enable_concurrent_allocand--concurrent_alloc_workersparameters to control the new concurrent allocation behavior. This allows for gradual rollout and easy fallback to the original implementation if neededaudioserver/visualserver.manager,get_items_embedis default to v2 implementation to reduce time.Performance Evaluation
I evaluated the performance using images of the same size(644*364) during inference with our internal LLaVA-like model.
Testing Environment:
concurrent_alloc_workers=4The reported values are averages and may fluctuate slightly, but not significantly.