中文文档 | English
This document describes how to use MooncakeStore as the KV Cache storage backend for FastDeploy, enabling Global Cache Pooling across multiple inference instances.
Global Cache Pooling allows multiple FastDeploy instances to share KV Cache through a distributed storage layer. This enables:
- Cross-instance cache reuse: KV Cache computed by one instance can be reused by another
- PD Disaggregation optimization: Prefill and Decode instances can share cache seamlessly
- Reduced computation: Avoid redundant prefix computation across requests
┌─────────────────────────────────────────────────────────────────┐
│ Mooncake Master Server │
│ (Metadata & Coordination Service) │
└────────────────────────────┬────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ FastDeploy │ │ FastDeploy │ │ FastDeploy │
│ Instance P │ │ Instance D │ │ Instance X │
│ (Prefill) │ │ (Decode) │ │ (Standalone) │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌────────▼────────┐
│ MooncakeStore │
│ (Shared KV │
│ Cache Pool) │
└─────────────────┘
Ready-to-use example scripts are available in examples/cache_storage/.
| Script | Scenario | Description |
|---|---|---|
run.sh |
Multi-Instance | Two standalone instances sharing cache |
run_03b_pd_storage.sh |
PD Disaggregation | P+D instances with global cache pooling |
- NVIDIA GPU with CUDA support
- RDMA network (recommended for production) or TCP
- Python 3.8+
- CUDA 11.8+
- FastDeploy (see installation below)
Refer to NVIDIA CUDA GPU Installation for FastDeploy installation.
# Option 1: Install from PyPI
pip install fastdeploy-gpu
# Option 2: Build from source
bash build.sh
pip install ./dist/fastdeploy*.whlMooncakeStore is automatically installed when you install FastDeploy.
We support two ways to configure MooncakeStore: via configuration file mooncake_config.json or via environment variables.
Create a mooncake_config.json file:
{
"metadata_server": "http://0.0.0.0:15002/metadata",
"master_server_addr": "0.0.0.0:15001",
"global_segment_size": 1000000000,
"local_buffer_size": 134217728,
"protocol": "rdma",
"rdma_devices": ""
}Set the MOONCAKE_CONFIG_PATH environment variable to enable the configuration:
export MOONCAKE_CONFIG_PATH=path/to/mooncake_config.jsonConfiguration parameters:
| Parameter | Description | Default |
|---|---|---|
metadata_server |
HTTP metadata server URL | Required |
master_server_addr |
Master server address | Required |
global_segment_size |
Memory space each TP process shares to global shared memory (bytes) | 1GB |
local_buffer_size |
Local buffer size for data transfer (bytes) | 128MB |
protocol |
Transfer protocol: rdma or tcp |
rdma |
rdma_devices |
RDMA device names (comma-separated) | Auto-detect |
Mooncake can also be configured via environment variables:
| Variable | Description |
|---|---|
MOONCAKE_MASTER_SERVER_ADDR |
Master server address (e.g., 10.0.0.1:15001) |
MOONCAKE_METADATA_SERVER |
Metadata server URL |
MOONCAKE_GLOBAL_SEGMENT_SIZE |
Memory space each TP process shares to global shared memory (bytes) |
MOONCAKE_LOCAL_BUFFER_SIZE |
Local buffer size (bytes) |
MOONCAKE_PROTOCOL |
Transfer protocol (rdma or tcp) |
MOONCAKE_RDMA_DEVICES |
RDMA device names |
Run multiple FastDeploy instances sharing a global KV Cache pool.
Step 1: Start Mooncake Master
mooncake_master \
--port=15001 \
--enable_http_metadata_server=true \
--http_metadata_server_host=0.0.0.0 \
--http_metadata_server_port=15002 \
--metrics_port=15003Step 2: Start FastDeploy Instances
Instance 0:
export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
export CUDA_VISIBLE_DEVICES=0
python -m fastdeploy.entrypoints.openai.api_server \
--model ${MODEL_NAME} \
--port 52700 \
--max-model-len 32768 \
--max-num-seqs 32 \
--kvcache-storage-backend mooncakeInstance 1:
export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
export CUDA_VISIBLE_DEVICES=1
python -m fastdeploy.entrypoints.openai.api_server \
--model ${MODEL_NAME} \
--port 52800 \
--max-model-len 32768 \
--max-num-seqs 32 \
--kvcache-storage-backend mooncakeStep 3: Test Cache Reuse
Send the same prompt to both instances. The second instance should reuse the KV Cache computed by the first instance.
# Request to Instance 0
curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'
# Request to Instance 1 (should hit cached KV)
curl -X POST "http://0.0.0.0:52800/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'This scenario combines PD Disaggregation with Global Cache Pooling, enabling:
- Prefill instances to read Decode instances' output cache
- Optimal multi-turn conversation performance
Architecture:
┌──────────────────────────────────────────┐
│ Router │
│ (Load Balancer) │
└─────────────────┬────────────────────────┘
│
┌───────────────┴───────────────┐
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Prefill │ │ Decode │
│ Instance │◄───────────────►│ Instance │
│ │ KV Transfer │ │
└──────┬──────┘ └──────┬──────┘
│ │
└───────────────┬───────────────┘
│
┌────────▼────────┐
│ MooncakeStore │
│ (Global Cache) │
└─────────────────┘
Step 1: Start Mooncake Master
mooncake_master \
--port=15001 \
--enable_http_metadata_server=true \
--http_metadata_server_host=0.0.0.0 \
--http_metadata_server_port=15002Step 2: Start Router
python -m fastdeploy.router.launch \
--port 52700 \
--splitwiseStep 3: Start Prefill Instance
export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
export MOONCAKE_PROTOCOL="rdma"
export CUDA_VISIBLE_DEVICES=0
python -m fastdeploy.entrypoints.openai.api_server \
--model ${MODEL_NAME} \
--port 52400 \
--max-model-len 32768 \
--max-num-seqs 32 \
--splitwise-role prefill \
--cache-transfer-protocol rdma \
--router "0.0.0.0:52700" \
--kvcache-storage-backend mooncakeStep 4: Start Decode Instance
export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
export MOONCAKE_PROTOCOL="rdma"
export CUDA_VISIBLE_DEVICES=1
python -m fastdeploy.entrypoints.openai.api_server \
--model ${MODEL_NAME} \
--port 52500 \
--max-model-len 32768 \
--max-num-seqs 32 \
--splitwise-role decode \
--cache-transfer-protocol rdma \
--router "0.0.0.0:52700" \
--enable-output-caching \
--kvcache-storage-backend mooncakeStep 5: Send Requests via Router
curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}'| Parameter | Description |
|---|---|
--kvcache-storage-backend mooncake |
Enable Mooncake as KV Cache storage backend |
--enable-output-caching |
Enable output token caching (Decode instance recommended) |
--cache-transfer-protocol rdma |
Use RDMA for KV transfer between P and D |
--splitwise-role prefill/decode |
Set instance role in PD disaggregation |
--router |
Router address for PD disaggregation |
To verify cache hit in logs:
# For multi-instance scenario
grep -E "storage_cache_token_num" log_*/api_server.log
# For PD disaggregation scenario
grep -E "storage_cache_token_num" log_prefill/api_server.logIf storage_cache_token_num > 0, the instance successfully read cached KV blocks from the global pool.
# Check master status
curl http://localhost:15002/metadata
# Check metrics (if metrics_port is configured)
curl http://localhost:15003/metrics1. Port Already in Use
# Check port usage
ss -ltn | grep 15001
# Kill existing process
kill -9 $(lsof -t -i:15001)2. RDMA Connection Failed
- Verify RDMA devices:
ibv_devices - Check RDMA network:
ibv_devinfo - Fallback to TCP: Set
MOONCAKE_PROTOCOL=tcp
3. Cache Not Being Shared
- Verify all instances connect to the same Mooncake master
- Check metadata server URL is consistent
- Verify
global_segment_sizeis large enough
4. Permission Denied on /dev/shm
# Clean up stale shared memory files
find /dev/shm -type f -print0 | xargs -0 rm -fEnable debug logging:
export FD_DEBUG=1