Skip to content

Latest commit

 

History

History
368 lines (277 loc) · 12.1 KB

File metadata and controls

368 lines (277 loc) · 12.1 KB

中文文档 | English

Global Cache Pooling

This document describes how to use MooncakeStore as the KV Cache storage backend for FastDeploy, enabling Global Cache Pooling across multiple inference instances.

Overview

What is Global Cache Pooling?

Global Cache Pooling allows multiple FastDeploy instances to share KV Cache through a distributed storage layer. This enables:

  • Cross-instance cache reuse: KV Cache computed by one instance can be reused by another
  • PD Disaggregation optimization: Prefill and Decode instances can share cache seamlessly
  • Reduced computation: Avoid redundant prefix computation across requests

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Mooncake Master Server                       │
│              (Metadata & Coordination Service)                   │
└────────────────────────────┬────────────────────────────────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
         ▼                   ▼                   ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│  FastDeploy     │ │  FastDeploy     │ │  FastDeploy     │
│  Instance P     │ │  Instance D     │ │  Instance X     │
│  (Prefill)      │ │  (Decode)       │ │  (Standalone)   │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │
                    ┌────────▼────────┐
                    │  MooncakeStore  │
                    │  (Shared KV     │
                    │   Cache Pool)   │
                    └─────────────────┘

Example Scripts

Ready-to-use example scripts are available in examples/cache_storage/.

Script Scenario Description
run.sh Multi-Instance Two standalone instances sharing cache
run_03b_pd_storage.sh PD Disaggregation P+D instances with global cache pooling

Prerequisites

Hardware Requirements

  • NVIDIA GPU with CUDA support
  • RDMA network (recommended for production) or TCP

Software Requirements

  • Python 3.8+
  • CUDA 11.8+
  • FastDeploy (see installation below)

Installation

Refer to NVIDIA CUDA GPU Installation for FastDeploy installation.

# Option 1: Install from PyPI
pip install fastdeploy-gpu

# Option 2: Build from source
bash build.sh
pip install ./dist/fastdeploy*.whl

MooncakeStore is automatically installed when you install FastDeploy.

Configuration

We support two ways to configure MooncakeStore: via configuration file mooncake_config.json or via environment variables.

Mooncake Configuration File

Create a mooncake_config.json file:

{
    "metadata_server": "http://0.0.0.0:15002/metadata",
    "master_server_addr": "0.0.0.0:15001",
    "global_segment_size": 1000000000,
    "local_buffer_size": 134217728,
    "protocol": "rdma",
    "rdma_devices": ""
}

Set the MOONCAKE_CONFIG_PATH environment variable to enable the configuration:

export MOONCAKE_CONFIG_PATH=path/to/mooncake_config.json

Configuration parameters:

Parameter Description Default
metadata_server HTTP metadata server URL Required
master_server_addr Master server address Required
global_segment_size Memory space each TP process shares to global shared memory (bytes) 1GB
local_buffer_size Local buffer size for data transfer (bytes) 128MB
protocol Transfer protocol: rdma or tcp rdma
rdma_devices RDMA device names (comma-separated) Auto-detect

Environment Variables

Mooncake can also be configured via environment variables:

Variable Description
MOONCAKE_MASTER_SERVER_ADDR Master server address (e.g., 10.0.0.1:15001)
MOONCAKE_METADATA_SERVER Metadata server URL
MOONCAKE_GLOBAL_SEGMENT_SIZE Memory space each TP process shares to global shared memory (bytes)
MOONCAKE_LOCAL_BUFFER_SIZE Local buffer size (bytes)
MOONCAKE_PROTOCOL Transfer protocol (rdma or tcp)
MOONCAKE_RDMA_DEVICES RDMA device names

Usage Scenarios

Scenario 1: Multi-Instance Cache Sharing

Run multiple FastDeploy instances sharing a global KV Cache pool.

Step 1: Start Mooncake Master

mooncake_master \
    --port=15001 \
    --enable_http_metadata_server=true \
    --http_metadata_server_host=0.0.0.0 \
    --http_metadata_server_port=15002 \
    --metrics_port=15003

Step 2: Start FastDeploy Instances

Instance 0:

export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
export CUDA_VISIBLE_DEVICES=0

python -m fastdeploy.entrypoints.openai.api_server \
       --model ${MODEL_NAME} \
       --port 52700 \
       --max-model-len 32768 \
       --max-num-seqs 32 \
       --kvcache-storage-backend mooncake

Instance 1:

export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
export CUDA_VISIBLE_DEVICES=1

python -m fastdeploy.entrypoints.openai.api_server \
       --model ${MODEL_NAME} \
       --port 52800 \
       --max-model-len 32768 \
       --max-num-seqs 32 \
       --kvcache-storage-backend mooncake

Step 3: Test Cache Reuse

Send the same prompt to both instances. The second instance should reuse the KV Cache computed by the first instance.

# Request to Instance 0
curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'

# Request to Instance 1 (should hit cached KV)
curl -X POST "http://0.0.0.0:52800/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'

Scenario 2: PD Disaggregation with Global Cache

This scenario combines PD Disaggregation with Global Cache Pooling, enabling:

  • Prefill instances to read Decode instances' output cache
  • Optimal multi-turn conversation performance

Architecture:

         ┌──────────────────────────────────────────┐
         │              Router                       │
         │         (Load Balancer)                   │
         └─────────────────┬────────────────────────┘
                           │
           ┌───────────────┴───────────────┐
           │                               │
           ▼                               ▼
    ┌─────────────┐                 ┌─────────────┐
    │   Prefill   │                 │   Decode    │
    │  Instance   │◄───────────────►│  Instance   │
    │             │   KV Transfer   │             │
    └──────┬──────┘                 └──────┬──────┘
           │                               │
           └───────────────┬───────────────┘
                           │
                  ┌────────▼────────┐
                  │  MooncakeStore  │
                  │  (Global Cache) │
                  └─────────────────┘

Step 1: Start Mooncake Master

mooncake_master \
    --port=15001 \
    --enable_http_metadata_server=true \
    --http_metadata_server_host=0.0.0.0 \
    --http_metadata_server_port=15002

Step 2: Start Router

python -m fastdeploy.router.launch \
    --port 52700 \
    --splitwise

Step 3: Start Prefill Instance

export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
export MOONCAKE_PROTOCOL="rdma"
export CUDA_VISIBLE_DEVICES=0

python -m fastdeploy.entrypoints.openai.api_server \
    --model ${MODEL_NAME} \
    --port 52400 \
    --max-model-len 32768 \
    --max-num-seqs 32 \
    --splitwise-role prefill \
    --cache-transfer-protocol rdma \
    --router "0.0.0.0:52700" \
    --kvcache-storage-backend mooncake

Step 4: Start Decode Instance

export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
export MOONCAKE_PROTOCOL="rdma"
export CUDA_VISIBLE_DEVICES=1

python -m fastdeploy.entrypoints.openai.api_server \
    --model ${MODEL_NAME} \
    --port 52500 \
    --max-model-len 32768 \
    --max-num-seqs 32 \
    --splitwise-role decode \
    --cache-transfer-protocol rdma \
    --router "0.0.0.0:52700" \
    --enable-output-caching \
    --kvcache-storage-backend mooncake

Step 5: Send Requests via Router

curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}'

FastDeploy Parameters for Mooncake

Parameter Description
--kvcache-storage-backend mooncake Enable Mooncake as KV Cache storage backend
--enable-output-caching Enable output token caching (Decode instance recommended)
--cache-transfer-protocol rdma Use RDMA for KV transfer between P and D
--splitwise-role prefill/decode Set instance role in PD disaggregation
--router Router address for PD disaggregation

Verification

Check Cache Hit

To verify cache hit in logs:

# For multi-instance scenario
grep -E "storage_cache_token_num" log_*/api_server.log

# For PD disaggregation scenario
grep -E "storage_cache_token_num" log_prefill/api_server.log

If storage_cache_token_num > 0, the instance successfully read cached KV blocks from the global pool.

Monitor Mooncake Master

# Check master status
curl http://localhost:15002/metadata

# Check metrics (if metrics_port is configured)
curl http://localhost:15003/metrics

Troubleshooting

Common Issues

1. Port Already in Use

# Check port usage
ss -ltn | grep 15001

# Kill existing process
kill -9 $(lsof -t -i:15001)

2. RDMA Connection Failed

  • Verify RDMA devices: ibv_devices
  • Check RDMA network: ibv_devinfo
  • Fallback to TCP: Set MOONCAKE_PROTOCOL=tcp

3. Cache Not Being Shared

  • Verify all instances connect to the same Mooncake master
  • Check metadata server URL is consistent
  • Verify global_segment_size is large enough

4. Permission Denied on /dev/shm

# Clean up stale shared memory files
find /dev/shm -type f -print0 | xargs -0 rm -f

Debug Mode

Enable debug logging:

export FD_DEBUG=1

More Resources