Skip to content

[BUG] "backend:cudaMallocAsync" cause OOM #381

@yodiaditya

Description

@yodiaditya

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Describe the bug

torch.OutOfMemoryError: Allocation on device on 2x4090. Because this code in main.py
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "backend:cudaMallocAsync"

Related to open ticket at PyTorch: pytorch/pytorch#124351

Trace

Traceback (most recent call last):                                                                        
  File "~/tabbyAPI/main.py", line 181, in <module>                                     
    entrypoint()                                                                                          
  File "~/tabbyAPI/main.py", line 177, in entrypoint                                   
    asyncio.run(entrypoint_async())                                                                       
  File "~/envs/tabby/lib/python3.11/asyncio/runners.py", line 190, in run              
    return runner.run(main)                                                                               
           ^^^^^^^^^^^^^^^^                                                                               
  File "~/envs/tabby/lib/python3.11/asyncio/runners.py", line 118, in run              
    return self._loop.run_until_complete(task)                                                            
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                            
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete                               
  File "~/tabbyAPI/main.py", line 61, in entrypoint_async                              
    await model.load_model(                                                                               
  File "~/tabbyAPI/common/model.py", line 226, in load_model                           
    async for _ in load_model_gen(model_path, **kwargs):                                                  
  File "~/tabbyAPI/common/model.py", line 202, in load_model_gen                       
    async for module, modules in load_status:                                                             
  File "~/tabbyAPI/backends/exllamav2/model.py", line 491, in load_gen                 
    async for value in iterate_in_threadpool(model_load_generator):                                       
  File "~/tabbyAPI/common/concurrency.py", line 30, in iterate_in_threadpool           
    yield await asyncio.to_thread(gen_next, generator)                                                    
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                    
  File "~/envs/tabby/lib/python3.11/asyncio/threads.py", line 25, in to_thread         
    return await loop.run_in_executor(None, func_call)                                                    
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                    
  File "~/envs/tabby/lib/python3.11/concurrent/futures/thread.py", line 58, in run     
    result = self.fn(*self.args, **self.kwargs)                                                           
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                           
  File "~/tabbyAPI/common/concurrency.py", line 20, in gen_next                        
    return next(generator)                                                                                
           ^^^^^^^^^^^^^^^                                                                                
  File "~/envs/tabby/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 59,
 in generator_context                                                                                     
    response = gen.send(request)                                                                          
               ^^^^^^^^^^^^^^^^^                                                                          
  File "~/tabbyAPI/backends/exllamav2/model.py", line 608, in load_model_sync     
    for value in self.model.load_tp_gen(       
  File "~/envs/tabby/lib/python3.11/site-packages/exllamav2/model.py", line 460, in loa
d_tp_gen                                                                                                  
    module.tp_split(BROADCAST_VC)
  File "~/envs/tabby/lib/python3.11/site-packages/exllamav2/linear.py", line 576, in tp
_split                                                                                                    
    "q_weight": safe_move_tensor(self.q_tensors["q_weight"][:, a:b], idx).contiguous()

Reproduction steps

Sample config break on commit 113643c0df73a52685c7fc54768307c6e06a051b , but works on previous version.

# Sample YAML file for configuration.
# Comment and uncomment values as needed.
# Every value has a default within the application.
# This file serves to be a drop in for config.yml

# Unless specified in the comments, DO NOT put these options in quotes!
# You can use https://www.yamllint.com/ if you want to check your YAML formatting.

# Options for networking
network:
  # The IP to host on (default: 127.0.0.1).
  # Use 0.0.0.0 to expose on all network adapters.
  host: 127.0.0.1

  # The port to host on (default: 5000).
  port: 8000

  # Disable HTTP token authentication with requests.
  # WARNING: This will make your instance vulnerable!
  # Turn on this option if you are ONLY connecting from localhost.
  disable_auth: false

  # Disable fetching external content in response to requests,such as images from URLs.
  disable_fetch_requests: false

  # Send tracebacks over the API (default: False).
  # NOTE: Only enable this for debug purposes.
  send_tracebacks: false

  # Select API servers to enable (default: ["OAI"]).
  # Possible values: OAI, Kobold.
  api_servers: ["OAI"]

# Options for logging
logging:
  # Enable prompt logging (default: False).
  log_prompt: false

  # Enable generation parameter logging (default: False).
  log_generation_params: false

  # Enable request logging (default: False).
  # NOTE: Only use this for debugging!
  log_requests: false

# Options for model overrides and loading
# Please read the comments to understand how arguments are handled
# between initial and API loads
model:
  # Directory to look for models (default: models).
  # Windows users, do NOT put this path in quotes!
  model_dir: /model

  # Allow direct loading of models from a completion or chat completion request (default: False).
  # This method of loading is strict by default.
  # Enable dummy models to add exceptions for invalid model names.
  inline_model_loading: false

  # Sends dummy model names when the models endpoint is queried. (default: False)
  # Enable this if the client is looking for specific OAI models.
  use_dummy_models: false

  # A list of fake model names that are sent via the /v1/models endpoint. (default: ["gpt-3.5-turbo"])
  # Also used as bypasses for strict mode if inline_model_loading is true.
  dummy_model_names: ["gpt-3.5-turbo"]

  # An initial model to load.
  # Make sure the model is located in the model directory!
  # REQUIRED: This must be filled out to load a model on startup.
  model_name: model_llama3370Binstruct_bullerwins

  # Names of args to use as a fallback for API load requests (default: []).
  # For example, if you always want cache_mode to be Q4 instead of on the inital model load, add "cache_mode" to this array.
  # Example: ['max_seq_len', 'cache_mode'].
  use_as_default: ['max_seq_len', 'cache_mode', 'chunk_size']

  # Backend to use for this model (auto-detect if not specified)
  # Options: exllamav2, exllamav3
  backend:

  # Max sequence length (default: 4096).
  # Set to -1 to fetch from the model's config.json
  max_seq_len: 16384

  # Load model with tensor parallelism.
  # Falls back to autosplit if GPU split isn't provided.
  # This ignores the gpu_split_auto value.
  tensor_parallel: true

  # Sets a backend type for tensor parallelism. (default: native).
  # Options: native, nccl
  # Native is recommended for PCIe GPUs
  # NCCL is recommended for NVLink.
  tensor_parallel_backend: nccl

  # Automatically allocate resources to GPUs (default: True).
  # Not parsed for single GPU users.
  gpu_split_auto: false

  # Reserve VRAM used for autosplit loading (default: 96 MB on GPU 0).
  # Represented as an array of MB per GPU.
  autosplit_reserve: [0]

  # An integer array of GBs of VRAM to split between GPUs (default: []).
  # Used with tensor parallelism.
  gpu_split: [25, 25]

  # NOTE: If a model has YaRN rope scaling, it will automatically be enabled by ExLlama.
  # rope_scale and rope_alpha settings won't apply in this case.

  # Rope scale (default: 1.0).
  # Same as compress_pos_emb.
  # Use if the model was trained on long context with rope.
  # Leave blank to pull the value from the model.
  rope_scale: 1.0

  # Rope alpha (default: None).
  # Same as alpha_value. Set to "auto" to auto-calculate.
  # Leaving this value blank will either pull from the model or auto-calculate.
  rope_alpha:

  # Enable different cache modes for VRAM savings (default: FP16).
  # Possible values for exllamav2: 'FP16', 'Q8', 'Q6', 'Q4'.
  # For exllamav3, specify the pair k_bits,v_bits where k_bits and v_bits are integers from 2-8 (i.e. 8,8).
  cache_mode: Q8

  # Size of the prompt cache to allocate (default: max_seq_len).
  # Must be a multiple of 256 and can't be less than max_seq_len.
  # For CFG, set this to 2 * max_seq_len.
  cache_size:

  # Chunk size for prompt ingestion (default: 2048).
  # A lower value reduces VRAM usage but decreases ingestion speed.
  # NOTE: Effects vary depending on the model.
  # An ideal value is between 512 and 4096.
  chunk_size: 4096

  # Set the maximum number of prompts to process at one time (default: None/Automatic).
  # Automatically calculated if left blank.
  # NOTE: Only available for Nvidia ampere (30 series) and above GPUs.
  max_batch_size:

  # Set the prompt template for this model. (default: None)
  # If empty, attempts to look for the model's chat template.
  # If a model contains multiple templates in its tokenizer_config.json,
  # set prompt_template to the name of the template you want to use.
  # NOTE: Only works with chat completion message lists!
  prompt_template:

  # Enables vision support if the model supports it. (default: False)
  vision: false

# Options for draft models (speculative decoding)
# This will use more VRAM!
draft_model:
  # Directory to look for draft models (default: models)
  draft_model_dir: models

  # An initial draft model to load.
  # Ensure the model is in the model directory.
  draft_model_name:

  # Rope scale for draft models (default: 1.0).
  # Same as compress_pos_emb.
  # Use if the draft model was trained on long context with rope.
  draft_rope_scale: 1.0

  # Rope alpha for draft models (default: None).
  # Same as alpha_value. Set to "auto" to auto-calculate.
  # Leaving this value blank will either pull from the model or auto-calculate.
  draft_rope_alpha:

  # Cache mode for draft models to save VRAM (default: FP16).
  # Possible values for exllamav2: 'FP16', 'Q8', 'Q6', 'Q4'.
  # For exllamav3, specify the pair k_bits,v_bits where k_bits and v_bits are integers from 2-8 (i.e. 8,8).
  draft_cache_mode: FP16

  # An integer array of GBs of VRAM to split between GPUs (default: []).
  # If this isn't filled in, the draft model is autosplit.
  draft_gpu_split: []

# Options for Sampling
sampling:
  # Select a sampler override preset (default: None).
  # Find this in the sampler-overrides folder.
  # This overrides default fallbacks for sampler values that are passed to the API.
  # NOTE: safe_defaults is noob friendly and provides fallbacks for frontends that don't send sampling parameters.
  # Remove this for any advanced usage.
  override_preset: safe_defaults

# Options for Loras
lora:
  # Directory to look for LoRAs (default: loras).
  lora_dir: loras

  # List of LoRAs to load and associated scaling factors (default scale: 1.0).
  # For the YAML file, add each entry as a YAML list:
  # - name: lora1
  #   scaling: 1.0
  loras:

# Options for embedding models and loading.
# NOTE: Embeddings requires the "extras" feature to be installed
# Install it via "pip install .[extras]"
embeddings:
  # Directory to look for embedding models (default: models).
  embedding_model_dir: models

  # Device to load embedding models on (default: cpu).
  # Possible values: cpu, auto, cuda.
  # NOTE: It's recommended to load embedding models on the CPU.
  # If using an AMD GPU, set this value to 'cuda'.
  embeddings_device: cpu

  # An initial embedding model to load on the infinity backend.
  embedding_model_name:

# Options for development and experimentation
developer:
  # Skip Exllamav2 version check (default: False).
  # WARNING: It's highly recommended to update your dependencies rather than enabling this flag.
  unsafe_launch: false

  # Disable API request streaming (default: False).
  disable_request_streaming: false

  # Set process to use a higher priority.
  # For realtime process priority, run as administrator or sudo.
  # Otherwise, the priority will be set to high.
  realtime_process_priority: false

Expected behavior

Should be works on latest commit using the same config

Logs

No response

Additional context

No response

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions