Skip to content

Comments

Add audio modality support for CLAP evaluation#148

Open
JeniaJitsev wants to merge 26 commits intoLAION-AI:mainfrom
Spatenfe:audio_benchmarks
Open

Add audio modality support for CLAP evaluation#148
JeniaJitsev wants to merge 26 commits intoLAION-AI:mainfrom
Spatenfe:audio_benchmarks

Conversation

@JeniaJitsev
Copy link
Member

Summary

This PR integrates audio modality support into CLIP Benchmark, enabling standardized evaluation of CLAP (Contrastive Language-Audio Pretraining) models alongside existing image-language CLIP models. It supports both the original LAION-CLAP (v1) pretrained checkpoints and models trained with the most recent open_clip CLAP implementation (v2).

Key features

  • Audio modality in all evaluation tasks: zero-shot classification, zero-shot retrieval, and linear probe — fully integrated into the existing CLI and evaluation pipeline
  • Two CLAP model loaders:
    • clap — loads old LAION-CLAP v1 pretrained checkpoints (e.g. 630k-audioset-best) into the new open_clip architecture via state_dict key remapping. No dependency on the old laion_clap package — only open_clip is needed
    • clap_v2 — loads checkpoints from recent open_clip CLAP training directly (no key remapping)
  • Fusion model support: both loaders auto-detect AFF-2D fusion checkpoints from state_dict keys and create the model with the correct architecture
  • Audio WebDataset support: audio samples (WAV/FLAC/MP3) loaded from WebDataset tars via librosa, with proper padding/truncation and mel spectrogram computation for fusion models
  • --modality CLI flag: explicit image/audio selection with auto-detection from the loaded model type

Evaluation results

Zero-shot classification

Results using CLAP (HTSAT-tiny) with LAION-CLAP v1 pretrained checkpoints:

Dataset Fusion Acc@1 Acc@5 mAP
ESC-50 92.50% 99.50% -
ESC-50 92.75% 98.75% -
ESC-50 (no overlap) 91.03% 99.35% -
ESC-50 (no overlap) 89.65% 98.79% -
UrbanSound8K 80.65% 96.54% -
UrbanSound8K 76.94% 97.13% -
UrbanSound8K (no overlap) 75.60% 97.02% -
UrbanSound8K (no overlap) 75.95% 98.40% -
GTZAN 53.65% 74.97% -
GTZAN 34.53% 67.87% -
FSD50K - - 55.97%
FSD50K - - 56.90%

Zero-shot retrieval

Dataset Fusion Text R@1 Text R@5 Audio R@1 Audio R@5
AudioCaps 41.7% 77.5% 30.8% 64.9%
AudioCaps 42.0% 74.8% 31.1% 67.1%
Clotho 17.7% 43.0% 14.9% 37.9%
Clotho 18.6% 42.0% 14.0% 34.8%

Linear probe

Dataset Fusion Acc@1 mAP Gain
ESC-50 97.00% - +4.50%
ESC-50 95.75% - +3.00%
UrbanSound8K 88.89% - +8.24%
UrbanSound8K 88.29% - +11.35%
FSD50K - 67.52% +11.55%
FSD50K - 68.06% +11.16%

Fusion models use 630k-audioset-fusion-best.pt, standard models use 630k-audioset-best.pt from lukewys/laion_clap.

Changes by file

File Description
clip_benchmark/models/clap.py CLAP v1 loader — remaps old laion_clap state_dict keys to open_clip format, auto-detects fusion, fixes text projection shape mismatch. Downloads checkpoints from HuggingFace.
clip_benchmark/models/clap_v2.py CLAP v2 loader — direct checkpoint loading for models trained with recent open_clip. Fusion auto-detection.
clip_benchmark/models/__init__.py Registers clap and clap_v2 model types with graceful ImportError fallback
clip_benchmark/cli.py Adds --modality flag (image/audio/auto), passes audio_loader and modality through the evaluation pipeline
clip_benchmark/datasets/builder.py Audio WebDataset support: audio file decoding, audio_loader integration, mixed .txt/.json caption format handling for retrieval
clip_benchmark/metrics/zeroshot_classification.py Audio modality support in classification, .float() cast before normalization for numerical stability
clip_benchmark/metrics/zeroshot_retrieval.py Audio modality support in retrieval, .float() cast before normalization
clip_benchmark/metrics/linear_probe.py Audio modality support in linear probe
clip_benchmark/metrics/utils.py Shared utility functions for audio data handling
AUDIO_README.md Documentation for audio benchmarks with usage examples and results

Usage examples

# Zero-shot classification with old LAION-CLAP checkpoint (auto-downloaded)
clip_benchmark eval \
    --model_type clap \
    --model CLAP-HTSAT-tiny-Roberta-base \
    --pretrained 630k-audioset-best \
    --dataset wds/UrbanSounds8k_no_overlap \
    --dataset_root /path/to/UrbanSounds8k_no_overlap \
    --task zeroshot_classification \
    --modality audio \
    --no_amp

# Zero-shot retrieval with own training checkpoint
clip_benchmark eval \
    --model_type clap_v2 \
    --model HTSAT-tiny-Roberta-base \
    --pretrained /path/to/epoch_45.pt \
    --dataset wds/audiocaps \
    --dataset_root /path/to/audiocaps \
    --task zeroshot_retrieval \
    --modality audio \
    --no_amp

Dependencies

  • open_clip (for model architecture and tokenizer)
  • librosa (audio decoding)
  • torchaudio (mel spectrogram computation for fusion models)
  • huggingface_hub (optional, for downloading old LAION-CLAP checkpoints)

No dependency on the old laion_clap package.

Test plan

  • Zero-shot classification on UrbanSound8K with non-fusion and fusion LAION-CLAP checkpoints
  • Zero-shot retrieval on AudioCaps and Clotho with non-fusion and fusion checkpoints
  • Linear probe on ESC-50, UrbanSound8K, FSD50K
  • Results consistent with published LAION-CLAP numbers
  • Fusion auto-detection from checkpoint state_dict keys
  • clap_v2 loader verified with own training checkpoints (strict load)
  • Image modality regression test (existing CLIP functionality unaffected)

With assistance by Claude Code Opus 4.6

Spatenfe and others added 25 commits December 28, 2025 21:02
Remove the hard dependency on the old laion_clap package in clap.py
and rewrite it to load old LAION-CLAP pretrained checkpoints directly
into the new open_clip CLAP architecture via state_dict key remapping.
This makes the eval pipeline fully self-contained with only open_clip
as the model backend.

clap.py changes (old LAION-CLAP checkpoint loader):
- Remove `import laion_clap` and `from transformers import RobertaTokenizer`
- Add state_dict key remapping: audio_branch->audio.encoder,
  audio_projection->audio.proj, text_branch->text.transformer,
  text_projection->text.proj, logit_scale_a->logit_scale
- Auto-detect fusion from checkpoint keys (fusion_model, mel_conv2d)
- Fix text projection shape mismatch (old: 768->512 Linear+ReLU+Linear
  with bias; new: 768->640 Linear+GELU+Linear without bias)
- Add FusionAudioLoader for fusion checkpoints (4-channel mel_fusion:
  global resized + 3 deterministic local chunks)
- Use open_clip.get_tokenizer() instead of RobertaTokenizer directly
- Support both HuggingFace checkpoint names (630k-best,
  630k-fusion-best, etc.) and local file paths
- Tested on both non-fusion (630k-audioset-best) and fusion
  (630k-audioset-fusion-best) old checkpoints

clap_v2.py (new open_clip CLAP checkpoint loader):
- New file for loading our own open_clip CLAP training checkpoints
- Direct state_dict loading (no key remapping needed)
- Auto-detects fusion from checkpoint keys and creates model with
  enable_fusion=True when detected
- Supports both non-fusion (waveform input) and fusion (mel_fusion)
- Verified with real training checkpoints (strict=True load)
- Verified with synthetic fusion round-trip test

__init__.py:
- Register clap and clap_v2 model types with try/except ImportError
  guards (graceful fallback if open_clip not installed)

Eval pipeline fixes (classification + retrieval):
- Disable torch.autocast — float16 precision on GH200 destroys cosine
  similarity discriminability for CLAP models (acc drops to random
  chance). The --no_amp CLI flag was already added but the metrics code
  still used autocast internally.
- Cast features to float32 before F.normalize and cosine similarity
- Handle non-tensor targets in classification (torch.tensor conversion)

builder.py:
- Support both .txt (newline-separated) and .json ({"text": [...]})
  caption formats in retrieval WebDatasets (audiocaps uses .json)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JeniaJitsev
Copy link
Member Author

@mehdidc please have a look

Use getattr() with safe defaults for audio-specific CLI args
(modality, dump_classnames, dump_templates) so that callers
without these attributes — including the existing unit tests —
continue to work unchanged.

Fixes test_clip_benchmark.py::test_base AttributeError.

With assistance by Claude Code Opus 4.6
@JeniaJitsev JeniaJitsev requested a review from mehdidc February 20, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants