Add audio modality support for CLAP evaluation by JeniaJitsev · Pull Request #148 · LAION-AI/CLIP_benchmark

JeniaJitsev · 2026-02-19T04:01:55Z

Summary

This PR integrates audio modality support into CLIP Benchmark, enabling standardized evaluation of CLAP (Contrastive Language-Audio Pretraining) models alongside existing image-language CLIP models. It supports both the original LAION-CLAP (v1) pretrained checkpoints and models trained with the most recent open_clip CLAP implementation (v2).

Key features

Audio modality in all evaluation tasks: zero-shot classification, zero-shot retrieval, and linear probe — fully integrated into the existing CLI and evaluation pipeline
Two CLAP model loaders:
- clap — loads old LAION-CLAP v1 pretrained checkpoints (e.g. 630k-audioset-best) into the new open_clip architecture via state_dict key remapping. No dependency on the old laion_clap package — only open_clip is needed
- clap_v2 — loads checkpoints from recent open_clip CLAP training directly (no key remapping)
Fusion model support: both loaders auto-detect AFF-2D fusion checkpoints from state_dict keys and create the model with the correct architecture
Audio WebDataset support: audio samples (WAV/FLAC/MP3) loaded from WebDataset tars via librosa, with proper padding/truncation and mel spectrogram computation for fusion models
--modality CLI flag: explicit image/audio selection with auto-detection from the loaded model type

Evaluation results

Zero-shot classification

Results using CLAP (HTSAT-tiny) with LAION-CLAP v1 pretrained checkpoints:

Dataset	Fusion	Acc@1	Acc@5	mAP
ESC-50		92.50%	99.50%	-
ESC-50	✓	92.75%	98.75%	-
ESC-50 (no overlap)		91.03%	99.35%	-
ESC-50 (no overlap)	✓	89.65%	98.79%	-
UrbanSound8K		80.65%	96.54%	-
UrbanSound8K	✓	76.94%	97.13%	-
UrbanSound8K (no overlap)		75.60%	97.02%	-
UrbanSound8K (no overlap)	✓	75.95%	98.40%	-
GTZAN		53.65%	74.97%	-
GTZAN	✓	34.53%	67.87%	-
FSD50K		-	-	55.97%
FSD50K	✓	-	-	56.90%

Zero-shot retrieval

Dataset	Fusion	Text R@1	Text R@5	Audio R@1	Audio R@5
AudioCaps		41.7%	77.5%	30.8%	64.9%
AudioCaps	✓	42.0%	74.8%	31.1%	67.1%
Clotho		17.7%	43.0%	14.9%	37.9%
Clotho	✓	18.6%	42.0%	14.0%	34.8%

Linear probe

Dataset	Fusion	Acc@1	mAP	Gain
ESC-50		97.00%	-	+4.50%
ESC-50	✓	95.75%	-	+3.00%
UrbanSound8K		88.89%	-	+8.24%
UrbanSound8K	✓	88.29%	-	+11.35%
FSD50K		-	67.52%	+11.55%
FSD50K	✓	-	68.06%	+11.16%

Fusion models use 630k-audioset-fusion-best.pt, standard models use 630k-audioset-best.pt from lukewys/laion_clap.

Changes by file

File	Description
`clip_benchmark/models/clap.py`	CLAP v1 loader — remaps old `laion_clap` state_dict keys to open_clip format, auto-detects fusion, fixes text projection shape mismatch. Downloads checkpoints from HuggingFace.
`clip_benchmark/models/clap_v2.py`	CLAP v2 loader — direct checkpoint loading for models trained with recent open_clip. Fusion auto-detection.
`clip_benchmark/models/__init__.py`	Registers `clap` and `clap_v2` model types with graceful ImportError fallback
`clip_benchmark/cli.py`	Adds `--modality` flag (image/audio/auto), passes `audio_loader` and `modality` through the evaluation pipeline
`clip_benchmark/datasets/builder.py`	Audio WebDataset support: audio file decoding, `audio_loader` integration, mixed `.txt`/`.json` caption format handling for retrieval
`clip_benchmark/metrics/zeroshot_classification.py`	Audio modality support in classification, `.float()` cast before normalization for numerical stability
`clip_benchmark/metrics/zeroshot_retrieval.py`	Audio modality support in retrieval, `.float()` cast before normalization
`clip_benchmark/metrics/linear_probe.py`	Audio modality support in linear probe
`clip_benchmark/metrics/utils.py`	Shared utility functions for audio data handling
`AUDIO_README.md`	Documentation for audio benchmarks with usage examples and results

Usage examples

# Zero-shot classification with old LAION-CLAP checkpoint (auto-downloaded)
clip_benchmark eval \
    --model_type clap \
    --model CLAP-HTSAT-tiny-Roberta-base \
    --pretrained 630k-audioset-best \
    --dataset wds/UrbanSounds8k_no_overlap \
    --dataset_root /path/to/UrbanSounds8k_no_overlap \
    --task zeroshot_classification \
    --modality audio \
    --no_amp

# Zero-shot retrieval with own training checkpoint
clip_benchmark eval \
    --model_type clap_v2 \
    --model HTSAT-tiny-Roberta-base \
    --pretrained /path/to/epoch_45.pt \
    --dataset wds/audiocaps \
    --dataset_root /path/to/audiocaps \
    --task zeroshot_retrieval \
    --modality audio \
    --no_amp

Dependencies

open_clip (for model architecture and tokenizer)
librosa (audio decoding)
torchaudio (mel spectrogram computation for fusion models)
huggingface_hub (optional, for downloading old LAION-CLAP checkpoints)

No dependency on the old laion_clap package.

Test plan

Zero-shot classification on UrbanSound8K with non-fusion and fusion LAION-CLAP checkpoints
Zero-shot retrieval on AudioCaps and Clotho with non-fusion and fusion checkpoints
Linear probe on ESC-50, UrbanSound8K, FSD50K
Results consistent with published LAION-CLAP numbers
Fusion auto-detection from checkpoint state_dict keys
clap_v2 loader verified with own training checkpoints (strict load)
Image modality regression test (existing CLIP functionality unaffected)

With assistance by Claude Code Opus 4.6

…enchmark into audio_benchmarks

Remove the hard dependency on the old laion_clap package in clap.py and rewrite it to load old LAION-CLAP pretrained checkpoints directly into the new open_clip CLAP architecture via state_dict key remapping. This makes the eval pipeline fully self-contained with only open_clip as the model backend. clap.py changes (old LAION-CLAP checkpoint loader): - Remove `import laion_clap` and `from transformers import RobertaTokenizer` - Add state_dict key remapping: audio_branch->audio.encoder, audio_projection->audio.proj, text_branch->text.transformer, text_projection->text.proj, logit_scale_a->logit_scale - Auto-detect fusion from checkpoint keys (fusion_model, mel_conv2d) - Fix text projection shape mismatch (old: 768->512 Linear+ReLU+Linear with bias; new: 768->640 Linear+GELU+Linear without bias) - Add FusionAudioLoader for fusion checkpoints (4-channel mel_fusion: global resized + 3 deterministic local chunks) - Use open_clip.get_tokenizer() instead of RobertaTokenizer directly - Support both HuggingFace checkpoint names (630k-best, 630k-fusion-best, etc.) and local file paths - Tested on both non-fusion (630k-audioset-best) and fusion (630k-audioset-fusion-best) old checkpoints clap_v2.py (new open_clip CLAP checkpoint loader): - New file for loading our own open_clip CLAP training checkpoints - Direct state_dict loading (no key remapping needed) - Auto-detects fusion from checkpoint keys and creates model with enable_fusion=True when detected - Supports both non-fusion (waveform input) and fusion (mel_fusion) - Verified with real training checkpoints (strict=True load) - Verified with synthetic fusion round-trip test __init__.py: - Register clap and clap_v2 model types with try/except ImportError guards (graceful fallback if open_clip not installed) Eval pipeline fixes (classification + retrieval): - Disable torch.autocast — float16 precision on GH200 destroys cosine similarity discriminability for CLAP models (acc drops to random chance). The --no_amp CLI flag was already added but the metrics code still used autocast internally. - Cast features to float32 before F.normalize and cosine similarity - Handle non-tensor targets in classification (torch.tensor conversion) builder.py: - Support both .txt (newline-separated) and .json ({"text": [...]}) caption formats in retrieval WebDatasets (audiocaps uses .json) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

JeniaJitsev · 2026-02-19T04:03:53Z

@mehdidc please have a look

Use getattr() with safe defaults for audio-specific CLI args (modality, dump_classnames, dump_templates) so that callers without these attributes — including the existing unit tests — continue to work unchanged. Fixes test_clip_benchmark.py::test_base AttributeError. With assistance by Claude Code Opus 4.6

Spatenfe and others added 25 commits December 28, 2025 21:02

added audio modality

f9ba674

vggsounder cleanup

f714bbd

better multilabel logging

9982d91

remove unneeded names

92486f9

add audio linear probe

181b435

auto readme

89e91ef

modality for linear probe

c3b0e86

readme update

4eec28a

audio dataset refactor

f20da35

added datasets dependency

1daa355

update exampels

c3689e0

audio multi eval added

fa9b8f8

audio/text wds support

82901e4

update

3e3057c

audio wds support in build

69cf7ed

add audio modality to retrieval

d7c0d52

update

f3bf982

updated wds support for retrieval, linear probe and zs classification

1d1edc6

added no overlap classification results

211f6a0

remove old code

304b330

Merge branch 'LAION-AI:main' into audio_benchmarks

e07a499

clean up

4b0cd92

Merge branch 'audio_benchmarks' of https://github.com/Spatenfe/CLIP_b…

bc98549

…enchmark into audio_benchmarks

clean readme

a6b84bd

JeniaJitsev requested a review from mehdidc February 20, 2026 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add audio modality support for CLAP evaluation#148

Add audio modality support for CLAP evaluation#148
JeniaJitsev wants to merge 26 commits intoLAION-AI:mainfrom
Spatenfe:audio_benchmarks

JeniaJitsev commented Feb 19, 2026

Uh oh!

JeniaJitsev commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

JeniaJitsev commented Feb 19, 2026

Summary

Key features

Evaluation results

Zero-shot classification

Zero-shot retrieval

Linear probe

Changes by file

Usage examples

Dependencies

Test plan

Uh oh!

JeniaJitsev commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants