Add audio modality support for CLAP evaluation#148
Open
JeniaJitsev wants to merge 26 commits intoLAION-AI:mainfrom
Open
Add audio modality support for CLAP evaluation#148JeniaJitsev wants to merge 26 commits intoLAION-AI:mainfrom
JeniaJitsev wants to merge 26 commits intoLAION-AI:mainfrom
Conversation
…enchmark into audio_benchmarks
Remove the hard dependency on the old laion_clap package in clap.py
and rewrite it to load old LAION-CLAP pretrained checkpoints directly
into the new open_clip CLAP architecture via state_dict key remapping.
This makes the eval pipeline fully self-contained with only open_clip
as the model backend.
clap.py changes (old LAION-CLAP checkpoint loader):
- Remove `import laion_clap` and `from transformers import RobertaTokenizer`
- Add state_dict key remapping: audio_branch->audio.encoder,
audio_projection->audio.proj, text_branch->text.transformer,
text_projection->text.proj, logit_scale_a->logit_scale
- Auto-detect fusion from checkpoint keys (fusion_model, mel_conv2d)
- Fix text projection shape mismatch (old: 768->512 Linear+ReLU+Linear
with bias; new: 768->640 Linear+GELU+Linear without bias)
- Add FusionAudioLoader for fusion checkpoints (4-channel mel_fusion:
global resized + 3 deterministic local chunks)
- Use open_clip.get_tokenizer() instead of RobertaTokenizer directly
- Support both HuggingFace checkpoint names (630k-best,
630k-fusion-best, etc.) and local file paths
- Tested on both non-fusion (630k-audioset-best) and fusion
(630k-audioset-fusion-best) old checkpoints
clap_v2.py (new open_clip CLAP checkpoint loader):
- New file for loading our own open_clip CLAP training checkpoints
- Direct state_dict loading (no key remapping needed)
- Auto-detects fusion from checkpoint keys and creates model with
enable_fusion=True when detected
- Supports both non-fusion (waveform input) and fusion (mel_fusion)
- Verified with real training checkpoints (strict=True load)
- Verified with synthetic fusion round-trip test
__init__.py:
- Register clap and clap_v2 model types with try/except ImportError
guards (graceful fallback if open_clip not installed)
Eval pipeline fixes (classification + retrieval):
- Disable torch.autocast — float16 precision on GH200 destroys cosine
similarity discriminability for CLAP models (acc drops to random
chance). The --no_amp CLI flag was already added but the metrics code
still used autocast internally.
- Cast features to float32 before F.normalize and cosine similarity
- Handle non-tensor targets in classification (torch.tensor conversion)
builder.py:
- Support both .txt (newline-separated) and .json ({"text": [...]})
caption formats in retrieval WebDatasets (audiocaps uses .json)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Member
Author
|
@mehdidc please have a look |
Use getattr() with safe defaults for audio-specific CLI args (modality, dump_classnames, dump_templates) so that callers without these attributes — including the existing unit tests — continue to work unchanged. Fixes test_clip_benchmark.py::test_base AttributeError. With assistance by Claude Code Opus 4.6
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR integrates audio modality support into CLIP Benchmark, enabling standardized evaluation of CLAP (Contrastive Language-Audio Pretraining) models alongside existing image-language CLIP models. It supports both the original LAION-CLAP (v1) pretrained checkpoints and models trained with the most recent open_clip CLAP implementation (v2).
Key features
clap— loads old LAION-CLAP v1 pretrained checkpoints (e.g.630k-audioset-best) into the new open_clip architecture via state_dict key remapping. No dependency on the oldlaion_clappackage — onlyopen_clipis neededclap_v2— loads checkpoints from recent open_clip CLAP training directly (no key remapping)librosa, with proper padding/truncation and mel spectrogram computation for fusion models--modalityCLI flag: explicitimage/audioselection with auto-detection from the loaded model typeEvaluation results
Zero-shot classification
Results using CLAP (HTSAT-tiny) with LAION-CLAP v1 pretrained checkpoints:
Zero-shot retrieval
Linear probe
Changes by file
clip_benchmark/models/clap.pylaion_clapstate_dict keys to open_clip format, auto-detects fusion, fixes text projection shape mismatch. Downloads checkpoints from HuggingFace.clip_benchmark/models/clap_v2.pyclip_benchmark/models/__init__.pyclapandclap_v2model types with graceful ImportError fallbackclip_benchmark/cli.py--modalityflag (image/audio/auto), passesaudio_loaderandmodalitythrough the evaluation pipelineclip_benchmark/datasets/builder.pyaudio_loaderintegration, mixed.txt/.jsoncaption format handling for retrievalclip_benchmark/metrics/zeroshot_classification.py.float()cast before normalization for numerical stabilityclip_benchmark/metrics/zeroshot_retrieval.py.float()cast before normalizationclip_benchmark/metrics/linear_probe.pyclip_benchmark/metrics/utils.pyAUDIO_README.mdUsage examples
Dependencies
open_clip(for model architecture and tokenizer)librosa(audio decoding)torchaudio(mel spectrogram computation for fusion models)huggingface_hub(optional, for downloading old LAION-CLAP checkpoints)No dependency on the old
laion_clappackage.Test plan
clap_v2loader verified with own training checkpoints (strict load)With assistance by Claude Code Opus 4.6