If you use this code in your research, please cite our ICML 2025 paper:
@inproceedings{
zhang2025lemon,
title={{LEM}oN: Label Error Detection using Multimodal Neighbors},
author={Haoran Zhang and Aparna Balagopalan and Nassim Oufattole and Hyewon Jeong and Yan Wu and Jiacheng Zhu and Marzyeh Ghassemi},
booktitle={Forty-second International Conference on Machine Learning},
year={2025}
}
Run the following commands to clone this repo and create the Conda environment:
git clone git@github.com:MLforHealth/LEMoN.git
cd LEMoN
conda env create -f environment.yml
conda activate lemon
CIFAR-10 and CIFAR-100 are downloaded automatically by the codebase. To preprocess the remaining datasets, follow the instructions in DataSources.md.
To run a single evaluation, call run_lemon.py with the appropriate arguments, for example:
python -m run_lemon \
--output_dir /output/dir \
--dataset mscoco \
--noise_type cat \
--noise_level 0.4
To reproduce the experiments in the paper which involve training a grid of models using different hyperparameters, use sweep.py as follows:
python sweep.py launch \
--experiment {experiment_name} \
--output_dir {output_root} \
--command_launcher {launcher}
where:
experiment_namecorresponds to experiments defined as classes inexperiments.pyoutput_rootis a directory where experimental results will be stored.launcheris a string corresponding to a launcher defined inlaunchers.py(i.e.slurmorlocal).
After the lemon_all experiment has finished running, to create Tables 2 and 3, run notebooks/agg_results.ipynb and notebooks/hparam_drop.ipynb
The CLIP Logits baseline in our paper uses temperature=1 for logit scaling, consistent with the mathematical formulation described in Feng et al.. We note that Liang et al., who also concurrently proposed this baseline, specify temperature as a hyperparameter. In their updated version (which we observed after our publication), this is set to 0.01.
We now evaluated the baseline across multiple temperature values (0.01, 0.015, 0.07, 1) and verified that our method consistently outperforms it on three out of four classification datasets. However, the magnitude of AUROC gains of our method varies: lower performance gains of 0.7% (temperature = 0.01; still significant) to over 3% (temperatures of 0.07 or 1) for label error detection over the best baseline. Thus, the degree of performance gains are somewhat sensitive to temperature scaling in this baseline.
Downstream filtering performance is stable across temperatures, with low impact to observed trends in most datasets even at a temperature of 0.01. Note that CLIP Logits relies on a pre-defined set of classification labels and is therefore not a baseline for captioning datasets.