This directory contains example scripts demonstrating how to use TextAttack-Multilabel for generating multilabel adversarial examples.
Run the complete workflow with built-in sample data:
# Quick demo (5 samples, fast)
python example_toxic_adv_examples/run_end_to_end_demo.py --quick
# Full demo (10 samples)
python example_toxic_adv_examples/run_end_to_end_demo.py
# Custom configuration
python example_toxic_adv_examples/run_end_to_end_demo.py \
--num-samples 20 \
--wir-method gradient \
--recipe-type transformWhat this does:
- β Creates sample benign and toxic texts
- β Loads Detoxify model
- β Runs multilabel adversarial attacks
- β Analyzes attack success rates
- β Saves results with statistics
- β No data download needed!
Run attacks on the Jigsaw Toxic Comments dataset:
# 1. Download the data (requires Kaggle API)
python example_toxic_adv_examples/download_data.py
# 2. Run attacks with configuration
python example_toxic_adv_examples/run_multilabel_tae_main.py \
--config example_toxic_adv_examples/config/attack_config.yaml \
--attack benign| File | Description | Use Case |
|---|---|---|
run_end_to_end_demo.py |
Complete workflow with built-in data | β Quick testing, demos, learning |
run_multilabel_tae_main.py |
Production script with config files | Production attacks on real data |
download_data.py |
Download Jigsaw dataset from Kaggle | Get real toxicity data |
baseline_multiclass_toxic_adv_example_attack.py |
Baseline single-label attacks | Comparison/benchmarking |
config/ |
Configuration files | Customize attack parameters |
# Quick demo (5 samples, unk method)
python example_toxic_adv_examples/run_end_to_end_demo.py --quick
# Standard demo (10 samples)
python example_toxic_adv_examples/run_end_to_end_demo.py
# More samples
python example_toxic_adv_examples/run_end_to_end_demo.py --num-samples 50# Use gradient-based word importance ranking
python example_toxic_adv_examples/run_end_to_end_demo.py --wir-method gradient
# Use transform recipe instead of target recipe
python example_toxic_adv_examples/run_end_to_end_demo.py --recipe-type transform
# Attack only benign samples
python example_toxic_adv_examples/run_end_to_end_demo.py --no-attack-toxic
# Attack only toxic samples
python example_toxic_adv_examples/run_end_to_end_demo.py --no-attack-benignunk- Unknown token replacement (fastest)delete- Word deletion importanceweighted-saliency- Gradient-weighted saliencygradient- Pure gradient-based (slowest, most effective)random- Random word selection (baseline)
======================================================================
TextAttack-Multilabel End-to-End Demo
======================================================================
Start time: 2024-01-15 14:23:45
======================================================================
Step 1: Creating Sample Data
======================================================================
β Created 10 benign samples
β Created 10 toxic samples
======================================================================
Step 2: Loading Model
======================================================================
βΉ Using device: cuda
βΉ Loading Detoxify model...
β Detoxify model loaded successfully
======================================================================
Step 3: Building Attack Recipe
======================================================================
βΉ Attack type: maximize
βΉ WIR method: unk
βΉ Recipe: target
βΉ Goal: Maximize all toxic labels (make benign β toxic)
β Attack recipe built successfully
βΉ - Goal function: MultilabelClassificationGoalFunction
βΉ - Search method: GreedyWordSwapWIRTruncated
βΉ - Constraints: 5 active
βΉ - Transformation: CompositeTransformation
...
======================================================================
Final Summary
======================================================================
Overall Results:
Benign β Toxic Attack:
Success rate: 80.0%
Successful: 8/10
Toxic β Benign Attack:
Success rate: 70.0%
Successful: 7/10
β End-to-end demo completed successfully!
For running attacks on real datasets with full configuration control.
-
Download data:
python example_toxic_adv_examples/download_data.py
-
Set up Kaggle API credentials (for data download):
export KAGGLE_USERNAME="your_username" export KAGGLE_KEY="your_api_key"
# Attack benign samples (make them toxic)
python example_toxic_adv_examples/run_multilabel_tae_main.py \
--config example_toxic_adv_examples/config/attack_config.yaml \
--attack benign
# Attack toxic samples (make them benign)
python example_toxic_adv_examples/run_multilabel_tae_main.py \
--config example_toxic_adv_examples/config/attack_config.yaml \
--attack toxic
# Attack both
python example_toxic_adv_examples/run_multilabel_tae_main.py \
--config example_toxic_adv_examples/config/attack_config.yaml \
--attack both
# Override data path
python example_toxic_adv_examples/run_multilabel_tae_main.py \
--config example_toxic_adv_examples/config/attack_config.yaml \
--attack benign \
--data path/to/your/data.csvEdit config/attack_config.yaml to customize:
- Model: Detoxify variant or custom HuggingFace model
- Dataset: Jigsaw or custom dataset
- Attack: WIR method, target scores, constraints
- Output: Format (parquet/csv), save location
Example config:
defaults:
model:
type: "detoxify"
variant: "original"
dataset:
name: "jigsaw_toxic_comments"
sample_size: 500
attack:
wir_method: "gradient"
constraints:
pos_constraint: true
sbert_constraint: falseBenign β Toxic (Maximize):
- Goal: ALL toxic labels > target_score (default 0.5)
- Example:
[0.1, 0.2, 0.3] β [0.6, 0.7, 0.8]β Success
Toxic β Benign (Minimize):
- Goal: ALL toxic labels < target_score (default 0.5)
- Example:
[0.8, 0.7, 0.9] β [0.3, 0.2, 0.1]β Success
Results are saved in results/ directory:
-
attack_*.parquet- Main results file- Original text
- Perturbed text
- Original predictions
- Perturbed predictions
- Number of queries
- Attack success status
-
attack_*.summary.txt- Statistics summary- Total samples
- Success/fail/skip counts
- Average queries
- Average words changed
import pandas as pd
# Load results
df = pd.read_parquet('results/attack_benign_20240115_142345.parquet')
# View successful attacks
successful = df[df['result_type'] == 'Successful']
# Analyze query efficiency
print(f"Avg queries: {df['num_queries'].mean()}")
# Look at perturbations
for idx, row in successful.head(5).iterrows():
print(f"Original: {row['original_text']}")
print(f"Perturbed: {row['perturbed_text_clean']}")
print(f"Queries: {row['num_queries']}\n")Best for: Most scenarios, good balance
python example_toxic_adv_examples/run_end_to_end_demo.py --recipe-type targetFeatures:
- Composite transformations (multiple perturbation types)
- Character swaps, homoglyphs, word substitutions
- Higher success rate
Best for: Specific transformation types
python example_toxic_adv_examples/run_end_to_end_demo.py --recipe-type transformFeatures:
- Single transformation method
- Options: GloVe embeddings, MLM, WordNet
- More interpretable perturbations
Issue: ModuleNotFoundError: No module named 'detoxify'
# Solution: Install detoxify
pip install detoxifyIssue: CUDA out of memory
# Solution: Use CPU or reduce batch size
# The script auto-detects device, but you can force CPU mode
CUDA_VISIBLE_DEVICES="" python example_toxic_adv_examples/run_end_to_end_demo.pyIssue: FileNotFoundError: Data file not found
# Solution: Download data first
python example_toxic_adv_examples/download_data.pyIssue: Attack runs very slowly
# Solution: Use faster WIR method
python example_toxic_adv_examples/run_end_to_end_demo.py --wir-method unk
# Or reduce samples
python example_toxic_adv_examples/run_end_to_end_demo.py --num-samples 5-
Run the quick demo to see the workflow:
python example_toxic_adv_examples/run_end_to_end_demo.py --quick
-
Try different WIR methods to compare effectiveness:
for method in unk delete gradient; do python example_toxic_adv_examples/run_end_to_end_demo.py --wir-method $method done
-
Experiment with real data using the production script
-
Analyze results to understand attack patterns
-
Customize attacks by modifying configuration files
- Start small: Use
--quickmode first to verify setup - GPU recommended: Attacks run 10-50x faster on GPU
- WIR method matters:
gradientis most effective but slowest - Check constraints: Adjust POS/SBERT constraints for quality vs. success rate
- Save results: All outputs include timestamps for versioning
- Main README:
../README.md - Package documentation:
../textattack_multilabel/ - TextAttack documentation: https://textattack.readthedocs.io/
- Research paper: [ACL 2023 Multilabel Attacks]
Found issues or have improvements? Please open an issue or PR in the main repository!