Skip to content

Conversation

@gonzalobenegas
Copy link
Member

No description provided.

@gonzalobenegas gonzalobenegas merged commit ea43515 into main Dec 18, 2025
6 checks passed
@gonzalobenegas gonzalobenegas deleted the new-dataset branch December 18, 2025 21:31
@claude
Copy link

claude bot commented Dec 18, 2025

Code Review - PR #18: Add experiment config

I've reviewed this PR against the CLAUDE.md guidelines. Here's my feedback:

🔴 Critical Issues

1. Missing soft_masked_weight Parameter

Location: Both experiment configs
Issue: The configs don't specify soft_masked_weight in the data section, but the model config (clm_transformer_base.yaml:3) references ${data.soft_masked_weight}.

Impact: This will cause a Hydra interpolation error at runtime since the key is missing.

Fix: Add to both configs:

data:
  _target_: glm_experiments.data.lm_datamodule.CLMDataModule
  dataset_name: data/gonzalobenegas/genomes-v2-genome_set-animals-intervals-v1_512_256
  per_device_batch_size: 256
  data_augmentation: false
  soft_masked_weight: 0.01  # Required - referenced by model config

2. Missing evals Parameter

Location: Both experiment configs
Issue: Similarly, the model config references ${data.evals} (line 4) but it's not defined.

Fix: Add to both configs:

data:
  evals: null  # Or specify eval datasets if needed

⚠️ Medium Priority Issues

3. Duplicate WandB Name

Location: clm_transformer_base_new_dataset_vertebrates.yaml:12
Issue: Both configs use the same wandb name experiment-clm-transformer-base-new-dataset, making it hard to distinguish runs.

Fix:

logger:
  wandb:
    name: experiment-clm-transformer-base-new-dataset-vertebrates  # Add '-vertebrates'

4. Incorrect Documentation Comment

Location: clm_transformer_base_new_dataset_vertebrates.yaml:4
Issue: Comment says experiment=clm_transformer_base_new_dataset but should be experiment=clm_transformer_base_new_dataset_vertebrates.

Fix:

# to execute this experiment run:
# python glm_experiments/train.py experiment=clm_transformer_base_new_dataset_vertebrates

📝 Recommendations

5. Missing Dataset Validation

Concern: The README.md is updated to include download instructions for the new dataset, but there's no verification that:

  • The dataset exists at the specified HuggingFace path
  • The dataset has the expected structure (columns, format)
  • The dataset is compatible with CLMDataModule

Suggestion: Add a simple test or at least manually verify the dataset can be loaded.

6. Inconsistent Pattern with Base Config

Observation: clm_transformer_base.yaml doesn't explicitly override data_augmentation: false, but these new configs do. This is fine, but worth noting for consistency.

✅ Positive Observations

  • Proper use of Hydra _target_ instantiation pattern
  • Correct scheduler configuration with interpolation (${trainer.max_steps})
  • Appropriate tags for experiment tracking
  • Follows existing experiment config structure

🧪 Testing Recommendation

Since this PR was merged, I recommend:

  1. Running both experiment configs to verify they don't crash on startup
  2. Checking that the Hydra interpolations resolve correctly
  3. Verifying the datasets can be loaded

Command to test:

# Dry-run to check config resolution
uv run python glm_experiments/train.py experiment=clm_transformer_base_new_dataset --cfg job

# Check vertebrates config
uv run python glm_experiments/train.py experiment=clm_transformer_base_new_dataset_vertebrates --cfg job

Summary

Severity: Medium-High - The missing required parameters will cause runtime failures.

Action Required: Either:

  1. Create a follow-up PR to fix the missing parameters, OR
  2. Verify that subsequent commits already addressed these issues

Review conducted following CLAUDE.md standards for configuration management and Hydra patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants