Automatic generation of RL training environments using Claude Code subagents, based on the DeepSeek-V3.2 methodology.
This project replicates DeepSeek's approach to automatically synthesizing "Hard to Solve, Easy to Verify" constraint satisfaction environments for reinforcement learning. Instead of expensive API calls, we use Claude Code subagents (free!) to generate environments in parallel.
✅ Zero Cost: Uses Claude Code subagents instead of paid APIs ✅ Parallel Generation: Spawn multiple subagents simultaneously ✅ Real Data: Environments built from actual internet sources (D&D 5e, MTG, Pokemon, etc.) ✅ Multiple Formats: Export to RL training, benchmarks, or fine-tuning datasets ✅ Quality Filtered: Pass@K evaluation ensures proper difficulty
Each environment follows a 6-step synthesis workflow:
1. Generate Database → Real data from internet (WebSearch/WebFetch)
2. Synthesize Tools → Python functions that query database
3. Create Task → Base task + solution + verifier
4. Scale Difficulty → Progressive constraints (5 levels)
5. Augment Toolset → Add tools as needed for harder levels
6. Create Metadata → Document sources and environment details
Core Philosophy: "Hard to Solve, Easy to Verify"
- Solutions can ONLY call tool functions (no direct database access)
- Verification is simple constraint checking
- Creates good RL training signal
auto-rlenvs/
├── README.md
├── plan.md # Detailed implementation plan
├── tasks.md # 50+ CSP problem templates
├── deepseek.md # Original DeepSeek paper
│
├── scripts/ # Benchmark runner scripts
│ └── run_pc_part_picker_benchmark.sh # Run Harbor benchmarks on Daytona
│
├── src/
│ ├── prompts/
│ │ ├── synthesis_prompt.md # Main 6-step workflow template
│ │ └── category_hints/ # Category-specific guidance
│ │ ├── rpg_builds.md
│ │ ├── deck_building.md
│ │ ├── pokemon_teams.md
│ │ ├── chess_problems.md
│ │ └── pc_part_picker.md
│ │
│ ├── utils/ # Helper utilities (TODO)
│ │ ├── file_manager.py
│ │ └── validator.py
│ │
│ └── output/ # Export formatters (TODO)
│ ├── rl_formatter.py
│ ├── benchmark_formatter.py
│ └── finetune_formatter.py
│
├── data/
│ ├── envs/ # Generated environments
│ │ └── {env_id}/
│ │ ├── database.db # Real data (SQLite)
│ │ ├── tools.py # Query functions
│ │ ├── tasks.json # ALL task definitions
│ │ ├── verifier.py # Constraint checker
│ │ ├── environment.json # Metadata + sources
│ │ ├── harbor-tasks/ # Harbor task directories (50 separate tasks)
│ │ │ ├── task-1/
│ │ │ │ ├── instruction.md
│ │ │ │ ├── task.toml
│ │ │ │ ├── environment/ # Dockerfile, database.db, tools.py, etc.
│ │ │ │ └── tests/test.sh
│ │ │ ├── task-2/
│ │ │ ...
│ │ │ └── task-N/
│ │ ├── HARBOR_STRUCTURE.md # Harbor format documentation
│ │ ├── run_benchmark_full.sh
│ │ └── run_benchmark_quick.sh
│ │
│ └── outputs/ # Exported datasets
│ ├── rl_training/
│ ├── benchmarks/
│ └── finetune/
│
└── configs/ # Configuration files (TODO)
└── synthesis_config.yaml
You (User)
↓
Claude Code (Main Process)
├─ Reads prompt templates
├─ Spawns N subagents in parallel
└─ Aggregates results
Subagent 1 → Generates rpg_001/
Subagent 2 → Generates rpg_002/
Subagent 3 → Generates rpg_003/
...
↓
Complete Environments Ready for RL Training
# You ask Claude Code to generate environments
"Generate 5 RPG character building environments"
# Claude Code:
# 1. Loads synthesis_prompt.md + rpg_builds.md
# 2. Creates 5 customized prompts
# 3. Spawns 5 Task subagents in parallel
# 4. Each subagent independently:
# - Searches internet for real D&D 5e data
# - Creates database.json with real spells/equipment
# - Writes tools.py with query functions
# - Designs 5 difficulty levels in tasks.json
# - Writes verifier.py for constraint checking
# - Saves to data/envs/{env_id}/
# 5. Aggregates and reports results
# Result: 5 complete environments in ~15 minutesEach environment contains:
Real data from authoritative internet sources:
{
"classes": [
{"id": "wizard", "name": "Wizard", "hit_die": 6, "armor_proficiency": ["light"], ...}
],
"spells": [
{"id": "fireball", "name": "Fireball", "level": 3, "school": "evocation", ...}
],
"equipment": [
{"id": "staff", "name": "Quarterstaff", "cost": 2, "damage": "1d6", ...}
]
}Query functions (solutions can ONLY use these):
import json
from pathlib import Path
DB_PATH = Path(__file__).parent / 'database.json'
with open(DB_PATH) as f:
db = json.load(f)
def get_class_by_id(class_id: str) -> dict:
"""Get class details by ID."""
return next((c for c in db['classes'] if c['id'] == class_id), None)
def get_available_spells(class_id: str, max_level: int) -> list:
"""Get spells available to a class up to given level."""
return [s for s in db['spells']
if class_id in s['class_access'] and s['level'] <= max_level]
# ... 8-10 total functions5 difficulty levels with progressive constraints:
{
"tasks": [
{
"level": 1,
"prompt": "Build a Level 5 Wizard with Intelligence >= 18 and Fireball",
"constraints": [...]
},
{
"level": 2,
"prompt": "Build a Level 5 Wizard with Int >= 18, Fireball, and equipment cost < 1000 gold",
"constraints": [...]
},
// ... levels 3-5
]
}Simple constraint checking (easy to verify):
def verify(output: dict) -> dict:
"""Verify solution satisfies all constraints."""
violations = []
# Check class
if output.get('class') != 'wizard':
violations.append("Must be Wizard")
# Check Intelligence
if output.get('primary_stat', {}).get('intelligence', 0) < 18:
violations.append("Intelligence must be >= 18")
# ... more checks
return {
"passed": len(violations) == 0,
"violations": violations,
"score": 1.0 if len(violations) == 0 else 0.0
}Metadata and data provenance:
{
"env_id": "rpg_001",
"category": "rpg_builds",
"description": "D&D 5e Wizard character building with spell selection and equipment constraints",
"difficulty_range": [1, 5],
"num_tools": 10,
"database_tables": ["classes", "spells", "equipment", "skills"],
"data_sources": [
"https://www.dndbeyond.com/classes/wizard",
"https://roll20.net/compendium/dnd5e/Spells"
],
"created_at": "2025-12-10",
"pass_rate": null
}Currently supported categories:
- D&D 5e character optimization
- Class selection, spell choices, equipment management
- Constraints: stats, costs, specializations, combat optimization
- Magic: The Gathering style deck construction
- Card selection, mana curve, archetype requirements
- Constraints: deck size, rarity limits, synergy requirements
- Competitive Pokemon team composition
- Type coverage, role distribution, stat optimization
- Constraints: tier limits, weakness coverage, synergy chains
- Tactical puzzles and mate-in-N problems
- Piece placement, checkmate patterns, endgame studies
- Constraints: material limits, forced sequences, positional requirements
- Computer hardware compatibility and optimization
- CPU, GPU, motherboard, RAM, storage, PSU, case selection
- Constraints: socket compatibility, physical dimensions, power requirements, budget limits, color themes
- Real-world data from PCPartPicker, manufacturer specs
-
Ask Claude Code to generate environments:
"Generate 5 RPG character building environments" -
Claude Code will:
- Load prompt templates
- Spawn 5 subagents in parallel
- Each generates a complete environment
- Report results after ~15 minutes
-
Review generated environments:
"Show me what's in data/envs/rpg_001/" -
Generate more or export:
"Generate 10 Pokemon team environments" "Export all environments to RL training format"
Custom categories:
"Generate 5 environments for [your category]"
Quality filtering:
"Run pass@100 evaluation on all environments and filter out any with pass rate > 90% or < 5%"
Export formats:
"Export environments to GRPO RL training format"
"Export environments as evaluation benchmarks"
"Export environments as supervised fine-tuning data"
{"prompt": "Build a Wizard...", "tools": [...], "verifier": "def verify...", "difficulty": 1}
{"prompt": "Build a Wizard...", "tools": [...], "verifier": "def verify...", "difficulty": 2}
...{
"name": "Gaming Logic Benchmark v1",
"categories": {"rpg_builds": 50, "deck_building": 25, ...},
"tasks": [
{"id": "rpg_001_level_1", "prompt": "...", "tools": [...], "verifier": "..."}
]
}{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "tool_calls": [...]}]}Environments are filtered using pass@K evaluation:
pass@100 = (successful solutions) / 100 attempts
Keep if: 0.01 < pass@100 < 0.95
Reasoning:
- pass@100 = 0%: Impossible or too hard (no learning signal)
- pass@100 = 100%: Too easy (no challenge)
- pass@100 = 5-50%: Perfect for RL (hard but learnable)
All environments include citations to source data:
{
"data_sources": [
"https://www.dndbeyond.com/sources/basic-rules",
"https://roll20.net/compendium/dnd5e/Spells",
"https://5e.tools/"
],
"data_source_description": "Classes from D&D Basic Rules. Spells from Roll20. Equipment from 5e.tools."
}This ensures:
- Data is verifiable and accurate
- Environments are grounded in real game systems
- Legal compliance with open game licenses
Generation Speed:
- 1 environment: ~10-15 minutes (single subagent)
- 5 environments: ~15 minutes (5 parallel subagents)
- 10 environments: ~20 minutes (batched execution)
- 100 environments: ~2-3 hours (10 subagents × 10 batches)
Cost:
- Synthesis: $0 (Claude Code subagents are free)
- Quality filtering: $0 (use subagents for evaluation)
- Total: $0 for unlimited environments
Scalability:
- Local: 5-10 parallel subagents (limited by machine resources)
- Future (Daytona): 50+ parallel instances for massive scale
| Aspect | DeepSeek Method | Our Implementation |
|---|---|---|
| Agent | API calls (Claude Opus) | Claude Code subagents |
| Cost | ~$165 for 1000 envs | $0 (free subagents) |
| Data Source | LLM-synthesized | Real internet data |
| Parallelization | API rate limited | Unlimited subagents |
| Interactivity | Batch processing | Interactive with Claude Code |
- ✅ Prompt templates with 6-step workflow
- ✅ Category hints (RPG, TCG, Pokemon, Chess)
- ✅ Subagent-based parallel generation
- ✅ Real data collection via WebSearch/WebFetch
- ⬜ File manager utilities
- ⬜ Validation helpers
- ⬜ Automated orchestrator script
- ⬜ Configuration system
- ⬜ RL training format exporters
- ⬜ Benchmark format exporters
- ⬜ Fine-tuning format exporters
- ⬜ Pass@K quality filtering
- ⬜ Additional categories (from tasks.md)
- ⬜ Daytona integration for massive parallelization
- ⬜ Dataset versioning and management
- ⬜ Evaluation benchmarks
Task (Level 3):
"Build a Level 5 Wizard specialized in Evocation school. Must have:
- Intelligence >= 18
- Fireball spell
- Equipment cost < 1000 gold
- At least 3 Evocation spells"
Why it's good for RL:
- Hard to solve: Large search space (25 spells × 20 equipment items)
- Easy to verify: Simple constraint checks
- Progressive difficulty: Level 1 (easy) → Level 5 (hard)
Task (Level 4):
"Build a 6-Pokemon OU tier team with:
- Balanced type coverage (no double weaknesses)
- 1 physical sweeper + 1 special wall + 1 support
- Synergistic abilities (e.g., Rain Dance + Swift Swim)
- Average speed > 95"
Why it's good for RL:
- Hard to solve: Combinatorial optimization (30 pokemon × 6 slots)
- Easy to verify: Type chart + stat calculations
- Real-world grounding: Actual competitive Pokemon data
Task (Level 3):
"Build an all-white gaming PC under $1500 with:
- Intel i5-14600K (overclockable)
- RTX 4070 GPU (304mm length)
- 32GB DDR5 RAM
- 1TB NVMe SSD
- Case must fit GPU (check length constraint)
- Since CPU is K-series, must use Z790 chipset for overclocking
- ALL visible components must be white color"
Why it's good for RL:
- Hard to solve: Multi-dimensional constraints (compatibility, budget, physical fit, aesthetics)
- Easy to verify: Socket matching, dimension checks, price sum, color matching
- Real-world grounding: Actual PC components with real specs and prices
Want to add new categories? Create a new file in src/prompts/category_hints/:
# Your Category - Category Hints
## Database Requirements
[Describe tables needed]
## Suggested Constraints
[Describe difficulty progression]
## Example Task Progression
[Show 5 levels of tasks]
## Tool Suggestions
[List essential query functions]Then generate environments:
"Generate 5 environments for [your category]"
MIT License - See LICENSE file for details
- DeepSeek-V3.2 Paper: arXiv:2512.02556
- Claude Code: https://claude.com/claude-code
- D&D 5e SRD: https://www.dndbeyond.com/sources/basic-rules
If you use this project in research, please cite:
@software{auto_rl_envs_2025,
title={Auto RL Environments: Automated Generation of RL Training Environments},
author={Your Name},
year={2025},
url={https://github.com/yourusername/auto-rlenvs}
}We have 4 complete Harbor environments ready for benchmarking:
- pc_part_picker_001 - Computer hardware builds (50 tasks, difficulty 2-6/5)
- garden_design_001 - Landscape design with plants (50 tasks, difficulty 1-5/5)
- pharmaceutical_001 - Drug regimen planning (50 tasks, difficulty 1-5/5)
- chemical_synthesis_001 - Multi-step synthesis routes (50 tasks, difficulty 1-6/5)
Run benchmarks on any environment with a single command:
# List all available environments
./scripts/list_environments.sh
# Gemini 3 Pro Preview
./scripts/run_benchmark_gemini.sh ENV_NAME MODE
# Claude Opus 4.5 (via Bedrock)
./scripts/run_benchmark_opus.sh ENV_NAME MODEFor Gemini:
export GEMINI_API_KEY=<your-key>
export DAYTONA_API_KEY=<your-key>For Claude Opus (Bedrock):
export AWS_ACCESS_KEY_ID=<your-key>
export AWS_SECRET_ACCESS_KEY=<your-secret>
export AWS_REGION=us-east-1 # or your preferred region
export DAYTONA_API_KEY=<your-key>PC Part Picker (harbor-tasks structure):
# Quick benchmark (10 representative tasks)
./scripts/run_benchmark_gemini.sh pc_part_picker_001 quick
# Full benchmark (all 50 tasks)
./scripts/run_benchmark_opus.sh pc_part_picker_001 full
# Budget builds only (tasks 1-10)
./scripts/run_benchmark_gemini.sh pc_part_picker_001 budget
# Single task
./scripts/run_benchmark_opus.sh pc_part_picker_001 single 27Garden Design (harbor-tasks structure):
# Quick benchmark
./scripts/run_benchmark_gemini.sh garden_design_001 quick
# All 50 landscape design tasks
./scripts/run_benchmark_opus.sh garden_design_001 fullPharmaceutical (job_examples structure):
# Quick benchmark (easy cases)
./scripts/run_benchmark_gemini.sh pharmaceutical_001 quick
# Medium challenge (moderate complexity)
./scripts/run_benchmark_opus.sh pharmaceutical_001 medium
# Hard challenge (expert cases)
./scripts/run_benchmark_gemini.sh pharmaceutical_001 hard
# All tasks
./scripts/run_benchmark_opus.sh pharmaceutical_001 fullChemical Synthesis (job_examples structure):
# Quick test (simple reactions)
./scripts/run_benchmark_gemini.sh chemical_synthesis_001 quick
# Medium benchmark
./scripts/run_benchmark_opus.sh chemical_synthesis_001 medium
# Hard tasks (expert synthesis)
./scripts/run_benchmark_gemini.sh chemical_synthesis_001 hard
# Green chemistry challenge (specific config)
./scripts/run_benchmark_opus.sh chemical_synthesis_001 green_chemistry_challenge.yamlFor harbor-tasks environments (pc_part_picker, garden_design):
quick- 10 representative tasks (~5-10 min)full- All 50 tasks (~25-50 min)budget- First 10 tasks onlysingle N- Run specific task number
For job_examples environments (pharmaceutical, chemical_synthesis):
quick- Quick test/benchmarkmedium- Medium difficulty challengehard- Hard tasksfull- All tasks- Or specify exact YAML filename
All benchmarks run on Daytona cloud sandboxes with 10 concurrent workers for parallel execution.
Expected completion times:
- Quick (10 tasks): ~5-15 minutes
- Medium (20 tasks): ~10-25 minutes
- Full (50 tasks): ~25-60 minutes
Results are saved to data/envs/ENV_NAME/jobs/
For detailed guides, see environment-specific documentation:
data/envs/pc_part_picker_001/BENCHMARK_GUIDE.mddata/envs/garden_design_001/ENVIRONMENT_GUIDE.mddata/envs/pharmaceutical_001/BENCHMARK_GUIDE.mddata/envs/chemical_synthesis_001/BENCHMARK_GUIDE.md
# List environments
./scripts/list_environments.sh
# Run benchmarks (Gemini)
./scripts/run_benchmark_gemini.sh pc_part_picker_001 quick
./scripts/run_benchmark_gemini.sh garden_design_001 full
./scripts/run_benchmark_gemini.sh pharmaceutical_001 medium
./scripts/run_benchmark_gemini.sh chemical_synthesis_001 hard
# Run benchmarks (Opus)
./scripts/run_benchmark_opus.sh pc_part_picker_001 full
./scripts/run_benchmark_opus.sh garden_design_001 quick
./scripts/run_benchmark_opus.sh pharmaceutical_001 quick
./scripts/run_benchmark_opus.sh chemical_synthesis_001 medium
# Generate new environments
"Generate 5 RPG environments"
"Generate 10 Pokemon environments"
"Generate 5 PC Part Picker environments"
"Generate 10 Chess problem environments"
# Inspect
"Show me rpg_001"
"Show me pc_part_picker_001"
"List all generated environments"
# Export
"Export to RL training format"
"Export to benchmark format"
"Export to fine-tuning format"Status: ✅ Core synthesis working | ⏳ Utilities in progress | 📋 Export planned
Getting Started: Just ask Claude Code to "Generate 5 RPG environments" and watch the magic happen!