Auto RL Environments

Automatic generation of RL training environments using Claude Code subagents, based on the DeepSeek-V3.2 methodology.

Overview

This project replicates DeepSeek's approach to automatically synthesizing "Hard to Solve, Easy to Verify" constraint satisfaction environments for reinforcement learning. Instead of expensive API calls, we use Claude Code subagents (free!) to generate environments in parallel.

Key Features

✅ Zero Cost: Uses Claude Code subagents instead of paid APIs ✅ Parallel Generation: Spawn multiple subagents simultaneously ✅ Real Data: Environments built from actual internet sources (D&D 5e, MTG, Pokemon, etc.) ✅ Multiple Formats: Export to RL training, benchmarks, or fine-tuning datasets ✅ Quality Filtered: Pass@K evaluation ensures proper difficulty

The DeepSeek Methodology

Each environment follows a 6-step synthesis workflow:

1. Generate Database    → Real data from internet (WebSearch/WebFetch)
2. Synthesize Tools     → Python functions that query database
3. Create Task          → Base task + solution + verifier
4. Scale Difficulty     → Progressive constraints (5 levels)
5. Augment Toolset      → Add tools as needed for harder levels
6. Create Metadata      → Document sources and environment details

Core Philosophy: "Hard to Solve, Easy to Verify"

Solutions can ONLY call tool functions (no direct database access)
Verification is simple constraint checking
Creates good RL training signal

Project Structure

auto-rlenvs/
├── README.md
├── plan.md                            # Detailed implementation plan
├── tasks.md                           # 50+ CSP problem templates
├── deepseek.md                        # Original DeepSeek paper
│
├── scripts/                           # Benchmark runner scripts
│   └── run_pc_part_picker_benchmark.sh  # Run Harbor benchmarks on Daytona
│
├── src/
│   ├── prompts/
│   │   ├── synthesis_prompt.md        # Main 6-step workflow template
│   │   └── category_hints/            # Category-specific guidance
│   │       ├── rpg_builds.md
│   │       ├── deck_building.md
│   │       ├── pokemon_teams.md
│   │       ├── chess_problems.md
│   │       └── pc_part_picker.md
│   │
│   ├── utils/                         # Helper utilities (TODO)
│   │   ├── file_manager.py
│   │   └── validator.py
│   │
│   └── output/                        # Export formatters (TODO)
│       ├── rl_formatter.py
│       ├── benchmark_formatter.py
│       └── finetune_formatter.py
│
├── data/
│   ├── envs/                          # Generated environments
│   │   └── {env_id}/
│   │       ├── database.db            # Real data (SQLite)
│   │       ├── tools.py               # Query functions
│   │       ├── tasks.json             # ALL task definitions
│   │       ├── verifier.py            # Constraint checker
│   │       ├── environment.json       # Metadata + sources
│   │       ├── harbor-tasks/          # Harbor task directories (50 separate tasks)
│   │       │   ├── task-1/
│   │       │   │   ├── instruction.md
│   │       │   │   ├── task.toml
│   │       │   │   ├── environment/   # Dockerfile, database.db, tools.py, etc.
│   │       │   │   └── tests/test.sh
│   │       │   ├── task-2/
│   │       │   ...
│   │       │   └── task-N/
│   │       ├── HARBOR_STRUCTURE.md    # Harbor format documentation
│   │       ├── run_benchmark_full.sh
│   │       └── run_benchmark_quick.sh
│   │
│   └── outputs/                       # Exported datasets
│       ├── rl_training/
│       ├── benchmarks/
│       └── finetune/
│
└── configs/                           # Configuration files (TODO)
    └── synthesis_config.yaml

How It Works

Architecture

You (User)
  ↓
Claude Code (Main Process)
  ├─ Reads prompt templates
  ├─ Spawns N subagents in parallel
  └─ Aggregates results

Subagent 1 → Generates rpg_001/
Subagent 2 → Generates rpg_002/
Subagent 3 → Generates rpg_003/
...
  ↓
Complete Environments Ready for RL Training

Workflow Example

# You ask Claude Code to generate environments
"Generate 5 RPG character building environments"

# Claude Code:
# 1. Loads synthesis_prompt.md + rpg_builds.md
# 2. Creates 5 customized prompts
# 3. Spawns 5 Task subagents in parallel
# 4. Each subagent independently:
#    - Searches internet for real D&D 5e data
#    - Creates database.json with real spells/equipment
#    - Writes tools.py with query functions
#    - Designs 5 difficulty levels in tasks.json
#    - Writes verifier.py for constraint checking
#    - Saves to data/envs/{env_id}/
# 5. Aggregates and reports results

# Result: 5 complete environments in ~15 minutes

Generated Environment Structure

Each environment contains:

1. database.json

Real data from authoritative internet sources:

{
  "classes": [
    {"id": "wizard", "name": "Wizard", "hit_die": 6, "armor_proficiency": ["light"], ...}
  ],
  "spells": [
    {"id": "fireball", "name": "Fireball", "level": 3, "school": "evocation", ...}
  ],
  "equipment": [
    {"id": "staff", "name": "Quarterstaff", "cost": 2, "damage": "1d6", ...}
  ]
}

2. tools.py

Query functions (solutions can ONLY use these):

import json
from pathlib import Path

DB_PATH = Path(__file__).parent / 'database.json'
with open(DB_PATH) as f:
    db = json.load(f)

def get_class_by_id(class_id: str) -> dict:
    """Get class details by ID."""
    return next((c for c in db['classes'] if c['id'] == class_id), None)

def get_available_spells(class_id: str, max_level: int) -> list:
    """Get spells available to a class up to given level."""
    return [s for s in db['spells']
            if class_id in s['class_access'] and s['level'] <= max_level]

# ... 8-10 total functions

3. tasks.json

5 difficulty levels with progressive constraints:

{
  "tasks": [
    {
      "level": 1,
      "prompt": "Build a Level 5 Wizard with Intelligence >= 18 and Fireball",
      "constraints": [...]
    },
    {
      "level": 2,
      "prompt": "Build a Level 5 Wizard with Int >= 18, Fireball, and equipment cost < 1000 gold",
      "constraints": [...]
    },
    // ... levels 3-5
  ]
}

4. verifier.py

Simple constraint checking (easy to verify):

def verify(output: dict) -> dict:
    """Verify solution satisfies all constraints."""
    violations = []

    # Check class
    if output.get('class') != 'wizard':
        violations.append("Must be Wizard")

    # Check Intelligence
    if output.get('primary_stat', {}).get('intelligence', 0) < 18:
        violations.append("Intelligence must be >= 18")

    # ... more checks

    return {
        "passed": len(violations) == 0,
        "violations": violations,
        "score": 1.0 if len(violations) == 0 else 0.0
    }

5. environment.json

Metadata and data provenance:

{
  "env_id": "rpg_001",
  "category": "rpg_builds",
  "description": "D&D 5e Wizard character building with spell selection and equipment constraints",
  "difficulty_range": [1, 5],
  "num_tools": 10,
  "database_tables": ["classes", "spells", "equipment", "skills"],
  "data_sources": [
    "https://www.dndbeyond.com/classes/wizard",
    "https://roll20.net/compendium/dnd5e/Spells"
  ],
  "created_at": "2025-12-10",
  "pass_rate": null
}

Usage

Quick Start

Ask Claude Code to generate environments:

"Generate 5 RPG character building environments"

Claude Code will:
- Load prompt templates
- Spawn 5 subagents in parallel
- Each generates a complete environment
- Report results after ~15 minutes
Review generated environments:
```
"Show me what's in data/envs/rpg_001/"
```

Generate more or export:

"Generate 10 Pokemon team environments"
"Export all environments to RL training format"

Advanced Usage

Custom categories:

"Generate 5 environments for [your category]"

Quality filtering:

"Run pass@100 evaluation on all environments and filter out any with pass rate > 90% or < 5%"

Export formats:

"Export environments to GRPO RL training format"
"Export environments as evaluation benchmarks"
"Export environments as supervised fine-tuning data"

Output Formats

RL Training (GRPO Format)

{"prompt": "Build a Wizard...", "tools": [...], "verifier": "def verify...", "difficulty": 1}
{"prompt": "Build a Wizard...", "tools": [...], "verifier": "def verify...", "difficulty": 2}
...

Benchmark Format

{
  "name": "Gaming Logic Benchmark v1",
  "categories": {"rpg_builds": 50, "deck_building": 25, ...},
  "tasks": [
    {"id": "rpg_001_level_1", "prompt": "...", "tools": [...], "verifier": "..."}
  ]
}

Fine-Tuning Format (OpenAI Messages)

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "tool_calls": [...]}]}

Quality Filtering

Environments are filtered using pass@K evaluation:

pass@100 = (successful solutions) / 100 attempts

Keep if: 0.01 < pass@100 < 0.95

Reasoning:
- pass@100 = 0%: Impossible or too hard (no learning signal)
- pass@100 = 100%: Too easy (no challenge)
- pass@100 = 5-50%: Perfect for RL (hard but learnable)

Data Provenance

All environments include citations to source data:

{
  "data_sources": [
    "https://www.dndbeyond.com/sources/basic-rules",
    "https://roll20.net/compendium/dnd5e/Spells",
    "https://5e.tools/"
  ],
  "data_source_description": "Classes from D&D Basic Rules. Spells from Roll20. Equipment from 5e.tools."
}

This ensures:

Data is verifiable and accurate
Environments are grounded in real game systems
Legal compliance with open game licenses

Performance

Generation Speed:

1 environment: ~10-15 minutes (single subagent)
5 environments: ~15 minutes (5 parallel subagents)
10 environments: ~20 minutes (batched execution)
100 environments: ~2-3 hours (10 subagents × 10 batches)

Cost:

Synthesis: $0 (Claude Code subagents are free)
Quality filtering: $0 (use subagents for evaluation)
Total: $0 for unlimited environments

Scalability:

Local: 5-10 parallel subagents (limited by machine resources)
Future (Daytona): 50+ parallel instances for massive scale

Comparison to DeepSeek

Aspect	DeepSeek Method	Our Implementation
Agent	API calls (Claude Opus)	Claude Code subagents
Cost	~$165 for 1000 envs	$0 (free subagents)
Data Source	LLM-synthesized	Real internet data
Parallelization	API rate limited	Unlimited subagents
Interactivity	Batch processing	Interactive with Claude Code

Roadmap

Phase 1: Core Generation (Complete)

✅ Prompt templates with 6-step workflow
✅ Category hints (RPG, TCG, Pokemon, Chess)
✅ Subagent-based parallel generation
✅ Real data collection via WebSearch/WebFetch

Phase 2: Utilities & Automation (TODO)

⬜ File manager utilities
⬜ Validation helpers
⬜ Automated orchestrator script
⬜ Configuration system

Phase 3: Export & Quality (TODO)

⬜ RL training format exporters
⬜ Benchmark format exporters
⬜ Fine-tuning format exporters
⬜ Pass@K quality filtering

Phase 4: Scale & Expand (TODO)

⬜ Additional categories (from tasks.md)
⬜ Daytona integration for massive parallelization
⬜ Dataset versioning and management
⬜ Evaluation benchmarks

Examples

Example 1: RPG Environment

Task (Level 3):

"Build a Level 5 Wizard specialized in Evocation school. Must have:
- Intelligence >= 18
- Fireball spell
- Equipment cost < 1000 gold
- At least 3 Evocation spells"

Why it's good for RL:

Hard to solve: Large search space (25 spells × 20 equipment items)
Easy to verify: Simple constraint checks
Progressive difficulty: Level 1 (easy) → Level 5 (hard)

Example 2: Pokemon Team

Task (Level 4):

"Build a 6-Pokemon OU tier team with:
- Balanced type coverage (no double weaknesses)
- 1 physical sweeper + 1 special wall + 1 support
- Synergistic abilities (e.g., Rain Dance + Swift Swim)
- Average speed > 95"

Why it's good for RL:

Hard to solve: Combinatorial optimization (30 pokemon × 6 slots)
Easy to verify: Type chart + stat calculations
Real-world grounding: Actual competitive Pokemon data

Example 3: PC Part Picker

Task (Level 3):

"Build an all-white gaming PC under $1500 with:
- Intel i5-14600K (overclockable)
- RTX 4070 GPU (304mm length)
- 32GB DDR5 RAM
- 1TB NVMe SSD
- Case must fit GPU (check length constraint)
- Since CPU is K-series, must use Z790 chipset for overclocking
- ALL visible components must be white color"

Why it's good for RL:

Hard to solve: Multi-dimensional constraints (compatibility, budget, physical fit, aesthetics)
Easy to verify: Socket matching, dimension checks, price sum, color matching
Real-world grounding: Actual PC components with real specs and prices

Contributing

Want to add new categories? Create a new file in src/prompts/category_hints/:

# Your Category - Category Hints

## Database Requirements
[Describe tables needed]

## Suggested Constraints
[Describe difficulty progression]

## Example Task Progression
[Show 5 levels of tasks]

## Tool Suggestions
[List essential query functions]

Then generate environments:

"Generate 5 environments for [your category]"

License

MIT License - See LICENSE file for details

References

DeepSeek-V3.2 Paper: arXiv:2512.02556
Claude Code: https://claude.com/claude-code
D&D 5e SRD: https://www.dndbeyond.com/sources/basic-rules

Citation

If you use this project in research, please cite:

@software{auto_rl_envs_2025,
  title={Auto RL Environments: Automated Generation of RL Training Environments},
  author={Your Name},
  year={2025},
  url={https://github.com/yourusername/auto-rlenvs}
}

Running Benchmarks

Available Environments

We have 4 complete Harbor environments ready for benchmarking:

pc_part_picker_001 - Computer hardware builds (50 tasks, difficulty 2-6/5)
garden_design_001 - Landscape design with plants (50 tasks, difficulty 1-5/5)
pharmaceutical_001 - Drug regimen planning (50 tasks, difficulty 1-5/5)
chemical_synthesis_001 - Multi-step synthesis routes (50 tasks, difficulty 1-6/5)

Unified Benchmark Scripts

Run benchmarks on any environment with a single command:

# List all available environments
./scripts/list_environments.sh

# Gemini 3 Pro Preview
./scripts/run_benchmark_gemini.sh ENV_NAME MODE

# Claude Opus 4.5 (via Bedrock)
./scripts/run_benchmark_opus.sh ENV_NAME MODE

Setup

For Gemini:

export GEMINI_API_KEY=<your-key>
export DAYTONA_API_KEY=<your-key>

For Claude Opus (Bedrock):

export AWS_ACCESS_KEY_ID=<your-key>
export AWS_SECRET_ACCESS_KEY=<your-secret>
export AWS_REGION=us-east-1  # or your preferred region
export DAYTONA_API_KEY=<your-key>

Usage Examples

PC Part Picker (harbor-tasks structure):

# Quick benchmark (10 representative tasks)
./scripts/run_benchmark_gemini.sh pc_part_picker_001 quick

# Full benchmark (all 50 tasks)
./scripts/run_benchmark_opus.sh pc_part_picker_001 full

# Budget builds only (tasks 1-10)
./scripts/run_benchmark_gemini.sh pc_part_picker_001 budget

# Single task
./scripts/run_benchmark_opus.sh pc_part_picker_001 single 27

Garden Design (harbor-tasks structure):

# Quick benchmark
./scripts/run_benchmark_gemini.sh garden_design_001 quick

# All 50 landscape design tasks
./scripts/run_benchmark_opus.sh garden_design_001 full

Pharmaceutical (job_examples structure):

# Quick benchmark (easy cases)
./scripts/run_benchmark_gemini.sh pharmaceutical_001 quick

# Medium challenge (moderate complexity)
./scripts/run_benchmark_opus.sh pharmaceutical_001 medium

# Hard challenge (expert cases)
./scripts/run_benchmark_gemini.sh pharmaceutical_001 hard

# All tasks
./scripts/run_benchmark_opus.sh pharmaceutical_001 full

Chemical Synthesis (job_examples structure):

# Quick test (simple reactions)
./scripts/run_benchmark_gemini.sh chemical_synthesis_001 quick

# Medium benchmark
./scripts/run_benchmark_opus.sh chemical_synthesis_001 medium

# Hard tasks (expert synthesis)
./scripts/run_benchmark_gemini.sh chemical_synthesis_001 hard

# Green chemistry challenge (specific config)
./scripts/run_benchmark_opus.sh chemical_synthesis_001 green_chemistry_challenge.yaml

Modes

For harbor-tasks environments (pc_part_picker, garden_design):

quick - 10 representative tasks (~5-10 min)
full - All 50 tasks (~25-50 min)
budget - First 10 tasks only
single N - Run specific task number

For job_examples environments (pharmaceutical, chemical_synthesis):

quick - Quick test/benchmark
medium - Medium difficulty challenge
hard - Hard tasks
full - All tasks
Or specify exact YAML filename

Performance

All benchmarks run on Daytona cloud sandboxes with 10 concurrent workers for parallel execution.

Expected completion times:

Quick (10 tasks): ~5-15 minutes
Medium (20 tasks): ~10-25 minutes
Full (50 tasks): ~25-60 minutes

Results are saved to data/envs/ENV_NAME/jobs/

For detailed guides, see environment-specific documentation:

data/envs/pc_part_picker_001/BENCHMARK_GUIDE.md
data/envs/garden_design_001/ENVIRONMENT_GUIDE.md
data/envs/pharmaceutical_001/BENCHMARK_GUIDE.md
data/envs/chemical_synthesis_001/BENCHMARK_GUIDE.md

Quick Reference

# List environments
./scripts/list_environments.sh

# Run benchmarks (Gemini)
./scripts/run_benchmark_gemini.sh pc_part_picker_001 quick
./scripts/run_benchmark_gemini.sh garden_design_001 full
./scripts/run_benchmark_gemini.sh pharmaceutical_001 medium
./scripts/run_benchmark_gemini.sh chemical_synthesis_001 hard

# Run benchmarks (Opus)
./scripts/run_benchmark_opus.sh pc_part_picker_001 full
./scripts/run_benchmark_opus.sh garden_design_001 quick
./scripts/run_benchmark_opus.sh pharmaceutical_001 quick
./scripts/run_benchmark_opus.sh chemical_synthesis_001 medium

# Generate new environments
"Generate 5 RPG environments"
"Generate 10 Pokemon environments"
"Generate 5 PC Part Picker environments"
"Generate 10 Chess problem environments"

# Inspect
"Show me rpg_001"
"Show me pc_part_picker_001"
"List all generated environments"

# Export
"Export to RL training format"
"Export to benchmark format"
"Export to fine-tuning format"

Status: ✅ Core synthesis working | ⏳ Utilities in progress | 📋 Export planned

Getting Started: Just ask Claude Code to "Generate 5 RPG environments" and watch the magic happen!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data/envs		data/envs
harbor		harbor
harbor_docs		harbor_docs
scripts		scripts
src/prompts		src/prompts
.gitignore		.gitignore
HARBOR_ENV_GENERATION_GUIDE.md		HARBOR_ENV_GENERATION_GUIDE.md
README.md		README.md
deepseek.md		deepseek.md
plan.md		plan.md
tasks.md		tasks.md

Folders and files

Latest commit

History

Repository files navigation

Auto RL Environments

Overview

Key Features

The DeepSeek Methodology

Project Structure

How It Works

Architecture

Workflow Example

Generated Environment Structure

1. database.json

2. tools.py

3. tasks.json

4. verifier.py

5. environment.json

Categories

1. RPG Character Builds

2. TCG Deck Building

3. Pokemon Team Building

4. Chess Problems

5. PC Part Picker

Usage

Quick Start

Advanced Usage

Output Formats

RL Training (GRPO Format)

Benchmark Format

Fine-Tuning Format (OpenAI Messages)

Quality Filtering

Data Provenance

Performance

Comparison to DeepSeek

Roadmap

Phase 1: Core Generation (Complete)

Phase 2: Utilities & Automation (TODO)

Phase 3: Export & Quality (TODO)

Phase 4: Scale & Expand (TODO)

Examples

Example 1: RPG Environment

Example 2: Pokemon Team

Example 3: PC Part Picker

Contributing

License

References

Citation

Running Benchmarks

Available Environments

Unified Benchmark Scripts

Setup

Usage Examples

Modes

Performance

Quick Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages