diff --git a/README.md b/README.md
index dd00115f1..a20d1665a 100644
--- a/README.md
+++ b/README.md
@@ -288,12 +288,72 @@ prompt:
   num_top_programs: 3      # Performance examples
   num_diverse_programs: 2  # Creative inspiration
   include_artifacts: true  # Execution feedback
+  
+  # Template customization
+  template_dir: null               # Directory for custom prompt templates
+  use_template_stochasticity: true # Enable random variations in prompts
+  template_variations: {}          # Define variation placeholders
 ```
 
 Sample configuration files are available in the `configs/` directory:
 - `default_config.yaml`: Comprehensive configuration with all available options
 - `island_config_example.yaml`: Advanced island-based evolution setup
 
+### Template Customization
+
+OpenEvolve supports advanced prompt template customization to increase diversity in code evolution:
+
+#### Custom Templates with `template_dir`
+
+You can override the default prompt templates by providing custom ones:
+
+```yaml
+prompt:
+  template_dir: "path/to/your/templates"
+```
+
+Create `.txt` files in your template directory with these names:
+- `diff_user.txt` - Template for diff-based evolution
+- `full_rewrite_user.txt` - Template for full code rewrites  
+- `evolution_history.txt` - Format for presenting evolution history
+- `top_program.txt` - Format for top-performing programs
+- `previous_attempt.txt` - Format for previous attempts
+
+See these directories for complete examples of custom templates:
+- `examples/lm_eval/prompts/` - Custom templates for evaluation tasks
+- `examples/llm_prompt_optimization/templates/` - Templates for evolving prompts instead of code
+
+#### Template Variations with Stochasticity
+
+To add randomness to your prompts and prevent getting stuck in local optima:
+
+1. **Enable stochasticity** in your config:
+```yaml
+prompt:
+  use_template_stochasticity: true
+  template_variations:
+    greeting:
+      - "Let's improve this code."
+      - "Time to enhance this program."
+      - "Here's how we can optimize:"
+    analysis_intro:
+      - "Current metrics show"
+      - "Performance analysis indicates"
+      - "The evaluation reveals"
+```
+
+2. **Use variation placeholders** in your custom templates:
+```
+# custom_template.txt
+{greeting}
+{analysis_intro} the following results:
+{metrics}
+```
+
+The system will randomly select one variation for each placeholder during prompt generation, creating diverse prompts that can lead to more creative code evolutions.
+
+**Note**: The default templates don't include variation placeholders, so you'll need to create custom templates to use this feature effectively.
+
 ### Feature Dimensions in MAP-Elites
 
 Feature dimensions control how programs are organized in the MAP-Elites quality-diversity grid:
@@ -425,8 +485,12 @@ Demonstrates integration with [optillm](https://github.com/codelion/optillm) for
 - **Mixture of Agents (MoA)**: Multi-response synthesis for improved accuracy  
 - **Local model optimization**: Enhanced reasoning with smaller models
 
-#### [LLM Prompt Optimization](examples/llm_prompt_optimazation/)
-Evolving prompts themselves for better LLM performance, demonstrating self-improving AI systems.
+#### [LLM Prompt Optimization](examples/llm_prompt_optimization/)
+Evolving prompts for better LLM performance on HuggingFace datasets. Features:
+- Custom templates for evolving prompts instead of code
+- Two-stage cascading evaluation for efficiency
+- Support for any HuggingFace dataset
+- Automatic prompt improvement through evolution
 
 ### Systems & Performance Optimization
 
diff --git a/examples/llm_prompt_optimazation/README.md b/examples/llm_prompt_optimazation/README.md
deleted file mode 100644
index c207a0084..000000000
--- a/examples/llm_prompt_optimazation/README.md
+++ /dev/null
@@ -1,184 +0,0 @@
-# Evolving Better Prompts with OpenEvolve 🧠✨
-
-This example shows how to use **OpenEvolve** to automatically optimize prompts for **Large Language Models (LLMs)**. Whether you're working on classification, summarization, generation, or code tasks, OpenEvolve helps you find high-performing prompts using **evolutionary search**. For this example we'll use syntihetic data for sentiment analysis task, but you can adapt it to your own datasets and tasks.
-
----
-
-## 🎯 What Is Prompt Optimization?
-
-Prompt engineering is key to getting reliable outputs from LLMs—but finding the right prompt manually can be slow and inconsistent.
-
-OpenEvolve automates this by:
-
-* Generating and evolving prompt variations
-* Testing them against your task and metrics
-* Selecting the best prompts through generations
-
-You start with a simple prompt and let OpenEvolve evolve it into something smarter and more effective.
-
----
-
-## 🚀 Getting Started
-
-### 1. Install Dependencies
-
-```bash
-cd examples/llm_prompt_optimazation
-pip install -r requirements.txt
-sh run.sh
-```
-
-### 2. Add Your models
-
-1. Update your `config.yaml`:
-
-```yaml
-llm:
-  primary_model: "llm_name"
-  api_base: "llm_server_url"
-  api_key: "your_api_key_here"
-```
-
-2. Update your task-model in `evaluator.py`:
-
-```python
-TASK_MODEL_NAME = "task_llm_name"
-TASK_MODEL_URL = "task_llm_server_url"
-TASK_MODEL_API_KEY = "your_api_key_here"
-SAMPLE_SIZE = 25  # Number of samples to use for evaluation
-MAX_RETRIES = 3  # Number of retries for LLM calls
-
-```
-
-### 3. Run OpenEvolve
-
-```bash
-sh run.sh
-```
-
----
-
-## 🔧 How to Adapt This Template
-
-### 1. Replace the Dataset
-
-Edit `data.json` to match your use case:
-
-```json
-[
-  {
-    "id": 1,
-    "input": "Your input here",
-    "expected_output": "Target output"
-  }
-]
-```
-
-### 2. Customize the Evaluator
-
-In `evaluator.py`, define how to evaluate a prompt:
-
-* Load your data
-* Call the LLM using the prompt
-* Measure output quality (accuracy, score, etc.)
-
-### 3. Write Your Initial Prompt
-
-Create a basic starting prompt in `initial_prompt.txt`:
-
-```
-# EVOLVE-BLOCK-START
-Your task prompt using {input_text} as a placeholder.
-# EVOLVE-BLOCK-END
-```
-
-This is the part OpenEvolve will improve over time.
-Good to add the name of your task in 'initial_prompt.txt' header to help the model understand the context.
-
----
-
-## ⚙️ Key Config Options (`config.yaml`)
-
-```yaml
-llm:
-  primary_model: "gpt-4o"           # or your preferred model
-  secondary_model: "gpt-3.5"        # optional for diversity
-  temperature: 0.9
-  max_tokens: 2048
-
-database:
-  population_size: 40
-  max_iterations: 15
-  elite_selection_ratio: 0.25
-
-evaluator:
-  timeout: 45
-  parallel_evaluations: 3
-  use_llm_feedback: true
-```
-
----
-
-## 📈 Example Output
-
-OpenEvolve evolves prompts like this:
-
-**Initial Prompt:**
-
-```
-Please analyze the sentiment of the following sentence and provide a sentiment score:
-
-"{input_text}"
-
-Rate the sentiment on a scale from 0.0 to 10.0.
-
-Score:
-```
-
-**Evolved Prompt:**
-
-```
-Please analyze the sentiment of the following sentence and provide a sentiment score using the following guidelines:
-- 0.0-2.9: Strongly negative sentiment (e.g., expresses anger, sadness, or despair)
-- 3.0-6.9: Neutral or mixed sentiment (e.g., factual statements, ambiguous content)
-- 7.0-10.0: Strongly positive sentiment (e.g., expresses joy, satisfaction, or hope)
-
-"{input_text}"
-
-Rate the sentiment on a scale from 0.0 to 10.0:
-- 0.0-2.9: Strongly negative (e.g., "This product is terrible")
-- 3.0-6.9: Neutral/mixed (e.g., "The sky is blue today")
-- 7.0-10.0: Strongly positive (e.g., "This is amazing!")
-
-Provide only the numeric score (e.g., "8.5") without any additional text:
-
-Score:
-```
-
-**Result**: Improved accuracy and output consistency.
-
----
-
-## 🔍 Where to Use This
-
-OpenEvolve could be addapted on many tasks:
-
-* **Text Classification**: Spam detection, intent recognition
-* **Content Generation**: Social media posts, product descriptions
-* **Question Answering & Summarization**
-* **Code Tasks**: Review, generation, completion
-* **Structured Output**: JSON, table filling, data extraction
-
----
-
-## ✅ Best Practices
-
-* Start with a basic but relevant prompt
-* Use good-quality data and clear evaluation metrics
-* Run multiple evolutions for better results
-* Validate on held-out data before deployment
-
----
-
-**Ready to discover better prompts?**
-Use this template to evolve prompts for any LLM task—automatically.
diff --git a/examples/llm_prompt_optimazation/best_program.txt b/examples/llm_prompt_optimazation/best_program.txt
deleted file mode 100644
index 601c29da2..000000000
--- a/examples/llm_prompt_optimazation/best_program.txt
+++ /dev/null
@@ -1,19 +0,0 @@
-"""Sentiment analysis prompt example for OpenEvolve"""
-
-# EVOLVE-BLOCK-START
-Please analyze the sentiment of the following sentence and provide a sentiment score using the following guidelines:
-- 0.0-2.9: Strongly negative sentiment (e.g., expresses anger, sadness, or despair)
-- 3.0-6.9: Neutral or mixed sentiment (e.g., factual statements, ambiguous content)
-- 7.0-10.0: Strongly positive sentiment (e.g., expresses joy, satisfaction, or hope)
-
-"{input_text}"
-
-Rate the sentiment on a scale from 0.0 to 10.0:
-- 0.0-2.9: Strongly negative (e.g., "This product is terrible")
-- 3.0-6.9: Neutral/mixed (e.g., "The sky is blue today")
-- 7.0-10.0: Strongly positive (e.g., "This is amazing!")
-
-Provide only the numeric score (e.g., "8.5") without any additional text:
-
-Score:
-# EVOLVE-BLOCK-END
diff --git a/examples/llm_prompt_optimazation/config.yaml b/examples/llm_prompt_optimazation/config.yaml
deleted file mode 100644
index 57483c1aa..000000000
--- a/examples/llm_prompt_optimazation/config.yaml
+++ /dev/null
@@ -1,58 +0,0 @@
-# Configuration for prompt optimization
-max_iterations: 30
-checkpoint_interval: 10
-log_level: "INFO"
-
-# LLM configuration
-llm:
-  primary_model: "qwen3-32b-fp8"
-  api_base: "http://localhost:1234/v1"
-  api_key: "your_api_key_here"
-  temperature: 0.9
-  top_p: 0.95
-  max_tokens: 2048
-
-# Prompt configuration
-prompt:
-  system_message: |
-    You are an expert prompt engineer. Your task is to revise an existing prompt designed for large language models (LLMs), without being explicitly told what the task is.
-
-    Your improvements should:
-
-    * Infer the intended task and expected output format based on the structure and language of the original prompt.
-    * Clarify vague instructions, eliminate ambiguity, and improve overall interpretability for the LLM.
-    * Strengthen alignment between the prompt and the desired task outcome, ensuring more consistent and accurate responses.
-    * Improve robustness against edge cases or unclear input phrasing.
-    * If helpful, include formatting instructions, boundary conditions, or illustrative examples that reinforce the LLM's expected behavior.
-    * Avoid adding unnecessary verbosity or assumptions not grounded in the original prompt.
-
-    You will receive a prompt that uses the following structure:
-
-    ```python
-    prompt.format(input_text=some_text)
-    ```
-
-    The revised prompt should maintain the same input interface but be more effective, reliable, and production-ready for LLM use.
-
-    Return only the improved prompt text. Do not include explanations or additional comments. Your output should be a clean, high-quality replacement that enhances clarity, consistency, and LLM performance.
-
-  num_top_programs: 8
-  use_template_stochasticity: true
-
-# Database configuration
-database:
-  population_size: 40
-  archive_size: 20
-  num_islands: 3
-  elite_selection_ratio: 0.25
-  exploitation_ratio: 0.65
-
-# Evaluator configuration
-evaluator:
-  timeout: 45  
-  use_llm_feedback: true
-
-# Evolution settings
-diff_based_evolution: true
-allow_full_rewrites: true
-diversity_threshold: 0.1
diff --git a/examples/llm_prompt_optimazation/data.json b/examples/llm_prompt_optimazation/data.json
deleted file mode 100644
index 9fcdc621e..000000000
--- a/examples/llm_prompt_optimazation/data.json
+++ /dev/null
@@ -1,510 +0,0 @@
-{
-  "book_reviews": [
-    {
-      "id": 1,
-      "text": "This book was absolutely phenomenal! The writing was masterful and the plot kept me captivated from start to finish.",
-      "sentiment_score": 9.5
-    },
-    {
-      "id": 2,
-      "text": "I was really disappointed with this novel. The story dragged on and the characters felt flat and uninteresting.",
-      "sentiment_score": 2.5
-    },
-    {
-      "id": 3,
-      "text": "An incredible literary masterpiece! Brilliant prose and outstanding character development throughout.",
-      "sentiment_score": 9.8
-    },
-    {
-      "id": 4,
-      "text": "This was one of the worst books I've ever read. Terrible pacing and a completely incoherent storyline.",
-      "sentiment_score": 0.5
-    },
-    {
-      "id": 5,
-      "text": "A true work of art. Every page was beautifully crafted and emotionally resonant.",
-      "sentiment_score": 10.0
-    },
-    {
-      "id": 6,
-      "text": "Completely underwhelming. I expected so much more but was left feeling bored and frustrated.",
-      "sentiment_score": 2.0
-    },
-    {
-      "id": 7,
-      "text": "Incredible storytelling with rich world-building. This book exceeded all my expectations.",
-      "sentiment_score": 9.2
-    },
-    {
-      "id": 8,
-      "text": "A waste of time and money. Poor writing, bad plot, and overall just a terrible reading experience.",
-      "sentiment_score": 0.8
-    },
-    {
-      "id": 9,
-      "text": "Outstanding narrative and compelling characters. This book will stay with me for a long time.",
-      "sentiment_score": 9.0
-    },
-    {
-      "id": 10,
-      "text": "Disappointing and predictable. The book felt like a cheap imitation of much better novels.",
-      "sentiment_score": 2.8
-    },
-    {
-      "id": 11,
-      "text": "The book was decent. Some chapters were good, others not so much. Overall an average read.",
-      "sentiment_score": 5.0
-    },
-    {
-      "id": 12,
-      "text": "Not the best novel ever written, but certainly readable. Has its moments of brilliance.",
-      "sentiment_score": 6.5
-    },
-    {
-      "id": 13,
-      "text": "Pretty good book with solid writing and an interesting premise. Worth reading if you have time.",
-      "sentiment_score": 7.2
-    },
-    {
-      "id": 14,
-      "text": "The book had potential but fell short in execution. Some good ideas but poorly implemented.",
-      "sentiment_score": 4.0
-    },
-    {
-      "id": 15,
-      "text": "A truly exceptional piece of literature that pushes the boundaries of storytelling. Pure genius!",
-      "sentiment_score": 10.0
-    },
-    {
-      "id": 16,
-      "text": "Absolutely terrible in every possible way. I want my money and time back. Avoid at all costs.",
-      "sentiment_score": 0.0
-    },
-    {
-      "id": 17,
-      "text": "Surprisingly good! Exceeded my expectations with clever plot twists and strong character arcs.",
-      "sentiment_score": 7.8
-    },
-    {
-      "id": 18,
-      "text": "Mediocre at best. Nothing particularly wrong with it, but nothing special either.",
-      "sentiment_score": 4.5
-    },
-    {
-      "id": 19,
-      "text": "A delightful surprise! Charming prose and a heartwarming story that left me smiling.",
-      "sentiment_score": 8.5
-    },
-    {
-      "id": 20,
-      "text": "Painfully slow and pretentious. The author seemed more interested in showing off than telling a story.",
-      "sentiment_score": 1.2
-    },
-    {
-      "id": 21,
-      "text": "An engaging thriller that kept me on the edge of my seat. Well-crafted suspense and believable characters.",
-      "sentiment_score": 8.3
-    },
-    {
-      "id": 22,
-      "text": "The romance was sweet but the plot was lacking. Some beautiful moments but overall forgettable.",
-      "sentiment_score": 5.5
-    },
-    {
-      "id": 23,
-      "text": "Brilliant science fiction with thought-provoking themes. The author's imagination is truly remarkable.",
-      "sentiment_score": 9.1
-    },
-    {
-      "id": 24,
-      "text": "Confusing and poorly structured. I struggled to follow the narrative and lost interest quickly.",
-      "sentiment_score": 2.3
-    },
-    {
-      "id": 25,
-      "text": "A masterful blend of history and fiction. Thoroughly researched and beautifully written.",
-      "sentiment_score": 8.9
-    },
-    {
-      "id": 26,
-      "text": "The characters felt one-dimensional and the dialogue was stilted. Not the author's best work.",
-      "sentiment_score": 3.2
-    },
-    {
-      "id": 27,
-      "text": "Captivating from the first page to the last. A true page-turner with excellent pacing.",
-      "sentiment_score": 8.7
-    },
-    {
-      "id": 28,
-      "text": "Boring and repetitive. The same themes rehashed over and over without any fresh perspective.",
-      "sentiment_score": 2.1
-    },
-    {
-      "id": 29,
-      "text": "A profound exploration of human nature. Deep, meaningful, and beautifully executed.",
-      "sentiment_score": 9.4
-    },
-    {
-      "id": 30,
-      "text": "The plot had too many holes and the ending was unsatisfying. Left me with more questions than answers.",
-      "sentiment_score": 3.5
-    },
-    {
-      "id": 31,
-      "text": "Solid character development and a compelling mystery. Kept me guessing until the very end.",
-      "sentiment_score": 7.6
-    },
-    {
-      "id": 32,
-      "text": "The writing style was difficult to follow and the story seemed to go nowhere. A disappointing read.",
-      "sentiment_score": 2.7
-    },
-    {
-      "id": 33,
-      "text": "Excellent world-building and imaginative storytelling. A fantasy epic that delivers on all fronts.",
-      "sentiment_score": 8.8
-    },
-    {
-      "id": 34,
-      "text": "The humor fell flat and the characters were annoying rather than endearing. Not my cup of tea.",
-      "sentiment_score": 3.0
-    },
-    {
-      "id": 35,
-      "text": "A gripping psychological thriller with complex characters and unexpected twists. Highly recommended.",
-      "sentiment_score": 8.4
-    },
-    {
-      "id": 36,
-      "text": "The book was okay but nothing groundbreaking. Decent enough to finish but not memorable.",
-      "sentiment_score": 5.2
-    },
-    {
-      "id": 37,
-      "text": "Beautifully written prose that flows like poetry. A literary gem that touched my soul.",
-      "sentiment_score": 9.3
-    },
-    {
-      "id": 38,
-      "text": "Too much exposition and not enough action. The story moved at a snail's pace throughout.",
-      "sentiment_score": 3.8
-    },
-    {
-      "id": 39,
-      "text": "An inspiring tale of resilience and hope. The characters' journeys were both realistic and uplifting.",
-      "sentiment_score": 8.1
-    },
-    {
-      "id": 40,
-      "text": "Clichéd and predictable. I saw every plot twist coming from miles away. Very disappointing.",
-      "sentiment_score": 2.4
-    },
-    {
-      "id": 41,
-      "text": "A thought-provoking exploration of social issues wrapped in an entertaining narrative.",
-      "sentiment_score": 7.9
-    },
-    {
-      "id": 42,
-      "text": "The book started strong but lost momentum halfway through. The ending felt rushed and unsatisfying.",
-      "sentiment_score": 4.3
-    },
-    {
-      "id": 43,
-      "text": "Exceptional character depth and emotional resonance. A story that will haunt you long after reading.",
-      "sentiment_score": 9.6
-    },
-    {
-      "id": 44,
-      "text": "Poorly edited with numerous grammatical errors. The story couldn't overcome the technical flaws.",
-      "sentiment_score": 1.8
-    },
-    {
-      "id": 45,
-      "text": "A delightful coming-of-age story with authentic characters and relatable struggles.",
-      "sentiment_score": 7.4
-    },
-    {
-      "id": 46,
-      "text": "The premise was interesting but the execution was lacking. Felt like a missed opportunity.",
-      "sentiment_score": 4.1
-    },
-    {
-      "id": 47,
-      "text": "Absolutely riveting! Could not put it down once I started. A masterclass in suspenseful storytelling.",
-      "sentiment_score": 9.7
-    },
-    {
-      "id": 48,
-      "text": "Overly complicated and pretentious. The author tried too hard to be clever and it backfired.",
-      "sentiment_score": 2.2
-    },
-    {
-      "id": 49,
-      "text": "A heartwarming family saga with memorable characters and beautiful storytelling.",
-      "sentiment_score": 8.2
-    },
-    {
-      "id": 50,
-      "text": "The dialogue was unrealistic and the plot was full of convenient coincidences. Hard to believe.",
-      "sentiment_score": 3.3
-    },
-    {
-      "id": 51,
-      "text": "An ambitious epic that mostly succeeds in its grand vision. Some pacing issues but overall impressive.",
-      "sentiment_score": 7.7
-    },
-    {
-      "id": 52,
-      "text": "Dull and lifeless. The characters had no personality and the story lacked any real conflict.",
-      "sentiment_score": 2.6
-    },
-    {
-      "id": 53,
-      "text": "A beautiful meditation on love, loss, and redemption. Emotionally powerful and deeply moving.",
-      "sentiment_score": 8.9
-    },
-    {
-      "id": 54,
-      "text": "The book felt incomplete, like the author ran out of ideas halfway through. Very unsatisfying.",
-      "sentiment_score": 3.4
-    },
-    {
-      "id": 55,
-      "text": "Clever and witty with sharp social commentary. An entertaining read that also makes you think.",
-      "sentiment_score": 7.8
-    },
-    {
-      "id": 56,
-      "text": "Repetitive and boring. The same points made over and over without adding anything new.",
-      "sentiment_score": 2.9
-    },
-    {
-      "id": 57,
-      "text": "A stunning work of historical fiction that brings the past to life with vivid detail.",
-      "sentiment_score": 8.6
-    },
-    {
-      "id": 58,
-      "text": "The mystery was easy to solve and the red herrings were obvious. Not very engaging.",
-      "sentiment_score": 3.7
-    },
-    {
-      "id": 59,
-      "text": "Outstanding world-building and character development. A fantasy series starter that promises great things.",
-      "sentiment_score": 8.3
-    },
-    {
-      "id": 60,
-      "text": "Too many subplots that went nowhere. The main story got lost in all the unnecessary complexity.",
-      "sentiment_score": 3.6
-    },
-    {
-      "id": 61,
-      "text": "A perfectly crafted thriller with tight pacing and genuine surprises. Everything a good book should be.",
-      "sentiment_score": 9.0
-    },
-    {
-      "id": 62,
-      "text": "The writing was awkward and the story felt forced. Could have used more time in development.",
-      "sentiment_score": 2.8
-    },
-    {
-      "id": 63,
-      "text": "An enchanting tale that captures the magic of childhood while addressing serious themes.",
-      "sentiment_score": 7.9
-    },
-    {
-      "id": 64,
-      "text": "The book was reasonably entertaining but nothing I hadn't seen before. Average in every way.",
-      "sentiment_score": 5.0
-    },
-    {
-      "id": 65,
-      "text": "Brilliant use of multiple perspectives to tell a complex story. Masterfully woven narrative threads.",
-      "sentiment_score": 9.2
-    },
-    {
-      "id": 66,
-      "text": "The pacing was all wrong - too slow in places, too rushed in others. Needed better editing.",
-      "sentiment_score": 3.9
-    },
-    {
-      "id": 67,
-      "text": "A touching story of friendship and loyalty that resonated deeply with me. Highly recommended.",
-      "sentiment_score": 8.0
-    },
-    {
-      "id": 68,
-      "text": "Confusing timeline and unclear motivations made this a frustrating read. Lost potential.",
-      "sentiment_score": 3.1
-    },
-    {
-      "id": 69,
-      "text": "Exceptional prose and a story that stays with you. A modern classic in the making.",
-      "sentiment_score": 9.5
-    },
-    {
-      "id": 70,
-      "text": "The book tried to do too much and ended up accomplishing very little. Unfocused and scattered.",
-      "sentiment_score": 2.5
-    },
-    {
-      "id": 71,
-      "text": "A solid mystery with well-developed characters and a satisfying resolution. Good entertainment.",
-      "sentiment_score": 7.3
-    },
-    {
-      "id": 72,
-      "text": "Derivative and unoriginal. Felt like I'd read this exact story multiple times before.",
-      "sentiment_score": 2.0
-    },
-    {
-      "id": 73,
-      "text": "Beautiful, lyrical writing that creates an immersive reading experience. A true work of art.",
-      "sentiment_score": 8.8
-    },
-    {
-      "id": 74,
-      "text": "The book was readable but forgettable. Nothing particularly good or bad about it.",
-      "sentiment_score": 5.1
-    },
-    {
-      "id": 75,
-      "text": "An epic adventure with memorable characters and breathtaking scope. Fantasy at its finest.",
-      "sentiment_score": 9.1
-    },
-    {
-      "id": 76,
-      "text": "Poor character development and a weak plot made this a chore to finish. Very disappointing.",
-      "sentiment_score": 1.9
-    },
-    {
-      "id": 77,
-      "text": "A compelling drama with realistic characters facing believable challenges. Well worth reading.",
-      "sentiment_score": 7.6
-    },
-    {
-      "id": 78,
-      "text": "The book meandered without purpose and the ending came out of nowhere. Poorly structured.",
-      "sentiment_score": 3.2
-    },
-    {
-      "id": 79,
-      "text": "Absolutely captivating! A page-turner that combines great writing with an irresistible plot.",
-      "sentiment_score": 8.7
-    },
-    {
-      "id": 80,
-      "text": "Too many clichés and stereotypes. The author relied on tired tropes instead of original ideas.",
-      "sentiment_score": 2.3
-    },
-    {
-      "id": 81,
-      "text": "A thoughtful exploration of complex themes with nuanced characters and elegant prose.",
-      "sentiment_score": 8.4
-    },
-    {
-      "id": 82,
-      "text": "The story had potential but was ruined by poor execution and sloppy writing. What a waste.",
-      "sentiment_score": 2.7
-    },
-    {
-      "id": 83,
-      "text": "An outstanding debut novel that announces the arrival of a major new talent. Brilliant work.",
-      "sentiment_score": 9.3
-    },
-    {
-      "id": 84,
-      "text": "Bland and uninspiring. The characters were flat and the story lacked any real emotion.",
-      "sentiment_score": 2.1
-    },
-    {
-      "id": 85,
-      "text": "A gripping tale of survival and redemption that kept me reading late into the night.",
-      "sentiment_score": 8.1
-    },
-    {
-      "id": 86,
-      "text": "The book was okay for what it was, but it didn't really grab me. Decent but unremarkable.",
-      "sentiment_score": 4.8
-    },
-    {
-      "id": 87,
-      "text": "Masterful storytelling with rich imagery and profound insights into the human condition.",
-      "sentiment_score": 9.4
-    },
-    {
-      "id": 88,
-      "text": "Choppy writing and an incoherent plot made this difficult to follow and even harder to enjoy.",
-      "sentiment_score": 1.7
-    },
-    {
-      "id": 89,
-      "text": "A delightful romantic comedy with sparkling dialogue and charming characters. Pure enjoyment.",
-      "sentiment_score": 7.8
-    },
-    {
-      "id": 90,
-      "text": "The book started promisingly but quickly devolved into nonsense. Very disappointing conclusion.",
-      "sentiment_score": 3.0
-    },
-    {
-      "id": 91,
-      "text": "An intelligent and well-researched novel that educates as much as it entertains. Excellent work.",
-      "sentiment_score": 8.2
-    },
-    {
-      "id": 92,
-      "text": "Boring and predictable with cardboard characters and a paint-by-numbers plot. Skip this one.",
-      "sentiment_score": 1.4
-    },
-    {
-      "id": 93,
-      "text": "A powerful and moving story that tackles difficult subjects with sensitivity and grace.",
-      "sentiment_score": 8.9
-    },
-    {
-      "id": 94,
-      "text": "The author clearly didn't know how to end the story. The conclusion was abrupt and unsatisfying.",
-      "sentiment_score": 3.5
-    },
-    {
-      "id": 95,
-      "text": "Extraordinary! A once-in-a-generation masterpiece that redefines what literature can achieve.",
-      "sentiment_score": 10.0
-    },
-    {
-      "id": 96,
-      "text": "Terrible pacing and wooden dialogue made this one of the worst books I've read this year.",
-      "sentiment_score": 0.9
-    },
-    {
-      "id": 97,
-      "text": "A satisfying read with good character arcs and a well-constructed plot. Solid entertainment.",
-      "sentiment_score": 7.1
-    },
-    {
-      "id": 98,
-      "text": "The book felt like a rough draft that was published too early. Needed much more work.",
-      "sentiment_score": 2.4
-    },
-    {
-      "id": 99,
-      "text": "Brilliant, innovative, and utterly engaging. A book that changes how you think about storytelling.",
-      "sentiment_score": 9.8
-    },
-    {
-      "id": 100,
-      "text": "Completely unreadable. Poor grammar, worse plotting, and characters with no redeeming qualities.",
-      "sentiment_score": 0.2
-    }
-  ],
-  "metadata": {
-    "description": "Synthesised book review sentiment analysis dataset",
-    "total_reviews": 100,
-    "sentiment_scale": "0.0 (extremely negative) to 10.0 (extremely positive)",
-    "created": "2025-07-01"
-  }
-}
\ No newline at end of file
diff --git a/examples/llm_prompt_optimazation/evaluator.py b/examples/llm_prompt_optimazation/evaluator.py
deleted file mode 100644
index 6a816f15b..000000000
--- a/examples/llm_prompt_optimazation/evaluator.py
+++ /dev/null
@@ -1,196 +0,0 @@
-"""
-Evaluator for the prompt optimization task.
-"""
-
-import re
-import traceback
-import json
-import os
-import time
-from openai import OpenAI
-from tqdm import tqdm
-
-TASK_MODEL_NAME = "meta-llama-3.1-8b-instruct@q8_0"
-TASK_MODEL_URL = "http://localhost:1234/v1"
-TASK_MODEL_API_KEY = "your_api_key_here"
-SAMPLE_SIZE = 25  # Number of samples to use for evaluation
-MAX_RETRIES = 3  # Number of retries for LLM calls
-
-
-def load_dataset(data_file_path):
-    """
-    Load the book review dataset from JSON file.
-    
-    Args:
-        data_file_path: Path to the JSON data file
-        
-    Returns:
-        List of review dictionaries with 'text' and 'label' keys
-    """
-    try:
-        with open(data_file_path, 'r', encoding='utf-8') as f:
-            data = json.load(f)
-        
-        # Convert the data structure to match the expected format
-        reviews = []
-        for review in data.get('book_reviews', []):
-            reviews.append({
-                'text': review['text'],
-                'label': review['sentiment_score']
-            })
-        
-        print(f"Successfully loaded {len(reviews)} book reviews from dataset")
-        return reviews
-        
-    except Exception as e:
-        print(f"Error loading dataset from {data_file_path}: {e}")
-        traceback.print_exc()
-        return []
-
-# Load dataset from JSON file
-data_file_path = os.path.join(os.path.dirname(__file__), "data.json")
-ds = load_dataset(data_file_path)
-        
-if not ds:
-    raise ValueError("Failed to load dataset or dataset is empty")
-  
-def evaluate(prompt_path):
-    """
-    Evaluate the program by run the LLM model on a benchmarck dataset.
-
-    Args:
-        program_path: Path to the program file
-
-    Returns:
-        Dictionary of metrics
-    """
-    print('-' * 80)
-    print("Starting evaluation...")
-    print('-' * 80)
-    try:
-        # Initialize OpenAI test_model with error handling
-        try:
-            test_model = OpenAI(
-                base_url=TASK_MODEL_URL,
-                api_key=TASK_MODEL_API_KEY
-            )
-            print(f"Initialized OpenAI test_model with model: {TASK_MODEL_NAME}")
-        except Exception as e:
-            print(f"Error initializing OpenAI test_model: {e}")
-            test_model = None
-
-        # Use a subset for faster evaluation during evolution (can be configured)
-        eval_sample_size = min(SAMPLE_SIZE, len(ds))  
-        ds_sample = ds[:eval_sample_size]   
-        print(f"Using {len(ds_sample)} samples from {len(ds)} total reviews for evaluation")
-                
-        # load the prompt from the file
-        with open(prompt_path, "r") as f:
-            prompt = f.read()
-
-        # extract the prompt between the markers
-        prompt_match = re.search(r"EVOLVE-BLOCK-START(.*)EVOLVE-BLOCK-END", prompt, re.DOTALL)
-        if prompt_match:
-            prompt = prompt_match.group(1).strip()
-        else:
-            raise ValueError("No EVOLVE-BLOCK found in the prompt file")
-        
-        total_score = 0.0
-        total_examples = 0
-        individual_scores = []
-
-        print(f"Evaluating with prompt:\n{prompt}\n")
-        for example in tqdm(ds_sample, desc="Evaluating examples", unit="example"):
-            total_examples += 1
-            input_text = example["text"]
-            expected_score = example["label"]
-
-            # Prepare the message for the LLM
-            messages = [
-                {"role": "user", "content": prompt.format(input_text=input_text)}
-            ]
-
-            # Call the LLM with retry logic
-            max_retries = MAX_RETRIES
-            for attempt in range(max_retries):
-                try:
-                    response = test_model.chat.completions.create(
-                        model=TASK_MODEL_NAME,
-                        messages=messages
-                    )
-                    break
-                except Exception as e:
-                    if attempt == max_retries - 1:
-                        print(f"Failed to get response after {max_retries} attempts: {e}")
-                        raise e
-                    time.sleep(1)  # Brief pause before retry
-
-            output_text = response.choices[0].message.content.strip()
-            
-            # Extract numerical score from the response
-            try:
-                # Try to extract a number between 0 and 10
-                score_match = re.search(r'(\d+(?:\.\d+)?)', output_text)
-                if score_match:
-                    predicted_score = float(score_match.group(1))
-                    
-                    # Ensure score is within valid range (0-10)
-                    predicted_score = max(0.0, min(10.0, predicted_score))
-                else:
-                    predicted_score = 5.0  # Default to neutral
-                
-                # Calculate accuracy based on how close the prediction is to the expected score
-                # Using 1 - (absolute difference / 10), so perfect match = 1.0, worst case = 0.0
-                accuracy = 1.0 - (abs(predicted_score - expected_score) / 10.0)
-                individual_scores.append(accuracy)
-                total_score += accuracy
-                
-            except Exception as e:
-                print(f"Error processing response '{output_text}': {e}")
-                individual_scores.append(0.0)  # Score 0 for failed predictions
-        # Calculate comprehensive metrics
-        average_score = total_score / total_examples if total_examples > 0 else 0.0
-        min_score = min(individual_scores) if individual_scores else 0.0
-        max_score = max(individual_scores) if individual_scores else 0.0
-        
-        # Calculate additional metrics
-        std_dev = 0.0
-        if len(individual_scores) > 1:
-            mean = sum(individual_scores) / len(individual_scores)
-            variance = sum((x - mean) ** 2 for x in individual_scores) / len(individual_scores)
-            std_dev = variance ** 0.5
-        
-        # Count high-accuracy predictions (>0.8 accuracy)
-        high_accuracy_count = sum(1 for score in individual_scores if score > 0.8)
-        high_accuracy_rate = high_accuracy_count / len(individual_scores) if individual_scores else 0.0
-
-        print(f"Total examples: {total_examples}")
-        print(f"Average accuracy: {average_score:.3f}")
-        print(f"Standard deviation: {std_dev:.3f}")
-        print(f"Min accuracy: {min_score:.3f}")
-        print(f"Max accuracy: {max_score:.3f}")
-        print(f"High accuracy rate (>0.8): {high_accuracy_rate:.3f}")
-        print('-' * 80)
-        return {
-            "score": average_score,
-            "total_examples": total_examples,
-            "individual_scores": individual_scores,
-            "min_score": min_score,
-            "max_score": max_score,
-            "std_dev": std_dev,
-            "high_accuracy_rate": high_accuracy_rate
-        }
-
-    except Exception as e:
-        print(f"Evaluation failed completely: {str(e)}")
-        traceback.print_exc()
-        print('-' * 80)
-        return {
-            "score": 0.0,
-            "total_examples": 0,
-            "individual_scores": [],
-            "min_score": 0.0,
-            "max_score": 0.0,
-            "std_dev": 0.0,
-            "high_accuracy_rate": 0.0
-        }
diff --git a/examples/llm_prompt_optimazation/initial_prompt.txt b/examples/llm_prompt_optimazation/initial_prompt.txt
deleted file mode 100644
index 6f12bf353..000000000
--- a/examples/llm_prompt_optimazation/initial_prompt.txt
+++ /dev/null
@@ -1,11 +0,0 @@
-"""Sentiment analysis prompt example for OpenEvolve"""
-
-# EVOLVE-BLOCK-START
-Please analyze the sentiment of the following sentence and provide a sentiment score:
-
-"{input_text}"
-
-Rate the sentiment on a scale from 0.0 to 10.0.
-
-Score:
-# EVOLVE-BLOCK-END
diff --git a/examples/llm_prompt_optimazation/requirements.txt b/examples/llm_prompt_optimazation/requirements.txt
deleted file mode 100644
index 01354db40..000000000
--- a/examples/llm_prompt_optimazation/requirements.txt
+++ /dev/null
@@ -1,2 +0,0 @@
-openai
-tqdm
\ No newline at end of file
diff --git a/examples/llm_prompt_optimazation/run.sh b/examples/llm_prompt_optimazation/run.sh
deleted file mode 100644
index 7226a0b82..000000000
--- a/examples/llm_prompt_optimazation/run.sh
+++ /dev/null
@@ -1,4 +0,0 @@
- python ../../openevolve-run.py \
-    examples/llm_prompt_optimazation/initial_prompt.txt \
-    examples/llm_prompt_optimazation/evaluator.py \
-    --config examples/llm_prompt_optimazation/config.yaml 
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/README.md b/examples/llm_prompt_optimization/README.md
new file mode 100644
index 000000000..77ff57311
--- /dev/null
+++ b/examples/llm_prompt_optimization/README.md
@@ -0,0 +1,254 @@
+# LLM Prompt Optimization with OpenEvolve 🚀
+
+This example demonstrates how to use OpenEvolve to automatically optimize prompts for Large Language Models. The system uses evolutionary search to discover high-performing prompts by testing them against ground truth data from various datasets.
+
+## 🎯 Overview
+
+OpenEvolve automatically:
+- Loads datasets from various sources
+- Evolves prompts through multiple generations
+- Uses cascading evaluation for efficiency
+- Finds optimal prompts for your specific task and model
+
+**Key Feature**: The evaluator automatically matches prompt files with dataset configurations using a naming convention (`xxx_prompt.txt` → `xxx_prompt_dataset.yaml`), making it easy to manage multiple benchmark tasks.
+
+## 🚀 Quick Start
+
+### 1. Install Dependencies
+
+```bash
+cd examples/llm_prompt_optimization
+pip install -r requirements.txt
+```
+
+### 2. Configure Your Model
+
+Update `config.yaml` with your LLM settings:
+
+```yaml
+llm:
+  api_base: "https://openrouter.ai/api/v1"
+  api_key: "your_api_key_here"
+  models:
+    - name: "google/gemini-2.5-flash"  # Or any OpenAI-compatible model
+      weight: 1.0
+```
+
+### 3. Set Up Your Dataset and Prompt
+
+This example uses a naming convention to match prompts with their dataset configurations:
+- For a prompt file `xxx_prompt.txt`, create a matching `xxx_prompt_dataset.yaml`
+- For example: `emotion_prompt.txt` uses `emotion_prompt_dataset.yaml`
+
+Create your dataset configuration file (e.g., `emotion_prompt_dataset.yaml`):
+
+```yaml
+# Dataset configuration
+dataset_name: "dair-ai/emotion"   # Dataset identifier
+input_field: "text"               # Field containing input data
+target_field: "label"             # Field containing ground truth
+split: "test"                     # Dataset split to use
+
+# Evaluation samples
+max_samples: 200   # Number of samples to evaluate
+```
+
+Create your initial prompt file (e.g., `emotion_prompt.txt`):
+
+```
+Classify the emotion expressed in the following text.
+
+Text: "{input_text}"
+
+Emotion (0-5):
+```
+
+### 4. Run OpenEvolve
+
+Use the provided `run_evolution.sh` script to ensure the correct dataset is used:
+
+```bash
+# For emotion classification benchmark
+./run_evolution.sh emotion_prompt.txt --iterations 50
+
+# For IMDB sentiment analysis
+./run_evolution.sh initial_prompt.txt --iterations 50
+
+# With custom iterations and checkpoint
+./run_evolution.sh emotion_prompt.txt --iterations 100 --checkpoint-interval 20
+```
+
+The script automatically:
+- Sets the `OPENEVOLVE_PROMPT` environment variable so the evaluator knows which dataset to use
+- Passes all additional arguments to OpenEvolve
+- Ensures the correct `_dataset.yaml` file is matched with your prompt
+
+**Note**: If you prefer to run OpenEvolve directly, set the environment variable first:
+```bash
+export OPENEVOLVE_PROMPT=emotion_prompt.txt
+python ../../openevolve-run.py emotion_prompt.txt evaluator.py --config config.yaml --iterations 50
+```
+
+## 📊 Supported Datasets
+
+This optimizer works with a wide variety of datasets. Included examples:
+
+- **IMDB Sentiment**: `initial_prompt.txt` + `initial_prompt_dataset.yaml` (binary classification)
+- **Emotion**: `emotion_prompt.txt` + `emotion_prompt_dataset.yaml` (6-class, benchmark against DSPy)
+- **GSM8K**: `gsm8k_prompt.txt` + `gsm8k_prompt_dataset.yaml` (grade school math, DSPy achieves 97.1%)
+
+### Creating New Tasks
+
+To add a new dataset:
+1. Create `yourtask_prompt.txt` with the initial prompt
+2. Create `yourtask_prompt_dataset.yaml` with the dataset configuration
+3. Run: `./run_evolution.sh yourtask_prompt.txt --iterations 50`
+
+**Note**: If you call OpenEvolve directly without the wrapper script, the evaluator will look for a default `dataset_config.yaml` file.
+
+### Common Dataset Configurations:
+
+### Sentiment Analysis
+```yaml
+dataset_name: "stanfordnlp/imdb"
+input_field: "text"
+target_field: "label"  # 0 or 1
+```
+
+### Question Answering
+```yaml
+dataset_name: "squad"
+input_field: "question"
+target_field: "answers"  # Dict with 'text' field
+```
+
+### Text Classification
+```yaml
+dataset_name: "ag_news"
+input_field: "text"
+target_field: "label"  # 0-3 for categories
+```
+
+### Summarization
+```yaml
+dataset_name: "xsum"
+input_field: "document"
+target_field: "summary"
+```
+
+## ⚙️ How It Works
+
+### Simple Evaluation
+
+The evaluator uses a straightforward single-stage evaluation:
+
+1. **Load Dataset**: Downloads the specified dataset
+2. **Sample Data**: Takes `max_samples` examples from the dataset
+3. **Test Prompt**: Sends each example through the LLM with the prompt
+4. **Calculate Accuracy**: Compares LLM outputs to ground truth labels
+
+### Evolution Process
+
+1. OpenEvolve starts with your initial prompt
+2. The LLM generates variations based on performance feedback
+3. Each variant is tested using cascading evaluation
+4. Best performers are kept and evolved further
+5. Process continues for specified iterations
+
+### 🎭 Custom Templates for Prompt Evolution
+
+By default, OpenEvolve is designed for code evolution. To make it work properly for prompt evolution, this example includes custom templates in the `templates/` directory:
+
+- **`full_rewrite_user.txt`**: Replaces the default code evolution template with prompt-specific language
+
+This ensures the LLM understands it should evolve the prompt text itself, not generate code. The configuration automatically uses these templates via:
+
+```yaml
+prompt:
+  template_dir: "templates"  # Use custom templates for prompt evolution
+```
+
+## 🎯 Configuration Options
+
+### Evaluation Configuration
+
+In `config.yaml`:
+```yaml
+evaluator:
+  parallel_evaluations: 4      # Run 4 evaluations in parallel
+  cascade_evaluation: false    # Simple single-stage evaluation
+```
+
+### Sample Size
+
+Adjust in `dataset.yaml`:
+```yaml
+max_samples: 50    # Number of samples to evaluate
+```
+
+## 📈 Example Results
+
+Starting prompt:
+```
+Analyze the sentiment: "{input_text}"
+```
+
+Evolved prompt after 100 iterations:
+```
+Analyze the sentiment of the following text. Determine if the overall emotional tone is positive or negative.
+
+Text: "{input_text}"
+
+Response: Provide only a single digit - either 1 for positive sentiment or 0 for negative sentiment. Do not include any explanation or additional text.
+```
+
+Accuracy improvement: 72% → 94%
+
+## 🔧 Advanced Usage
+
+### Custom Evaluation Metrics
+
+The evaluator extracts predictions and compares them to ground truth. For classification tasks, it looks for:
+- Exact number matches (0, 1, etc.)
+- Keywords (positive/negative, yes/no)
+- Custom patterns you define
+
+### Different Task Types
+
+While the default setup is for classification, you can modify the evaluator for:
+- **Regression**: Compare numeric outputs
+- **Generation**: Use BLEU/ROUGE scores
+- **Extraction**: Check if key information is present
+
+## 🐛 Troubleshooting
+
+### Dataset Not Found
+- Check the exact dataset name and source
+- Some datasets require acceptance of terms
+
+### Low Stage 1 Accuracy
+- Your initial prompt may be too vague
+- Check if the output format matches expectations
+- Verify the dataset fields are correct
+
+### API Errors
+- Ensure your API key is valid
+- Check rate limits
+- Verify the model name is correct
+
+## 🚀 Tips for Best Results
+
+1. **Start Simple**: Begin with a clear, working prompt
+2. **Clear Output Format**: Specify exactly what output you expect
+3. **Appropriate Samples**: More samples = better evaluation but slower
+4. **Multiple Runs**: Evolution has randomness; try multiple runs
+5. **Monitor Progress**: Check intermediate best_program.txt files
+
+## 📚 Next Steps
+
+- Try different datasets and benchmarks
+- Experiment with different models
+- Adjust evolution parameters in config.yaml
+- Create task-specific evaluation metrics
+
+Happy prompt evolving! 🧬✨
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/config.yaml b/examples/llm_prompt_optimization/config.yaml
new file mode 100644
index 000000000..da644f77c
--- /dev/null
+++ b/examples/llm_prompt_optimization/config.yaml
@@ -0,0 +1,74 @@
+# Configuration for HuggingFace prompt optimization
+# Based on optimized settings from config2.yaml
+
+# General settings
+max_iterations: 50
+checkpoint_interval: 10
+log_level: "INFO"
+diff_based_evolution: false  # Full rewrite mode (best for prompt optimization)
+max_code_length: 10000
+language: "text"  # Explicitly set language to text for prompt evolution
+
+# LLM Configuration
+llm:
+  api_base: "https://generativelanguage.googleapis.com/v1beta/openai/"
+  models:
+    - name: "gemini-2.5-flash-lite"  # Using Gemini 2.5 Flash Lite
+      weight: 1.0
+  
+  temperature: 0.4  # Optimal from experiments
+  max_tokens: 16000  # Optimal context
+  timeout: 150
+  retries: 3
+
+# Prompt Configuration - Optimal settings discovered
+prompt:
+  template_dir: "templates"  # Use custom templates for prompt evolution
+  num_top_programs: 3      # Best balance
+  num_diverse_programs: 2  # Best balance
+  include_artifacts: true  # +20.7% improvement when enabled
+  
+  # System message for prompt evolution
+  system_message: |
+    You are an expert prompt engineer. Your task is to revise an existing prompt designed for large language models (LLMs), without being explicitly told what the task is.
+
+    Your improvements should:
+
+    * Infer the intended task and expected output format based on the structure and language of the original prompt.
+    * Clarify vague instructions, eliminate ambiguity, and improve overall interpretability for the LLM.
+    * Strengthen alignment between the prompt and the desired task outcome, ensuring more consistent and accurate responses.
+    * Improve robustness against edge cases or unclear input phrasing.
+    * If helpful, include formatting instructions, boundary conditions, or illustrative examples that reinforce the LLM's expected behavior.
+    * Avoid adding unnecessary verbosity or assumptions not grounded in the original prompt.
+
+    The revised prompt should maintain the same input interface but be more effective, reliable, and production-ready for LLM use.
+
+    Return only the improved prompt text. Do not include explanations or additional comments. Your output should be a clean, high-quality replacement that enhances clarity, consistency, and LLM performance.
+
+# Database Configuration
+database:
+  population_size: 1000
+  archive_size: 100
+  num_islands: 4
+  
+  # Feature dimensions for MAP-Elites
+  # Using custom features returned by the evaluator
+  feature_dimensions: ["prompt_length", "reasoning_strategy"]
+  feature_bins: 10  # 10x10 grid = 100 cells
+  
+  # Selection parameters - Optimal ratios from testing
+  elite_selection_ratio: 0.1   # 10% elite selection
+  exploration_ratio: 0.3       # 30% exploration
+  exploitation_ratio: 0.6      # 60% exploitation
+  
+  # Migration parameters - Optimal settings
+  migration_interval: 10
+  migration_rate: 0.1
+
+# Evaluator Configuration
+evaluator:
+  timeout: 1800
+  max_retries: 3
+  parallel_evaluations: 4
+  cascade_evaluation: true  # Two-stage cascading evaluation
+  cascade_thresholds: [0.9]  # Stage 1 must achieve 90% accuracy to proceed to stage 2
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/dataset_config.yaml b/examples/llm_prompt_optimization/dataset_config.yaml
new file mode 100644
index 000000000..08ea83cbf
--- /dev/null
+++ b/examples/llm_prompt_optimization/dataset_config.yaml
@@ -0,0 +1,9 @@
+# Default dataset configuration (fallback when not using run_evolution.sh)
+# This is used when OpenEvolve is called directly without setting OPENEVOLVE_PROMPT
+dataset_name: "stanfordnlp/imdb"
+input_field: "text"
+target_field: "label"  # 0 or 1
+split: "test"
+
+# Evaluation samples
+max_samples: 50
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/emotion_prompt.txt b/examples/llm_prompt_optimization/emotion_prompt.txt
new file mode 100644
index 000000000..a947907ac
--- /dev/null
+++ b/examples/llm_prompt_optimization/emotion_prompt.txt
@@ -0,0 +1,11 @@
+Classify the emotion in the following text. Choose exactly one emotion from this list:
+- sadness
+- joy
+- love
+- anger
+- fear
+- surprise
+
+Text: "{input_text}"
+
+Emotion (respond with one word only):
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/emotion_prompt_dataset.yaml b/examples/llm_prompt_optimization/emotion_prompt_dataset.yaml
new file mode 100644
index 000000000..46a2d5375
--- /dev/null
+++ b/examples/llm_prompt_optimization/emotion_prompt_dataset.yaml
@@ -0,0 +1,18 @@
+# HuggingFace dataset configuration for emotion classification
+# This is a standard benchmark used by DSPy and others
+dataset_name: "dair-ai/emotion"
+input_field: "text"
+target_field: "label"  # 0-5: sadness, joy, love, anger, fear, surprise
+split: "test"
+
+# Evaluation samples
+max_samples: 200  # Larger sample for 6-class problem
+
+# Labels mapping for reference
+label_names:
+  0: "sadness"
+  1: "joy" 
+  2: "love"
+  3: "anger"
+  4: "fear"
+  5: "surprise"
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/evaluator.py b/examples/llm_prompt_optimization/evaluator.py
new file mode 100644
index 000000000..49fad99ba
--- /dev/null
+++ b/examples/llm_prompt_optimization/evaluator.py
@@ -0,0 +1,446 @@
+"""
+Evaluator for HuggingFace dataset-based prompt optimization.
+"""
+
+import re
+import traceback
+import yaml
+import os
+import time
+from openai import OpenAI
+from tqdm import tqdm
+from datasets import load_dataset
+
+# Read config.yaml to get model settings
+with open(os.path.join(os.path.dirname(__file__), "config.yaml"), 'r') as f:
+    config = yaml.safe_load(f)
+
+# Get model settings from config
+llm_config = config.get('llm', {})
+api_base = llm_config.get('api_base', 'http://localhost:1234/v1')
+
+# Handle both single model and model list configurations
+models = llm_config.get('models', [])
+if models:
+    # Use first model from list
+    TASK_MODEL_NAME = models[0].get('name', 'default-model')
+else:
+    # Fallback to direct model specification
+    TASK_MODEL_NAME = llm_config.get('primary_model', 'default-model')
+
+# Get evaluator settings
+evaluator_config = config.get('evaluator', {})
+MAX_RETRIES = evaluator_config.get('max_retries', 3)
+
+# Get max_tokens from LLM config
+MAX_TOKENS = llm_config.get('max_tokens', 16000)
+print(f"Using max_tokens: {MAX_TOKENS}")
+
+# Initialize OpenAI client once for all evaluations
+test_model = OpenAI(base_url=api_base)
+print(f"Initialized OpenAI client with model: {TASK_MODEL_NAME}")
+
+# Determine which dataset to use based on the OPENEVOLVE_PROMPT environment variable
+import sys
+prompt_file = os.environ.get('OPENEVOLVE_PROMPT')
+if not prompt_file:
+    # Default to a generic dataset config if not using the wrapper script
+    evaluator_dir = os.path.dirname(os.path.abspath(__file__))
+    DATASET_CONFIG_PATH = os.path.join(evaluator_dir, 'dataset_config.yaml')
+    print("Warning: OPENEVOLVE_PROMPT not set. Using default dataset_config.yaml")
+else:
+    basename = os.path.basename(prompt_file)
+    dataset_filename = basename.replace('_prompt.txt', '_prompt_dataset.yaml').replace('.txt', '_dataset.yaml')
+    evaluator_dir = os.path.dirname(os.path.abspath(__file__))
+    DATASET_CONFIG_PATH = os.path.join(evaluator_dir, dataset_filename)
+    print(f"Dataset configuration: {dataset_filename}")
+
+
+def calculate_prompt_features(prompt):
+    """
+    Calculate custom features for MAP-Elites binning
+    
+    Returns:
+        tuple: (prompt_length, reasoning_strategy) - both in range 0-9
+    """
+    # Feature 1: Prompt length bin (0-9)
+    length = len(prompt)
+    if length < 100:
+        prompt_length = 0    # Minimal
+    elif length < 200:
+        prompt_length = 1    # Very short
+    elif length < 400:
+        prompt_length = 2    # Short
+    elif length < 600:
+        prompt_length = 3    # Medium-short
+    elif length < 900:
+        prompt_length = 4    # Medium
+    elif length < 1200:
+        prompt_length = 5    # Medium-long
+    elif length < 1600:
+        prompt_length = 6    # Long
+    elif length < 2000:
+        prompt_length = 7    # Very long
+    elif length < 2500:
+        prompt_length = 8    # Extensive
+    else:
+        prompt_length = 9    # Very extensive
+    
+    # Feature 2: Reasoning strategy (0-9)
+    prompt_lower = prompt.lower()
+    
+    # Check for few-shot examples
+    has_example = ('example' in prompt_lower or 
+                  prompt.count('####') >= 4 or
+                  bool(re.search(r'problem:.*?solution:', prompt_lower, re.DOTALL)))
+    
+    # Check for Chain-of-Thought (CoT) indicators
+    has_cot = ('step by step' in prompt_lower or 
+               'step-by-step' in prompt_lower or
+               any(phrase in prompt_lower for phrase in ['think through', 'reasoning', 'explain your']) or
+               bool(re.search(r'(first|then|next|finally)', prompt_lower)))
+    
+    # Assign reasoning strategy bins
+    if has_example:
+        # Few-shot examples (bins 7-9)
+        if has_cot:
+            reasoning_strategy = 9  # Few-shot + CoT (most sophisticated)
+        elif length > 1500:
+            reasoning_strategy = 8  # Extensive few-shot
+        else:
+            reasoning_strategy = 7  # Basic few-shot
+    elif has_cot:
+        # Chain-of-thought (bins 4-6)
+        if 'must' in prompt_lower or 'exactly' in prompt_lower:
+            reasoning_strategy = 6  # Strict CoT
+        elif length > 500:
+            reasoning_strategy = 5  # Detailed CoT
+        else:
+            reasoning_strategy = 4  # Basic CoT
+    else:
+        # Basic prompts (bins 0-3)
+        if length < 100:
+            reasoning_strategy = 0  # Minimal
+        elif 'solve' in prompt_lower or 'calculate' in prompt_lower:
+            reasoning_strategy = 2  # Direct instruction
+        else:
+            reasoning_strategy = 1  # Simple prompt
+    
+    return prompt_length, reasoning_strategy
+
+
+def load_prompt_config(prompt_path):
+    """Load the prompt from text file and dataset config from matching _dataset.yaml file."""
+    # Load prompt from text file
+    with open(prompt_path, 'r') as f:
+        prompt = f.read().strip()
+    
+    # Load the configuration (already determined from environment variable)
+    if not os.path.exists(DATASET_CONFIG_PATH):
+        raise FileNotFoundError(f"Dataset configuration not found: {DATASET_CONFIG_PATH}")
+    
+    with open(DATASET_CONFIG_PATH, 'r') as f:
+        config = yaml.safe_load(f)
+    
+    return config, prompt
+
+def load_hf_dataset(config):
+    """Load HuggingFace dataset based on configuration."""
+    dataset_name = config['dataset_name']
+    dataset_config = config.get('dataset_config', None)
+    split = config.get('split', 'test')
+    
+    print(f"Loading dataset: {dataset_name}")
+    
+    try:
+        # Try to load the specified split
+        if dataset_config:
+            dataset = load_dataset(dataset_name, dataset_config, split=split)
+        else:
+            dataset = load_dataset(dataset_name, split=split)
+    except:
+        # Fallback to train split if test is not available
+        print(f"Split '{split}' not found, falling back to 'train'")
+        if dataset_config:
+            dataset = load_dataset(dataset_name, dataset_config, split='train')
+        else:
+            dataset = load_dataset(dataset_name, split='train')
+    
+    print(f"Dataset loaded with {len(dataset)} examples")
+    return dataset
+
+def evaluate_prompt(prompt, dataset, config, num_samples):
+    """Evaluate a prompt on a subset of the dataset."""
+    input_field = config['input_field']
+    target_field = config['target_field']
+    
+    # Check dataset type
+    dataset_name = config.get('dataset_name', '').lower()
+    is_emotion = 'emotion' in dataset_name
+    is_gsm8k = 'gsm8k' in dataset_name
+    
+    # Sample from dataset
+    samples = dataset.select(range(min(num_samples, len(dataset))))
+    
+    correct = 0
+    total = 0
+    
+    for example in tqdm(samples, desc=f"Evaluating {num_samples} samples"):
+        input_text = example[input_field]
+        expected = example[target_field]
+        
+        # Prepare the message for the LLM
+        messages = [
+            {"role": "user", "content": prompt.format(input_text=input_text)}
+        ]
+        
+        # Call the LLM with retry logic
+        for attempt in range(MAX_RETRIES):
+            try:
+                # Use max_tokens from config
+                response = test_model.chat.completions.create(
+                    model=TASK_MODEL_NAME,
+                    messages=messages,
+                    temperature=0.1,  # Low temperature for consistent results
+                    max_tokens=MAX_TOKENS
+                )
+                break
+            except Exception as e:
+                if attempt == MAX_RETRIES - 1:
+                    print(f"Failed to get response after {MAX_RETRIES} attempts: {e}")
+                    raise e
+                time.sleep(1)
+        
+        # Handle potential None response
+        if not response:
+            print(f"Warning: No response object from LLM")
+            total += 1  # Count as incorrect
+            continue
+            
+        if not response.choices:
+            print(f"Warning: No choices in response from LLM")
+            total += 1  # Count as incorrect
+            continue
+            
+        if not response.choices[0].message:
+            print(f"Warning: No message in response choice")
+            total += 1  # Count as incorrect
+            continue
+            
+        output_text = response.choices[0].message.content
+        if output_text is None:
+            print(f"Warning: None content in LLM response")
+            print(f"Full response: {response}")
+            total += 1  # Count as incorrect
+            continue
+            
+        output_text = output_text.strip()
+        
+        # Extract prediction from output
+        try:
+            if is_gsm8k:
+                # For GSM8K, extract the numeric answer after ####
+                # First, extract the expected answer from the ground truth
+                expected_answer = expected.split('####')[-1].strip()
+                try:
+                    expected_number = float(expected_answer.replace(',', ''))
+                except:
+                    print(f"Warning: Could not parse expected answer: {expected_answer}")
+                    total += 1
+                    continue
+                
+                # Extract prediction from model output
+                prediction = None
+                if '####' in output_text:
+                    predicted_answer = output_text.split('####')[-1].strip()
+                    # Extract just the number, removing any extra text like $ signs
+                    import re
+                    numbers = re.findall(r'-?\$?[\d,]+\.?\d*', predicted_answer)
+                    if numbers:
+                        try:
+                            # Remove $ and , from the number
+                            number_str = numbers[0].replace('$', '').replace(',', '')
+                            prediction = float(number_str)
+                        except:
+                            pass
+                
+                # If we found a prediction, check if it matches
+                if prediction is not None:
+                    # Check if answers match (with small tolerance for floats)
+                    if abs(prediction - expected_number) < 0.001:
+                        correct += 1
+                
+                total += 1
+                continue  # Skip the general case to avoid double counting
+                
+            elif is_emotion:
+                # For emotion classification (0-5)
+                numbers = re.findall(r'\b[0-5]\b', output_text)
+                if numbers:
+                    prediction = int(numbers[-1])  # Use the last number found
+                else:
+                    # Try to infer from emotion keywords
+                    output_lower = output_text.lower()
+                    emotion_map = {
+                        'sadness': 0, 'sad': 0,
+                        'joy': 1, 'happy': 1, 'happiness': 1,
+                        'love': 2,
+                        'anger': 3, 'angry': 3,
+                        'fear': 4, 'afraid': 4, 'scared': 4,
+                        'surprise': 5, 'surprised': 5
+                    }
+                    prediction = -1
+                    for emotion, label in emotion_map.items():
+                        if emotion in output_lower:
+                            prediction = label
+                            break
+            else:
+                # For sentiment classification (0-1)
+                numbers = re.findall(r'\b[01]\b', output_text)
+                if numbers:
+                    prediction = int(numbers[-1])  # Use the last number found
+                else:
+                    # Try to infer from keywords
+                    output_lower = output_text.lower()
+                    if 'positive' in output_lower:
+                        prediction = 1
+                    elif 'negative' in output_lower:
+                        prediction = 0
+                    else:
+                        prediction = -1  # Invalid prediction
+            
+            if prediction == expected:
+                correct += 1
+            
+            total += 1
+            
+        except Exception as e:
+            print(f"Error parsing response '{output_text}': {e}")
+            total += 1  # Count as incorrect
+    
+    accuracy = correct / total if total > 0 else 0.0
+    return accuracy, correct, total
+
+def evaluate_stage1(prompt_path):
+    """
+    Stage 1 evaluation: Quick evaluation with 10% of samples
+    
+    Args:
+        prompt_path: Path to the prompt file
+        
+    Returns:
+        Dictionary with combined_score metric
+    """
+    print('-' * 80)
+    print("Starting Stage 1 evaluation...")
+    print('-' * 80)
+    
+    try:
+        # Load prompt configuration
+        config, prompt = load_prompt_config(prompt_path)
+        print(f"Loaded prompt configuration")
+        
+        # Load dataset
+        dataset = load_hf_dataset(config)
+        
+        # Get number of samples from config
+        num_samples = config.get('max_samples', 50)
+        stage1_samples = max(10, int(num_samples * 0.1))
+        
+        print(f"Stage 1: Evaluating {stage1_samples} samples...")
+        
+        # Run evaluation
+        accuracy, correct, total = evaluate_prompt(
+            prompt, dataset, config, stage1_samples
+        )
+        
+        print(f"Stage 1 accuracy: {accuracy:.3f} ({correct}/{total})")
+        print('-' * 80)
+        
+        # Calculate custom features
+        prompt_length, reasoning_strategy = calculate_prompt_features(prompt)
+        print(f"Prompt features - Length bin: {prompt_length}, Reasoning bin: {reasoning_strategy}")
+        
+        return {
+            "combined_score": accuracy,
+            "prompt_length": prompt_length,
+            "reasoning_strategy": reasoning_strategy
+        }
+        
+    except Exception as e:
+        print(f"Stage 1 evaluation failed: {str(e)}")
+        traceback.print_exc()
+        print('-' * 80)
+        return {
+            "combined_score": 0.0,
+            "error": str(e)
+        }
+
+
+def evaluate_stage2(prompt_path):
+    """
+    Stage 2 evaluation: Full evaluation with all samples
+    
+    Args:
+        prompt_path: Path to the prompt file
+        
+    Returns:
+        Dictionary with combined_score metric
+    """
+    print('-' * 80)
+    print("Starting Stage 2 evaluation...")
+    print('-' * 80)
+    
+    try:
+        # Load prompt configuration
+        config, prompt = load_prompt_config(prompt_path)
+        print(f"Loaded prompt configuration")
+        
+        # Load dataset
+        dataset = load_hf_dataset(config)
+        
+        # Get number of samples from config
+        num_samples = config.get('max_samples', 50)
+        
+        print(f"Stage 2: Evaluating all {num_samples} samples...")
+        
+        # Run evaluation
+        accuracy, correct, total = evaluate_prompt(
+            prompt, dataset, config, num_samples
+        )
+        
+        print(f"Stage 2 accuracy: {accuracy:.3f} ({correct}/{total})")
+        print('-' * 80)
+        
+        # Calculate custom features
+        prompt_length, reasoning_strategy = calculate_prompt_features(prompt)
+        print(f"Prompt features - Length bin: {prompt_length}, Reasoning bin: {reasoning_strategy}")
+        
+        return {
+            "combined_score": accuracy,
+            "prompt_length": prompt_length,
+            "reasoning_strategy": reasoning_strategy
+        }
+        
+    except Exception as e:
+        print(f"Stage 2 evaluation failed: {str(e)}")
+        traceback.print_exc()
+        print('-' * 80)
+        return {
+            "combined_score": 0.0,
+            "error": str(e)
+        }
+
+
+def evaluate(prompt_path):
+    """
+    Main evaluation function - for backwards compatibility
+    Calls evaluate_stage2 for full evaluation
+    
+    Args:
+        prompt_path: Path to the prompt file
+        
+    Returns:
+        Dictionary with combined_score metric
+    """
+    return evaluate_stage2(prompt_path)
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/gsm8k_prompt.txt b/examples/llm_prompt_optimization/gsm8k_prompt.txt
new file mode 100644
index 000000000..476efed05
--- /dev/null
+++ b/examples/llm_prompt_optimization/gsm8k_prompt.txt
@@ -0,0 +1,5 @@
+Solve the following grade school math problem step by step.
+
+Problem: {input_text}
+
+Show your work and reasoning for each step. After solving, provide your final numeric answer after "####".
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/gsm8k_prompt_dataset.yaml b/examples/llm_prompt_optimization/gsm8k_prompt_dataset.yaml
new file mode 100644
index 000000000..db28e49eb
--- /dev/null
+++ b/examples/llm_prompt_optimization/gsm8k_prompt_dataset.yaml
@@ -0,0 +1,14 @@
+# HuggingFace dataset configuration for GSM8K (Grade School Math)
+# DSPy achieved 97.1% accuracy with GPT-4 on this benchmark
+dataset_name: "openai/gsm8k"
+dataset_config: "main"  # GSM8K requires config name
+input_field: "question"
+target_field: "answer"  # Contains step-by-step solution ending with #### followed by the numeric answer
+split: "test"
+
+# Evaluation samples
+max_samples: 200  # Start with subset, full test set has 1,319 problems
+
+# Note: The answer field contains the full solution with the format:
+# "Step 1 explanation... Step 2... #### numeric_answer"
+# The evaluator will need to extract the number after ####
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/initial_prompt.txt b/examples/llm_prompt_optimization/initial_prompt.txt
new file mode 100644
index 000000000..ab329a63f
--- /dev/null
+++ b/examples/llm_prompt_optimization/initial_prompt.txt
@@ -0,0 +1,5 @@
+Analyze the sentiment of the following text and classify it as positive (1) or negative (0).
+
+Text: "{input_text}"
+
+Label:
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/initial_prompt_dataset.yaml b/examples/llm_prompt_optimization/initial_prompt_dataset.yaml
new file mode 100644
index 000000000..8bf503ae3
--- /dev/null
+++ b/examples/llm_prompt_optimization/initial_prompt_dataset.yaml
@@ -0,0 +1,8 @@
+# HuggingFace dataset configuration
+dataset_name: "stanfordnlp/imdb"
+input_field: "text"
+target_field: "label"
+split: "test"  # Will fallback to train if not available
+
+# Evaluation samples
+max_samples: 50  # Number of samples to evaluate
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/requirements.txt b/examples/llm_prompt_optimization/requirements.txt
new file mode 100644
index 000000000..b72f54907
--- /dev/null
+++ b/examples/llm_prompt_optimization/requirements.txt
@@ -0,0 +1,4 @@
+openai
+tqdm
+datasets
+pyyaml
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/run_evolution.sh b/examples/llm_prompt_optimization/run_evolution.sh
new file mode 100644
index 000000000..2d7daa4c6
--- /dev/null
+++ b/examples/llm_prompt_optimization/run_evolution.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+# Wrapper script to run OpenEvolve with the correct dataset
+
+if [ $# -lt 1 ]; then
+    echo "Usage: $0 <prompt_file> [additional_args...]"
+    echo "Example: $0 emotion_prompt.txt --iterations 50"
+    exit 1
+fi
+
+PROMPT_FILE=$1
+shift  # Remove first argument
+
+# Set the environment variable for the evaluator
+export OPENEVOLVE_PROMPT=$PROMPT_FILE
+
+# Run OpenEvolve
+python ../../openevolve-run.py "$PROMPT_FILE" evaluator.py --config config.yaml "$@"
\ No newline at end of file
diff --git a/examples/llm_prompt_optimization/templates/full_rewrite_user.txt b/examples/llm_prompt_optimization/templates/full_rewrite_user.txt
new file mode 100644
index 000000000..216844a48
--- /dev/null
+++ b/examples/llm_prompt_optimization/templates/full_rewrite_user.txt
@@ -0,0 +1,20 @@
+# Current Prompt Information
+- Current performance metrics: {metrics}
+- Areas identified for improvement: {improvement_areas}
+
+{artifacts}
+
+# Prompt Evolution History
+{evolution_history}
+
+# Current Prompt
+{current_program}
+
+# Task
+Rewrite the prompt to improve its performance on the specified metrics.
+Provide the complete new prompt text.
+
+IMPORTANT: Make sure your rewritten prompt maintains the same input placeholder ({{input_text}})
+but with improved instructions for better LLM performance.
+
+Your improved prompt:
\ No newline at end of file
diff --git a/openevolve/database.py b/openevolve/database.py
index 0d2ba6fd4..740768839 100644
--- a/openevolve/database.py
+++ b/openevolve/database.py
@@ -8,6 +8,7 @@
 import os
 import random
 import time
+import uuid
 from dataclasses import asdict, dataclass, field, fields
 
 # FileLock removed - no longer needed with threaded parallel processing
@@ -998,12 +999,29 @@ def _sample_exploration_parent(self) -> Program:
         if not current_island_programs:
             # If current island is empty, initialize with best program or random program
             if self.best_program_id and self.best_program_id in self.programs:
-                # Clone best program to current island
+                # Create a copy of best program for the empty island (don't reuse same ID)
                 best_program = self.programs[self.best_program_id]
-                self.islands[self.current_island].add(self.best_program_id)
-                best_program.metadata["island"] = self.current_island
-                logger.debug(f"Initialized empty island {self.current_island} with best program")
-                return best_program
+                copy_program = Program(
+                    id=str(uuid.uuid4()),
+                    code=best_program.code,
+                    language=best_program.language,
+                    parent_id=best_program.id,
+                    generation=best_program.generation,
+                    timestamp=time.time(),
+                    iteration_found=self.last_iteration,
+                    metrics=best_program.metrics.copy(),
+                    complexity=best_program.complexity,
+                    diversity=best_program.diversity,
+                    metadata={"island": self.current_island},
+                    artifacts_json=best_program.artifacts_json,
+                    artifact_dir=best_program.artifact_dir,
+                )
+                self.programs[copy_program.id] = copy_program
+                self.islands[self.current_island].add(copy_program.id)
+                logger.debug(
+                    f"Initialized empty island {self.current_island} with copy of best program"
+                )
+                return copy_program
             else:
                 # Use any available program
                 return next(iter(self.programs.values()))
@@ -1026,10 +1044,29 @@ def _sample_exploration_parent(self) -> Program:
                 f"Island {self.current_island} has no valid programs after cleanup, reinitializing"
             )
             if self.best_program_id and self.best_program_id in self.programs:
+                # Create a copy of best program for the empty island (don't reuse same ID)
                 best_program = self.programs[self.best_program_id]
-                self.islands[self.current_island].add(self.best_program_id)
-                best_program.metadata["island"] = self.current_island
-                return best_program
+                copy_program = Program(
+                    id=str(uuid.uuid4()),
+                    code=best_program.code,
+                    language=best_program.language,
+                    parent_id=best_program.id,
+                    generation=best_program.generation,
+                    timestamp=time.time(),
+                    iteration_found=self.last_iteration,
+                    metrics=best_program.metrics.copy(),
+                    complexity=best_program.complexity,
+                    diversity=best_program.diversity,
+                    metadata={"island": self.current_island},
+                    artifacts_json=best_program.artifacts_json,
+                    artifact_dir=best_program.artifact_dir,
+                )
+                self.programs[copy_program.id] = copy_program
+                self.islands[self.current_island].add(copy_program.id)
+                logger.debug(
+                    f"Reinitialized empty island {self.current_island} with copy of best program"
+                )
+                return copy_program
             else:
                 return next(iter(self.programs.values()))
 
@@ -1347,6 +1384,26 @@ def migrate_programs(self) -> None:
             target_islands = [(i + 1) % len(self.islands), (i - 1) % len(self.islands)]
 
             for migrant in migrants:
+                # Prevent re-migration of already migrated programs to avoid exponential duplication.
+                # Analysis of actual evolution runs shows this causes severe issues:
+                # - Program cb5d07f2 had 183 descendant copies by iteration 850
+                # - Program 5645fbd2 had 31 descendant copies
+                # - IDs grow exponentially: program_migrant_2_migrant_3_migrant_4_migrant_0...
+                #
+                # This is particularly problematic for OpenEvolve's MAP-Elites + Island hybrid architecture:
+                # 1. All copies have identical code → same complexity/diversity/performance scores
+                # 2. They all map to the SAME MAP-Elites cell → only 1 survives, rest discarded
+                # 3. Wastes computation evaluating hundreds of identical programs
+                # 4. Reduces actual diversity as islands fill with duplicates
+                #
+                # By preventing already-migrated programs from migrating again, we ensure:
+                # - Each program migrates at most once per lineage
+                # - True diversity is maintained between islands
+                # - Computational resources aren't wasted on duplicates
+                # - Aligns with MAP-Elites' one-program-per-cell principle
+                if migrant.metadata.get("migrant", False):
+                    continue
+
                 for target_island in target_islands:
                     # Create a copy for migration (to avoid removing from source)
                     migrant_copy = Program(
diff --git a/openevolve/evaluator.py b/openevolve/evaluator.py
index 25d880987..80bcac333 100644
--- a/openevolve/evaluator.py
+++ b/openevolve/evaluator.py
@@ -644,6 +644,9 @@ def _create_cascade_error_context(self, stage: str, error: Exception) -> dict:
     def _passes_threshold(self, metrics: Dict[str, float], threshold: float) -> bool:
         """
         Check if metrics pass a threshold
+        
+        Uses 'combined_score' if available (for consistency with evolution),
+        otherwise falls back to averaging all numeric metrics except 'error'
 
         Args:
             metrics: Dictionary of metric name to score
@@ -655,7 +658,14 @@ def _passes_threshold(self, metrics: Dict[str, float], threshold: float) -> bool
         if not metrics:
             return False
 
-        # Calculate average score, skipping non-numeric values and 'error' key
+        # Use combined_score if available - this is what evolution uses
+        if "combined_score" in metrics:
+            score = metrics.get("combined_score")
+            if isinstance(score, (int, float)):
+                return float(score) >= threshold
+
+        # Fallback: average all numeric metrics except 'error'
+        # This maintains backward compatibility
         valid_metrics = []
         for name, value in metrics.items():
             # Skip 'error' keys and ensure values are numeric
diff --git a/tests/test_database.py b/tests/test_database.py
index 0d17f8961..cd11a7e26 100644
--- a/tests/test_database.py
+++ b/tests/test_database.py
@@ -3,6 +3,7 @@
 """
 
 import unittest
+import uuid
 from openevolve.config import Config
 from openevolve.database import Program, ProgramDatabase
 
@@ -457,6 +458,183 @@ def test_diversity_feature_integration(self):
             self.assertGreaterEqual(coord, 0)
             self.assertLess(coord, self.db.feature_bins)
 
+    def test_migration_prevents_re_migration(self):
+        """Test that programs marked as migrants don't migrate again"""
+        # Create database with multiple islands
+        config = Config()
+        config.database.in_memory = True
+        config.database.num_islands = 3
+        config.database.migration_interval = 1  # Migrate every generation
+        multi_db = ProgramDatabase(config.database)
+
+        # Add programs to each island (avoid "migrant" in original IDs)
+        for i in range(3):
+            program = Program(
+                id=f"test_prog_{i}",
+                code=f"def test_{i}(): return {i}",
+                language="python",
+                metrics={"score": 0.5 + i * 0.1},
+            )
+            multi_db.add(program, target_island=i)
+
+        # Manually mark one as a migrant
+        migrant_program = multi_db.get("test_prog_0")
+        migrant_program.metadata["migrant"] = True
+
+        # Store original ID
+        original_id = migrant_program.id
+
+        # Count initial programs with "_migrant_" pattern (created by migration)
+        initial_migrant_count = sum(1 for pid in multi_db.programs if "_migrant_" in pid)
+        self.assertEqual(initial_migrant_count, 0)  # Should be none initially
+
+        # Run migration
+        multi_db.island_generations[0] = config.database.migration_interval
+        multi_db.island_generations[1] = config.database.migration_interval
+        multi_db.island_generations[2] = config.database.migration_interval
+        multi_db.migrate_programs()
+
+        # Check that the migrant program wasn't re-migrated
+        # It should still exist with the same ID (not a new migrant ID)
+        still_exists = multi_db.get(original_id)
+        self.assertIsNotNone(still_exists)
+        
+        # Count new programs created by migration (identified by "_migrant_" pattern)
+        new_migrant_ids = [pid for pid in multi_db.programs if "_migrant_" in pid]
+        
+        # Each non-migrant program (2 of them) migrates to 2 adjacent islands
+        # So we expect 2 * 2 = 4 new migrant programs
+        # The already-marked migrant (test_prog_0) should NOT create any new copies
+        self.assertEqual(len(new_migrant_ids), 4)
+        
+        # Verify the already-migrant program didn't create new copies
+        migrant_descendants = [pid for pid in new_migrant_ids if original_id in pid]
+        self.assertEqual(len(migrant_descendants), 0, 
+                        f"Program {original_id} should not have created migrant copies")
+
+    def test_empty_island_initialization_creates_copies(self):
+        """Test that empty islands are initialized with copies, not shared references"""
+        # Create database with multiple islands
+        config = Config()
+        config.database.in_memory = True
+        config.database.num_islands = 3
+        # Force exploration mode to test empty island handling
+        config.database.exploration_ratio = 1.0
+        config.database.exploitation_ratio = 0.0
+        multi_db = ProgramDatabase(config.database)
+
+        # Add a single program to island 1
+        program = Program(
+            id="original_program",
+            code="def original(): return 42",
+            language="python",
+            metrics={"score": 0.9, "combined_score": 0.9},
+        )
+        multi_db.add(program, target_island=1)
+        
+        # Make it the best program
+        multi_db.best_program_id = "original_program"
+
+        # Switch to empty island 0 and sample
+        multi_db.set_current_island(0)
+        sampled_parent, _ = multi_db.sample()
+
+        # The sampled program should be a copy, not the original
+        self.assertNotEqual(sampled_parent.id, "original_program")
+        self.assertEqual(sampled_parent.code, program.code)  # Same code
+        self.assertEqual(sampled_parent.parent_id, "original_program")  # Parent is the original
+        
+        # Check island membership
+        self.assertIn("original_program", multi_db.islands[1])
+        self.assertNotIn("original_program", multi_db.islands[0])
+        self.assertIn(sampled_parent.id, multi_db.islands[0])
+        
+        # Run validation - should not raise any errors
+        multi_db._validate_migration_results()
+
+    def test_no_program_assigned_to_multiple_islands(self):
+        """Test that programs are never assigned to multiple islands"""
+        # Create database with multiple islands
+        config = Config()
+        config.database.in_memory = True
+        config.database.num_islands = 4
+        multi_db = ProgramDatabase(config.database)
+
+        # Add programs to different islands
+        program_ids = []
+        for i in range(4):
+            program = Program(
+                id=f"island_test_{i}",
+                code=f"def test_{i}(): return {i}",
+                language="python",
+                metrics={"score": 0.5 + i * 0.1, "combined_score": 0.5 + i * 0.1},
+            )
+            multi_db.add(program, target_island=i)
+            program_ids.append(program.id)
+
+        # Make the best program from island 3
+        multi_db.best_program_id = "island_test_3"
+
+        # Sample from empty islands - this should create copies
+        for empty_island in range(4):
+            if len(multi_db.islands[empty_island]) == 0:
+                multi_db.set_current_island(empty_island)
+                parent, _ = multi_db.sample()
+
+        # Check that no program ID appears in multiple islands
+        all_island_programs = {}
+        for island_idx, island_programs in enumerate(multi_db.islands):
+            for program_id in island_programs:
+                if program_id in all_island_programs:
+                    self.fail(
+                        f"Program {program_id} found in both island {all_island_programs[program_id]} "
+                        f"and island {island_idx}"
+                    )
+                all_island_programs[program_id] = island_idx
+
+        # Run validation - should not raise any errors
+        multi_db._validate_migration_results()
+
+    def test_migration_validation_passes(self):
+        """Test that migration validation passes after our fixes"""
+        # Create database with multiple islands
+        config = Config()
+        config.database.in_memory = True
+        config.database.num_islands = 3
+        config.database.migration_interval = 1
+        multi_db = ProgramDatabase(config.database)
+
+        # Add programs and run several migration cycles
+        for i in range(6):
+            program = Program(
+                id=f"test_program_{i}",
+                code=f"def test_{i}(): return {i * 2}",
+                language="python",
+                metrics={"score": 0.4 + i * 0.1, "combined_score": 0.4 + i * 0.1},
+            )
+            multi_db.add(program, target_island=i % 3)
+
+        # Run multiple migration cycles
+        for cycle in range(3):
+            # Increment generations to trigger migration
+            for island in range(3):
+                multi_db.island_generations[island] += 1
+            
+            # Migrate programs
+            multi_db.migrate_programs()
+            
+            # Validation should pass without warnings
+            multi_db._validate_migration_results()
+            
+            # Verify no program has exponential ID growth
+            for program_id in multi_db.programs:
+                # Count occurrences of "migrant" in ID
+                migrant_count = program_id.count("migrant")
+                self.assertLessEqual(
+                    migrant_count, 1, 
+                    f"Program ID {program_id} has been migrated multiple times"
+                )
+
 
 if __name__ == "__main__":
     unittest.main()