diff --git a/README.md b/README.md index dd00115f1..a20d1665a 100644 --- a/README.md +++ b/README.md @@ -288,12 +288,72 @@ prompt: num_top_programs: 3 # Performance examples num_diverse_programs: 2 # Creative inspiration include_artifacts: true # Execution feedback + + # Template customization + template_dir: null # Directory for custom prompt templates + use_template_stochasticity: true # Enable random variations in prompts + template_variations: {} # Define variation placeholders ``` Sample configuration files are available in the `configs/` directory: - `default_config.yaml`: Comprehensive configuration with all available options - `island_config_example.yaml`: Advanced island-based evolution setup +### Template Customization + +OpenEvolve supports advanced prompt template customization to increase diversity in code evolution: + +#### Custom Templates with `template_dir` + +You can override the default prompt templates by providing custom ones: + +```yaml +prompt: + template_dir: "path/to/your/templates" +``` + +Create `.txt` files in your template directory with these names: +- `diff_user.txt` - Template for diff-based evolution +- `full_rewrite_user.txt` - Template for full code rewrites +- `evolution_history.txt` - Format for presenting evolution history +- `top_program.txt` - Format for top-performing programs +- `previous_attempt.txt` - Format for previous attempts + +See these directories for complete examples of custom templates: +- `examples/lm_eval/prompts/` - Custom templates for evaluation tasks +- `examples/llm_prompt_optimization/templates/` - Templates for evolving prompts instead of code + +#### Template Variations with Stochasticity + +To add randomness to your prompts and prevent getting stuck in local optima: + +1. **Enable stochasticity** in your config: +```yaml +prompt: + use_template_stochasticity: true + template_variations: + greeting: + - "Let's improve this code." + - "Time to enhance this program." + - "Here's how we can optimize:" + analysis_intro: + - "Current metrics show" + - "Performance analysis indicates" + - "The evaluation reveals" +``` + +2. **Use variation placeholders** in your custom templates: +``` +# custom_template.txt +{greeting} +{analysis_intro} the following results: +{metrics} +``` + +The system will randomly select one variation for each placeholder during prompt generation, creating diverse prompts that can lead to more creative code evolutions. + +**Note**: The default templates don't include variation placeholders, so you'll need to create custom templates to use this feature effectively. + ### Feature Dimensions in MAP-Elites Feature dimensions control how programs are organized in the MAP-Elites quality-diversity grid: @@ -425,8 +485,12 @@ Demonstrates integration with [optillm](https://github.com/codelion/optillm) for - **Mixture of Agents (MoA)**: Multi-response synthesis for improved accuracy - **Local model optimization**: Enhanced reasoning with smaller models -#### [LLM Prompt Optimization](examples/llm_prompt_optimazation/) -Evolving prompts themselves for better LLM performance, demonstrating self-improving AI systems. +#### [LLM Prompt Optimization](examples/llm_prompt_optimization/) +Evolving prompts for better LLM performance on HuggingFace datasets. Features: +- Custom templates for evolving prompts instead of code +- Two-stage cascading evaluation for efficiency +- Support for any HuggingFace dataset +- Automatic prompt improvement through evolution ### Systems & Performance Optimization diff --git a/examples/llm_prompt_optimazation/README.md b/examples/llm_prompt_optimazation/README.md deleted file mode 100644 index c207a0084..000000000 --- a/examples/llm_prompt_optimazation/README.md +++ /dev/null @@ -1,184 +0,0 @@ -# Evolving Better Prompts with OpenEvolve 🧠✨ - -This example shows how to use **OpenEvolve** to automatically optimize prompts for **Large Language Models (LLMs)**. Whether you're working on classification, summarization, generation, or code tasks, OpenEvolve helps you find high-performing prompts using **evolutionary search**. For this example we'll use syntihetic data for sentiment analysis task, but you can adapt it to your own datasets and tasks. - ---- - -## 🎯 What Is Prompt Optimization? - -Prompt engineering is key to getting reliable outputs from LLMs—but finding the right prompt manually can be slow and inconsistent. - -OpenEvolve automates this by: - -* Generating and evolving prompt variations -* Testing them against your task and metrics -* Selecting the best prompts through generations - -You start with a simple prompt and let OpenEvolve evolve it into something smarter and more effective. - ---- - -## 🚀 Getting Started - -### 1. Install Dependencies - -```bash -cd examples/llm_prompt_optimazation -pip install -r requirements.txt -sh run.sh -``` - -### 2. Add Your models - -1. Update your `config.yaml`: - -```yaml -llm: - primary_model: "llm_name" - api_base: "llm_server_url" - api_key: "your_api_key_here" -``` - -2. Update your task-model in `evaluator.py`: - -```python -TASK_MODEL_NAME = "task_llm_name" -TASK_MODEL_URL = "task_llm_server_url" -TASK_MODEL_API_KEY = "your_api_key_here" -SAMPLE_SIZE = 25 # Number of samples to use for evaluation -MAX_RETRIES = 3 # Number of retries for LLM calls - -``` - -### 3. Run OpenEvolve - -```bash -sh run.sh -``` - ---- - -## 🔧 How to Adapt This Template - -### 1. Replace the Dataset - -Edit `data.json` to match your use case: - -```json -[ - { - "id": 1, - "input": "Your input here", - "expected_output": "Target output" - } -] -``` - -### 2. Customize the Evaluator - -In `evaluator.py`, define how to evaluate a prompt: - -* Load your data -* Call the LLM using the prompt -* Measure output quality (accuracy, score, etc.) - -### 3. Write Your Initial Prompt - -Create a basic starting prompt in `initial_prompt.txt`: - -``` -# EVOLVE-BLOCK-START -Your task prompt using {input_text} as a placeholder. -# EVOLVE-BLOCK-END -``` - -This is the part OpenEvolve will improve over time. -Good to add the name of your task in 'initial_prompt.txt' header to help the model understand the context. - ---- - -## ⚙️ Key Config Options (`config.yaml`) - -```yaml -llm: - primary_model: "gpt-4o" # or your preferred model - secondary_model: "gpt-3.5" # optional for diversity - temperature: 0.9 - max_tokens: 2048 - -database: - population_size: 40 - max_iterations: 15 - elite_selection_ratio: 0.25 - -evaluator: - timeout: 45 - parallel_evaluations: 3 - use_llm_feedback: true -``` - ---- - -## 📈 Example Output - -OpenEvolve evolves prompts like this: - -**Initial Prompt:** - -``` -Please analyze the sentiment of the following sentence and provide a sentiment score: - -"{input_text}" - -Rate the sentiment on a scale from 0.0 to 10.0. - -Score: -``` - -**Evolved Prompt:** - -``` -Please analyze the sentiment of the following sentence and provide a sentiment score using the following guidelines: -- 0.0-2.9: Strongly negative sentiment (e.g., expresses anger, sadness, or despair) -- 3.0-6.9: Neutral or mixed sentiment (e.g., factual statements, ambiguous content) -- 7.0-10.0: Strongly positive sentiment (e.g., expresses joy, satisfaction, or hope) - -"{input_text}" - -Rate the sentiment on a scale from 0.0 to 10.0: -- 0.0-2.9: Strongly negative (e.g., "This product is terrible") -- 3.0-6.9: Neutral/mixed (e.g., "The sky is blue today") -- 7.0-10.0: Strongly positive (e.g., "This is amazing!") - -Provide only the numeric score (e.g., "8.5") without any additional text: - -Score: -``` - -**Result**: Improved accuracy and output consistency. - ---- - -## 🔍 Where to Use This - -OpenEvolve could be addapted on many tasks: - -* **Text Classification**: Spam detection, intent recognition -* **Content Generation**: Social media posts, product descriptions -* **Question Answering & Summarization** -* **Code Tasks**: Review, generation, completion -* **Structured Output**: JSON, table filling, data extraction - ---- - -## ✅ Best Practices - -* Start with a basic but relevant prompt -* Use good-quality data and clear evaluation metrics -* Run multiple evolutions for better results -* Validate on held-out data before deployment - ---- - -**Ready to discover better prompts?** -Use this template to evolve prompts for any LLM task—automatically. diff --git a/examples/llm_prompt_optimazation/best_program.txt b/examples/llm_prompt_optimazation/best_program.txt deleted file mode 100644 index 601c29da2..000000000 --- a/examples/llm_prompt_optimazation/best_program.txt +++ /dev/null @@ -1,19 +0,0 @@ -"""Sentiment analysis prompt example for OpenEvolve""" - -# EVOLVE-BLOCK-START -Please analyze the sentiment of the following sentence and provide a sentiment score using the following guidelines: -- 0.0-2.9: Strongly negative sentiment (e.g., expresses anger, sadness, or despair) -- 3.0-6.9: Neutral or mixed sentiment (e.g., factual statements, ambiguous content) -- 7.0-10.0: Strongly positive sentiment (e.g., expresses joy, satisfaction, or hope) - -"{input_text}" - -Rate the sentiment on a scale from 0.0 to 10.0: -- 0.0-2.9: Strongly negative (e.g., "This product is terrible") -- 3.0-6.9: Neutral/mixed (e.g., "The sky is blue today") -- 7.0-10.0: Strongly positive (e.g., "This is amazing!") - -Provide only the numeric score (e.g., "8.5") without any additional text: - -Score: -# EVOLVE-BLOCK-END diff --git a/examples/llm_prompt_optimazation/config.yaml b/examples/llm_prompt_optimazation/config.yaml deleted file mode 100644 index 57483c1aa..000000000 --- a/examples/llm_prompt_optimazation/config.yaml +++ /dev/null @@ -1,58 +0,0 @@ -# Configuration for prompt optimization -max_iterations: 30 -checkpoint_interval: 10 -log_level: "INFO" - -# LLM configuration -llm: - primary_model: "qwen3-32b-fp8" - api_base: "http://localhost:1234/v1" - api_key: "your_api_key_here" - temperature: 0.9 - top_p: 0.95 - max_tokens: 2048 - -# Prompt configuration -prompt: - system_message: | - You are an expert prompt engineer. Your task is to revise an existing prompt designed for large language models (LLMs), without being explicitly told what the task is. - - Your improvements should: - - * Infer the intended task and expected output format based on the structure and language of the original prompt. - * Clarify vague instructions, eliminate ambiguity, and improve overall interpretability for the LLM. - * Strengthen alignment between the prompt and the desired task outcome, ensuring more consistent and accurate responses. - * Improve robustness against edge cases or unclear input phrasing. - * If helpful, include formatting instructions, boundary conditions, or illustrative examples that reinforce the LLM's expected behavior. - * Avoid adding unnecessary verbosity or assumptions not grounded in the original prompt. - - You will receive a prompt that uses the following structure: - - ```python - prompt.format(input_text=some_text) - ``` - - The revised prompt should maintain the same input interface but be more effective, reliable, and production-ready for LLM use. - - Return only the improved prompt text. Do not include explanations or additional comments. Your output should be a clean, high-quality replacement that enhances clarity, consistency, and LLM performance. - - num_top_programs: 8 - use_template_stochasticity: true - -# Database configuration -database: - population_size: 40 - archive_size: 20 - num_islands: 3 - elite_selection_ratio: 0.25 - exploitation_ratio: 0.65 - -# Evaluator configuration -evaluator: - timeout: 45 - use_llm_feedback: true - -# Evolution settings -diff_based_evolution: true -allow_full_rewrites: true -diversity_threshold: 0.1 diff --git a/examples/llm_prompt_optimazation/data.json b/examples/llm_prompt_optimazation/data.json deleted file mode 100644 index 9fcdc621e..000000000 --- a/examples/llm_prompt_optimazation/data.json +++ /dev/null @@ -1,510 +0,0 @@ -{ - "book_reviews": [ - { - "id": 1, - "text": "This book was absolutely phenomenal! The writing was masterful and the plot kept me captivated from start to finish.", - "sentiment_score": 9.5 - }, - { - "id": 2, - "text": "I was really disappointed with this novel. The story dragged on and the characters felt flat and uninteresting.", - "sentiment_score": 2.5 - }, - { - "id": 3, - "text": "An incredible literary masterpiece! Brilliant prose and outstanding character development throughout.", - "sentiment_score": 9.8 - }, - { - "id": 4, - "text": "This was one of the worst books I've ever read. Terrible pacing and a completely incoherent storyline.", - "sentiment_score": 0.5 - }, - { - "id": 5, - "text": "A true work of art. Every page was beautifully crafted and emotionally resonant.", - "sentiment_score": 10.0 - }, - { - "id": 6, - "text": "Completely underwhelming. I expected so much more but was left feeling bored and frustrated.", - "sentiment_score": 2.0 - }, - { - "id": 7, - "text": "Incredible storytelling with rich world-building. This book exceeded all my expectations.", - "sentiment_score": 9.2 - }, - { - "id": 8, - "text": "A waste of time and money. Poor writing, bad plot, and overall just a terrible reading experience.", - "sentiment_score": 0.8 - }, - { - "id": 9, - "text": "Outstanding narrative and compelling characters. This book will stay with me for a long time.", - "sentiment_score": 9.0 - }, - { - "id": 10, - "text": "Disappointing and predictable. The book felt like a cheap imitation of much better novels.", - "sentiment_score": 2.8 - }, - { - "id": 11, - "text": "The book was decent. Some chapters were good, others not so much. Overall an average read.", - "sentiment_score": 5.0 - }, - { - "id": 12, - "text": "Not the best novel ever written, but certainly readable. Has its moments of brilliance.", - "sentiment_score": 6.5 - }, - { - "id": 13, - "text": "Pretty good book with solid writing and an interesting premise. Worth reading if you have time.", - "sentiment_score": 7.2 - }, - { - "id": 14, - "text": "The book had potential but fell short in execution. Some good ideas but poorly implemented.", - "sentiment_score": 4.0 - }, - { - "id": 15, - "text": "A truly exceptional piece of literature that pushes the boundaries of storytelling. Pure genius!", - "sentiment_score": 10.0 - }, - { - "id": 16, - "text": "Absolutely terrible in every possible way. I want my money and time back. Avoid at all costs.", - "sentiment_score": 0.0 - }, - { - "id": 17, - "text": "Surprisingly good! Exceeded my expectations with clever plot twists and strong character arcs.", - "sentiment_score": 7.8 - }, - { - "id": 18, - "text": "Mediocre at best. Nothing particularly wrong with it, but nothing special either.", - "sentiment_score": 4.5 - }, - { - "id": 19, - "text": "A delightful surprise! Charming prose and a heartwarming story that left me smiling.", - "sentiment_score": 8.5 - }, - { - "id": 20, - "text": "Painfully slow and pretentious. The author seemed more interested in showing off than telling a story.", - "sentiment_score": 1.2 - }, - { - "id": 21, - "text": "An engaging thriller that kept me on the edge of my seat. Well-crafted suspense and believable characters.", - "sentiment_score": 8.3 - }, - { - "id": 22, - "text": "The romance was sweet but the plot was lacking. Some beautiful moments but overall forgettable.", - "sentiment_score": 5.5 - }, - { - "id": 23, - "text": "Brilliant science fiction with thought-provoking themes. The author's imagination is truly remarkable.", - "sentiment_score": 9.1 - }, - { - "id": 24, - "text": "Confusing and poorly structured. I struggled to follow the narrative and lost interest quickly.", - "sentiment_score": 2.3 - }, - { - "id": 25, - "text": "A masterful blend of history and fiction. Thoroughly researched and beautifully written.", - "sentiment_score": 8.9 - }, - { - "id": 26, - "text": "The characters felt one-dimensional and the dialogue was stilted. Not the author's best work.", - "sentiment_score": 3.2 - }, - { - "id": 27, - "text": "Captivating from the first page to the last. A true page-turner with excellent pacing.", - "sentiment_score": 8.7 - }, - { - "id": 28, - "text": "Boring and repetitive. The same themes rehashed over and over without any fresh perspective.", - "sentiment_score": 2.1 - }, - { - "id": 29, - "text": "A profound exploration of human nature. Deep, meaningful, and beautifully executed.", - "sentiment_score": 9.4 - }, - { - "id": 30, - "text": "The plot had too many holes and the ending was unsatisfying. Left me with more questions than answers.", - "sentiment_score": 3.5 - }, - { - "id": 31, - "text": "Solid character development and a compelling mystery. Kept me guessing until the very end.", - "sentiment_score": 7.6 - }, - { - "id": 32, - "text": "The writing style was difficult to follow and the story seemed to go nowhere. A disappointing read.", - "sentiment_score": 2.7 - }, - { - "id": 33, - "text": "Excellent world-building and imaginative storytelling. A fantasy epic that delivers on all fronts.", - "sentiment_score": 8.8 - }, - { - "id": 34, - "text": "The humor fell flat and the characters were annoying rather than endearing. Not my cup of tea.", - "sentiment_score": 3.0 - }, - { - "id": 35, - "text": "A gripping psychological thriller with complex characters and unexpected twists. Highly recommended.", - "sentiment_score": 8.4 - }, - { - "id": 36, - "text": "The book was okay but nothing groundbreaking. Decent enough to finish but not memorable.", - "sentiment_score": 5.2 - }, - { - "id": 37, - "text": "Beautifully written prose that flows like poetry. A literary gem that touched my soul.", - "sentiment_score": 9.3 - }, - { - "id": 38, - "text": "Too much exposition and not enough action. The story moved at a snail's pace throughout.", - "sentiment_score": 3.8 - }, - { - "id": 39, - "text": "An inspiring tale of resilience and hope. The characters' journeys were both realistic and uplifting.", - "sentiment_score": 8.1 - }, - { - "id": 40, - "text": "Clichéd and predictable. I saw every plot twist coming from miles away. Very disappointing.", - "sentiment_score": 2.4 - }, - { - "id": 41, - "text": "A thought-provoking exploration of social issues wrapped in an entertaining narrative.", - "sentiment_score": 7.9 - }, - { - "id": 42, - "text": "The book started strong but lost momentum halfway through. The ending felt rushed and unsatisfying.", - "sentiment_score": 4.3 - }, - { - "id": 43, - "text": "Exceptional character depth and emotional resonance. A story that will haunt you long after reading.", - "sentiment_score": 9.6 - }, - { - "id": 44, - "text": "Poorly edited with numerous grammatical errors. The story couldn't overcome the technical flaws.", - "sentiment_score": 1.8 - }, - { - "id": 45, - "text": "A delightful coming-of-age story with authentic characters and relatable struggles.", - "sentiment_score": 7.4 - }, - { - "id": 46, - "text": "The premise was interesting but the execution was lacking. Felt like a missed opportunity.", - "sentiment_score": 4.1 - }, - { - "id": 47, - "text": "Absolutely riveting! Could not put it down once I started. A masterclass in suspenseful storytelling.", - "sentiment_score": 9.7 - }, - { - "id": 48, - "text": "Overly complicated and pretentious. The author tried too hard to be clever and it backfired.", - "sentiment_score": 2.2 - }, - { - "id": 49, - "text": "A heartwarming family saga with memorable characters and beautiful storytelling.", - "sentiment_score": 8.2 - }, - { - "id": 50, - "text": "The dialogue was unrealistic and the plot was full of convenient coincidences. Hard to believe.", - "sentiment_score": 3.3 - }, - { - "id": 51, - "text": "An ambitious epic that mostly succeeds in its grand vision. Some pacing issues but overall impressive.", - "sentiment_score": 7.7 - }, - { - "id": 52, - "text": "Dull and lifeless. The characters had no personality and the story lacked any real conflict.", - "sentiment_score": 2.6 - }, - { - "id": 53, - "text": "A beautiful meditation on love, loss, and redemption. Emotionally powerful and deeply moving.", - "sentiment_score": 8.9 - }, - { - "id": 54, - "text": "The book felt incomplete, like the author ran out of ideas halfway through. Very unsatisfying.", - "sentiment_score": 3.4 - }, - { - "id": 55, - "text": "Clever and witty with sharp social commentary. An entertaining read that also makes you think.", - "sentiment_score": 7.8 - }, - { - "id": 56, - "text": "Repetitive and boring. The same points made over and over without adding anything new.", - "sentiment_score": 2.9 - }, - { - "id": 57, - "text": "A stunning work of historical fiction that brings the past to life with vivid detail.", - "sentiment_score": 8.6 - }, - { - "id": 58, - "text": "The mystery was easy to solve and the red herrings were obvious. Not very engaging.", - "sentiment_score": 3.7 - }, - { - "id": 59, - "text": "Outstanding world-building and character development. A fantasy series starter that promises great things.", - "sentiment_score": 8.3 - }, - { - "id": 60, - "text": "Too many subplots that went nowhere. The main story got lost in all the unnecessary complexity.", - "sentiment_score": 3.6 - }, - { - "id": 61, - "text": "A perfectly crafted thriller with tight pacing and genuine surprises. Everything a good book should be.", - "sentiment_score": 9.0 - }, - { - "id": 62, - "text": "The writing was awkward and the story felt forced. Could have used more time in development.", - "sentiment_score": 2.8 - }, - { - "id": 63, - "text": "An enchanting tale that captures the magic of childhood while addressing serious themes.", - "sentiment_score": 7.9 - }, - { - "id": 64, - "text": "The book was reasonably entertaining but nothing I hadn't seen before. Average in every way.", - "sentiment_score": 5.0 - }, - { - "id": 65, - "text": "Brilliant use of multiple perspectives to tell a complex story. Masterfully woven narrative threads.", - "sentiment_score": 9.2 - }, - { - "id": 66, - "text": "The pacing was all wrong - too slow in places, too rushed in others. Needed better editing.", - "sentiment_score": 3.9 - }, - { - "id": 67, - "text": "A touching story of friendship and loyalty that resonated deeply with me. Highly recommended.", - "sentiment_score": 8.0 - }, - { - "id": 68, - "text": "Confusing timeline and unclear motivations made this a frustrating read. Lost potential.", - "sentiment_score": 3.1 - }, - { - "id": 69, - "text": "Exceptional prose and a story that stays with you. A modern classic in the making.", - "sentiment_score": 9.5 - }, - { - "id": 70, - "text": "The book tried to do too much and ended up accomplishing very little. Unfocused and scattered.", - "sentiment_score": 2.5 - }, - { - "id": 71, - "text": "A solid mystery with well-developed characters and a satisfying resolution. Good entertainment.", - "sentiment_score": 7.3 - }, - { - "id": 72, - "text": "Derivative and unoriginal. Felt like I'd read this exact story multiple times before.", - "sentiment_score": 2.0 - }, - { - "id": 73, - "text": "Beautiful, lyrical writing that creates an immersive reading experience. A true work of art.", - "sentiment_score": 8.8 - }, - { - "id": 74, - "text": "The book was readable but forgettable. Nothing particularly good or bad about it.", - "sentiment_score": 5.1 - }, - { - "id": 75, - "text": "An epic adventure with memorable characters and breathtaking scope. Fantasy at its finest.", - "sentiment_score": 9.1 - }, - { - "id": 76, - "text": "Poor character development and a weak plot made this a chore to finish. Very disappointing.", - "sentiment_score": 1.9 - }, - { - "id": 77, - "text": "A compelling drama with realistic characters facing believable challenges. Well worth reading.", - "sentiment_score": 7.6 - }, - { - "id": 78, - "text": "The book meandered without purpose and the ending came out of nowhere. Poorly structured.", - "sentiment_score": 3.2 - }, - { - "id": 79, - "text": "Absolutely captivating! A page-turner that combines great writing with an irresistible plot.", - "sentiment_score": 8.7 - }, - { - "id": 80, - "text": "Too many clichés and stereotypes. The author relied on tired tropes instead of original ideas.", - "sentiment_score": 2.3 - }, - { - "id": 81, - "text": "A thoughtful exploration of complex themes with nuanced characters and elegant prose.", - "sentiment_score": 8.4 - }, - { - "id": 82, - "text": "The story had potential but was ruined by poor execution and sloppy writing. What a waste.", - "sentiment_score": 2.7 - }, - { - "id": 83, - "text": "An outstanding debut novel that announces the arrival of a major new talent. Brilliant work.", - "sentiment_score": 9.3 - }, - { - "id": 84, - "text": "Bland and uninspiring. The characters were flat and the story lacked any real emotion.", - "sentiment_score": 2.1 - }, - { - "id": 85, - "text": "A gripping tale of survival and redemption that kept me reading late into the night.", - "sentiment_score": 8.1 - }, - { - "id": 86, - "text": "The book was okay for what it was, but it didn't really grab me. Decent but unremarkable.", - "sentiment_score": 4.8 - }, - { - "id": 87, - "text": "Masterful storytelling with rich imagery and profound insights into the human condition.", - "sentiment_score": 9.4 - }, - { - "id": 88, - "text": "Choppy writing and an incoherent plot made this difficult to follow and even harder to enjoy.", - "sentiment_score": 1.7 - }, - { - "id": 89, - "text": "A delightful romantic comedy with sparkling dialogue and charming characters. Pure enjoyment.", - "sentiment_score": 7.8 - }, - { - "id": 90, - "text": "The book started promisingly but quickly devolved into nonsense. Very disappointing conclusion.", - "sentiment_score": 3.0 - }, - { - "id": 91, - "text": "An intelligent and well-researched novel that educates as much as it entertains. Excellent work.", - "sentiment_score": 8.2 - }, - { - "id": 92, - "text": "Boring and predictable with cardboard characters and a paint-by-numbers plot. Skip this one.", - "sentiment_score": 1.4 - }, - { - "id": 93, - "text": "A powerful and moving story that tackles difficult subjects with sensitivity and grace.", - "sentiment_score": 8.9 - }, - { - "id": 94, - "text": "The author clearly didn't know how to end the story. The conclusion was abrupt and unsatisfying.", - "sentiment_score": 3.5 - }, - { - "id": 95, - "text": "Extraordinary! A once-in-a-generation masterpiece that redefines what literature can achieve.", - "sentiment_score": 10.0 - }, - { - "id": 96, - "text": "Terrible pacing and wooden dialogue made this one of the worst books I've read this year.", - "sentiment_score": 0.9 - }, - { - "id": 97, - "text": "A satisfying read with good character arcs and a well-constructed plot. Solid entertainment.", - "sentiment_score": 7.1 - }, - { - "id": 98, - "text": "The book felt like a rough draft that was published too early. Needed much more work.", - "sentiment_score": 2.4 - }, - { - "id": 99, - "text": "Brilliant, innovative, and utterly engaging. A book that changes how you think about storytelling.", - "sentiment_score": 9.8 - }, - { - "id": 100, - "text": "Completely unreadable. Poor grammar, worse plotting, and characters with no redeeming qualities.", - "sentiment_score": 0.2 - } - ], - "metadata": { - "description": "Synthesised book review sentiment analysis dataset", - "total_reviews": 100, - "sentiment_scale": "0.0 (extremely negative) to 10.0 (extremely positive)", - "created": "2025-07-01" - } -} \ No newline at end of file diff --git a/examples/llm_prompt_optimazation/evaluator.py b/examples/llm_prompt_optimazation/evaluator.py deleted file mode 100644 index 6a816f15b..000000000 --- a/examples/llm_prompt_optimazation/evaluator.py +++ /dev/null @@ -1,196 +0,0 @@ -""" -Evaluator for the prompt optimization task. -""" - -import re -import traceback -import json -import os -import time -from openai import OpenAI -from tqdm import tqdm - -TASK_MODEL_NAME = "meta-llama-3.1-8b-instruct@q8_0" -TASK_MODEL_URL = "http://localhost:1234/v1" -TASK_MODEL_API_KEY = "your_api_key_here" -SAMPLE_SIZE = 25 # Number of samples to use for evaluation -MAX_RETRIES = 3 # Number of retries for LLM calls - - -def load_dataset(data_file_path): - """ - Load the book review dataset from JSON file. - - Args: - data_file_path: Path to the JSON data file - - Returns: - List of review dictionaries with 'text' and 'label' keys - """ - try: - with open(data_file_path, 'r', encoding='utf-8') as f: - data = json.load(f) - - # Convert the data structure to match the expected format - reviews = [] - for review in data.get('book_reviews', []): - reviews.append({ - 'text': review['text'], - 'label': review['sentiment_score'] - }) - - print(f"Successfully loaded {len(reviews)} book reviews from dataset") - return reviews - - except Exception as e: - print(f"Error loading dataset from {data_file_path}: {e}") - traceback.print_exc() - return [] - -# Load dataset from JSON file -data_file_path = os.path.join(os.path.dirname(__file__), "data.json") -ds = load_dataset(data_file_path) - -if not ds: - raise ValueError("Failed to load dataset or dataset is empty") - -def evaluate(prompt_path): - """ - Evaluate the program by run the LLM model on a benchmarck dataset. - - Args: - program_path: Path to the program file - - Returns: - Dictionary of metrics - """ - print('-' * 80) - print("Starting evaluation...") - print('-' * 80) - try: - # Initialize OpenAI test_model with error handling - try: - test_model = OpenAI( - base_url=TASK_MODEL_URL, - api_key=TASK_MODEL_API_KEY - ) - print(f"Initialized OpenAI test_model with model: {TASK_MODEL_NAME}") - except Exception as e: - print(f"Error initializing OpenAI test_model: {e}") - test_model = None - - # Use a subset for faster evaluation during evolution (can be configured) - eval_sample_size = min(SAMPLE_SIZE, len(ds)) - ds_sample = ds[:eval_sample_size] - print(f"Using {len(ds_sample)} samples from {len(ds)} total reviews for evaluation") - - # load the prompt from the file - with open(prompt_path, "r") as f: - prompt = f.read() - - # extract the prompt between the markers - prompt_match = re.search(r"EVOLVE-BLOCK-START(.*)EVOLVE-BLOCK-END", prompt, re.DOTALL) - if prompt_match: - prompt = prompt_match.group(1).strip() - else: - raise ValueError("No EVOLVE-BLOCK found in the prompt file") - - total_score = 0.0 - total_examples = 0 - individual_scores = [] - - print(f"Evaluating with prompt:\n{prompt}\n") - for example in tqdm(ds_sample, desc="Evaluating examples", unit="example"): - total_examples += 1 - input_text = example["text"] - expected_score = example["label"] - - # Prepare the message for the LLM - messages = [ - {"role": "user", "content": prompt.format(input_text=input_text)} - ] - - # Call the LLM with retry logic - max_retries = MAX_RETRIES - for attempt in range(max_retries): - try: - response = test_model.chat.completions.create( - model=TASK_MODEL_NAME, - messages=messages - ) - break - except Exception as e: - if attempt == max_retries - 1: - print(f"Failed to get response after {max_retries} attempts: {e}") - raise e - time.sleep(1) # Brief pause before retry - - output_text = response.choices[0].message.content.strip() - - # Extract numerical score from the response - try: - # Try to extract a number between 0 and 10 - score_match = re.search(r'(\d+(?:\.\d+)?)', output_text) - if score_match: - predicted_score = float(score_match.group(1)) - - # Ensure score is within valid range (0-10) - predicted_score = max(0.0, min(10.0, predicted_score)) - else: - predicted_score = 5.0 # Default to neutral - - # Calculate accuracy based on how close the prediction is to the expected score - # Using 1 - (absolute difference / 10), so perfect match = 1.0, worst case = 0.0 - accuracy = 1.0 - (abs(predicted_score - expected_score) / 10.0) - individual_scores.append(accuracy) - total_score += accuracy - - except Exception as e: - print(f"Error processing response '{output_text}': {e}") - individual_scores.append(0.0) # Score 0 for failed predictions - # Calculate comprehensive metrics - average_score = total_score / total_examples if total_examples > 0 else 0.0 - min_score = min(individual_scores) if individual_scores else 0.0 - max_score = max(individual_scores) if individual_scores else 0.0 - - # Calculate additional metrics - std_dev = 0.0 - if len(individual_scores) > 1: - mean = sum(individual_scores) / len(individual_scores) - variance = sum((x - mean) ** 2 for x in individual_scores) / len(individual_scores) - std_dev = variance ** 0.5 - - # Count high-accuracy predictions (>0.8 accuracy) - high_accuracy_count = sum(1 for score in individual_scores if score > 0.8) - high_accuracy_rate = high_accuracy_count / len(individual_scores) if individual_scores else 0.0 - - print(f"Total examples: {total_examples}") - print(f"Average accuracy: {average_score:.3f}") - print(f"Standard deviation: {std_dev:.3f}") - print(f"Min accuracy: {min_score:.3f}") - print(f"Max accuracy: {max_score:.3f}") - print(f"High accuracy rate (>0.8): {high_accuracy_rate:.3f}") - print('-' * 80) - return { - "score": average_score, - "total_examples": total_examples, - "individual_scores": individual_scores, - "min_score": min_score, - "max_score": max_score, - "std_dev": std_dev, - "high_accuracy_rate": high_accuracy_rate - } - - except Exception as e: - print(f"Evaluation failed completely: {str(e)}") - traceback.print_exc() - print('-' * 80) - return { - "score": 0.0, - "total_examples": 0, - "individual_scores": [], - "min_score": 0.0, - "max_score": 0.0, - "std_dev": 0.0, - "high_accuracy_rate": 0.0 - } diff --git a/examples/llm_prompt_optimazation/initial_prompt.txt b/examples/llm_prompt_optimazation/initial_prompt.txt deleted file mode 100644 index 6f12bf353..000000000 --- a/examples/llm_prompt_optimazation/initial_prompt.txt +++ /dev/null @@ -1,11 +0,0 @@ -"""Sentiment analysis prompt example for OpenEvolve""" - -# EVOLVE-BLOCK-START -Please analyze the sentiment of the following sentence and provide a sentiment score: - -"{input_text}" - -Rate the sentiment on a scale from 0.0 to 10.0. - -Score: -# EVOLVE-BLOCK-END diff --git a/examples/llm_prompt_optimazation/requirements.txt b/examples/llm_prompt_optimazation/requirements.txt deleted file mode 100644 index 01354db40..000000000 --- a/examples/llm_prompt_optimazation/requirements.txt +++ /dev/null @@ -1,2 +0,0 @@ -openai -tqdm \ No newline at end of file diff --git a/examples/llm_prompt_optimazation/run.sh b/examples/llm_prompt_optimazation/run.sh deleted file mode 100644 index 7226a0b82..000000000 --- a/examples/llm_prompt_optimazation/run.sh +++ /dev/null @@ -1,4 +0,0 @@ - python ../../openevolve-run.py \ - examples/llm_prompt_optimazation/initial_prompt.txt \ - examples/llm_prompt_optimazation/evaluator.py \ - --config examples/llm_prompt_optimazation/config.yaml \ No newline at end of file diff --git a/examples/llm_prompt_optimization/README.md b/examples/llm_prompt_optimization/README.md new file mode 100644 index 000000000..77ff57311 --- /dev/null +++ b/examples/llm_prompt_optimization/README.md @@ -0,0 +1,254 @@ +# LLM Prompt Optimization with OpenEvolve 🚀 + +This example demonstrates how to use OpenEvolve to automatically optimize prompts for Large Language Models. The system uses evolutionary search to discover high-performing prompts by testing them against ground truth data from various datasets. + +## 🎯 Overview + +OpenEvolve automatically: +- Loads datasets from various sources +- Evolves prompts through multiple generations +- Uses cascading evaluation for efficiency +- Finds optimal prompts for your specific task and model + +**Key Feature**: The evaluator automatically matches prompt files with dataset configurations using a naming convention (`xxx_prompt.txt` → `xxx_prompt_dataset.yaml`), making it easy to manage multiple benchmark tasks. + +## 🚀 Quick Start + +### 1. Install Dependencies + +```bash +cd examples/llm_prompt_optimization +pip install -r requirements.txt +``` + +### 2. Configure Your Model + +Update `config.yaml` with your LLM settings: + +```yaml +llm: + api_base: "https://openrouter.ai/api/v1" + api_key: "your_api_key_here" + models: + - name: "google/gemini-2.5-flash" # Or any OpenAI-compatible model + weight: 1.0 +``` + +### 3. Set Up Your Dataset and Prompt + +This example uses a naming convention to match prompts with their dataset configurations: +- For a prompt file `xxx_prompt.txt`, create a matching `xxx_prompt_dataset.yaml` +- For example: `emotion_prompt.txt` uses `emotion_prompt_dataset.yaml` + +Create your dataset configuration file (e.g., `emotion_prompt_dataset.yaml`): + +```yaml +# Dataset configuration +dataset_name: "dair-ai/emotion" # Dataset identifier +input_field: "text" # Field containing input data +target_field: "label" # Field containing ground truth +split: "test" # Dataset split to use + +# Evaluation samples +max_samples: 200 # Number of samples to evaluate +``` + +Create your initial prompt file (e.g., `emotion_prompt.txt`): + +``` +Classify the emotion expressed in the following text. + +Text: "{input_text}" + +Emotion (0-5): +``` + +### 4. Run OpenEvolve + +Use the provided `run_evolution.sh` script to ensure the correct dataset is used: + +```bash +# For emotion classification benchmark +./run_evolution.sh emotion_prompt.txt --iterations 50 + +# For IMDB sentiment analysis +./run_evolution.sh initial_prompt.txt --iterations 50 + +# With custom iterations and checkpoint +./run_evolution.sh emotion_prompt.txt --iterations 100 --checkpoint-interval 20 +``` + +The script automatically: +- Sets the `OPENEVOLVE_PROMPT` environment variable so the evaluator knows which dataset to use +- Passes all additional arguments to OpenEvolve +- Ensures the correct `_dataset.yaml` file is matched with your prompt + +**Note**: If you prefer to run OpenEvolve directly, set the environment variable first: +```bash +export OPENEVOLVE_PROMPT=emotion_prompt.txt +python ../../openevolve-run.py emotion_prompt.txt evaluator.py --config config.yaml --iterations 50 +``` + +## 📊 Supported Datasets + +This optimizer works with a wide variety of datasets. Included examples: + +- **IMDB Sentiment**: `initial_prompt.txt` + `initial_prompt_dataset.yaml` (binary classification) +- **Emotion**: `emotion_prompt.txt` + `emotion_prompt_dataset.yaml` (6-class, benchmark against DSPy) +- **GSM8K**: `gsm8k_prompt.txt` + `gsm8k_prompt_dataset.yaml` (grade school math, DSPy achieves 97.1%) + +### Creating New Tasks + +To add a new dataset: +1. Create `yourtask_prompt.txt` with the initial prompt +2. Create `yourtask_prompt_dataset.yaml` with the dataset configuration +3. Run: `./run_evolution.sh yourtask_prompt.txt --iterations 50` + +**Note**: If you call OpenEvolve directly without the wrapper script, the evaluator will look for a default `dataset_config.yaml` file. + +### Common Dataset Configurations: + +### Sentiment Analysis +```yaml +dataset_name: "stanfordnlp/imdb" +input_field: "text" +target_field: "label" # 0 or 1 +``` + +### Question Answering +```yaml +dataset_name: "squad" +input_field: "question" +target_field: "answers" # Dict with 'text' field +``` + +### Text Classification +```yaml +dataset_name: "ag_news" +input_field: "text" +target_field: "label" # 0-3 for categories +``` + +### Summarization +```yaml +dataset_name: "xsum" +input_field: "document" +target_field: "summary" +``` + +## ⚙️ How It Works + +### Simple Evaluation + +The evaluator uses a straightforward single-stage evaluation: + +1. **Load Dataset**: Downloads the specified dataset +2. **Sample Data**: Takes `max_samples` examples from the dataset +3. **Test Prompt**: Sends each example through the LLM with the prompt +4. **Calculate Accuracy**: Compares LLM outputs to ground truth labels + +### Evolution Process + +1. OpenEvolve starts with your initial prompt +2. The LLM generates variations based on performance feedback +3. Each variant is tested using cascading evaluation +4. Best performers are kept and evolved further +5. Process continues for specified iterations + +### 🎭 Custom Templates for Prompt Evolution + +By default, OpenEvolve is designed for code evolution. To make it work properly for prompt evolution, this example includes custom templates in the `templates/` directory: + +- **`full_rewrite_user.txt`**: Replaces the default code evolution template with prompt-specific language + +This ensures the LLM understands it should evolve the prompt text itself, not generate code. The configuration automatically uses these templates via: + +```yaml +prompt: + template_dir: "templates" # Use custom templates for prompt evolution +``` + +## 🎯 Configuration Options + +### Evaluation Configuration + +In `config.yaml`: +```yaml +evaluator: + parallel_evaluations: 4 # Run 4 evaluations in parallel + cascade_evaluation: false # Simple single-stage evaluation +``` + +### Sample Size + +Adjust in `dataset.yaml`: +```yaml +max_samples: 50 # Number of samples to evaluate +``` + +## 📈 Example Results + +Starting prompt: +``` +Analyze the sentiment: "{input_text}" +``` + +Evolved prompt after 100 iterations: +``` +Analyze the sentiment of the following text. Determine if the overall emotional tone is positive or negative. + +Text: "{input_text}" + +Response: Provide only a single digit - either 1 for positive sentiment or 0 for negative sentiment. Do not include any explanation or additional text. +``` + +Accuracy improvement: 72% → 94% + +## 🔧 Advanced Usage + +### Custom Evaluation Metrics + +The evaluator extracts predictions and compares them to ground truth. For classification tasks, it looks for: +- Exact number matches (0, 1, etc.) +- Keywords (positive/negative, yes/no) +- Custom patterns you define + +### Different Task Types + +While the default setup is for classification, you can modify the evaluator for: +- **Regression**: Compare numeric outputs +- **Generation**: Use BLEU/ROUGE scores +- **Extraction**: Check if key information is present + +## 🐛 Troubleshooting + +### Dataset Not Found +- Check the exact dataset name and source +- Some datasets require acceptance of terms + +### Low Stage 1 Accuracy +- Your initial prompt may be too vague +- Check if the output format matches expectations +- Verify the dataset fields are correct + +### API Errors +- Ensure your API key is valid +- Check rate limits +- Verify the model name is correct + +## 🚀 Tips for Best Results + +1. **Start Simple**: Begin with a clear, working prompt +2. **Clear Output Format**: Specify exactly what output you expect +3. **Appropriate Samples**: More samples = better evaluation but slower +4. **Multiple Runs**: Evolution has randomness; try multiple runs +5. **Monitor Progress**: Check intermediate best_program.txt files + +## 📚 Next Steps + +- Try different datasets and benchmarks +- Experiment with different models +- Adjust evolution parameters in config.yaml +- Create task-specific evaluation metrics + +Happy prompt evolving! 🧬✨ \ No newline at end of file diff --git a/examples/llm_prompt_optimization/config.yaml b/examples/llm_prompt_optimization/config.yaml new file mode 100644 index 000000000..da644f77c --- /dev/null +++ b/examples/llm_prompt_optimization/config.yaml @@ -0,0 +1,74 @@ +# Configuration for HuggingFace prompt optimization +# Based on optimized settings from config2.yaml + +# General settings +max_iterations: 50 +checkpoint_interval: 10 +log_level: "INFO" +diff_based_evolution: false # Full rewrite mode (best for prompt optimization) +max_code_length: 10000 +language: "text" # Explicitly set language to text for prompt evolution + +# LLM Configuration +llm: + api_base: "https://generativelanguage.googleapis.com/v1beta/openai/" + models: + - name: "gemini-2.5-flash-lite" # Using Gemini 2.5 Flash Lite + weight: 1.0 + + temperature: 0.4 # Optimal from experiments + max_tokens: 16000 # Optimal context + timeout: 150 + retries: 3 + +# Prompt Configuration - Optimal settings discovered +prompt: + template_dir: "templates" # Use custom templates for prompt evolution + num_top_programs: 3 # Best balance + num_diverse_programs: 2 # Best balance + include_artifacts: true # +20.7% improvement when enabled + + # System message for prompt evolution + system_message: | + You are an expert prompt engineer. Your task is to revise an existing prompt designed for large language models (LLMs), without being explicitly told what the task is. + + Your improvements should: + + * Infer the intended task and expected output format based on the structure and language of the original prompt. + * Clarify vague instructions, eliminate ambiguity, and improve overall interpretability for the LLM. + * Strengthen alignment between the prompt and the desired task outcome, ensuring more consistent and accurate responses. + * Improve robustness against edge cases or unclear input phrasing. + * If helpful, include formatting instructions, boundary conditions, or illustrative examples that reinforce the LLM's expected behavior. + * Avoid adding unnecessary verbosity or assumptions not grounded in the original prompt. + + The revised prompt should maintain the same input interface but be more effective, reliable, and production-ready for LLM use. + + Return only the improved prompt text. Do not include explanations or additional comments. Your output should be a clean, high-quality replacement that enhances clarity, consistency, and LLM performance. + +# Database Configuration +database: + population_size: 1000 + archive_size: 100 + num_islands: 4 + + # Feature dimensions for MAP-Elites + # Using custom features returned by the evaluator + feature_dimensions: ["prompt_length", "reasoning_strategy"] + feature_bins: 10 # 10x10 grid = 100 cells + + # Selection parameters - Optimal ratios from testing + elite_selection_ratio: 0.1 # 10% elite selection + exploration_ratio: 0.3 # 30% exploration + exploitation_ratio: 0.6 # 60% exploitation + + # Migration parameters - Optimal settings + migration_interval: 10 + migration_rate: 0.1 + +# Evaluator Configuration +evaluator: + timeout: 1800 + max_retries: 3 + parallel_evaluations: 4 + cascade_evaluation: true # Two-stage cascading evaluation + cascade_thresholds: [0.9] # Stage 1 must achieve 90% accuracy to proceed to stage 2 \ No newline at end of file diff --git a/examples/llm_prompt_optimization/dataset_config.yaml b/examples/llm_prompt_optimization/dataset_config.yaml new file mode 100644 index 000000000..08ea83cbf --- /dev/null +++ b/examples/llm_prompt_optimization/dataset_config.yaml @@ -0,0 +1,9 @@ +# Default dataset configuration (fallback when not using run_evolution.sh) +# This is used when OpenEvolve is called directly without setting OPENEVOLVE_PROMPT +dataset_name: "stanfordnlp/imdb" +input_field: "text" +target_field: "label" # 0 or 1 +split: "test" + +# Evaluation samples +max_samples: 50 \ No newline at end of file diff --git a/examples/llm_prompt_optimization/emotion_prompt.txt b/examples/llm_prompt_optimization/emotion_prompt.txt new file mode 100644 index 000000000..a947907ac --- /dev/null +++ b/examples/llm_prompt_optimization/emotion_prompt.txt @@ -0,0 +1,11 @@ +Classify the emotion in the following text. Choose exactly one emotion from this list: +- sadness +- joy +- love +- anger +- fear +- surprise + +Text: "{input_text}" + +Emotion (respond with one word only): \ No newline at end of file diff --git a/examples/llm_prompt_optimization/emotion_prompt_dataset.yaml b/examples/llm_prompt_optimization/emotion_prompt_dataset.yaml new file mode 100644 index 000000000..46a2d5375 --- /dev/null +++ b/examples/llm_prompt_optimization/emotion_prompt_dataset.yaml @@ -0,0 +1,18 @@ +# HuggingFace dataset configuration for emotion classification +# This is a standard benchmark used by DSPy and others +dataset_name: "dair-ai/emotion" +input_field: "text" +target_field: "label" # 0-5: sadness, joy, love, anger, fear, surprise +split: "test" + +# Evaluation samples +max_samples: 200 # Larger sample for 6-class problem + +# Labels mapping for reference +label_names: + 0: "sadness" + 1: "joy" + 2: "love" + 3: "anger" + 4: "fear" + 5: "surprise" \ No newline at end of file diff --git a/examples/llm_prompt_optimization/evaluator.py b/examples/llm_prompt_optimization/evaluator.py new file mode 100644 index 000000000..49fad99ba --- /dev/null +++ b/examples/llm_prompt_optimization/evaluator.py @@ -0,0 +1,446 @@ +""" +Evaluator for HuggingFace dataset-based prompt optimization. +""" + +import re +import traceback +import yaml +import os +import time +from openai import OpenAI +from tqdm import tqdm +from datasets import load_dataset + +# Read config.yaml to get model settings +with open(os.path.join(os.path.dirname(__file__), "config.yaml"), 'r') as f: + config = yaml.safe_load(f) + +# Get model settings from config +llm_config = config.get('llm', {}) +api_base = llm_config.get('api_base', 'http://localhost:1234/v1') + +# Handle both single model and model list configurations +models = llm_config.get('models', []) +if models: + # Use first model from list + TASK_MODEL_NAME = models[0].get('name', 'default-model') +else: + # Fallback to direct model specification + TASK_MODEL_NAME = llm_config.get('primary_model', 'default-model') + +# Get evaluator settings +evaluator_config = config.get('evaluator', {}) +MAX_RETRIES = evaluator_config.get('max_retries', 3) + +# Get max_tokens from LLM config +MAX_TOKENS = llm_config.get('max_tokens', 16000) +print(f"Using max_tokens: {MAX_TOKENS}") + +# Initialize OpenAI client once for all evaluations +test_model = OpenAI(base_url=api_base) +print(f"Initialized OpenAI client with model: {TASK_MODEL_NAME}") + +# Determine which dataset to use based on the OPENEVOLVE_PROMPT environment variable +import sys +prompt_file = os.environ.get('OPENEVOLVE_PROMPT') +if not prompt_file: + # Default to a generic dataset config if not using the wrapper script + evaluator_dir = os.path.dirname(os.path.abspath(__file__)) + DATASET_CONFIG_PATH = os.path.join(evaluator_dir, 'dataset_config.yaml') + print("Warning: OPENEVOLVE_PROMPT not set. Using default dataset_config.yaml") +else: + basename = os.path.basename(prompt_file) + dataset_filename = basename.replace('_prompt.txt', '_prompt_dataset.yaml').replace('.txt', '_dataset.yaml') + evaluator_dir = os.path.dirname(os.path.abspath(__file__)) + DATASET_CONFIG_PATH = os.path.join(evaluator_dir, dataset_filename) + print(f"Dataset configuration: {dataset_filename}") + + +def calculate_prompt_features(prompt): + """ + Calculate custom features for MAP-Elites binning + + Returns: + tuple: (prompt_length, reasoning_strategy) - both in range 0-9 + """ + # Feature 1: Prompt length bin (0-9) + length = len(prompt) + if length < 100: + prompt_length = 0 # Minimal + elif length < 200: + prompt_length = 1 # Very short + elif length < 400: + prompt_length = 2 # Short + elif length < 600: + prompt_length = 3 # Medium-short + elif length < 900: + prompt_length = 4 # Medium + elif length < 1200: + prompt_length = 5 # Medium-long + elif length < 1600: + prompt_length = 6 # Long + elif length < 2000: + prompt_length = 7 # Very long + elif length < 2500: + prompt_length = 8 # Extensive + else: + prompt_length = 9 # Very extensive + + # Feature 2: Reasoning strategy (0-9) + prompt_lower = prompt.lower() + + # Check for few-shot examples + has_example = ('example' in prompt_lower or + prompt.count('####') >= 4 or + bool(re.search(r'problem:.*?solution:', prompt_lower, re.DOTALL))) + + # Check for Chain-of-Thought (CoT) indicators + has_cot = ('step by step' in prompt_lower or + 'step-by-step' in prompt_lower or + any(phrase in prompt_lower for phrase in ['think through', 'reasoning', 'explain your']) or + bool(re.search(r'(first|then|next|finally)', prompt_lower))) + + # Assign reasoning strategy bins + if has_example: + # Few-shot examples (bins 7-9) + if has_cot: + reasoning_strategy = 9 # Few-shot + CoT (most sophisticated) + elif length > 1500: + reasoning_strategy = 8 # Extensive few-shot + else: + reasoning_strategy = 7 # Basic few-shot + elif has_cot: + # Chain-of-thought (bins 4-6) + if 'must' in prompt_lower or 'exactly' in prompt_lower: + reasoning_strategy = 6 # Strict CoT + elif length > 500: + reasoning_strategy = 5 # Detailed CoT + else: + reasoning_strategy = 4 # Basic CoT + else: + # Basic prompts (bins 0-3) + if length < 100: + reasoning_strategy = 0 # Minimal + elif 'solve' in prompt_lower or 'calculate' in prompt_lower: + reasoning_strategy = 2 # Direct instruction + else: + reasoning_strategy = 1 # Simple prompt + + return prompt_length, reasoning_strategy + + +def load_prompt_config(prompt_path): + """Load the prompt from text file and dataset config from matching _dataset.yaml file.""" + # Load prompt from text file + with open(prompt_path, 'r') as f: + prompt = f.read().strip() + + # Load the configuration (already determined from environment variable) + if not os.path.exists(DATASET_CONFIG_PATH): + raise FileNotFoundError(f"Dataset configuration not found: {DATASET_CONFIG_PATH}") + + with open(DATASET_CONFIG_PATH, 'r') as f: + config = yaml.safe_load(f) + + return config, prompt + +def load_hf_dataset(config): + """Load HuggingFace dataset based on configuration.""" + dataset_name = config['dataset_name'] + dataset_config = config.get('dataset_config', None) + split = config.get('split', 'test') + + print(f"Loading dataset: {dataset_name}") + + try: + # Try to load the specified split + if dataset_config: + dataset = load_dataset(dataset_name, dataset_config, split=split) + else: + dataset = load_dataset(dataset_name, split=split) + except: + # Fallback to train split if test is not available + print(f"Split '{split}' not found, falling back to 'train'") + if dataset_config: + dataset = load_dataset(dataset_name, dataset_config, split='train') + else: + dataset = load_dataset(dataset_name, split='train') + + print(f"Dataset loaded with {len(dataset)} examples") + return dataset + +def evaluate_prompt(prompt, dataset, config, num_samples): + """Evaluate a prompt on a subset of the dataset.""" + input_field = config['input_field'] + target_field = config['target_field'] + + # Check dataset type + dataset_name = config.get('dataset_name', '').lower() + is_emotion = 'emotion' in dataset_name + is_gsm8k = 'gsm8k' in dataset_name + + # Sample from dataset + samples = dataset.select(range(min(num_samples, len(dataset)))) + + correct = 0 + total = 0 + + for example in tqdm(samples, desc=f"Evaluating {num_samples} samples"): + input_text = example[input_field] + expected = example[target_field] + + # Prepare the message for the LLM + messages = [ + {"role": "user", "content": prompt.format(input_text=input_text)} + ] + + # Call the LLM with retry logic + for attempt in range(MAX_RETRIES): + try: + # Use max_tokens from config + response = test_model.chat.completions.create( + model=TASK_MODEL_NAME, + messages=messages, + temperature=0.1, # Low temperature for consistent results + max_tokens=MAX_TOKENS + ) + break + except Exception as e: + if attempt == MAX_RETRIES - 1: + print(f"Failed to get response after {MAX_RETRIES} attempts: {e}") + raise e + time.sleep(1) + + # Handle potential None response + if not response: + print(f"Warning: No response object from LLM") + total += 1 # Count as incorrect + continue + + if not response.choices: + print(f"Warning: No choices in response from LLM") + total += 1 # Count as incorrect + continue + + if not response.choices[0].message: + print(f"Warning: No message in response choice") + total += 1 # Count as incorrect + continue + + output_text = response.choices[0].message.content + if output_text is None: + print(f"Warning: None content in LLM response") + print(f"Full response: {response}") + total += 1 # Count as incorrect + continue + + output_text = output_text.strip() + + # Extract prediction from output + try: + if is_gsm8k: + # For GSM8K, extract the numeric answer after #### + # First, extract the expected answer from the ground truth + expected_answer = expected.split('####')[-1].strip() + try: + expected_number = float(expected_answer.replace(',', '')) + except: + print(f"Warning: Could not parse expected answer: {expected_answer}") + total += 1 + continue + + # Extract prediction from model output + prediction = None + if '####' in output_text: + predicted_answer = output_text.split('####')[-1].strip() + # Extract just the number, removing any extra text like $ signs + import re + numbers = re.findall(r'-?\$?[\d,]+\.?\d*', predicted_answer) + if numbers: + try: + # Remove $ and , from the number + number_str = numbers[0].replace('$', '').replace(',', '') + prediction = float(number_str) + except: + pass + + # If we found a prediction, check if it matches + if prediction is not None: + # Check if answers match (with small tolerance for floats) + if abs(prediction - expected_number) < 0.001: + correct += 1 + + total += 1 + continue # Skip the general case to avoid double counting + + elif is_emotion: + # For emotion classification (0-5) + numbers = re.findall(r'\b[0-5]\b', output_text) + if numbers: + prediction = int(numbers[-1]) # Use the last number found + else: + # Try to infer from emotion keywords + output_lower = output_text.lower() + emotion_map = { + 'sadness': 0, 'sad': 0, + 'joy': 1, 'happy': 1, 'happiness': 1, + 'love': 2, + 'anger': 3, 'angry': 3, + 'fear': 4, 'afraid': 4, 'scared': 4, + 'surprise': 5, 'surprised': 5 + } + prediction = -1 + for emotion, label in emotion_map.items(): + if emotion in output_lower: + prediction = label + break + else: + # For sentiment classification (0-1) + numbers = re.findall(r'\b[01]\b', output_text) + if numbers: + prediction = int(numbers[-1]) # Use the last number found + else: + # Try to infer from keywords + output_lower = output_text.lower() + if 'positive' in output_lower: + prediction = 1 + elif 'negative' in output_lower: + prediction = 0 + else: + prediction = -1 # Invalid prediction + + if prediction == expected: + correct += 1 + + total += 1 + + except Exception as e: + print(f"Error parsing response '{output_text}': {e}") + total += 1 # Count as incorrect + + accuracy = correct / total if total > 0 else 0.0 + return accuracy, correct, total + +def evaluate_stage1(prompt_path): + """ + Stage 1 evaluation: Quick evaluation with 10% of samples + + Args: + prompt_path: Path to the prompt file + + Returns: + Dictionary with combined_score metric + """ + print('-' * 80) + print("Starting Stage 1 evaluation...") + print('-' * 80) + + try: + # Load prompt configuration + config, prompt = load_prompt_config(prompt_path) + print(f"Loaded prompt configuration") + + # Load dataset + dataset = load_hf_dataset(config) + + # Get number of samples from config + num_samples = config.get('max_samples', 50) + stage1_samples = max(10, int(num_samples * 0.1)) + + print(f"Stage 1: Evaluating {stage1_samples} samples...") + + # Run evaluation + accuracy, correct, total = evaluate_prompt( + prompt, dataset, config, stage1_samples + ) + + print(f"Stage 1 accuracy: {accuracy:.3f} ({correct}/{total})") + print('-' * 80) + + # Calculate custom features + prompt_length, reasoning_strategy = calculate_prompt_features(prompt) + print(f"Prompt features - Length bin: {prompt_length}, Reasoning bin: {reasoning_strategy}") + + return { + "combined_score": accuracy, + "prompt_length": prompt_length, + "reasoning_strategy": reasoning_strategy + } + + except Exception as e: + print(f"Stage 1 evaluation failed: {str(e)}") + traceback.print_exc() + print('-' * 80) + return { + "combined_score": 0.0, + "error": str(e) + } + + +def evaluate_stage2(prompt_path): + """ + Stage 2 evaluation: Full evaluation with all samples + + Args: + prompt_path: Path to the prompt file + + Returns: + Dictionary with combined_score metric + """ + print('-' * 80) + print("Starting Stage 2 evaluation...") + print('-' * 80) + + try: + # Load prompt configuration + config, prompt = load_prompt_config(prompt_path) + print(f"Loaded prompt configuration") + + # Load dataset + dataset = load_hf_dataset(config) + + # Get number of samples from config + num_samples = config.get('max_samples', 50) + + print(f"Stage 2: Evaluating all {num_samples} samples...") + + # Run evaluation + accuracy, correct, total = evaluate_prompt( + prompt, dataset, config, num_samples + ) + + print(f"Stage 2 accuracy: {accuracy:.3f} ({correct}/{total})") + print('-' * 80) + + # Calculate custom features + prompt_length, reasoning_strategy = calculate_prompt_features(prompt) + print(f"Prompt features - Length bin: {prompt_length}, Reasoning bin: {reasoning_strategy}") + + return { + "combined_score": accuracy, + "prompt_length": prompt_length, + "reasoning_strategy": reasoning_strategy + } + + except Exception as e: + print(f"Stage 2 evaluation failed: {str(e)}") + traceback.print_exc() + print('-' * 80) + return { + "combined_score": 0.0, + "error": str(e) + } + + +def evaluate(prompt_path): + """ + Main evaluation function - for backwards compatibility + Calls evaluate_stage2 for full evaluation + + Args: + prompt_path: Path to the prompt file + + Returns: + Dictionary with combined_score metric + """ + return evaluate_stage2(prompt_path) \ No newline at end of file diff --git a/examples/llm_prompt_optimization/gsm8k_prompt.txt b/examples/llm_prompt_optimization/gsm8k_prompt.txt new file mode 100644 index 000000000..476efed05 --- /dev/null +++ b/examples/llm_prompt_optimization/gsm8k_prompt.txt @@ -0,0 +1,5 @@ +Solve the following grade school math problem step by step. + +Problem: {input_text} + +Show your work and reasoning for each step. After solving, provide your final numeric answer after "####". \ No newline at end of file diff --git a/examples/llm_prompt_optimization/gsm8k_prompt_dataset.yaml b/examples/llm_prompt_optimization/gsm8k_prompt_dataset.yaml new file mode 100644 index 000000000..db28e49eb --- /dev/null +++ b/examples/llm_prompt_optimization/gsm8k_prompt_dataset.yaml @@ -0,0 +1,14 @@ +# HuggingFace dataset configuration for GSM8K (Grade School Math) +# DSPy achieved 97.1% accuracy with GPT-4 on this benchmark +dataset_name: "openai/gsm8k" +dataset_config: "main" # GSM8K requires config name +input_field: "question" +target_field: "answer" # Contains step-by-step solution ending with #### followed by the numeric answer +split: "test" + +# Evaluation samples +max_samples: 200 # Start with subset, full test set has 1,319 problems + +# Note: The answer field contains the full solution with the format: +# "Step 1 explanation... Step 2... #### numeric_answer" +# The evaluator will need to extract the number after #### \ No newline at end of file diff --git a/examples/llm_prompt_optimization/initial_prompt.txt b/examples/llm_prompt_optimization/initial_prompt.txt new file mode 100644 index 000000000..ab329a63f --- /dev/null +++ b/examples/llm_prompt_optimization/initial_prompt.txt @@ -0,0 +1,5 @@ +Analyze the sentiment of the following text and classify it as positive (1) or negative (0). + +Text: "{input_text}" + +Label: \ No newline at end of file diff --git a/examples/llm_prompt_optimization/initial_prompt_dataset.yaml b/examples/llm_prompt_optimization/initial_prompt_dataset.yaml new file mode 100644 index 000000000..8bf503ae3 --- /dev/null +++ b/examples/llm_prompt_optimization/initial_prompt_dataset.yaml @@ -0,0 +1,8 @@ +# HuggingFace dataset configuration +dataset_name: "stanfordnlp/imdb" +input_field: "text" +target_field: "label" +split: "test" # Will fallback to train if not available + +# Evaluation samples +max_samples: 50 # Number of samples to evaluate \ No newline at end of file diff --git a/examples/llm_prompt_optimization/requirements.txt b/examples/llm_prompt_optimization/requirements.txt new file mode 100644 index 000000000..b72f54907 --- /dev/null +++ b/examples/llm_prompt_optimization/requirements.txt @@ -0,0 +1,4 @@ +openai +tqdm +datasets +pyyaml \ No newline at end of file diff --git a/examples/llm_prompt_optimization/run_evolution.sh b/examples/llm_prompt_optimization/run_evolution.sh new file mode 100644 index 000000000..2d7daa4c6 --- /dev/null +++ b/examples/llm_prompt_optimization/run_evolution.sh @@ -0,0 +1,17 @@ +#!/bin/bash +# Wrapper script to run OpenEvolve with the correct dataset + +if [ $# -lt 1 ]; then + echo "Usage: $0 [additional_args...]" + echo "Example: $0 emotion_prompt.txt --iterations 50" + exit 1 +fi + +PROMPT_FILE=$1 +shift # Remove first argument + +# Set the environment variable for the evaluator +export OPENEVOLVE_PROMPT=$PROMPT_FILE + +# Run OpenEvolve +python ../../openevolve-run.py "$PROMPT_FILE" evaluator.py --config config.yaml "$@" \ No newline at end of file diff --git a/examples/llm_prompt_optimization/templates/full_rewrite_user.txt b/examples/llm_prompt_optimization/templates/full_rewrite_user.txt new file mode 100644 index 000000000..216844a48 --- /dev/null +++ b/examples/llm_prompt_optimization/templates/full_rewrite_user.txt @@ -0,0 +1,20 @@ +# Current Prompt Information +- Current performance metrics: {metrics} +- Areas identified for improvement: {improvement_areas} + +{artifacts} + +# Prompt Evolution History +{evolution_history} + +# Current Prompt +{current_program} + +# Task +Rewrite the prompt to improve its performance on the specified metrics. +Provide the complete new prompt text. + +IMPORTANT: Make sure your rewritten prompt maintains the same input placeholder ({{input_text}}) +but with improved instructions for better LLM performance. + +Your improved prompt: \ No newline at end of file diff --git a/openevolve/database.py b/openevolve/database.py index 0d2ba6fd4..740768839 100644 --- a/openevolve/database.py +++ b/openevolve/database.py @@ -8,6 +8,7 @@ import os import random import time +import uuid from dataclasses import asdict, dataclass, field, fields # FileLock removed - no longer needed with threaded parallel processing @@ -998,12 +999,29 @@ def _sample_exploration_parent(self) -> Program: if not current_island_programs: # If current island is empty, initialize with best program or random program if self.best_program_id and self.best_program_id in self.programs: - # Clone best program to current island + # Create a copy of best program for the empty island (don't reuse same ID) best_program = self.programs[self.best_program_id] - self.islands[self.current_island].add(self.best_program_id) - best_program.metadata["island"] = self.current_island - logger.debug(f"Initialized empty island {self.current_island} with best program") - return best_program + copy_program = Program( + id=str(uuid.uuid4()), + code=best_program.code, + language=best_program.language, + parent_id=best_program.id, + generation=best_program.generation, + timestamp=time.time(), + iteration_found=self.last_iteration, + metrics=best_program.metrics.copy(), + complexity=best_program.complexity, + diversity=best_program.diversity, + metadata={"island": self.current_island}, + artifacts_json=best_program.artifacts_json, + artifact_dir=best_program.artifact_dir, + ) + self.programs[copy_program.id] = copy_program + self.islands[self.current_island].add(copy_program.id) + logger.debug( + f"Initialized empty island {self.current_island} with copy of best program" + ) + return copy_program else: # Use any available program return next(iter(self.programs.values())) @@ -1026,10 +1044,29 @@ def _sample_exploration_parent(self) -> Program: f"Island {self.current_island} has no valid programs after cleanup, reinitializing" ) if self.best_program_id and self.best_program_id in self.programs: + # Create a copy of best program for the empty island (don't reuse same ID) best_program = self.programs[self.best_program_id] - self.islands[self.current_island].add(self.best_program_id) - best_program.metadata["island"] = self.current_island - return best_program + copy_program = Program( + id=str(uuid.uuid4()), + code=best_program.code, + language=best_program.language, + parent_id=best_program.id, + generation=best_program.generation, + timestamp=time.time(), + iteration_found=self.last_iteration, + metrics=best_program.metrics.copy(), + complexity=best_program.complexity, + diversity=best_program.diversity, + metadata={"island": self.current_island}, + artifacts_json=best_program.artifacts_json, + artifact_dir=best_program.artifact_dir, + ) + self.programs[copy_program.id] = copy_program + self.islands[self.current_island].add(copy_program.id) + logger.debug( + f"Reinitialized empty island {self.current_island} with copy of best program" + ) + return copy_program else: return next(iter(self.programs.values())) @@ -1347,6 +1384,26 @@ def migrate_programs(self) -> None: target_islands = [(i + 1) % len(self.islands), (i - 1) % len(self.islands)] for migrant in migrants: + # Prevent re-migration of already migrated programs to avoid exponential duplication. + # Analysis of actual evolution runs shows this causes severe issues: + # - Program cb5d07f2 had 183 descendant copies by iteration 850 + # - Program 5645fbd2 had 31 descendant copies + # - IDs grow exponentially: program_migrant_2_migrant_3_migrant_4_migrant_0... + # + # This is particularly problematic for OpenEvolve's MAP-Elites + Island hybrid architecture: + # 1. All copies have identical code → same complexity/diversity/performance scores + # 2. They all map to the SAME MAP-Elites cell → only 1 survives, rest discarded + # 3. Wastes computation evaluating hundreds of identical programs + # 4. Reduces actual diversity as islands fill with duplicates + # + # By preventing already-migrated programs from migrating again, we ensure: + # - Each program migrates at most once per lineage + # - True diversity is maintained between islands + # - Computational resources aren't wasted on duplicates + # - Aligns with MAP-Elites' one-program-per-cell principle + if migrant.metadata.get("migrant", False): + continue + for target_island in target_islands: # Create a copy for migration (to avoid removing from source) migrant_copy = Program( diff --git a/openevolve/evaluator.py b/openevolve/evaluator.py index 25d880987..80bcac333 100644 --- a/openevolve/evaluator.py +++ b/openevolve/evaluator.py @@ -644,6 +644,9 @@ def _create_cascade_error_context(self, stage: str, error: Exception) -> dict: def _passes_threshold(self, metrics: Dict[str, float], threshold: float) -> bool: """ Check if metrics pass a threshold + + Uses 'combined_score' if available (for consistency with evolution), + otherwise falls back to averaging all numeric metrics except 'error' Args: metrics: Dictionary of metric name to score @@ -655,7 +658,14 @@ def _passes_threshold(self, metrics: Dict[str, float], threshold: float) -> bool if not metrics: return False - # Calculate average score, skipping non-numeric values and 'error' key + # Use combined_score if available - this is what evolution uses + if "combined_score" in metrics: + score = metrics.get("combined_score") + if isinstance(score, (int, float)): + return float(score) >= threshold + + # Fallback: average all numeric metrics except 'error' + # This maintains backward compatibility valid_metrics = [] for name, value in metrics.items(): # Skip 'error' keys and ensure values are numeric diff --git a/tests/test_database.py b/tests/test_database.py index 0d17f8961..cd11a7e26 100644 --- a/tests/test_database.py +++ b/tests/test_database.py @@ -3,6 +3,7 @@ """ import unittest +import uuid from openevolve.config import Config from openevolve.database import Program, ProgramDatabase @@ -457,6 +458,183 @@ def test_diversity_feature_integration(self): self.assertGreaterEqual(coord, 0) self.assertLess(coord, self.db.feature_bins) + def test_migration_prevents_re_migration(self): + """Test that programs marked as migrants don't migrate again""" + # Create database with multiple islands + config = Config() + config.database.in_memory = True + config.database.num_islands = 3 + config.database.migration_interval = 1 # Migrate every generation + multi_db = ProgramDatabase(config.database) + + # Add programs to each island (avoid "migrant" in original IDs) + for i in range(3): + program = Program( + id=f"test_prog_{i}", + code=f"def test_{i}(): return {i}", + language="python", + metrics={"score": 0.5 + i * 0.1}, + ) + multi_db.add(program, target_island=i) + + # Manually mark one as a migrant + migrant_program = multi_db.get("test_prog_0") + migrant_program.metadata["migrant"] = True + + # Store original ID + original_id = migrant_program.id + + # Count initial programs with "_migrant_" pattern (created by migration) + initial_migrant_count = sum(1 for pid in multi_db.programs if "_migrant_" in pid) + self.assertEqual(initial_migrant_count, 0) # Should be none initially + + # Run migration + multi_db.island_generations[0] = config.database.migration_interval + multi_db.island_generations[1] = config.database.migration_interval + multi_db.island_generations[2] = config.database.migration_interval + multi_db.migrate_programs() + + # Check that the migrant program wasn't re-migrated + # It should still exist with the same ID (not a new migrant ID) + still_exists = multi_db.get(original_id) + self.assertIsNotNone(still_exists) + + # Count new programs created by migration (identified by "_migrant_" pattern) + new_migrant_ids = [pid for pid in multi_db.programs if "_migrant_" in pid] + + # Each non-migrant program (2 of them) migrates to 2 adjacent islands + # So we expect 2 * 2 = 4 new migrant programs + # The already-marked migrant (test_prog_0) should NOT create any new copies + self.assertEqual(len(new_migrant_ids), 4) + + # Verify the already-migrant program didn't create new copies + migrant_descendants = [pid for pid in new_migrant_ids if original_id in pid] + self.assertEqual(len(migrant_descendants), 0, + f"Program {original_id} should not have created migrant copies") + + def test_empty_island_initialization_creates_copies(self): + """Test that empty islands are initialized with copies, not shared references""" + # Create database with multiple islands + config = Config() + config.database.in_memory = True + config.database.num_islands = 3 + # Force exploration mode to test empty island handling + config.database.exploration_ratio = 1.0 + config.database.exploitation_ratio = 0.0 + multi_db = ProgramDatabase(config.database) + + # Add a single program to island 1 + program = Program( + id="original_program", + code="def original(): return 42", + language="python", + metrics={"score": 0.9, "combined_score": 0.9}, + ) + multi_db.add(program, target_island=1) + + # Make it the best program + multi_db.best_program_id = "original_program" + + # Switch to empty island 0 and sample + multi_db.set_current_island(0) + sampled_parent, _ = multi_db.sample() + + # The sampled program should be a copy, not the original + self.assertNotEqual(sampled_parent.id, "original_program") + self.assertEqual(sampled_parent.code, program.code) # Same code + self.assertEqual(sampled_parent.parent_id, "original_program") # Parent is the original + + # Check island membership + self.assertIn("original_program", multi_db.islands[1]) + self.assertNotIn("original_program", multi_db.islands[0]) + self.assertIn(sampled_parent.id, multi_db.islands[0]) + + # Run validation - should not raise any errors + multi_db._validate_migration_results() + + def test_no_program_assigned_to_multiple_islands(self): + """Test that programs are never assigned to multiple islands""" + # Create database with multiple islands + config = Config() + config.database.in_memory = True + config.database.num_islands = 4 + multi_db = ProgramDatabase(config.database) + + # Add programs to different islands + program_ids = [] + for i in range(4): + program = Program( + id=f"island_test_{i}", + code=f"def test_{i}(): return {i}", + language="python", + metrics={"score": 0.5 + i * 0.1, "combined_score": 0.5 + i * 0.1}, + ) + multi_db.add(program, target_island=i) + program_ids.append(program.id) + + # Make the best program from island 3 + multi_db.best_program_id = "island_test_3" + + # Sample from empty islands - this should create copies + for empty_island in range(4): + if len(multi_db.islands[empty_island]) == 0: + multi_db.set_current_island(empty_island) + parent, _ = multi_db.sample() + + # Check that no program ID appears in multiple islands + all_island_programs = {} + for island_idx, island_programs in enumerate(multi_db.islands): + for program_id in island_programs: + if program_id in all_island_programs: + self.fail( + f"Program {program_id} found in both island {all_island_programs[program_id]} " + f"and island {island_idx}" + ) + all_island_programs[program_id] = island_idx + + # Run validation - should not raise any errors + multi_db._validate_migration_results() + + def test_migration_validation_passes(self): + """Test that migration validation passes after our fixes""" + # Create database with multiple islands + config = Config() + config.database.in_memory = True + config.database.num_islands = 3 + config.database.migration_interval = 1 + multi_db = ProgramDatabase(config.database) + + # Add programs and run several migration cycles + for i in range(6): + program = Program( + id=f"test_program_{i}", + code=f"def test_{i}(): return {i * 2}", + language="python", + metrics={"score": 0.4 + i * 0.1, "combined_score": 0.4 + i * 0.1}, + ) + multi_db.add(program, target_island=i % 3) + + # Run multiple migration cycles + for cycle in range(3): + # Increment generations to trigger migration + for island in range(3): + multi_db.island_generations[island] += 1 + + # Migrate programs + multi_db.migrate_programs() + + # Validation should pass without warnings + multi_db._validate_migration_results() + + # Verify no program has exponential ID growth + for program_id in multi_db.programs: + # Count occurrences of "migrant" in ID + migrant_count = program_id.count("migrant") + self.assertLessEqual( + migrant_count, 1, + f"Program ID {program_id} has been migrated multiple times" + ) + if __name__ == "__main__": unittest.main()