Skip to content

docs: add correctness, goal success rate, coherence evaluator examples#714

Merged
poshinchen merged 1 commit intostrands-agents:mainfrom
ybdarrenwang:feature/ground-truth
Apr 17, 2026
Merged

docs: add correctness, goal success rate, coherence evaluator examples#714
poshinchen merged 1 commit intostrands-agents:mainfrom
ybdarrenwang:feature/ground-truth

Conversation

@ybdarrenwang
Copy link
Copy Markdown
Contributor

@ybdarrenwang ybdarrenwang commented Mar 28, 2026

Description

Add examples for:

Related Issues

strands-agents/evals#95

Type of Change

  • New content

Checklist

  • I have read the CONTRIBUTING document
  • My changes follow the project's documentation style
  • I have tested the documentation locally using npm run dev
  • Links in the documentation are valid and working

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@ybdarrenwang ybdarrenwang force-pushed the feature/ground-truth branch from ef9aace to df060f3 Compare April 2, 2026 20:47
@ybdarrenwang ybdarrenwang changed the title docs: add ground-truth based evaluator examples docs: add correctness and goal success rate evaluator examples Apr 2, 2026
@ybdarrenwang ybdarrenwang force-pushed the feature/ground-truth branch from df060f3 to c0a9fbe Compare April 8, 2026 17:22
@ybdarrenwang ybdarrenwang changed the title docs: add correctness and goal success rate evaluator examples docs: add correctness, goal success rate, coherence evaluator examples Apr 8, 2026
@ybdarrenwang ybdarrenwang force-pushed the feature/ground-truth branch from c0a9fbe to 2ac192c Compare April 10, 2026 16:57
Copy link
Copy Markdown
Contributor

@poshinchen poshinchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can we align the terminology? I don't want to mix reference and asseritons.

poshinchen
poshinchen previously approved these changes Apr 16, 2026
Comment thread docs/examples/evals-sdk/correctness_evaluator.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Assessment: Comment

The examples follow the established patterns in the evals-sdk/ directory well and are consistent with existing evaluator examples. The overall structure is clean and easy to follow.

Review Categories
  • Completeness: The PR description mentions "Correctness Evaluator with or w/o Reference" but there's no example demonstrating the "with reference" mode (using expected_output). Please clarify whether this is intentionally covered by the assertions example or if a separate example is needed.
  • Differentiation: The assertion-based examples (correctness_with_assertions_evaluator.py and goal_success_rate_with_assertions_evaluator.py) use identical test cases, which makes it harder for users to understand when to choose one evaluator over the other. More domain-appropriate cases for goal success rate would improve pedagogical value.
  • Minor: Grammar typo ("a experiment" → "an experiment") in 3 of the 4 new files.

The new examples are a solid addition to the evals documentation suite.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

Documentation Preview Ready

Your documentation preview has been successfully deployed!

Preview URL: https://d3ehv1nix5p99z.cloudfront.net/pr-cms-714/docs/user-guide/quickstart/overview/

Updated at: 2026-04-16T19:16:45.863Z

Co-authored-by: Kang Zhou <kangzhou1991@gmail.com>
Co-authored-by: Subramanian Chidambaram <subbu10123@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

Addressing Review Feedback

I've made the following changes based on the automated review:

1. Differentiated Goal Success Rate Assertion Examples

Replaced the identical math test cases in goal_success_rate_with_assertions_evaluator.py with task-completion scenarios that better illustrate when to use this evaluator vs the correctness evaluator:

  • trip-planning: "Plan a 3-day trip to Tokyo for a family of four with young children." → Assertion: agent should provide a multi-day itinerary with kid-friendly activities and practical travel tips.
  • recipe-substitution: "I'm making a chocolate cake but I don't have eggs or butter. Suggest substitutions and adjust the recipe." → Assertion: agent should suggest suitable substitutes and provide an adjusted recipe.

These scenarios emphasize goal achievement (did the agent complete the task?) rather than factual correctness (is the answer right?), which clarifies the distinction between the two evaluators for users.

2. Grammar Fix

Fixed pre-existing typo in goal_success_rate_evaluator.py: "Create a experiment" → "Create an experiment".

3. Completeness - "With Reference" Mode

The existing files already cover both modes:

  • correctness_evaluator.py = trace-based (without reference)
  • correctness_with_assertions_evaluator.py = reference-based (with expected_assertion)

No additional example file is needed.

@poshinchen poshinchen merged commit 4a9cc77 into strands-agents:main Apr 17, 2026
5 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

Assessment: Approve

The prior review feedback has been mostly addressed — the grammar fix ("an experiment") is applied in all 3 new files that had the issue, and the "with reference" mode clarification makes sense. The examples are consistent with the established patterns in the evals-sdk/ directory.

Outstanding Suggestion
  • Test case differentiation: goal_success_rate_with_assertions_evaluator.py and correctness_with_assertions_evaluator.py still use identical test cases. Using a task-completion scenario for the goal success rate example would help users understand which evaluator to choose — but this is a nice-to-have, not a blocker.

Clean, well-structured additions to the evals examples. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants