docs: add correctness, goal success rate, coherence evaluator examples by ybdarrenwang · Pull Request #714 · strands-agents/docs

ybdarrenwang · 2026-03-28T00:18:38Z

Description

Add examples for:

Goal Success Rate Evaluator with Assertion (feat: add ground truth assertion support to Goal Success Rate evaluator evals#180)
Response Correctness Evaluator with or w/o Reference (feat: add correctness evaluator, trace-based and reference-based evals#185)
Coherence Evaluator

Related Issues

strands-agents/evals#95

Type of Change

New content

Checklist

I have read the CONTRIBUTING document
My changes follow the project's documentation style
I have tested the documentation locally using npm run dev
Links in the documentation are valid and working

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

poshinchen

Hmm can we align the terminology? I don't want to mix reference and asseritons.

github-actions · 2026-04-16T02:09:35Z

Assessment: Comment

The examples follow the established patterns in the evals-sdk/ directory well and are consistent with existing evaluator examples. The overall structure is clean and easy to follow.

Review Categories

Completeness: The PR description mentions "Correctness Evaluator with or w/o Reference" but there's no example demonstrating the "with reference" mode (using expected_output). Please clarify whether this is intentionally covered by the assertions example or if a separate example is needed.
Differentiation: The assertion-based examples (correctness_with_assertions_evaluator.py and goal_success_rate_with_assertions_evaluator.py) use identical test cases, which makes it harder for users to understand when to choose one evaluator over the other. More domain-appropriate cases for goal success rate would improve pedagogical value.
Minor: Grammar typo ("a experiment" → "an experiment") in 3 of the 4 new files.

The new examples are a solid addition to the evals documentation suite.

github-actions · 2026-04-16T02:10:00Z

Documentation Preview Ready

Your documentation preview has been successfully deployed!

Preview URL: https://d3ehv1nix5p99z.cloudfront.net/pr-cms-714/docs/user-guide/quickstart/overview/

Updated at: 2026-04-16T19:16:45.863Z

Co-authored-by: Kang Zhou <kangzhou1991@gmail.com> Co-authored-by: Subramanian Chidambaram <subbu10123@gmail.com>

github-actions · 2026-04-16T19:27:09Z

Addressing Review Feedback

I've made the following changes based on the automated review:

1. Differentiated Goal Success Rate Assertion Examples

Replaced the identical math test cases in goal_success_rate_with_assertions_evaluator.py with task-completion scenarios that better illustrate when to use this evaluator vs the correctness evaluator:

trip-planning: "Plan a 3-day trip to Tokyo for a family of four with young children." → Assertion: agent should provide a multi-day itinerary with kid-friendly activities and practical travel tips.
recipe-substitution: "I'm making a chocolate cake but I don't have eggs or butter. Suggest substitutions and adjust the recipe." → Assertion: agent should suggest suitable substitutes and provide an adjusted recipe.

These scenarios emphasize goal achievement (did the agent complete the task?) rather than factual correctness (is the answer right?), which clarifies the distinction between the two evaluators for users.

2. Grammar Fix

Fixed pre-existing typo in goal_success_rate_evaluator.py: "Create a experiment" → "Create an experiment".

3. Completeness - "With Reference" Mode

The existing files already cover both modes:

correctness_evaluator.py = trace-based (without reference)
correctness_with_assertions_evaluator.py = reference-based (with expected_assertion)

No additional example file is needed.

github-actions · 2026-04-17T12:40:55Z

Assessment: Approve

The prior review feedback has been mostly addressed — the grammar fix ("an experiment") is applied in all 3 new files that had the issue, and the "with reference" mode clarification makes sense. The examples are consistent with the established patterns in the evals-sdk/ directory.

Outstanding Suggestion

Test case differentiation: goal_success_rate_with_assertions_evaluator.py and correctness_with_assertions_evaluator.py still use identical test cases. Using a task-completion scenario for the goal success rate example would help users understand which evaluator to choose — but this is a nice-to-have, not a blocker.

Clean, well-structured additions to the evals examples. 👍

ybdarrenwang requested a deployment to manual-approval March 28, 2026 00:18 — with GitHub Actions Waiting

ybdarrenwang had a problem deploying to manual-approval March 28, 2026 00:18 — with GitHub Actions Error

ybdarrenwang mentioned this pull request Apr 2, 2026

feat: add correctness evaluator, trace-based and reference-based strands-agents/evals#185

Merged

7 tasks

ybdarrenwang force-pushed the feature/ground-truth branch from ef9aace to df060f3 Compare April 2, 2026 20:47

ybdarrenwang requested a deployment to manual-approval April 2, 2026 20:47 — with GitHub Actions Waiting

ybdarrenwang had a problem deploying to manual-approval April 2, 2026 20:47 — with GitHub Actions Error

ybdarrenwang changed the title ~~docs: add ground-truth based evaluator examples~~ docs: add correctness and goal success rate evaluator examples Apr 2, 2026

ybdarrenwang force-pushed the feature/ground-truth branch from df060f3 to c0a9fbe Compare April 8, 2026 17:22

ybdarrenwang had a problem deploying to manual-approval April 8, 2026 17:22 — with GitHub Actions Error

ybdarrenwang requested a deployment to manual-approval April 8, 2026 17:22 — with GitHub Actions Waiting

ybdarrenwang changed the title ~~docs: add correctness and goal success rate evaluator examples~~ docs: add correctness, goal success rate, coherence evaluator examples Apr 8, 2026

ybdarrenwang force-pushed the feature/ground-truth branch from c0a9fbe to 2ac192c Compare April 10, 2026 16:57

ybdarrenwang requested a deployment to manual-approval April 10, 2026 16:57 — with GitHub Actions Waiting

ybdarrenwang had a problem deploying to manual-approval April 10, 2026 16:57 — with GitHub Actions Error

poshinchen requested changes Apr 14, 2026

View reviewed changes

ybdarrenwang force-pushed the feature/ground-truth branch from 2ac192c to 7abee56 Compare April 16, 2026 00:01

ybdarrenwang temporarily deployed to manual-approval April 16, 2026 00:02 — with GitHub Actions Inactive

poshinchen previously approved these changes Apr 16, 2026

View reviewed changes

github-actions bot added the strands-running label Apr 16, 2026

github-actions bot reviewed Apr 16, 2026

View reviewed changes

Comment thread docs/examples/evals-sdk/correctness_evaluator.py Outdated

github-actions bot removed the strands-running label Apr 16, 2026

docs: add correctness, goal success rate, coherence evaluator examples

99a9c7c

Co-authored-by: Kang Zhou <kangzhou1991@gmail.com> Co-authored-by: Subramanian Chidambaram <subbu10123@gmail.com>

ybdarrenwang dismissed poshinchen’s stale review via 99a9c7c April 16, 2026 19:03

ybdarrenwang force-pushed the feature/ground-truth branch from 7abee56 to 99a9c7c Compare April 16, 2026 19:03

ybdarrenwang temporarily deployed to manual-approval April 16, 2026 19:04 — with GitHub Actions Inactive

github-actions bot added the strands-running label Apr 16, 2026

github-actions bot added strands-running and removed strands-running labels Apr 16, 2026

poshinchen approved these changes Apr 17, 2026

View reviewed changes

poshinchen merged commit 4a9cc77 into strands-agents:main Apr 17, 2026
5 checks passed

github-actions bot removed the strands-running label Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add correctness, goal success rate, coherence evaluator examples#714

docs: add correctness, goal success rate, coherence evaluator examples#714
poshinchen merged 1 commit intostrands-agents:mainfrom
ybdarrenwang:feature/ground-truth

ybdarrenwang commented Mar 28, 2026 •

edited

Loading

Uh oh!

poshinchen left a comment

Uh oh!

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ybdarrenwang commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Type of Change

Checklist

Uh oh!

poshinchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation Preview Ready

Uh oh!

github-actions bot commented Apr 16, 2026

Addressing Review Feedback

1. Differentiated Goal Success Rate Assertion Examples

2. Grammar Fix

3. Completeness - "With Reference" Mode

Uh oh!

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ybdarrenwang commented Mar 28, 2026 •

edited

Loading

github-actions bot commented Apr 16, 2026 •

edited

Loading