Add RLVR grader framework, CLI, Lambda deployment, and BFCL tool-calling evaluation by mccartnick · Pull Request #2 · aws/llm-eval-kit

mccartnick · 2026-03-19T03:09:00Z

feat: Add RLVR grader framework, CLI, Lambda deployment, and BFCL tool-calling evaluation

Issue #, if available: N/A

Description of changes:

Extends llm-eval-kit (v1.0 → v1.1.0) with a grader framework for building and deploying reward functions for Reinforcement Learning with Verifiable Rewards (RLVR) on Amazon Bedrock. All existing SageMaker pre/post processor functionality, tests, and documentation are preserved — this is purely additive.

Grader framework (src/llm_eval_kit/graders/)

Grader ABC, @grader decorator, and GraderRegistry for defining and registering evaluation functions
Built-in graders: exact_match, string_similarity (Levenshtein + token F1), tool_call (BFCL-style AST comparison with type coercion and weighted scoring across function name, param names, and param values)

Data models (src/llm_eval_kit/models/)

Pydantic v2 boundary models: Message, EvalSample, EvalDataset, EvaluateResult, MetricResult
EvalDataset with JSONL serialization and iteration support

Dataset loaders (src/llm_eval_kit/datasets/)

load_jsonl — generic JSONL
load_bfcl — BFCL-specific field mapping
load_huggingface — pull from HuggingFace Hub with configurable column mapping, data_files selection, and nested list unwrapping for BFCL format

Evaluation pipeline (src/llm_eval_kit/execution/)

EvalPipeline — runs a grader over a dataset, collects per-sample results, and produces an EvalReport with summary statistics and optional JSONL output

Lambda deployment (src/llm_eval_kit/deploy/)

deploy_grader — packages a grader + llm_eval_kit + dependencies (pydantic, etc.) into a zip, creates/updates a Lambda function via boto3
deploy_reward_function — lightweight deployer for standalone zero-dependency reward function files
YAML-based config with env var overrides and credential chain support
Prefers uv for dependency installation with pip fallback

CLI (src/llm_eval_kit/cli/)

llm-eval-kit evaluate — run a grader over a dataset (JSONL or BFCL format)
llm-eval-kit list-graders — show registered graders
llm-eval-kit validate — schema-check a dataset file
llm-eval-kit deploy — deploy a grader as a Lambda reward function

Documentation (docs/)

graders.md, datasets.md, deploy.md, cli.md
Updated root README.md with RLVR section and doc links table

Other

pyproject.toml — added [datasets], [deploy] optional deps, CLI entry point, requires-python >= 3.10
.gitignore — added for the repo
uv as the recommended package manager throughout docs and deployment code

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

add cli, rft graders and lambda deployment

e9159c2

mccartnick requested a review from tang-ti March 19, 2026 03:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RLVR grader framework, CLI, Lambda deployment, and BFCL tool-calling evaluation#2

Add RLVR grader framework, CLI, Lambda deployment, and BFCL tool-calling evaluation#2
mccartnick wants to merge 1 commit intomainfrom
feature/rft-graders

mccartnick commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mccartnick commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat: Add RLVR grader framework, CLI, Lambda deployment, and BFCL tool-calling evaluation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mccartnick commented Mar 19, 2026 •

edited

Loading