Skip to content

Add RLVR grader framework, CLI, Lambda deployment, and BFCL tool-calling evaluation#2

Open
mccartnick wants to merge 1 commit intomainfrom
feature/rft-graders
Open

Add RLVR grader framework, CLI, Lambda deployment, and BFCL tool-calling evaluation#2
mccartnick wants to merge 1 commit intomainfrom
feature/rft-graders

Conversation

@mccartnick
Copy link
Collaborator

@mccartnick mccartnick commented Mar 19, 2026

feat: Add RLVR grader framework, CLI, Lambda deployment, and BFCL tool-calling evaluation

Issue #, if available: N/A

Description of changes:

Extends llm-eval-kit (v1.0 → v1.1.0) with a grader framework for building and deploying reward functions for Reinforcement Learning with Verifiable Rewards (RLVR) on Amazon Bedrock. All existing SageMaker pre/post processor functionality, tests, and documentation are preserved — this is purely additive.

Grader framework (src/llm_eval_kit/graders/)

  • Grader ABC, @grader decorator, and GraderRegistry for defining and registering evaluation functions
  • Built-in graders: exact_match, string_similarity (Levenshtein + token F1), tool_call (BFCL-style AST comparison with type coercion and weighted scoring across function name, param names, and param values)

Data models (src/llm_eval_kit/models/)

  • Pydantic v2 boundary models: Message, EvalSample, EvalDataset, EvaluateResult, MetricResult
  • EvalDataset with JSONL serialization and iteration support

Dataset loaders (src/llm_eval_kit/datasets/)

  • load_jsonl — generic JSONL
  • load_bfcl — BFCL-specific field mapping
  • load_huggingface — pull from HuggingFace Hub with configurable column mapping, data_files selection, and nested list unwrapping for BFCL format

Evaluation pipeline (src/llm_eval_kit/execution/)

  • EvalPipeline — runs a grader over a dataset, collects per-sample results, and produces an EvalReport with summary statistics and optional JSONL output

Lambda deployment (src/llm_eval_kit/deploy/)

  • deploy_grader — packages a grader + llm_eval_kit + dependencies (pydantic, etc.) into a zip, creates/updates a Lambda function via boto3
  • deploy_reward_function — lightweight deployer for standalone zero-dependency reward function files
  • YAML-based config with env var overrides and credential chain support
  • Prefers uv for dependency installation with pip fallback

CLI (src/llm_eval_kit/cli/)

  • llm-eval-kit evaluate — run a grader over a dataset (JSONL or BFCL format)
  • llm-eval-kit list-graders — show registered graders
  • llm-eval-kit validate — schema-check a dataset file
  • llm-eval-kit deploy — deploy a grader as a Lambda reward function

Documentation (docs/)

  • graders.md, datasets.md, deploy.md, cli.md
  • Updated root README.md with RLVR section and doc links table

Other

  • pyproject.toml — added [datasets], [deploy] optional deps, CLI entry point, requires-python >= 3.10
  • .gitignore — added for the repo
  • uv as the recommended package manager throughout docs and deployment code

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@mccartnick mccartnick requested a review from tang-ti March 19, 2026 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant