-
Notifications
You must be signed in to change notification settings - Fork 1
feat: add eval framework for installer agent testing #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
nicknisi
wants to merge
12
commits into
main
Choose a base branch
from
prompt-improvements
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+3,565
−31
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Introduces a structured evaluation system to validate the WorkOS installer agent against framework fixtures. Phase 1 includes: - Core types and interfaces for grading results - File and build graders with pattern matching - Next.js-specific grader checking AuthKit integration - Fixture manager for temp dir setup/cleanup - Eval runner orchestrating fixture → agent → grade flow - CLI entry point with --framework and --verbose flags - Minimal Next.js 14 App Router fixture The agent executor is stubbed to validate framework structure first. Run with: pnpm eval
Add CLI with filtering (--framework, --state, --json), matrix reporter, and graders for all 5 frameworks. Create fixtures for fresh, existing, and existing-auth0 states across Next.js, React SPA, React Router, TanStack Start, and Vanilla JS.
- Add history.ts for results persistence with compare functionality - Extend CLI with --debug, --keep-on-fail, --retry, --no-retry flags - Add history and compare subcommands (pnpm eval:history, eval:compare) - Implement retry loop in runner for handling LLM non-determinism - Add verbose failure output with expected/actual values - Create README documentation for eval framework usage
Replace stub implementation with real agent execution: - Add env-loader for credentials from .env.local - Configure SDK with direct auth mode (bypasses gateway) - Capture tool calls and output from message stream - Add ToolCall interface to types
- Use glob + content matching for callback route (path is configurable) - Remove process.env.WORKOS_ check (SDK abstracts env access) - Add checkFileWithPattern helper for flexible file discovery
Member
Author
|
Example run:
> tsx tests/evals/index.ts --framework=nextjs --verbose --keepOnFail --state=fresh
Running: nextjs/fresh
Installing dependencies...
[dotenv@17.2.3] injecting env (2) from .env.local -- tip: ⚙️ load multiple .env files with { path: ['.env.local', '.env'] }
Initializing agent for nextjs...
Tool: Skill
Agent: I'll start by fetching the SDK documentation from the README, which is the source of truth....
Tool: WebFetch
Agent: Now let me validate the project structure and check the existing environment variables....
Tool: Glob
Tool: Read
Tool: Read
Tool: Glob
Agent: Let me also check the existing app structure:...
Tool: Glob
Tool: Read
Tool: Read
Tool: TodoWrite
Agent: Now let me proceed with the integration. First, I'll add the missing environment variable and instal...
Tool: Edit
Tool: Bash
Agent: Let me check if there's a pnpm or yarn lockfile to determine the correct package manager:...
Tool: Bash
Tool: Bash
Tool: Bash
Tool: TodoWrite
Agent: Now let me verify the SDK is installed and create the middleware. Since this is Next.js 14, I'll cre...
Tool: Bash
Agent: Now let me create the middleware, callback route, update layout, and update the page:...
Tool: Write
Tool: Bash
Tool: Write
Tool: TodoWrite
Agent: Now let me update the layout with AuthKitProvider and the page with auth UI:...
Tool: Write
Tool: Write
Tool: TodoWrite
Agent: Now let me run the verification checklist and build:...
Tool: Bash
Tool: Bash
Tool: Bash
Agent: All verification checks pass. Now let's run the build:...
Tool: Bash
Agent: Let me check the correct import paths from the SDK:...
Tool: Bash
Tool: Bash
Tool: Bash
Agent: I see - the SDK doesn't have `getUser` and `AuthKitProvider` in the main exports. Let me check the c...
Tool: Bash
Tool: Bash
Agent: I see - `AuthKitProvider` is in `@workos-inc/authkit-nextjs/components`, and I need to use `withAuth...
Tool: Bash
Tool: Bash
Tool: Bash
Agent: Now I understand. The SDK uses `withAuth` instead of `getUser`. Let me update the files with the cor...
Tool: Write
Tool: Write
Agent: Now let's run the build again:...
Tool: Bash
Agent: The build succeeded. Let me update the todo list to show completion:...
Tool: TodoWrite
Agent: I've successfully integrated WorkOS AuthKit into your Next.js application. Here's a summary of what ...
✓ PASSED
Results saved to: /Users/nicknisi/Developer/cli/prompt-improvements/tests/eval-results/2026-02-01T05-39-28-884Z.json
┌─────────────────┬─────────┬──────────┬───────────────┐
│ Framework │ Fresh │ Existing │ Existing+Auth │
├─────────────────┼─────────┼──────────┼───────────────┤
│ nextjs │ ✓ │ - │ - │
│ react │ - │ - │ - │
│ react-router │ - │ - │ - │
│ tanstack-start │ - │ - │ - │
│ vanilla-js │ - │ - │ - │
└─────────────────┴─────────┴──────────┴───────────────┘
Results: 1/1 passed (100.0%)
pnpm eval --framework=nextjs --verbose --keepOnFail --state=fresh 66.79s user 23.09s system 32% cpu 4:39.04 total |
- Grader: support src/ directory (v1.132+) in addition to app/ - Grader: check for authkitMiddleware instead of createServerFn - Grader: fix package name to @workos/authkit-tanstack-react-start - Grader: remove AuthKitProvider requirement (optional for server-only) - Grader: support both flat and nested route patterns for callback - Skill: add directory detection guidance (src/ vs app/) - Skill: fix handleAuth() → handleCallbackRoute() - Skill: add SDK exports reference section
- Remove callback component check (SDK handles OAuth internally) - Use glob pattern to find useAuth anywhere in src/**/*.tsx - Support both Vite (main.tsx) and CRA (index.tsx) entry points - Add comprehensive header documenting SDK patterns
- Fix package name: @workos-inc/authkit-react-router (was @workos-inc/authkit) - Use glob patterns instead of hardcoded file paths - Check for authLoader in callback routes (flexible location) - Check for authkitLoader in route files for auth state - Remove unnecessary ProtectedRoute.tsx/auth.ts checks (SDK has ensureSignedIn) - Support both app/ and src/ directory structures
- Remove callback.html/callback.js checks (SDK handles OAuth internally) - Remove auth.js with getAuthorizationUrl (old pattern) - Check for createClient from @workos-inc/authkit-js or CDN WorkOS.createClient - Check for auth methods (signIn, signOut, getUser, getAccessToken) - Support both bundled (ESM import) and CDN (script tag) patterns
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Why
Need automated testing to validate installer agent behavior across different project configurations before releases.
Notes
pnpm eval --framework=nextjs --state=freshANTHROPIC_API_KEYin.env.local