Amazon Nova 2 Lite Reasoning Evaluation

A comprehensive evaluation of Amazon Nova model family's reasoning capabilities using LLM-as-a-Judge methodology with GPT OSS 20B as the evaluator. You can configure the setup to use evaluator LLM of your choice.

Overview

This evaluation compares five Amazon Nova models across complex customer support scenarios to assess their reasoning, problem-solving, and communication capabilities:

Amazon Nova Lite v1.0 (amazon.nova-lite-v1:0)
Amazon Nova Lite v2.0 (amazon.nova-2-lite-v1:0)
Amazon Nova Micro v1.0 (amazon.nova-micro-v1:0)
Amazon Nova Pro v1.0 (amazon.nova-pro-v1:0)
Amazon Nova Premier (us.amazon.nova-premier-v1:0)

Note: If you see errors during model invocation, try us.<model id> or global.<model id> Note: Both Nova 2 Lite and Nova Lite 2.0 refer to the same model here and in the notebook.

Using GPT OSS 20B as the evaluator model.

Key Features

Optimized Prompts: Uses AWS Bedrock Prompt Optimizer API to generate model-specific optimized prompts for fair comparisons
Consistency Metrics: CV-based consistency scoring (0-100 scale) quantifies response reliability beyond average performance using Coefficient of Variation
Content Filtering Awareness: Built-in guardrail detection identifies when models refuse responses due to safety filters
Diagnostic Analysis: Automated anomaly detection for evaluation failures and content blocking patterns

Evaluation Results

Note: This evaluation tests the models' ability to use inference and analytical thinking for customer support scenarios, not the extended reasoning capabilities. If you are interested in evaluating the extended reasoning capability, please refer to https://docs.aws.amazon.com/nova/latest/userguide/extended-thinking.html

Overall Performance Rankings

Final Rankings with Statistical Rigor:

(Note that owing to statistical variation, the numbers may be slightly different from your run or what is depicted in the images)

Rank	Model	Overall Score	SE	95% CI	Consistency (CV-based)	CV%
1	Nova Lite 2.0	9.42/10	0.08	[9.28, 9.57]	94.45/100	5.55%
2	Nova Lite 1.0	8.65/10	0.09	[8.48, 8.82]	93.05/100	6.95%
3	Nova Pro	8.53/10	0.12	[8.30, 8.76]	90.46/100	9.54%
4	Nova Micro	7.70/10	0.32	[7.08, 8.32]	71.37/100	28.63%
5	Nova Premier	7.16/10	0.38	[6.41, 7.91]	62.96/100	37.04%

Key Insights:

Nova Lite 2.0 leads with highest overall score (9.42/10) AND highest consistency (94.45/100)
Statistical significance: Nova Lite 2.0's 95% CI [9.28, 9.57] does not overlap with other models, confirming superior performance
Nova Premier has lowest consistency (62.96/100) and highest coefficient of variation (37.04%) due to content filtering
Precision matters: Nova Lite 2.0's low standard error (0.08) indicates highly reliable estimates
Sample size: n=49 successful evaluations per model (98% success rate overall)

Detailed Performance Breakdown

Table 1: Overall Model Performance Summary

Metric	Nova Lite 2.0	Nova Lite 1.0	Nova Pro	Nova Micro	Nova Premier
Overall Score	9.42	8.65	8.53	7.70	7.16
Standard Error (SE)	0.08	0.09	0.12	0.32	0.38
95% Confidence Interval	[9.28, 9.57]	[8.48, 8.82]	[8.30, 8.76]	[7.08, 8.32]	[6.41, 7.91]
Consistency Score (CV-based)	94.45	93.05	90.46	71.37	62.96
Coefficient of Variation	5.55%	6.95%	9.54%	28.63%	37.04%
Consistency Rating	Good	Good	Good	Poor	Poor
Sample Size	n=49	n=49	n=49	n=49	n=49

Table 2: Dimension-Level Performance (Mean Scores ± 95% CI)

Dimension	Nova Lite 2.0	Nova Lite 1.0	Nova Pro	Nova Micro	Nova Premier
Problem Identification	9.63 ± 0.27	8.57 ± 0.46	8.16 ± 0.44	7.59 ± 0.74	6.94 ± 0.82
Solution Completeness	9.59 ± 0.23	8.08 ± 0.32	8.04 ± 0.42	6.78 ± 0.65	6.33 ± 0.69
Policy Adherence	8.82 ± 0.54	7.76 ± 0.59	7.55 ± 0.64	7.02 ± 0.69	6.37 ± 0.81
Factual Accuracy	9.55 ± 0.26	9.18 ± 0.30	9.10 ± 0.28	8.08 ± 0.74	8.00 ± 0.89
Empathy Tone	8.98 ± 0.33	8.57 ± 0.34	8.08 ± 0.36	7.55 ± 0.65	7.10 ± 0.79
Communication Clarity	9.76 ± 0.19	9.14 ± 0.28	8.94 ± 0.28	8.04 ± 0.69	7.63 ± 0.85
Logical Coherence	9.71 ± 0.35	9.67 ± 0.29	9.92 ± 0.11	8.98 ± 0.74	8.16 ± 0.91
Practical Utility	9.35 ± 0.27	8.24 ± 0.22	8.45 ± 0.24	7.55 ± 0.62	6.78 ± 0.70

Table 3: Scenario-Level Performance (Mean Scores)

Scenario	Nova Lite 2.0	Nova Lite 1.0	Nova Pro	Nova Micro	Nova Premier
Account Security Concern	9.25	7.95	7.65	6.90	2.00
Angry Customer Complaint	9.95	9.50	9.30	8.35	8.20
Billing Dispute	9.15	8.75	8.60	8.85	8.20
Product Defect Report	9.25	8.90	7.70	8.00	8.75
Software Technical Problem	10.00	8.20	8.55	8.75	8.60

Nova Lite 2.0 (Winner)

Overall Score: 9.42/10 (SE: 0.08, 95% CI: [9.28, 9.57]) | Consistency Score: 94.45/100 (Good)

Key Strengths: Nova Lite 2.0 demonstrates exceptional performance across all dimensions with remarkably low coefficient of variation (5.55%), indicating consistent high-quality responses. It excels particularly in communication clarity (9.76 ± 0.19), logical coherence (9.71 ± 0.35), and problem identification (9.63 ± 0.27). Its high consistency score (94.45/100) and narrow confidence intervals make it ideal for production deployments where reliability is critical.

Why LLM-as-a-Judge is the Most Reliable Evaluation Method

The Limitations of Traditional ML Metrics

Traditional machine learning evaluation metrics like BLEU, ROUGE, perplexity, and F1 scores fall short when evaluating large language models for several critical reasons:

Surface-Level Matching: Metrics like BLEU and ROUGE measure n-gram overlap between generated and reference texts. A response can have perfect word overlap but completely miss the point, or conversely, express the same idea with different words and score poorly.
No Semantic Understanding: Traditional metrics cannot assess whether a response actually addresses the user's concern, follows company policies, or demonstrates appropriate empathy. They measure form, not function.
Single Reference Limitation: Most traditional metrics require reference answers, but in open-ended tasks like customer support, there are infinite valid responses. Creating comprehensive reference sets is impractical.
Context Blindness: Metrics like perplexity measure statistical likelihood but ignore whether the response is contextually appropriate, factually accurate, or practically useful.

The LLM-as-a-Judge Advantage

LLM-as-a-Judge methodology addresses these limitations through:

1. Holistic Evaluation

Our evaluation framework assesses eight critical dimensions:

Problem Identification: Understanding all issues in the scenario
Solution Completeness: Addressing every problem with actionable solutions
Policy Adherence: Following stated guidelines and best practices
Factual Accuracy: Providing correct technical information
Empathy & Tone: Demonstrating appropriate emotional intelligence
Communication Clarity: Delivering clear, well-structured responses
Logical Coherence: Maintaining sound reasoning without contradictions
Practical Utility: Offering genuinely helpful solutions

Traditional metrics cannot measure any of these dimensions effectively.

2. Evidence of Reliability

Consistency Across Scenarios: Our evaluation shows that Nova Lite 2.0's low variance (±0.39) across diverse scenarios demonstrates the evaluator's ability to consistently identify quality. If the evaluation were unreliable, we would see random score distributions rather than clear performance patterns.

Discriminative Power: The evaluation successfully differentiates between models with both quality and consistency metrics:

Model	Overall Score	Consistency Score	Pattern
Nova Lite 2.0	9.42/10 (CV: 5.55%)	94.45/100	Clear leader in both quality and reliability
Nova Lite 1.0	8.65/10 (CV: 6.95%)	93.05/100	Strong performer with good consistency
Nova Pro	8.53/10 (CV: 9.54%)	90.46/100	Solid performance with good consistency
Nova Micro	7.70/10 (CV: 28.63%)	71.37/100	Good quality but moderate variability
Nova Premier	7.16/10 (CV: 37.04%)	62.96/100	Impacted by content filtering and high variability

This granularity—measuring both average performance and consistency across 8 dimensions—is challenging to realize with aggregate metrics like BLEU scores. The consistency scores reveal that Nova Pro and Premier, despite reasonable average scores, show high variability that could impact production reliability.

3. Research-Backed Validation

Recent research validates LLM-as-a-Judge methodology:

High Correlation with Human Judgment: Studies show that advanced LLMs like GPT-4 achieve 80-85% agreement with human evaluators on open-ended tasks, compared to 40-50% for traditional metrics (Zheng et al., 2023, "Judging LLM-as-a-Judge").

Multi-Dimensional Assessment: LLM judges can evaluate multiple quality dimensions simultaneously, while traditional metrics require separate calculations for each aspect and often miss critical dimensions entirely.

Contextual Understanding: LLM evaluators understand nuance, sarcasm, implied meaning, and cultural context—capabilities that traditional metrics completely lack.

Scalability: While human evaluation is the gold standard, it's expensive and slow. LLM-as-a-Judge provides near-human-level assessment at machine speed and scale.

4. Methodological Rigor

Our evaluation ensures reliability through:

Structured Evaluation Prompts: Clear criteria and scoring rubrics reduce evaluator bias and ensure consistent assessment across all responses.

Multiple Runs: Each model-scenario combination is evaluated 10 times (250 total evaluations), allowing us to measure consistency and reduce random variation.

Diverse Scenarios: Five distinct customer support scenarios test different capabilities:

Angry Customer Complaint (emotional intelligence)
Software Technical Problem (technical accuracy)
Billing Dispute (policy adherence)
Product Defect Report (problem-solving)
Account Security Concern (urgency and thoroughness)

Quantitative Scoring: Each dimension receives a 1-10 score with written justification, providing both numerical comparability and qualitative insights.

Comparison with Alternative Approaches

Evaluation Method	Semantic Understanding	Multi-Dimensional	Contextual Awareness	Scalability	Cost
LLM-as-a-Judge	✅ Excellent	✅ Yes (8 dimensions)	✅ Full context	✅ High	💰 Low
Human Evaluation	✅ Excellent	✅ Yes	✅ Full context	❌ Low	💰💰💰 High
BLEU/ROUGE	❌ None	❌ No	❌ None	✅ High	💰 Very Low
Perplexity	❌ Statistical only	❌ No	❌ None	✅ High	💰 Very Low
F1/Accuracy	⚠️ Limited	⚠️ Single dimension	❌ None	✅ High	💰 Low

Key Findings

Nova Lite 2.0 is the Clear Winner: With a 9.42/10 score (95% CI: [9.28, 9.57]) and exceptional consistency (94.45/100, CV: 5.55%), Nova Lite 2.0 outperforms all other models across both quality and reliability metrics. The non-overlapping confidence intervals confirm statistically significant superiority.
Consistency Matters as Much as Quality: Nova Lite 2.0's low coefficient of variation (5.55%) indicates reliable performance across diverse scenarios. In contrast, Nova Premier has much lower consistency (62.96/100) and high CV (37.04%), making it less predictable for production use. The CV-based consistency metric provides an intuitive 0-100 scale for reliability assessment.
Statistical Significance Confirmed: The 95% confidence intervals demonstrate clear performance separation:
- Nova Lite 2.0: [9.28, 9.57] - No overlap with any other model
- Nova Lite 1.0: [8.48, 8.82] - Significantly better than Micro and Premier
- Nova Pro: [8.30, 8.76] - Significantly better than Micro and Premier
- Nova Micro: [7.08, 8.32] - Wide CI indicates high variability
- Nova Premier: [6.41, 7.91] - Widest CI reflects content filtering impact
Model Size ≠ Performance: Nova Lite 2.0 outperforms the larger Nova Pro (8.53/10) and flagship Nova Premier (7.16/10), demonstrating that architectural improvements and training methodology matter more than raw parameter count.
Content Filtering Impact: Nova Premier's lower scores and highest CV (37.04%) are partially due to content filtering, not poor reasoning. This reflects conservative guardrails designed for enterprise compliance.
Optimized Prompts Improve Performance: Using AWS Bedrock Prompt Optimizer API to create model-specific prompts ensures fair comparisons and maximizes each model's capabilities.
Power Analysis Validates Sample Size: With n=49 successful evaluations per model, the evaluation achieves sufficient statistical power to detect meaningful differences, with narrow confidence intervals confirming precise estimates.

Evaluation Methodology

Test Scenarios

Five realistic customer support scenarios covering:

Emotional situations (angry customers)
Technical problems (app crashes)
Financial disputes (billing issues)
Product quality (defects)
Security incidents (account compromise)

Evaluation Dimensions

Each response is scored across eight dimensions using a categorical-first approach:

Problem Identification
Solution Completeness
Policy Adherence
Factual Accuracy
Empathy & Tone
Communication Clarity
Logical Coherence
Practical Utility

Categorical-First Scoring Approach

This evaluation uses a research-backed categorical scoring system where the evaluator first assigns a semantic category, then maps it to a fixed numerical score:

Category	Fixed Score	Meaning
EXCELLENT	10	Comprehensive, professional, exceeds expectations
GOOD	8	Solid performance with minor room for improvement
ADEQUATE	6	Meets basic requirements but has notable gaps
POOR	4	Significant issues requiring major improvements
FAILING	2	Critical failures, does not meet requirements

Why this approach?

✅ More consistent evaluations: Discrete categories reduce ambiguity compared to continuous 1-10 scales
✅ Better interpretability: Semantic labels (EXCELLENT, GOOD) are clearer than arbitrary numbers
✅ Reduced evaluator bias: Clear decision boundaries between categories
✅ Simpler validation: One category = one score, eliminating score-category mismatches

Evaluation process:

Evaluator (GPT OSS 20B) reads the model response
Assigns a category label for each of 8 dimensions
Fixed score is automatically determined by category
Provides detailed reasoning for the categorization

Statistical Rigor

This evaluation employs rigorous statistical methods following best practices outlined in "Adding Error Bars to Evals" (Miller, 2024, Anthropic):

Multiple runs: 10 evaluations per model-scenario combination (250 total evaluations, 98% success rate)
Standard errors (SE): Precision of mean estimates calculated using Central Limit Theorem (SE = √(Var/n))
95% confidence intervals (CI): Uncertainty ranges for all scores (CI = mean ± 1.96×SE) enable significance testing
Paired differences analysis: Model comparisons account for scenario difficulty correlation, reducing variance 30-50%
Power analysis: Minimum detectable effect (MDE) calculation justifies sample size adequacy
CV-based consistency: Normalized reliability metric on 0-100 scale using Coefficient of Variation (Consistency = 100 - CV%, where CV = (σ/μ)×100)
Error bars on charts: Visualizations show 95% CI for transparent uncertainty communication
Zero-exclusion averaging: Failed evaluations excluded from statistics to prevent artificial deflation
Diagnostic analysis: Automated anomaly detection identifies evaluation failures and content blocking patterns

Statistical Interpretation:

Coefficient of Variation (CV): Measures relative variability; lower CV% indicates more consistent performance
- CV < 10%: Good consistency (Nova Lite 2.0: 5.55%, Nova Lite 1.0: 6.95%, Nova Pro: 9.54%)
- CV > 20%: Poor consistency (Nova Micro: 28.63%, Nova Premier: 37.04%)
Non-overlapping CIs: Indicate statistically significant differences between models
Standard Error: Quantifies precision of estimates; smaller SE means more reliable scores

Citation: Miller, E. (2024). Adding Error Bars to Evals. Anthropic. https://www.anthropic.com/research/error-bars-evals

Evaluator Model

GPT OSS 20B (openai.gpt-oss-20b-1:0) serves as the judge, providing:

Consistent evaluation criteria
Detailed scoring with justifications
Multi-dimensional assessment
Scalable evaluation at machine speed

Explainability

The evaluator is asked to provide justification for the score provided along each dimension
The evaluator is also asked to provide an overall score and justification

Sample explanations provided by the evaluator for Nova Lite 2.0 and Nova Micro for the Product Defect Report scenario along the Problem Identification dimension

Nova Lite 2.0 – Score: 9 The response clearly identifies all key issues: a three‚ week delayed delivery, a rude and unhelpful customer service interaction, the customer’s loyalty of five years, and the request for a refund. It directly quotes the customer’s concerns and acknowledges each one.

Nova Micro – Score: 5 The response correctly identifies the left earpiece not producing sound, but it omits the month age of the product, the $200 price point, and the fact that the customer has the receipt and original packaging. These omissions mean the issue is not fully acknowledged.

Content Filtering & Guardrails

Nova Premier has stricter content filtering compared to other Nova models, which is by design for enterprise safety and compliance:

Higher refusal rates: Premier may block responses for scenarios it deems sensitive (security, billing disputes, etc.)
Not a quality issue: This reflects conservative guardrails, not poor reasoning capability
Production consideration: Premier's guardrails are a feature for regulated industries but may not be suitable for all evaluation scenarios

The notebook includes diagnostic tools to identify and report blocked responses, preventing misinterpretation of content filtering as poor performance.

Running the Evaluation

Prerequisites

AWS credentials with access to Amazon Bedrock
Access to Nova models and GPT OSS 20B evaluator

Get Started

git clone git@github.com:aws-samples/sample-amazon-nova-reasoning-eval.git
cd sample-amazon-nova-reasoning-eval

Step 1: Generate Optimized Prompts (Optional but Recommended)

The evaluation uses AWS Bedrock Prompt Optimizer API to create model-specific optimized prompts:

# Generate optimized prompts for all models
python prompt_optimizer.py

This creates optimized_prompts_*.json files for each model. The optimizer:

Analyzes raw prompts and improves structure, clarity, and organization
Creates model-specific optimizations tailored to each Nova model
Typically increases prompt length 2-3x with better detail and formatting

Note: Pre-generated optimized prompts are included in the repository, so this step is optional unless you modify the test scenarios.

Step 2: Run the Evaluation

# Install dependencies
pip install boto3 pandas matplotlib seaborn numpy

# Open and run the notebook
jupyter notebook nova_lite_reasoning_evals.ipynb

The notebook will:

Load optimized prompts for each model
Run 10 evaluations per model-scenario combination (250 total)
Calculate scores, consistency metrics, and detect anomalies
Generate visualizations and export results

Files

Core Evaluation Files

nova_lite_reasoning_evals.ipynb - Complete evaluation notebook with consistency metrics and diagnostics
prompt_optimizer.py - Script to generate optimized prompts using AWS Bedrock Prompt Optimizer API
RAW_TEST_SCENARIOS.md - Original unoptimized test scenarios for reference

Optimized Prompts (Generated by prompt_optimizer.py)

optimized_prompts_amazon_nova-lite-v1_0.json - Nova Lite 1.0 optimized prompts
optimized_prompts_amazon_nova-2-lite-v1_0.json - Nova Lite 2.0 optimized prompts
optimized_prompts_amazon_nova-micro-v1_0.json - Nova Micro optimized prompts
optimized_prompts_amazon_nova-pro-v1_0.json - Nova Pro optimized prompts
optimized_prompts_amazon_nova-premier-v1_0.json - Nova Premier optimized prompts

Results & Visualizations (Generated by running the notebook)

nova_5_models_comparison_chart.png - Overall performance comparison bar chart
nova_5_models_radar_chart.png - Multi-dimensional performance radar chart
nova_lite_comparison_*.csv - Raw evaluation data with all scores
nova_lite_comparison_*.json - Detailed results with evaluator justifications

Documentation

README.md - This file

Conclusion

This evaluation demonstrates that Amazon Nova Lite 2.0 represents a significant advancement in reasoning capabilities, achieving the highest scores across all evaluation dimensions with exceptional consistency (94.45/100 consistency score, CV: 5.55%). The combination of high quality (9.42/10, 95% CI: [9.28, 9.57]) and low variability makes it the most reliable choice for production deployments.

The rigorous statistical methodology, following "Adding Error Bars to Evals" (Miller, 2024, Anthropic), provides quantified uncertainty through standard errors, confidence intervals, and paired differences analysis. The non-overlapping confidence intervals between Nova Lite 2.0 and all other models confirm statistically significant performance superiority, not just numerical differences.

The LLM-as-a-Judge methodology provides reliable, comprehensive assessment that traditional metrics cannot match, offering insights into model performance that directly correlate with real-world utility. The addition of CV-based consistency scoring (0-100 scale), power analysis, and diagnostic tools ensures that model selection considers both average performance and reliability—critical factors for production AI systems.

Recommendation: For customer support and similar reasoning-intensive applications, Nova Lite 2.0 offers the best balance of quality, consistency, and cost-effectiveness among the Nova model family. Its narrow confidence intervals (±0.15 margin of error) and high consistency score (94.45/100) provide confidence in reliable production performance.

Statistical Note: All results include 95% confidence intervals and consistency metrics. The ± values in dimension scores represent the margin of error (half the width of the 95% CI), enabling transparent assessment of estimate precision and statistical significance.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
RAW_TEST_SCENARIOS.md		RAW_TEST_SCENARIOS.md
README.md		README.md
nova_5_models_comparison_chart.png		nova_5_models_comparison_chart.png
nova_5_models_radar_chart.png		nova_5_models_radar_chart.png
nova_lite_comparison_20251118_193525.csv		nova_lite_comparison_20251118_193525.csv
nova_lite_comparison_20251118_193525.json		nova_lite_comparison_20251118_193525.json
nova_lite_reasoning_evals.ipynb		nova_lite_reasoning_evals.ipynb
optimized_prompts_amazon_nova-2-lite-v1_0.json		optimized_prompts_amazon_nova-2-lite-v1_0.json
optimized_prompts_amazon_nova-lite-v1_0.json		optimized_prompts_amazon_nova-lite-v1_0.json
optimized_prompts_amazon_nova-micro-v1_0.json		optimized_prompts_amazon_nova-micro-v1_0.json
optimized_prompts_amazon_nova-premier-v1_0.json		optimized_prompts_amazon_nova-premier-v1_0.json
optimized_prompts_amazon_nova-pro-v1_0.json		optimized_prompts_amazon_nova-pro-v1_0.json
optimized_prompts_nova_lite_1_0.json		optimized_prompts_nova_lite_1_0.json
optimized_prompts_nova_lite_2_0.json		optimized_prompts_nova_lite_2_0.json
prompt_optimizer.py		prompt_optimizer.py

License

aws-samples/sample-amazon-nova-reasoning-eval

Folders and files

Latest commit

History

Repository files navigation