An early-stage benchmark for evaluating AI assistant capabilities across real-world task execution and knowledge retrieval scenarios.
Task Arena is designed to test AI assistants on practical, real-world tasks that users frequently request. This is an early evaluation that provides a starting point for assessment, though it is not yet comprehensive. The benchmark currently focuses on a limited set of integrations, actions, and data sources, and will be expanded over time to cover a broader range of real-world scenarios.
The benchmark consists of two primary evaluation datasets:
- Action Dataset: Tests an AI assistant's ability to understand and execute complex, multi-step tasks
- Retrieval Dataset: Tests an AI assistant's ability to accurately retrieve and synthesize information from a knowledge base
| Dataset | Samples | Description |
|---|---|---|
| Action | 51 | Real-world task prompts covering email management, calendar scheduling, document creation, research, and more |
| Retrieval | 52 | Question-answer pairs testing knowledge retrieval, understanding, and synthesis capabilities |
The Action dataset contains real-world task prompts that AI assistants commonly encounter. These tasks test:
- Email Management: Reading, replying, drafting, and organizing emails
- Calendar Operations: Checking availability, scheduling meetings, sending invites
- Document Creation: Creating Google Docs, Sheets, presentations with specific content
- Research Tasks: Gathering information and organizing it into structured documents
- File Management: Sharing files, organizing documents, extracting information
- Multi-step Workflows: Complex tasks requiring multiple actions in sequence
[
{
"prompt": "Catch me up on my latest emails please."
},
{
"prompt": "Can you please schedule the meeting between me and Tejas based on his and my availability? Send over the invite and set up the meeting please (Google Meet) for when I'm available."
}
]Each entry contains:
prompt: The task description given to the AI assistant
The Retrieval dataset tests an AI assistant's ability to answer questions accurately based on a knowledge base. Questions cover:
- Technical Specifications: Architecture decisions, implementation details
- Product Features: Capabilities, configurations, limitations
- Versioning: Changes across different versions, deprecations
- Best Practices: Recommended approaches, performance targets
- Cross-referencing: Questions requiring synthesis of information from multiple sources
[
{
"prompt": "How does our desktop app offline queue work?",
"expected_response": "Actions are saved locally in SQLite with WAL mode. When offline, actions queue up (max 5,000 items). On reconnect, they sync with exponential backoff + jitter, using idempotency keys. For conflicts, the Final spec uses operational transform (CRDT-based), not last-writer-wins - that was only in v1. Conflict UI shows both versions and lets users pick or merge.",
"assistant_response": ""
}
]Each entry contains:
prompt: The question asked to the AI assistantexpected_response: The correct, detailed answerassistant_response: Field for storing the AI's actual response
import json
# Load Action dataset
with open('datasets/action.json', 'r') as f:
action_tasks = json.load(f)
# Load Retrieval dataset
with open('datasets/retrieval.json', 'r') as f:
retrieval_tasks = json.load(f)
print(f"Action tasks: {len(action_tasks)}")
print(f"Retrieval tasks: {len(retrieval_tasks)}")import { readFileSync } from 'fs';
const actionTasks = JSON.parse(
readFileSync('datasets/action.json', 'utf-8')
);
const retrievalTasks = JSON.parse(
readFileSync('datasets/retrieval.json', 'utf-8')
);
console.log(`Action tasks: ${actionTasks.length}`);
console.log(`Retrieval tasks: ${retrievalTasks.length}`);For each task in the action dataset:
- Present the
promptto your AI assistant - Observe the assistant's actions and responses
- Manually evaluate whether the task was completed successfully and correctly
- Record the evaluation result
Success criteria:
- Task completed as requested
- No errors or incorrect actions
- Appropriate verification steps taken
- Reasonable handling of edge cases
For each question in the retrieval dataset:
- Present the
promptto your AI assistant with access to your knowledge base - Capture the assistant's response in the
assistant_responsefield - Compare the response against the
expected_response - Record the evaluation result based on accuracy
Evaluation criteria:
- Factual correctness
- Completeness (includes all key information)
- Accuracy of technical details
- Proper handling of versioning/contradictions
- Appropriate level of detail
After evaluating the datasets and recording results in your own tracking system:
import json
def evaluate_dataset(results):
"""
results: list of evaluation results (True/False or 1/0)
"""
correct = sum(results)
total = len(results)
accuracy = (correct / total) * 100
return {
'correct': correct,
'total': total,
'accuracy': accuracy
}
# Example: After manual evaluation
action_results = [True, False, True, True, ...] # Your evaluation results
retrieval_results = [True, True, False, True, ...] # Your evaluation results
action_score = evaluate_dataset(action_results)
retrieval_score = evaluate_dataset(retrieval_results)
print(f"Action Dataset - Accuracy: {action_score['accuracy']:.2f}%")
print(f"Retrieval Dataset - Accuracy: {retrieval_score['accuracy']:.2f}%")Task Arena is ideal for:
- AI Assistant Benchmarking: Compare different AI assistants on real-world tasks
- Model Evaluation: Test new language models or agent architectures
- Capability Assessment: Identify strengths and weaknesses in task execution
- Progress Tracking: Monitor improvements over time
- Research: Study AI assistant behavior on practical tasks
- Product Development: Validate AI features before release
We welcome contributions to expand and improve the Task Arena benchmark! Please consider:
- Adding new diverse task prompts to the Action dataset
- Submitting new question-answer pairs for the Retrieval dataset
- Reporting issues with existing prompts or expected responses
- Suggesting new evaluation categories or metrics
If you use Task Arena in your research or product evaluation, please cite:
@misc{taskarena2025,
title={Task Arena: A Benchmark for Real-World AI Assistant Evaluation},
year={2025},
url={https://github.com/yourusername/task-arena-benchmark}
}This benchmark is released under the MIT License. See LICENSE for details.
Task Arena was created to address the need for practical, real-world evaluation of AI assistants beyond traditional benchmarks. The tasks and questions are derived from actual user interactions and production scenarios.
This is an early evaluation and will continue to evolve. Future versions will include:
- Broader Integration Coverage: Additional integrations beyond the current scope (more productivity tools, development platforms, communication apps)
- Expanded Action Types: More diverse task categories and complexity levels
- Richer Data Sources: More comprehensive knowledge bases and retrieval scenarios
- Domain-Specific Tasks: Specialized evaluations for different industries and use cases