Task Arena Benchmark

An early-stage benchmark for evaluating AI assistant capabilities across real-world task execution and knowledge retrieval scenarios.

Overview

Task Arena is designed to test AI assistants on practical, real-world tasks that users frequently request. This is an early evaluation that provides a starting point for assessment, though it is not yet comprehensive. The benchmark currently focuses on a limited set of integrations, actions, and data sources, and will be expanded over time to cover a broader range of real-world scenarios.

The benchmark consists of two primary evaluation datasets:

Action Dataset: Tests an AI assistant's ability to understand and execute complex, multi-step tasks
Retrieval Dataset: Tests an AI assistant's ability to accurately retrieve and synthesize information from a knowledge base

Dataset Statistics

Dataset	Samples	Description
Action	51	Real-world task prompts covering email management, calendar scheduling, document creation, research, and more
Retrieval	52	Question-answer pairs testing knowledge retrieval, understanding, and synthesis capabilities

Dataset Details

Action Dataset (`datasets/action.json`)

The Action dataset contains real-world task prompts that AI assistants commonly encounter. These tasks test:

Email Management: Reading, replying, drafting, and organizing emails
Calendar Operations: Checking availability, scheduling meetings, sending invites
Document Creation: Creating Google Docs, Sheets, presentations with specific content
Research Tasks: Gathering information and organizing it into structured documents
File Management: Sharing files, organizing documents, extracting information
Multi-step Workflows: Complex tasks requiring multiple actions in sequence

Format

[
  {
    "prompt": "Catch me up on my latest emails please."
  },
  {
    "prompt": "Can you please schedule the meeting between me and Tejas based on his and my availability? Send over the invite and set up the meeting please (Google Meet) for when I'm available."
  }
]

Each entry contains:

prompt: The task description given to the AI assistant

Retrieval Dataset (`datasets/retrieval.json`)

The Retrieval dataset tests an AI assistant's ability to answer questions accurately based on a knowledge base. Questions cover:

Technical Specifications: Architecture decisions, implementation details
Product Features: Capabilities, configurations, limitations
Versioning: Changes across different versions, deprecations
Best Practices: Recommended approaches, performance targets
Cross-referencing: Questions requiring synthesis of information from multiple sources

Format

[
  {
    "prompt": "How does our desktop app offline queue work?",
    "expected_response": "Actions are saved locally in SQLite with WAL mode. When offline, actions queue up (max 5,000 items). On reconnect, they sync with exponential backoff + jitter, using idempotency keys. For conflicts, the Final spec uses operational transform (CRDT-based), not last-writer-wins - that was only in v1. Conflict UI shows both versions and lets users pick or merge.",
    "assistant_response": ""
  }
]

Each entry contains:

prompt: The question asked to the AI assistant
expected_response: The correct, detailed answer
assistant_response: Field for storing the AI's actual response

Usage

Loading the Datasets

Python

import json

# Load Action dataset
with open('datasets/action.json', 'r') as f:
    action_tasks = json.load(f)

# Load Retrieval dataset
with open('datasets/retrieval.json', 'r') as f:
    retrieval_tasks = json.load(f)

print(f"Action tasks: {len(action_tasks)}")
print(f"Retrieval tasks: {len(retrieval_tasks)}")

JavaScript/TypeScript

import { readFileSync } from 'fs';

const actionTasks = JSON.parse(
  readFileSync('datasets/action.json', 'utf-8')
);

const retrievalTasks = JSON.parse(
  readFileSync('datasets/retrieval.json', 'utf-8')
);

console.log(`Action tasks: ${actionTasks.length}`);
console.log(`Retrieval tasks: ${retrievalTasks.length}`);

Evaluation Guidelines

Action Dataset Evaluation

For each task in the action dataset:

Present the prompt to your AI assistant
Observe the assistant's actions and responses
Manually evaluate whether the task was completed successfully and correctly
Record the evaluation result

Success criteria:

Task completed as requested
No errors or incorrect actions
Appropriate verification steps taken
Reasonable handling of edge cases

Retrieval Dataset Evaluation

For each question in the retrieval dataset:

Present the prompt to your AI assistant with access to your knowledge base
Capture the assistant's response in the assistant_response field
Compare the response against the expected_response
Record the evaluation result based on accuracy

Evaluation criteria:

Factual correctness
Completeness (includes all key information)
Accuracy of technical details
Proper handling of versioning/contradictions
Appropriate level of detail

Computing Scores

After evaluating the datasets and recording results in your own tracking system:

import json

def evaluate_dataset(results):
    """
    results: list of evaluation results (True/False or 1/0)
    """
    correct = sum(results)
    total = len(results)
    accuracy = (correct / total) * 100
    return {
        'correct': correct,
        'total': total,
        'accuracy': accuracy
    }

# Example: After manual evaluation
action_results = [True, False, True, True, ...]  # Your evaluation results
retrieval_results = [True, True, False, True, ...]  # Your evaluation results

action_score = evaluate_dataset(action_results)
retrieval_score = evaluate_dataset(retrieval_results)

print(f"Action Dataset - Accuracy: {action_score['accuracy']:.2f}%")
print(f"Retrieval Dataset - Accuracy: {retrieval_score['accuracy']:.2f}%")

Use Cases

Task Arena is ideal for:

AI Assistant Benchmarking: Compare different AI assistants on real-world tasks
Model Evaluation: Test new language models or agent architectures
Capability Assessment: Identify strengths and weaknesses in task execution
Progress Tracking: Monitor improvements over time
Research: Study AI assistant behavior on practical tasks
Product Development: Validate AI features before release

Contributing

We welcome contributions to expand and improve the Task Arena benchmark! Please consider:

Adding new diverse task prompts to the Action dataset
Submitting new question-answer pairs for the Retrieval dataset
Reporting issues with existing prompts or expected responses
Suggesting new evaluation categories or metrics

Citation

If you use Task Arena in your research or product evaluation, please cite:

@misc{taskarena2025,
  title={Task Arena: A Benchmark for Real-World AI Assistant Evaluation},
  year={2025},
  url={https://github.com/yourusername/task-arena-benchmark}
}

License

This benchmark is released under the MIT License. See LICENSE for details.

Acknowledgments

Task Arena was created to address the need for practical, real-world evaluation of AI assistants beyond traditional benchmarks. The tasks and questions are derived from actual user interactions and production scenarios.

Future Directions

This is an early evaluation and will continue to evolve. Future versions will include:

Broader Integration Coverage: Additional integrations beyond the current scope (more productivity tools, development platforms, communication apps)
Expanded Action Types: More diverse task categories and complexity levels
Richer Data Sources: More comprehensive knowledge bases and retrieval scenarios
Domain-Specific Tasks: Specialized evaluations for different industries and use cases

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
datasets		datasets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Task Arena Benchmark

Overview

Dataset Statistics

Dataset Details

Action Dataset (`datasets/action.json`)

Format

Retrieval Dataset (`datasets/retrieval.json`)

Format

Usage

Loading the Datasets

Python

JavaScript/TypeScript

Evaluation Guidelines

Action Dataset Evaluation

Retrieval Dataset Evaluation

Computing Scores

Use Cases

Contributing

Citation

License

Acknowledgments

Future Directions

About

Uh oh!

Releases

Packages

License

dimensionhq/task-arena

Folders and files

Latest commit

History

Repository files navigation

Task Arena Benchmark

Overview

Dataset Statistics

Dataset Details

Action Dataset (datasets/action.json)

Format

Retrieval Dataset (datasets/retrieval.json)

Format

Usage

Loading the Datasets

Python

JavaScript/TypeScript

Evaluation Guidelines

Action Dataset Evaluation

Retrieval Dataset Evaluation

Computing Scores

Use Cases

Contributing

Citation

License

Acknowledgments

Future Directions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Action Dataset (`datasets/action.json`)

Retrieval Dataset (`datasets/retrieval.json`)

Packages