Skip to content

dimensionhq/task-arena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Task Arena Benchmark

An early-stage benchmark for evaluating AI assistant capabilities across real-world task execution and knowledge retrieval scenarios.

Overview

Task Arena is designed to test AI assistants on practical, real-world tasks that users frequently request. This is an early evaluation that provides a starting point for assessment, though it is not yet comprehensive. The benchmark currently focuses on a limited set of integrations, actions, and data sources, and will be expanded over time to cover a broader range of real-world scenarios.

The benchmark consists of two primary evaluation datasets:

  1. Action Dataset: Tests an AI assistant's ability to understand and execute complex, multi-step tasks
  2. Retrieval Dataset: Tests an AI assistant's ability to accurately retrieve and synthesize information from a knowledge base

Dataset Statistics

Dataset Samples Description
Action 51 Real-world task prompts covering email management, calendar scheduling, document creation, research, and more
Retrieval 52 Question-answer pairs testing knowledge retrieval, understanding, and synthesis capabilities

Dataset Details

Action Dataset (datasets/action.json)

The Action dataset contains real-world task prompts that AI assistants commonly encounter. These tasks test:

  • Email Management: Reading, replying, drafting, and organizing emails
  • Calendar Operations: Checking availability, scheduling meetings, sending invites
  • Document Creation: Creating Google Docs, Sheets, presentations with specific content
  • Research Tasks: Gathering information and organizing it into structured documents
  • File Management: Sharing files, organizing documents, extracting information
  • Multi-step Workflows: Complex tasks requiring multiple actions in sequence

Format

[
  {
    "prompt": "Catch me up on my latest emails please."
  },
  {
    "prompt": "Can you please schedule the meeting between me and Tejas based on his and my availability? Send over the invite and set up the meeting please (Google Meet) for when I'm available."
  }
]

Each entry contains:

  • prompt: The task description given to the AI assistant

Retrieval Dataset (datasets/retrieval.json)

The Retrieval dataset tests an AI assistant's ability to answer questions accurately based on a knowledge base. Questions cover:

  • Technical Specifications: Architecture decisions, implementation details
  • Product Features: Capabilities, configurations, limitations
  • Versioning: Changes across different versions, deprecations
  • Best Practices: Recommended approaches, performance targets
  • Cross-referencing: Questions requiring synthesis of information from multiple sources

Format

[
  {
    "prompt": "How does our desktop app offline queue work?",
    "expected_response": "Actions are saved locally in SQLite with WAL mode. When offline, actions queue up (max 5,000 items). On reconnect, they sync with exponential backoff + jitter, using idempotency keys. For conflicts, the Final spec uses operational transform (CRDT-based), not last-writer-wins - that was only in v1. Conflict UI shows both versions and lets users pick or merge.",
    "assistant_response": ""
  }
]

Each entry contains:

  • prompt: The question asked to the AI assistant
  • expected_response: The correct, detailed answer
  • assistant_response: Field for storing the AI's actual response

Usage

Loading the Datasets

Python

import json

# Load Action dataset
with open('datasets/action.json', 'r') as f:
    action_tasks = json.load(f)

# Load Retrieval dataset
with open('datasets/retrieval.json', 'r') as f:
    retrieval_tasks = json.load(f)

print(f"Action tasks: {len(action_tasks)}")
print(f"Retrieval tasks: {len(retrieval_tasks)}")

JavaScript/TypeScript

import { readFileSync } from 'fs';

const actionTasks = JSON.parse(
  readFileSync('datasets/action.json', 'utf-8')
);

const retrievalTasks = JSON.parse(
  readFileSync('datasets/retrieval.json', 'utf-8')
);

console.log(`Action tasks: ${actionTasks.length}`);
console.log(`Retrieval tasks: ${retrievalTasks.length}`);

Evaluation Guidelines

Action Dataset Evaluation

For each task in the action dataset:

  1. Present the prompt to your AI assistant
  2. Observe the assistant's actions and responses
  3. Manually evaluate whether the task was completed successfully and correctly
  4. Record the evaluation result

Success criteria:

  • Task completed as requested
  • No errors or incorrect actions
  • Appropriate verification steps taken
  • Reasonable handling of edge cases

Retrieval Dataset Evaluation

For each question in the retrieval dataset:

  1. Present the prompt to your AI assistant with access to your knowledge base
  2. Capture the assistant's response in the assistant_response field
  3. Compare the response against the expected_response
  4. Record the evaluation result based on accuracy

Evaluation criteria:

  • Factual correctness
  • Completeness (includes all key information)
  • Accuracy of technical details
  • Proper handling of versioning/contradictions
  • Appropriate level of detail

Computing Scores

After evaluating the datasets and recording results in your own tracking system:

import json

def evaluate_dataset(results):
    """
    results: list of evaluation results (True/False or 1/0)
    """
    correct = sum(results)
    total = len(results)
    accuracy = (correct / total) * 100
    return {
        'correct': correct,
        'total': total,
        'accuracy': accuracy
    }

# Example: After manual evaluation
action_results = [True, False, True, True, ...]  # Your evaluation results
retrieval_results = [True, True, False, True, ...]  # Your evaluation results

action_score = evaluate_dataset(action_results)
retrieval_score = evaluate_dataset(retrieval_results)

print(f"Action Dataset - Accuracy: {action_score['accuracy']:.2f}%")
print(f"Retrieval Dataset - Accuracy: {retrieval_score['accuracy']:.2f}%")

Use Cases

Task Arena is ideal for:

  • AI Assistant Benchmarking: Compare different AI assistants on real-world tasks
  • Model Evaluation: Test new language models or agent architectures
  • Capability Assessment: Identify strengths and weaknesses in task execution
  • Progress Tracking: Monitor improvements over time
  • Research: Study AI assistant behavior on practical tasks
  • Product Development: Validate AI features before release

Contributing

We welcome contributions to expand and improve the Task Arena benchmark! Please consider:

  • Adding new diverse task prompts to the Action dataset
  • Submitting new question-answer pairs for the Retrieval dataset
  • Reporting issues with existing prompts or expected responses
  • Suggesting new evaluation categories or metrics

Citation

If you use Task Arena in your research or product evaluation, please cite:

@misc{taskarena2025,
  title={Task Arena: A Benchmark for Real-World AI Assistant Evaluation},
  year={2025},
  url={https://github.com/yourusername/task-arena-benchmark}
}

License

This benchmark is released under the MIT License. See LICENSE for details.

Acknowledgments

Task Arena was created to address the need for practical, real-world evaluation of AI assistants beyond traditional benchmarks. The tasks and questions are derived from actual user interactions and production scenarios.

Future Directions

This is an early evaluation and will continue to evolve. Future versions will include:

  • Broader Integration Coverage: Additional integrations beyond the current scope (more productivity tools, development platforms, communication apps)
  • Expanded Action Types: More diverse task categories and complexity levels
  • Richer Data Sources: More comprehensive knowledge bases and retrieval scenarios
  • Domain-Specific Tasks: Specialized evaluations for different industries and use cases

About

Evaluating agents on real-world tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published