Unstructured Data Analysis Benchmark

🎯 Project Overview

The explosion of unstructured data has immense analytical value. By leveraging large language models (LLMs) to extract table-like attributes from unstructured data, researchers are building LLM-powered systems that analyze documents as if querying a database. These unstructured data analysis (UDA) systems differ widely in query interfaces, optimization, and operators, making it unclear which works best in which scenario. However, no benchmark currently offers high-quality, large-scale, diverse datasets and rich query workloads to rigorously evaluate them. We present UDA-Bench, a comprehensive UDA benchmark that addresses this need. We curate 6 datasets from different domains and manually construct a relational database view for each using 30 graduate students. These relational databases serve as ground truth to evaluate any UDA system, regardless of its interface. We further design diverse queries over the database schema that evaluate various analytical operators with different selectivities and complexities. Using this benchmark, we conduct an in-depth analysis of key UDA components—query interface, optimization, operator design, and data processing—and run exhaustive experiments to evaluate systems and techniques along these dimensions. Our main contributions are: (1) a comprehensive benchmark for rigorous UDA evaluation, and (2) a deeper understanding of the strengths and limitations of current systems, paving the way for future work in UDA.

To help users quickly grasp each dataset’s schema, attributes, data distribution, and query workload, we provide an interactive visualization interface. It allows users to browse relational schemas, inspect attribute metadata, view example documents, and explore the query taxonomy, providing a single, easy-to-use interface for exploring and working with UDA-Bench. Please Click Here!

Figure 1: System architecture showing the query interface, logical optimization, physical optimization, and unstructured data processing pipeline.

📈 Dataset Statistics

Dataset	# Attributes	# Files	Tokens (Max/Min/Avg)	Multi-modal
Art	19	1,000	1,665 / 619 / 789	✓
CSPaper	20	200	107,710 / 5,325 / 29,951	✓
Player	28	225	51,378 / 73 / 8,047	✗
Legal	19	566	45,437 / 340 / 5,609	✗
Finance	30	100	838,418 / 7,162 / 130,633	✗
Healthcare	51	100,000	63,234 / 2,759 / 10,649	✗

💾 Data Access

Due to the large size of our datasets, we provide access through download links rather than storing them directly in the repository.

Dataset Downloads

Dataset	Size	Download Link	Ground Truth
Art	~379MB	Download Art Dataset	Download Ground Truth
CSPaper	~678.3MB	Download CSPaper Dataset	Download Ground Truth
Player	~2.43MB	Download Player Dataset	Download Ground Truth
Legal	~304MB	Download Legal Dataset	Download Ground Truth
Finance	~413.6MB	Download Finance Dataset	Download Ground Truth
Healthcare	~1.7GB	Download Healthcare Dataset	Download Ground Truth

📚 Dataset Details

🎨 Art Dataset

Source: WikiArt.org
Content: Artists and their artworks spanning from the 19th to 21st centuries
Characteristics: Multimodal dataset containing biographical information, artistic movements, representative works lists, and images of representative works

🧾 CSPaper Dataset

Source: Computer science publications (curated collection of CS papers)
Content: Paper extracted attributes such as title, authors, baselines, and performance.
Characteristics: Dataset is crawled from Arxiv containing 200 research papers annotated with key attributes, including authors, baselines and their performance, the modalities of experimental datasets etc. In particular, some papers describe the performance of all baselines in the main text, while other papers only describe the best-performing baselines and leave other results in tables or figures, resulting in an analysis scenario with mixed-modal.

🏀 Player Dataset

Source: Wikipedia
Content: NBA players, teams, team owners, and other information from the 20th century to present, covering basic information and statistics such as player personal honors, team founding year, owner nationality, etc.
Characteristics: Relatively simple structure, containing player personal honors, team founding year, owner nationality, and other information

⚖️ Legal Dataset

Source: AustLII
Content: 570 professional legal cases from Australia between 2006-2009
Characteristics: Domain-specific dataset containing different types such as criminal and administrative cases, requiring semantic reasoning to extract attributes

💰 Finance Dataset

Source: Enterprise RAG Challenge
Content: Annual and quarterly financial reports published in 2022 by 100 listed companies worldwide
Characteristics: Extremely long documents (average 130,633 tokens), containing mixed content types such as company name, net profit, total assets, etc.

🏥 Healthcare Dataset

Source: MMedC
Content: Large number of healthcare documents since 2020
Characteristics: Largest scale dataset containing drugs, diseases, medical institutions, news, interviews, and other various healthcare information

📁 File Structure

unstractured_analysis_benchmark/
├── README.md          # Project documentation
├── img/              # Project-related images
├── Queries/          # Benchmark queries
├── systems/          # Evaluation systems
│   ├── evaporate/    # Evaporate system adaptation
│   ├── palimpzest/   # Palimpzest system adaptation
│   ├── lotus/        # LOTUS system wrapper
│   ├── docetl/       # DocETL system usage examples
│   ├── quest/        # QUEST system extension
│   ├── zendb/        # ZenDB system implementation
│   └── uqe/          # UQE system implementation
└── evaluation/       # Evaluation scripts
    ├── evaluate.py
    ├── evaluate_healthcare.py
    ├── evaluate_agg.py
    └── attr_types.json

🔧 Benchmark Construction Process

Figure 2: Benchmark Construction Process

1. 📥 Data Collection and Preprocessing

Collect data from original sources
Use MinerU toolkit to parse complex formats (such as PDF)
Organize datasets into JSON format, where each object corresponds to an unstructured document
For Healthcare and Player datasets, divide documents into multiple related domains

2. 🏷️ Attribute Identification

Hire 6 Ph.D. students from different majors to carefully read documents
Identify significant attributes with different extraction difficulties
Examples: Judge names in legal datasets are easy to identify, while case numbers require full-text search and reasoning

3. ✅ Ground Truth Labeling

Total of 30 graduate students participated in labeling, consuming approximately 4k human hours
Use multiple LLMs (Deepseek-V3, GPT-4.1, Claude-sonnet-4) for cross-validation
Adopt semi-automated iterative labeling strategy for large-scale datasets

Figure 3: Category of Query

4. 🔍 Query Construction

Experts design query templates based on real-world scenarios
Support both SQL-like queries and Python code interfaces
Total of 608 queries created, which can be divided into 5 major categories and 42 sub-categories.

🚀 Usage Instructions

📥 Download Datasets: Use the provided download links to obtain the datasets you need
📂 Extract Files: Unzip the downloaded files to your local directory
💻 Load Data into System: Load the JSON data into your analysis system
🔍 Execute Queries: Run the benchmark queries (provided separately)
📊 Compare Results: Compare your results with the ground truth CSV files

🧪 Systems for Evaluation

Our benchmark evaluates 7 existing unstructured data analysis systems:

System	Open Source	Repository	Modifications
📋 Evaporate	✅	GitHub	Adaptation
🐍 Palimpzest (PZ)	✅	GitHub	Adaptation
🌸 LOTUS	✅	GitHub	Adaptation
🤖 DocETL	✅	GitHub	Direct Usage
❓ QUEST	✅	GitHub	Adaptation
🎯 ZenDB	❌	Paper	Implementation
🔍 UQE	❌	Paper	Implementation

System Descriptions:

Evaporate: A table extraction system that extracts structured tables from documents, and subsequently executes SQL queries on the resulting tables.

Palimpzest (PZ): Provides Python API-based operators for unstructured data processing. We convert each SQL query into the corresponding PZ code, execute it and obtain the results.

LOTUS: Provides an open-source Python library for AI-based data processing with indexing, extraction, filtering, and joining capabilities. We use its interface to execute queries.

DocETL: An agentic query rewriting and evaluation system for complex document processing. We directly use the DocETL library to execute queries without any modifications.

QUEST: A query engine for unstructured databases that accepts a subset of standard SQL syntax. We directly use their code to execute queries.

ZenDB: A system that constructs semantic hierarchical trees to identify relevant document sections. We implement their SHT chunking and filter reordering strategies.

UQE: A query engine for unstructured databases that supports SQL-like query syntax with sampling-based aggregation capabilities. We implement its filter and aggregate operators, as well as logical optimizations.

System Capabilities Comparison

System	Query Interface	Chunking	Embedding	Multi-modal	Extract	Filter	Join	Aggregate	Logical Opt.	Physical Opt.
Evaporate	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌
Palimpzest	Code	❌	❌	✅	✅	✅	✅	✅	✅	✅
LOTUS	Code	❌	✅	✅	✅	✅	✅	✅	❌	✅
DocETL	Code	✅	✅	❌	✅	✅	✅	✅	✅	✅
ZenDB	SQL-like	✅	✅	❌	✅	✅	✅	❌	✅	❌
QUEST	SQL-like	✅	✅	❌	✅	✅	✅	❌	✅	❌
UQE	SQL-like	❌	✅	✅	✅	✅	❌	✅	✅	❌

Table 1: Overview of existing unstructured data analysis systems and their capabilities.

🤝 Contributing

We welcome issue reports, feature requests, or code contributions. Please ensure to follow the project's coding standards and testing requirements.

📧 Contact

For questions or suggestions, please contact us through:

Submit GitHub Issues
Send email to: [Email to be added]

Last updated: 2025

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
Data		Data
Query		Query
evaluation		evaluation
img		img
systems		systems
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
technical_report.pdf		technical_report.pdf
template_api_key.yaml		template_api_key.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unstructured Data Analysis Benchmark

🎯 Project Overview

📈 Dataset Statistics