Skip to content

Unstructured Data Analysis using LLMs: A Comprehensive Benchmark [Experiments & Analysis]

License

Notifications You must be signed in to change notification settings

BIT-DataLab/UDA-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

91 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

UDA Unstructured Data Analysis Benchmark

License Python Dataset Systems

Benchmark Construction Process

🎯 Project Overview

The explosion of unstructured data has immense analytical value. By leveraging large language models (LLMs) to extract table-like attributes from unstructured data, researchers are building LLM-powered systems that analyze documents as if querying a database. These unstructured data analysis (UDA) systems differ widely in query interfaces, optimization, and operators, making it unclear which works best in which scenario. However, no benchmark currently offers high-quality, large-scale, diverse datasets and rich query workloads to rigorously evaluate them. We present UDA-Bench, a comprehensive UDA benchmark that addresses this need. We curate 6 datasets from different domains and manually construct a relational database view for each using 30 graduate students. These relational databases serve as ground truth to evaluate any UDA system, regardless of its interface. We further design diverse queries over the database schema that evaluate various analytical operators with different selectivities and complexities. Using this benchmark, we conduct an in-depth analysis of key UDA componentsβ€”query interface, optimization, operator design, and data processingβ€”and run exhaustive experiments to evaluate systems and techniques along these dimensions. Our main contributions are: (1) a comprehensive benchmark for rigorous UDA evaluation, and (2) a deeper understanding of the strengths and limitations of current systems, paving the way for future work in UDA.

To help users quickly grasp each dataset’s schema, attributes, data distribution, and query workload, we provide an interactive visualization interface. It allows users to browse relational schemas, inspect attribute metadata, view example documents, and explore the query taxonomy, providing a single, easy-to-use interface for exploring and working with UDA-Bench. Please Click Here!

Benchmark Construction Process
Figure 1: System architecture showing the query interface, logical optimization, physical optimization, and unstructured data processing pipeline.

πŸ“ˆ Dataset Statistics

Dataset # Attributes # Files Tokens (Max/Min/Avg) Multi-modal
Art 19 1,000 1,665 / 619 / 789 βœ“
CSPaper 20 200 107,710 / 5,325 / 29,951 βœ“
Player 28 225 51,378 / 73 / 8,047 βœ—
Legal 19 566 45,437 / 340 / 5,609 βœ—
Finance 30 100 838,418 / 7,162 / 130,633 βœ—
Healthcare 51 100,000 63,234 / 2,759 / 10,649 βœ—

πŸ’Ύ Data Access

Download

Due to the large size of our datasets, we provide access through download links rather than storing them directly in the repository.

Dataset Downloads

Dataset Size Download Link Ground Truth
Art ~379MB Download Art Dataset Download Ground Truth
CSPaper ~678.3MB Download CSPaper Dataset Download Ground Truth
Player ~2.43MB Download Player Dataset Download Ground Truth
Legal ~304MB Download Legal Dataset Download Ground Truth
Finance ~413.6MB Download Finance Dataset Download Ground Truth
Healthcare ~1.7GB Download Healthcare Dataset Download Ground Truth

πŸ“š Dataset Details

🎨 Art Dataset

  • Source: WikiArt.org
  • Content: Artists and their artworks spanning from the 19th to 21st centuries
  • Characteristics: Multimodal dataset containing biographical information, artistic movements, representative works lists, and images of representative works

🧾 CSPaper Dataset

  • Source: Computer science publications (curated collection of CS papers)
  • Content: Paper extracted attributes such as title, authors, baselines, and performance.
  • Characteristics: Dataset is crawled from Arxiv containing 200 research papers annotated with key attributes, including authors, baselines and their performance, the modalities of experimental datasets etc. In particular, some papers describe the performance of all baselines in the main text, while other papers only describe the best-performing baselines and leave other results in tables or figures, resulting in an analysis scenario with mixed-modal.

πŸ€ Player Dataset

  • Source: Wikipedia
  • Content: NBA players, teams, team owners, and other information from the 20th century to present, covering basic information and statistics such as player personal honors, team founding year, owner nationality, etc.
  • Characteristics: Relatively simple structure, containing player personal honors, team founding year, owner nationality, and other information

βš–οΈ Legal Dataset

  • Source: AustLII
  • Content: 570 professional legal cases from Australia between 2006-2009
  • Characteristics: Domain-specific dataset containing different types such as criminal and administrative cases, requiring semantic reasoning to extract attributes

πŸ’° Finance Dataset

  • Source: Enterprise RAG Challenge
  • Content: Annual and quarterly financial reports published in 2022 by 100 listed companies worldwide
  • Characteristics: Extremely long documents (average 130,633 tokens), containing mixed content types such as company name, net profit, total assets, etc.

πŸ₯ Healthcare Dataset

  • Source: MMedC
  • Content: Large number of healthcare documents since 2020
  • Characteristics: Largest scale dataset containing drugs, diseases, medical institutions, news, interviews, and other various healthcare information

πŸ“ File Structure

unstractured_analysis_benchmark/
β”œβ”€β”€ README.md          # Project documentation
β”œβ”€β”€ img/              # Project-related images
β”œβ”€β”€ Queries/          # Benchmark queries
β”œβ”€β”€ systems/          # Evaluation systems
β”‚   β”œβ”€β”€ evaporate/    # Evaporate system adaptation
β”‚   β”œβ”€β”€ palimpzest/   # Palimpzest system adaptation
β”‚   β”œβ”€β”€ lotus/        # LOTUS system wrapper
β”‚   β”œβ”€β”€ docetl/       # DocETL system usage examples
β”‚   β”œβ”€β”€ quest/        # QUEST system extension
β”‚   β”œβ”€β”€ zendb/        # ZenDB system implementation
β”‚   └── uqe/          # UQE system implementation
└── evaluation/       # Evaluation scripts
    β”œβ”€β”€ evaluate.py
    β”œβ”€β”€ evaluate_healthcare.py
    β”œβ”€β”€ evaluate_agg.py
    └── attr_types.json

πŸ”§ Benchmark Construction Process

Unstructured Data Analysis Framework
Figure 2: Benchmark Construction Process

1. πŸ“₯ Data Collection and Preprocessing

  • Collect data from original sources
  • Use MinerU toolkit to parse complex formats (such as PDF)
  • Organize datasets into JSON format, where each object corresponds to an unstructured document
  • For Healthcare and Player datasets, divide documents into multiple related domains

2. 🏷️ Attribute Identification

  • Hire 6 Ph.D. students from different majors to carefully read documents
  • Identify significant attributes with different extraction difficulties
  • Examples: Judge names in legal datasets are easy to identify, while case numbers require full-text search and reasoning

3. βœ… Ground Truth Labeling

  • Total of 30 graduate students participated in labeling, consuming approximately 4k human hours
  • Use multiple LLMs (Deepseek-V3, GPT-4.1, Claude-sonnet-4) for cross-validation
  • Adopt semi-automated iterative labeling strategy for large-scale datasets
Unstructured Data Analysis Framework
Figure 3: Category of Query

4. πŸ” Query Construction

  • Experts design query templates based on real-world scenarios
  • Support both SQL-like queries and Python code interfaces
  • Total of 608 queries created, which can be divided into 5 major categories and 42 sub-categories.

πŸš€ Usage Instructions

Quick Start Examples

  1. πŸ“₯ Download Datasets: Use the provided download links to obtain the datasets you need
  2. πŸ“‚ Extract Files: Unzip the downloaded files to your local directory
  3. πŸ’» Load Data into System: Load the JSON data into your analysis system
  4. πŸ” Execute Queries: Run the benchmark queries (provided separately)
  5. πŸ“Š Compare Results: Compare your results with the ground truth CSV files

πŸ§ͺ Systems for Evaluation

Evaluation Open Source

Our benchmark evaluates 7 existing unstructured data analysis systems:

System Open Source Repository Modifications
πŸ“‹ Evaporate βœ… GitHub Adaptation
🐍 Palimpzest (PZ) βœ… GitHub Adaptation
🌸 LOTUS βœ… GitHub Adaptation
πŸ€– DocETL βœ… GitHub Direct Usage
❓ QUEST βœ… GitHub Adaptation
🎯 ZenDB ❌ Paper Implementation
πŸ” UQE ❌ Paper Implementation

System Descriptions:

Evaporate: A table extraction system that extracts structured tables from documents, and subsequently executes SQL queries on the resulting tables.

Palimpzest (PZ): Provides Python API-based operators for unstructured data processing. We convert each SQL query into the corresponding PZ code, execute it and obtain the results.

LOTUS: Provides an open-source Python library for AI-based data processing with indexing, extraction, filtering, and joining capabilities. We use its interface to execute queries.

DocETL: An agentic query rewriting and evaluation system for complex document processing. We directly use the DocETL library to execute queries without any modifications.

QUEST: A query engine for unstructured databases that accepts a subset of standard SQL syntax. We directly use their code to execute queries.

ZenDB: A system that constructs semantic hierarchical trees to identify relevant document sections. We implement their SHT chunking and filter reordering strategies.

UQE: A query engine for unstructured databases that supports SQL-like query syntax with sampling-based aggregation capabilities. We implement its filter and aggregate operators, as well as logical optimizations.

System Capabilities Comparison

System Query Interface Chunking Embedding Multi-modal Extract Filter Join Aggregate Logical Opt. Physical Opt.
Evaporate ❌ ❌ ❌ ❌ βœ… ❌ ❌ ❌ ❌ ❌
Palimpzest Code ❌ ❌ βœ… βœ… βœ… βœ… βœ… βœ… βœ…
LOTUS Code ❌ βœ… βœ… βœ… βœ… βœ… βœ… ❌ βœ…
DocETL Code βœ… βœ… ❌ βœ… βœ… βœ… βœ… βœ… βœ…
ZenDB SQL-like βœ… βœ… ❌ βœ… βœ… βœ… ❌ βœ… ❌
QUEST SQL-like βœ… βœ… ❌ βœ… βœ… βœ… ❌ βœ… ❌
UQE SQL-like ❌ βœ… βœ… βœ… βœ… ❌ βœ… βœ… ❌

Table 1: Overview of existing unstructured data analysis systems and their capabilities.

🀝 Contributing

We welcome issue reports, feature requests, or code contributions. Please ensure to follow the project's coding standards and testing requirements.

πŸ“§ Contact

For questions or suggestions, please contact us through:

  • Submit GitHub Issues
  • Send email to: [Email to be added]

Last updated: 2025

About

Unstructured Data Analysis using LLMs: A Comprehensive Benchmark [Experiments & Analysis]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published