Skip to content

Speed up logical overwrites by pruning manifests in Snapshot._manifests #3039

@gabeiglio

Description

@gabeiglio

Feature Request / Improvement

We’ve observed a large performance gap between the Python and Java implementations for logical overwrites (metadata-only). Profiling shows most time is spent in snapshot.py (_manifests), where we are not pruning manifests when computing _existing_manifests and _deleted_entries.

After adding manifest pruning, we see the following benchmark results (100 overwrite iterations):

Scenario Avg (s) Min (s) Max (s)
Current PyIceberg – same partition 1.15 0.78 1.51
Current PyIceberg – random partitions 0.96 0.77 1.26
Pruning PyIceberg – same partition 0.50 0.28 0.78
Pruning PyIceberg – random partitions 0.38 0.27 0.49

Benchmark script: https://gist.github.com/gabeiglio/0092970c144228ef6d333a873dc1d316

Here is the PR for the optimization

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions