Skip to content

Improve ZIP store performance and filesystem listing#70

Open
kulvait wants to merge 2 commits intozarr-developers:mainfrom
kulvait:zip-clean
Open

Improve ZIP store performance and filesystem listing#70
kulvait wants to merge 2 commits intozarr-developers:mainfrom
kulvait:zip-clean

Conversation

@kulvait
Copy link
Copy Markdown

@kulvait kulvait commented Apr 1, 2026

Motivation for this work was that on large zip stores, the indexing was very slow and then opening arrays as well.
So I did replaced Apache Commons ZipArchiveInputStream by random access java class.
Change in FilesystemStore.java is more cosmetic but also slightly increase performance.

  • Replace stream-based ZIP access with random-access ZipFile for faster and more efficient reads
  • Optimize ZIP store initialization by reading the ZIP index only, avoiding full stream traversal
  • Introduce improved caching of directory structure and file sizes using synchronized maps
  • Simplify internal logic and remove unnecessary dependencies
  • Improve filesystem listing performance

Details:

  • Removed dependency on Apache Commons ZipArchiveInputStream and related classes in ReadOnlyZipStore
  • Added efficient caching using maps for directories and file sizes, async friendly
  • Normalized entry names for consistent lookup
  • Reduced redundant computations (e.g. cached entry sizes)
  • Simplified stream handling and chunk calculation

These changes significantly improve performance, reduce complexity, and make the ZIP store implementation more maintainable.

kulvait added 2 commits April 1, 2026 18:34
- Replace stream-based ZIP access with random-access `ZipFile` for faster and more efficient reads
- Optimize ZIP store initialization by reading the ZIP index only, avoiding full stream traversal
- Introduce improved caching of directory structure and file sizes using synchronized maps
- Simplify internal logic and remove unnecessary dependencies
- Improve filesystem listing performance

Details:
- Removed dependency on Apache Commons ZipArchiveInputStream and related classes in ReadOnlyZipStore
- Added efficient caching using maps for directories and file sizes, async friendly
- Normalized entry names for consistent lookup
- Reduced redundant computations (e.g. cached entry sizes)
- Simplified stream handling and chunk calculation

These changes significantly improve performance, reduce complexity,
and make the ZIP store implementation more maintainable.
… and directory existence

- Problem:
  * Some tests failed because ReadOnlyZipStore could not locate ZIP entries when the store
    was created by simply zipping a directory. Tools differ: some produce entry names with
    a leading slash, others without. This caused getInputStream(), read(), and getSize() to return null.
  * testExists() failed for root keys because exists() included directories, but by design it should only test file existence.

- Solution:
  * Added resolvePathWithLeadingSlashFromKeys() to try a secondary lookup with a leading slash.
    Primary lookup uses the standard key without leading slash; secondary is only for compatibility.
  * Modified exists(String[] keys) to check only fileSizeIndex, ignoring directories, to match the intended design.

- Effect:
  * ReadOnlyZipStoreTest passes regardless of whether ZIP entries have leading slashes or not.
  * Test logic now clearly distinguishes between file existence and directory entries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant