Skip to content

Add zip bomb protection to ZipConverter#1628

Open
mvanhorn wants to merge 2 commits intomicrosoft:mainfrom
mvanhorn:fix/zip-bomb-dos-protection
Open

Add zip bomb protection to ZipConverter#1628
mvanhorn wants to merge 2 commits intomicrosoft:mainfrom
mvanhorn:fix/zip-bomb-dos-protection

Conversation

@mvanhorn
Copy link
Copy Markdown

Problem

ZipConverter reads each file in a ZIP archive by calling zipObj.read(name) without checking the decompressed size first. A malicious zip bomb (a small ZIP that decompresses to gigabytes) can crash the process with an out-of-memory error.

Reported in #1514.

Fix

Added three pre-read safety checks in ZipConverter.convert():

  1. Per-file size limit (100 MB) - checks ZipInfo.file_size before reading
  2. Decompression ratio limit (100:1) - compares file_size / compress_size to detect suspiciously high compression ratios typical of zip bombs
  3. Cumulative size limit (500 MB) - tracks total decompressed bytes across all files in the archive

Files exceeding any limit are skipped with a logging.warning() message rather than raising an exception, so the rest of the archive is still processed (graceful degradation).

Testing

Added test_zip_security.py with four tests:

  • Normal ZIP converts successfully (no regression)
  • Oversized individual file is skipped
  • High decompression ratio is detected and skipped
  • Cumulative size limit stops extraction

All limits are defined as module-level constants (MAX_DECOMPRESSED_FILE_SIZE, MAX_DECOMPRESSION_RATIO, MAX_TOTAL_DECOMPRESSED_SIZE) for easy adjustment.

This contribution was developed with AI assistance (Claude Code).

Check ZipInfo.file_size and decompression ratio before reading each
entry. Skip files that exceed the per-file size limit (100 MB),
decompression ratio limit (100:1), or cumulative total (500 MB).
Exceeding limits logs a warning instead of crashing.

Fixes microsoft#1514
Copy link
Copy Markdown

@VANDRANKI VANDRANKI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three-check approach (per-file size, ratio, cumulative total) covers the main zip bomb attack surface and the test suite is thorough. A few things worth addressing.

Blocker: limits are not configurable

MAX_DECOMPRESSED_FILE_SIZE, MAX_DECOMPRESSION_RATIO, and MAX_TOTAL_DECOMPRESSED_SIZE are module-level constants with no way to override them. This will be a problem for legitimate use cases: scientific datasets, legal document archives, and financial filings commonly contain files well over 100 MB. A user converting a valid large ZIP has no way to raise the limit short of monkey-patching the module.

These should be constructor parameters on ZipConverter with the current values as defaults:

class ZipConverter(DocumentConverter):
    def __init__(
        self,
        max_file_size: int = MAX_DECOMPRESSED_FILE_SIZE,
        max_ratio: int = MAX_DECOMPRESSION_RATIO,
        max_total_size: int = MAX_TOTAL_DECOMPRESSED_SIZE,
    ):

This keeps the safe defaults for everyone while giving power users an escape hatch.

Question: total_decompressed excludes skipped files

The cumulative counter only increments for files that pass the per-file size check. Files that are skipped due to size do not contribute to total_decompressed. This means an archive with 1000 files each at 99 MB (just under the per-file limit, all individually passing) would correctly accumulate toward the total cap. But an archive with 1000 files each at 101 MB (all individually skipped) would leave total_decompressed at zero, and the total check would never trigger. The per-file skip already handles this correctly by logging and continuing, so this is not a critical gap, but it is worth confirming the intended behavior in a comment.

Note: ZipInfo.file_size can be spoofed

All three checks use info.file_size from the local file header. In a crafted ZIP, this value can be set to anything (the test's _make_zip_with_spoofed_size helper demonstrates exactly this). A genuine zip bomb can report file_size=100 in the header while the decompressed content is gigabytes. The current implementation defends against careless or accidental large files but not against a deliberately crafted bomb with falsified headers.

True protection requires streaming the decompressed bytes with a running counter rather than trusting the header. This is significantly more complex and may be out of scope for this PR, but it is worth documenting explicitly as a known limitation so users understand what the protection covers.

Suggestion: output should note truncation

When the cumulative limit triggers a break, the function returns partial markdown with no indication in the output that extraction was stopped early. The caller gets an incomplete result with no way to detect it programmatically (the log warning goes to a logger, not the output). Adding a note in the markdown output (e.g., `

Note: extraction stopped at cumulative size limit. Some files were not converted.`) would make the truncation visible to whoever reads the output.


The structure is solid and the test coverage for the happy paths and header-level protection is good. Making the limits configurable is the most important change before merge.

- Make zip bomb limits configurable via ZipConverter constructor
  (max_file_size, max_ratio, max_total_size) with safe defaults
- Add truncation note in markdown output when cumulative limit
  stops extraction early
- Document ZipInfo.file_size spoofing as a known limitation
- Clarify that skipped files do not count toward cumulative total

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mvanhorn
Copy link
Copy Markdown
Author

mvanhorn commented Apr 5, 2026

Addressed in 59c5907:

  • Limits are now constructor parameters (max_file_size, max_ratio, max_total_size) with the current values as defaults, so users can override them for large legitimate archives.
  • Cumulative limit now appends a truncation note in the markdown output so callers can detect incomplete extraction without checking logs.
  • Documented the file_size header spoofing limitation in the class docstring - streaming decompression with a running counter would be the next step for adversarial protection.
  • Added an inline comment clarifying that skipped files intentionally do not count toward total_decompressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants