Add zip bomb protection to ZipConverter by mvanhorn · Pull Request #1628 · microsoft/markitdown

mvanhorn · 2026-03-20T04:52:40Z

Problem

ZipConverter reads each file in a ZIP archive by calling zipObj.read(name) without checking the decompressed size first. A malicious zip bomb (a small ZIP that decompresses to gigabytes) can crash the process with an out-of-memory error.

Reported in #1514.

Fix

Added three pre-read safety checks in ZipConverter.convert():

Per-file size limit (100 MB) - checks ZipInfo.file_size before reading
Decompression ratio limit (100:1) - compares file_size / compress_size to detect suspiciously high compression ratios typical of zip bombs
Cumulative size limit (500 MB) - tracks total decompressed bytes across all files in the archive

Files exceeding any limit are skipped with a logging.warning() message rather than raising an exception, so the rest of the archive is still processed (graceful degradation).

Testing

Added test_zip_security.py with four tests:

Normal ZIP converts successfully (no regression)
Oversized individual file is skipped
High decompression ratio is detected and skipped
Cumulative size limit stops extraction

All limits are defined as module-level constants (MAX_DECOMPRESSED_FILE_SIZE, MAX_DECOMPRESSION_RATIO, MAX_TOTAL_DECOMPRESSED_SIZE) for easy adjustment.

This contribution was developed with AI assistance (Claude Code).

Check ZipInfo.file_size and decompression ratio before reading each entry. Skip files that exceed the per-file size limit (100 MB), decompression ratio limit (100:1), or cumulative total (500 MB). Exceeding limits logs a warning instead of crashing. Fixes microsoft#1514

VANDRANKI

The three-check approach (per-file size, ratio, cumulative total) covers the main zip bomb attack surface and the test suite is thorough. A few things worth addressing.

Blocker: limits are not configurable

MAX_DECOMPRESSED_FILE_SIZE, MAX_DECOMPRESSION_RATIO, and MAX_TOTAL_DECOMPRESSED_SIZE are module-level constants with no way to override them. This will be a problem for legitimate use cases: scientific datasets, legal document archives, and financial filings commonly contain files well over 100 MB. A user converting a valid large ZIP has no way to raise the limit short of monkey-patching the module.

These should be constructor parameters on ZipConverter with the current values as defaults:

class ZipConverter(DocumentConverter):
    def __init__(
        self,
        max_file_size: int = MAX_DECOMPRESSED_FILE_SIZE,
        max_ratio: int = MAX_DECOMPRESSION_RATIO,
        max_total_size: int = MAX_TOTAL_DECOMPRESSED_SIZE,
    ):

This keeps the safe defaults for everyone while giving power users an escape hatch.

Question: `total_decompressed` excludes skipped files

The cumulative counter only increments for files that pass the per-file size check. Files that are skipped due to size do not contribute to total_decompressed. This means an archive with 1000 files each at 99 MB (just under the per-file limit, all individually passing) would correctly accumulate toward the total cap. But an archive with 1000 files each at 101 MB (all individually skipped) would leave total_decompressed at zero, and the total check would never trigger. The per-file skip already handles this correctly by logging and continuing, so this is not a critical gap, but it is worth confirming the intended behavior in a comment.

Note: `ZipInfo.file_size` can be spoofed

All three checks use info.file_size from the local file header. In a crafted ZIP, this value can be set to anything (the test's _make_zip_with_spoofed_size helper demonstrates exactly this). A genuine zip bomb can report file_size=100 in the header while the decompressed content is gigabytes. The current implementation defends against careless or accidental large files but not against a deliberately crafted bomb with falsified headers.

True protection requires streaming the decompressed bytes with a running counter rather than trusting the header. This is significantly more complex and may be out of scope for this PR, but it is worth documenting explicitly as a known limitation so users understand what the protection covers.

Suggestion: output should note truncation

When the cumulative limit triggers a break, the function returns partial markdown with no indication in the output that extraction was stopped early. The caller gets an incomplete result with no way to detect it programmatically (the log warning goes to a logger, not the output). Adding a note in the markdown output (e.g., `

Note: extraction stopped at cumulative size limit. Some files were not converted.`) would make the truncation visible to whoever reads the output.

The structure is solid and the test coverage for the happy paths and header-level protection is good. Making the limits configurable is the most important change before merge.

- Make zip bomb limits configurable via ZipConverter constructor (max_file_size, max_ratio, max_total_size) with safe defaults - Add truncation note in markdown output when cumulative limit stops extraction early - Document ZipInfo.file_size spoofing as a known limitation - Clarify that skipped files do not count toward cumulative total Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mvanhorn · 2026-04-05T00:06:59Z

Addressed in 59c5907:

Limits are now constructor parameters (max_file_size, max_ratio, max_total_size) with the current values as defaults, so users can override them for large legitimate archives.
Cumulative limit now appends a truncation note in the markdown output so callers can detect incomplete extraction without checking logs.
Documented the file_size header spoofing limitation in the class docstring - streaming decompression with a running counter would be the next step for adversarial protection.
Added an inline comment clarifying that skipped files intentionally do not count toward total_decompressed.

mvanhorn mentioned this pull request Mar 20, 2026

[Security] Potential Denial of Service (DoS) via Zip Bomb in ZipConverter #1514

Open

VANDRANKI reviewed Apr 4, 2026

View reviewed changes

VANDRANKI mentioned this pull request Apr 4, 2026

Make ZipConverter safety limits configurable via constructor parameters #1661

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zip bomb protection to ZipConverter#1628

Add zip bomb protection to ZipConverter#1628
mvanhorn wants to merge 2 commits intomicrosoft:mainfrom
mvanhorn:fix/zip-bomb-dos-protection

mvanhorn commented Mar 20, 2026

Uh oh!

VANDRANKI left a comment

Uh oh!

mvanhorn commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mvanhorn commented Mar 20, 2026

Problem

Fix

Testing

Uh oh!

VANDRANKI left a comment

Choose a reason for hiding this comment

Blocker: limits are not configurable

Question: total_decompressed excludes skipped files

Note: ZipInfo.file_size can be spoofed

Suggestion: output should note truncation

Uh oh!

mvanhorn commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Question: `total_decompressed` excludes skipped files

Note: `ZipInfo.file_size` can be spoofed