Add zip bomb protection to ZipConverter#1628
Add zip bomb protection to ZipConverter#1628mvanhorn wants to merge 2 commits intomicrosoft:mainfrom
Conversation
Check ZipInfo.file_size and decompression ratio before reading each entry. Skip files that exceed the per-file size limit (100 MB), decompression ratio limit (100:1), or cumulative total (500 MB). Exceeding limits logs a warning instead of crashing. Fixes microsoft#1514
VANDRANKI
left a comment
There was a problem hiding this comment.
The three-check approach (per-file size, ratio, cumulative total) covers the main zip bomb attack surface and the test suite is thorough. A few things worth addressing.
Blocker: limits are not configurable
MAX_DECOMPRESSED_FILE_SIZE, MAX_DECOMPRESSION_RATIO, and MAX_TOTAL_DECOMPRESSED_SIZE are module-level constants with no way to override them. This will be a problem for legitimate use cases: scientific datasets, legal document archives, and financial filings commonly contain files well over 100 MB. A user converting a valid large ZIP has no way to raise the limit short of monkey-patching the module.
These should be constructor parameters on ZipConverter with the current values as defaults:
class ZipConverter(DocumentConverter):
def __init__(
self,
max_file_size: int = MAX_DECOMPRESSED_FILE_SIZE,
max_ratio: int = MAX_DECOMPRESSION_RATIO,
max_total_size: int = MAX_TOTAL_DECOMPRESSED_SIZE,
):This keeps the safe defaults for everyone while giving power users an escape hatch.
Question: total_decompressed excludes skipped files
The cumulative counter only increments for files that pass the per-file size check. Files that are skipped due to size do not contribute to total_decompressed. This means an archive with 1000 files each at 99 MB (just under the per-file limit, all individually passing) would correctly accumulate toward the total cap. But an archive with 1000 files each at 101 MB (all individually skipped) would leave total_decompressed at zero, and the total check would never trigger. The per-file skip already handles this correctly by logging and continuing, so this is not a critical gap, but it is worth confirming the intended behavior in a comment.
Note: ZipInfo.file_size can be spoofed
All three checks use info.file_size from the local file header. In a crafted ZIP, this value can be set to anything (the test's _make_zip_with_spoofed_size helper demonstrates exactly this). A genuine zip bomb can report file_size=100 in the header while the decompressed content is gigabytes. The current implementation defends against careless or accidental large files but not against a deliberately crafted bomb with falsified headers.
True protection requires streaming the decompressed bytes with a running counter rather than trusting the header. This is significantly more complex and may be out of scope for this PR, but it is worth documenting explicitly as a known limitation so users understand what the protection covers.
Suggestion: output should note truncation
When the cumulative limit triggers a break, the function returns partial markdown with no indication in the output that extraction was stopped early. The caller gets an incomplete result with no way to detect it programmatically (the log warning goes to a logger, not the output). Adding a note in the markdown output (e.g., `
Note: extraction stopped at cumulative size limit. Some files were not converted.`) would make the truncation visible to whoever reads the output.
The structure is solid and the test coverage for the happy paths and header-level protection is good. Making the limits configurable is the most important change before merge.
- Make zip bomb limits configurable via ZipConverter constructor (max_file_size, max_ratio, max_total_size) with safe defaults - Add truncation note in markdown output when cumulative limit stops extraction early - Document ZipInfo.file_size spoofing as a known limitation - Clarify that skipped files do not count toward cumulative total Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Addressed in 59c5907:
|
Problem
ZipConverterreads each file in a ZIP archive by callingzipObj.read(name)without checking the decompressed size first. A malicious zip bomb (a small ZIP that decompresses to gigabytes) can crash the process with an out-of-memory error.Reported in #1514.
Fix
Added three pre-read safety checks in
ZipConverter.convert():ZipInfo.file_sizebefore readingfile_size / compress_sizeto detect suspiciously high compression ratios typical of zip bombsFiles exceeding any limit are skipped with a
logging.warning()message rather than raising an exception, so the rest of the archive is still processed (graceful degradation).Testing
Added
test_zip_security.pywith four tests:All limits are defined as module-level constants (
MAX_DECOMPRESSED_FILE_SIZE,MAX_DECOMPRESSION_RATIO,MAX_TOTAL_DECOMPRESSED_SIZE) for easy adjustment.This contribution was developed with AI assistance (Claude Code).