Skip to content

fix: repair docx zip header casing#1909

Open
he-yufeng wants to merge 1 commit into
microsoft:mainfrom
he-yufeng:fix/docx-zip-case-mismatch
Open

fix: repair docx zip header casing#1909
he-yufeng wants to merge 1 commit into
microsoft:mainfrom
he-yufeng:fix/docx-zip-case-mismatch

Conversation

@he-yufeng
Copy link
Copy Markdown

Fixes #1812.

Some DOCX producers write ZIP entries whose central-directory name differs from the local file header only by path casing. Python's zipfile rejects those files during extraction even though Word and common zip tools tolerate them.

This patch repairs that narrow case in memory before the existing DOCX pre-processing step reads each member. The central-directory name remains authoritative, and the patch only applies when the local and central names have the same byte length and differ only by case.

Validation:

  • .venv\Scripts\python.exe -m pytest packages\markitdown\tests\test_module_misc.py::test_docx_preprocess_repairs_case_mismatched_zip_names -q
  • .venv\Scripts\python.exe -m pytest packages\markitdown\tests\test_module_misc.py::test_docx_comments packages\markitdown\tests\test_module_misc.py::test_docx_equations packages\markitdown\tests\test_module_misc.py::test_docx_preprocess_repairs_case_mismatched_zip_names -q
  • .venv\Scripts\python.exe -m py_compile packages\markitdown\src\markitdown\converter_utils\docx\pre_process.py packages\markitdown\tests\test_module_misc.py
  • .venv\Scripts\black.exe --check packages\markitdown\src\markitdown\converter_utils\docx\pre_process.py packages\markitdown\tests\test_module_misc.py
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BadZipFile crash on .docx files with case-mismatched zip entry names

1 participant