diff --git a/peps/pep-0784.rst b/peps/pep-0784.rst index fef83d8fce7..ac8c43e7d3c 100644 --- a/peps/pep-0784.rst +++ b/peps/pep-0784.rst @@ -2,50 +2,55 @@ PEP: 784 Title: Adding Zstandard to the standard library Author: Emma Harper Smith Sponsor: Gregory P. Smith +Discussions-To: https://discuss.python.org/t/87377 Status: Draft Type: Standards Track Created: 06-Apr-2025 Python-Version: 3.14 +Post-History: + `07-Apr-2025 `__, + Abstract ======== -`Zstandard `_ is a widely adopted, mature, -and highly efficient compression standard. This PEP proposes adding a new -module to the Python standard library containing a Python wrapper around Meta's -``zstd`` library, the default implementation. Additionally, to avoid name -collisions with packages on PyPI and to present a unified interface to Python -users, compression modules in the standard library will be moved under a -``compression.*`` namespace package. +`Zstandard`_ is a widely adopted, mature, and highly efficient compression +standard. This PEP proposes adding a new module to the Python standard library +containing a Python wrapper around Meta's |zstd| library, the default +implementation. Additionally, to avoid name collisions with packages on PyPI +and to present a unified interface to Python users, compression modules in the +standard library will be moved under a ``compression.*`` package. + +.. |zstd| replace:: ``zstd`` +.. _zstd: https://facebook.github.io/zstd/ +.. _Zstandard: https://facebook.github.io/zstd/ + Motivation ========== -CPython has modules for several different compression formats, such as `zlib -(DEFLATE) `_, -`bzip2 `_, -and `lzma `_, each widely used. -Including popular compression algorithms matches Python's "batteries included" -philosophy of incorporating widely useful standards and utilities. The last -compression module added to the language was ``lzma``, added in Python 3.3. +CPython has modules for several different compression formats, such as +:mod:`zlib (DEFLATE) `, :mod:`bzip2 `, and :mod:`lzma `, +each widely used. Including popular compression algorithms matches Python's +"batteries included" philosophy of incorporating widely useful standards and +utilities. :mod:`!lzma` is the most recent such module, added in Python 3.3. -Since then, Zstandard has become the modern de facto preferred compression +Since then, Zstandard has become the modern *de facto* preferred compression library for both high performance compression and decompression attaining high compression ratios at reasonable CPU and memory cost. Zstandard achieves a much higher compression ratio than bzip2 or zlib (DEFLATE) while decompressing significantly faster than LZMA. -Zstandard has seen `widespread adoption in many different areas of computing -`_. The numerous hardware -implementations demonstrate long-term commitment to Zstandard and an -expectation that Zstandard will stay the de facto choice for compression for -years to come. Zstandard compression is also implemented in both the ZFS and -Btrfs filesystems. +Zstandard has seen `widespread adoption`_ in many different areas of computing. +The numerous hardware implementations demonstrate long-term commitment to +Zstandard and an expectation that Zstandard will stay the *de facto* choice for +compression for years to come. Zstandard compression is also implemented in +both the ZFS_ and Btrfs_ filesystems. Zstandard's highly efficient compression has supplanted other modern -compression formats, such as brotli, lzo, and ucl due to its highly efficient -compression. While `LZ4 `_ is still used in very high -throughput scenarios, Zstandard can also be used in some of these contexts. +compression formats, such as brotli_, lzo_, and ucl_ due to its highly +efficient compression. While `LZ4`_ is still used in very high throughput +scenarios, Zstandard can also be used in some of these contexts. While inclusion of LZ4 is out of scope, it would be a compelling future addition to the ``compression`` namespace introduced by this PEP. @@ -53,24 +58,22 @@ There are several bindings to Zstandard for Python available on PyPI, each with different APIs and choices of how to bind the ``zstd`` library. One goal with introducing an official module in the standard library is to reduce confusion for Python users who want simple compression/decompression APIs for Zstandard. -The existing packages can continue providing extended APIs and bindings for -other Python implementations such as PyPy or integrate features from newer -Zstandard versions. +The existing packages can continue providing extended APIs or integrate +features from newer Zstandard versions. Another reason to add Zstandard support to the standard library is to resolve -a long standing `open issue `_ -requesting Zstandard support in the ``tarfile`` module. This issue has the 5th -most "thumbs up" of open issues on the CPython tracker, and has garnered a -significant amount of discussion and interest. Additionally, the `ZIP format -standardizes a Zstandard compression format ID -`_, -and integration with ``zipfile`` would allow opening ZIP archives using -Zstandard compression. The reference implementation for this PEP contains -integration with the ``zipfile``, ``tarfile``, and ``shutil`` modules. +a long standing open issue (`python/cpython#81276`_) requesting Zstandard +support in the :mod:`tarfile` module. This issue has the 5th most "thumbs up" +of open issues on the CPython tracker, and has garnered a significant amount of +discussion and interest. Additionally, the ZIP format standardizes a +`Zstandard compression format ID`_, and integration with the :mod:`zipfile` +module would allow opening ZIP archives using Zstandard compression. The +reference implementation for this PEP contains integration with the +:mod:`!zipfile`, :mod:`!tarfile`, and :mod:`shutil` modules. Zstandard compression could also be used to make Python wheel packages smaller and significantly faster to install. Anaconda found a sizeable speedup when -adopting Zstandard for the conda package format +adopting Zstandard for the conda package format: .. epigraph:: @@ -78,21 +81,33 @@ adopting Zstandard for the conda package format [...] We see approximately a 2.5x overall speedup, almost all thanks to the dramatically faster extraction speed of the zstd compression used in the new file format. - -- `Anaconda blog on Zstandard adoption `_ + -- `Anaconda blog on Zstandard adoption`_ -`According to lzbench `_, -a comprehensive benchmark of many different compression libraries and formats, Zstandard has a significantly higher compression ratio compared to wheel's -existing zlib-based compression. While this PEP does *not* prescribe any -changes to the wheel format or other packaging standards, having Zstandard -bindings in the standard library would enable a future PEP to improve the user -experience for Python wheel packages. +existing zlib-based compression, `according to lzbench`_, a comprehensive +benchmark of many different compression libraries and formats. +While this PEP does *not* prescribe any changes to the wheel format or other +packaging standards, having Zstandard bindings in the standard library would +enable a future PEP to improve the user experience for Python wheel packages. + +.. _widespread adoption: https://facebook.github.io/zstd/#references +.. _ZFS: https://en.wikipedia.org/wiki/ZFS +.. _Btrfs: https://btrfs.readthedocs.io/ +.. _brotli: https://brotli.org/ +.. _lzo: https://www.oberhumer.com/opensource/lzo/ +.. _ucl: https://www.oberhumer.com/opensource/ucl/ +.. _LZ4: https://lz4.org/ +.. _python/cpython#81276: https://github.com/python/cpython/issues/81276 +.. _Zstandard compression format ID: https://pkwaredownloads.blob.core.windows.net/pkware-general/Documentation/APPNOTE-6.3.8.TXT +.. _according to lzbench: https://github.com/inikep/lzbench#benchmarks +.. _Anaconda blog on Zstandard adoption: https://www.anaconda.com/blog/how-we-made-conda-faster-4-7 + Rationale ========= -Introduction of a ``compression`` namespace -------------------------------------------- +Introduction of a ``compression`` package +----------------------------------------- Both the ``zstd`` and ``zstandard`` import names are claimed by projects on PyPI. To avoid breaking users of one of the existing bindings, this PEP @@ -130,13 +145,17 @@ name otherwise. Implementation based on ``pyzstd`` ---------------------------------- -The implementation for this PEP is based on the `pyzstd project `_. -This project was chosen as the code was `originally written to be upstreamed `_ -to CPython by Ma Lin, who also wrote the `output buffer implementation used in -the standard library today `_. +The implementation for this PEP is based on the `pyzstd project`_. +This project was chosen as the code was `originally written to be upstreamed`_ +to CPython by Ma Lin, who also wrote the `output buffer implementation`_ used in +the standard library today. The project has since been taken over by Rogdham and is published to PyPI. The APIs in ``pyzstd`` are similar to the APIs for other compression modules in the -standard library such as ``bz2`` and ``lzma``. +standard library such as :mod:`!bz2` and :mod:`!lzma`. + +.. _pyzstd project: https://github.com/Rogdham/pyzstd +.. _originally written to be upstreamed: https://github.com/python/cpython/issues/81276#issuecomment-1093824963 +.. _output buffer implementation: https://github.com/python/cpython/commit/f9bedb630e8a0b7d94e1c7e609b20dfaa2b22231 Minimum supported Zstandard version ----------------------------------- @@ -149,13 +168,14 @@ compatibility with existing LTS Linux distributions, but a newer Zstandard version could likely be chosen given that newer Python releases are generally packaged as part of newer distribution releases. + Specification ============= The ``compression`` namespace ----------------------------- -A new namespace package for compression modules will be introduced named +A new namespace for compression modules will be introduced named ``compression``. The top-level module for this package will be empty to begin with, but a standard API for interacting with compression routines may be added in the future to the toplevel. @@ -167,17 +187,18 @@ A new module, ``compression.zstd`` will be introduced with Zstandard compression APIs that match other compression modules in the standard library, namely -* ``compress`` / ``decompress`` - APIs for one-shot compression/decompression -* ``ZstdFile`` / ``open`` - APIs for interacting with streams and file-like - objects -* ``ZstdCompressor`` / ``ZstdDecompressor`` - APIs for incremental compression/ - decompression +* :func:`!compress` / :func:`!decompress` - APIs for one-shot compression + or decompression +* :class:`!ZstdFile` / :func:`!open` - APIs for interacting with streams + and file-like objects +* :class:`!ZstdCompressor` / :class:`!ZstdDecompressor` - APIs for incremental + compression or decompression -It will also contain some Zstandard-specific functionality +It will also contain some Zstandard-specific functionality: -* ``ZstdDict`` / ``train_dict`` / ``finalize_dict`` - APIs for interacting with - Zstandard dictionaries, which are useful for compressing many small chunks of - similar data +* :class:`!ZstdDict` / :func:`!train_dict` / :func:`!finalize_dict` - APIs for + interacting with Zstandard dictionaries, which are useful for compressing + many small chunks of similar data ``libzstd`` optional dependency ------------------------------- @@ -222,11 +243,12 @@ Backwards Compatibility The main compatibility concern is usage of existing standard library compression APIs with the existing import names. These names will be -deprecated in 3.19 and will be removed in 3.24. Given the long coexistance of +deprecated in 3.19 and will be removed in 3.24. Given the long coexistence of the modules and a 5 year deprecation period, most users will likely migrate to the new import names well before then. Additionally, a libCST codemod can be provided to automatically rewrite imports, easing the migration. + Security Implications ===================== @@ -241,6 +263,7 @@ Taking on a new dependency also always has security risks, but the ``zstd`` library is mature, fuzzed on each commit, and `participates in Meta's bug bounty program `_. + How to Teach This ================= @@ -248,6 +271,7 @@ Documentation for the new module is in the reference implementation branch. The documentation for other modules will be updated to discuss the deprecation of their existing import names, and how to migrate. + Reference Implementation ======================== @@ -258,6 +282,7 @@ integration added. It also contains the re-exports of other compression modules. Deprecations for the existing import names will be added once a decision is reached regarding the open issues. + Rejected Ideas ============== @@ -273,6 +298,7 @@ import name ``lz4``. Instead of solving this issue for each compression format, it is better to solve it once and for all by using the already-claimed ``compression`` namespace. + Copyright =========