diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 0c90dfc9aa7..f041ebaef93 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -661,6 +661,7 @@ peps/pep-0779.rst @Yhg1s @colesbury @mpage peps/pep-0780.rst @lysnikolaou peps/pep-0781.rst @methane peps/pep-0782.rst @vstinner +peps/pep-0784.rst @gpshead # ... peps/pep-0789.rst @njsmith # ... diff --git a/peps/pep-0784.rst b/peps/pep-0784.rst new file mode 100644 index 00000000000..fef83d8fce7 --- /dev/null +++ b/peps/pep-0784.rst @@ -0,0 +1,280 @@ +PEP: 784 +Title: Adding Zstandard to the standard library +Author: Emma Harper Smith +Sponsor: Gregory P. Smith +Status: Draft +Type: Standards Track +Created: 06-Apr-2025 +Python-Version: 3.14 + +Abstract +======== + +`Zstandard `_ is a widely adopted, mature, +and highly efficient compression standard. This PEP proposes adding a new +module to the Python standard library containing a Python wrapper around Meta's +``zstd`` library, the default implementation. Additionally, to avoid name +collisions with packages on PyPI and to present a unified interface to Python +users, compression modules in the standard library will be moved under a +``compression.*`` namespace package. + +Motivation +========== + +CPython has modules for several different compression formats, such as `zlib +(DEFLATE) `_, +`bzip2 `_, +and `lzma `_, each widely used. +Including popular compression algorithms matches Python's "batteries included" +philosophy of incorporating widely useful standards and utilities. The last +compression module added to the language was ``lzma``, added in Python 3.3. + +Since then, Zstandard has become the modern de facto preferred compression +library for both high performance compression and decompression attaining high +compression ratios at reasonable CPU and memory cost. Zstandard achieves a much +higher compression ratio than bzip2 or zlib (DEFLATE) while decompressing +significantly faster than LZMA. + +Zstandard has seen `widespread adoption in many different areas of computing +`_. The numerous hardware +implementations demonstrate long-term commitment to Zstandard and an +expectation that Zstandard will stay the de facto choice for compression for +years to come. Zstandard compression is also implemented in both the ZFS and +Btrfs filesystems. + +Zstandard's highly efficient compression has supplanted other modern +compression formats, such as brotli, lzo, and ucl due to its highly efficient +compression. While `LZ4 `_ is still used in very high +throughput scenarios, Zstandard can also be used in some of these contexts. +While inclusion of LZ4 is out of scope, it would be a compelling future +addition to the ``compression`` namespace introduced by this PEP. + +There are several bindings to Zstandard for Python available on PyPI, each with +different APIs and choices of how to bind the ``zstd`` library. One goal with +introducing an official module in the standard library is to reduce confusion +for Python users who want simple compression/decompression APIs for Zstandard. +The existing packages can continue providing extended APIs and bindings for +other Python implementations such as PyPy or integrate features from newer +Zstandard versions. + +Another reason to add Zstandard support to the standard library is to resolve +a long standing `open issue `_ +requesting Zstandard support in the ``tarfile`` module. This issue has the 5th +most "thumbs up" of open issues on the CPython tracker, and has garnered a +significant amount of discussion and interest. Additionally, the `ZIP format +standardizes a Zstandard compression format ID +`_, +and integration with ``zipfile`` would allow opening ZIP archives using +Zstandard compression. The reference implementation for this PEP contains +integration with the ``zipfile``, ``tarfile``, and ``shutil`` modules. + +Zstandard compression could also be used to make Python wheel packages smaller +and significantly faster to install. Anaconda found a sizeable speedup when +adopting Zstandard for the conda package format + +.. epigraph:: + + Conda's download sizes are reduced ~30-40%, and extraction is dramatically faster. + [...] + We see approximately a 2.5x overall speedup, almost all thanks to the dramatically faster extraction speed of the zstd compression used in the new file format. + + -- `Anaconda blog on Zstandard adoption `_ + +`According to lzbench `_, +a comprehensive benchmark of many different compression libraries and formats, +Zstandard has a significantly higher compression ratio compared to wheel's +existing zlib-based compression. While this PEP does *not* prescribe any +changes to the wheel format or other packaging standards, having Zstandard +bindings in the standard library would enable a future PEP to improve the user +experience for Python wheel packages. + +Rationale +========= + +Introduction of a ``compression`` namespace +------------------------------------------- + +Both the ``zstd`` and ``zstandard`` import names are claimed by projects on +PyPI. To avoid breaking users of one of the existing bindings, this PEP +proposes introducing a new namespace for compression libraries, +``compression``. This name is already reserved on PyPI for use in the +standard library. The new Zstandard module will be ``compression.zstd``. +Other compression modules will be re-exported to the ``compression`` namespace +and their current import names will be deprecated. + +Providing a common namespace for compression modules has several advantages. +First, it reduces user confusion about where to find compression modules. +Second, the top level ``compression`` module could provide information on which +compression formats are available, similar to ``hashlib``'s +``algorithms_available``. If :pep:`775` is accepted, a +``compression.algorithms_guaranteed`` could be provided as well, listing +``zlib``. Finally, a ``compression`` namespace prevents future issues with +merging other compression formats into the standard library. New compression +formats will likely be published to PyPI prior to integration into +CPython. Therefore, any new compression format import name will likely already +be claimed by the time a module would be considered for inclusion in CPython. +Putting compression modules under a package prefix prevents issues with +potential future name clashes. + +Code that would like to remain compatible across Python versions may use the +following pattern to ensure compatibility:: + + try: + from compression.lzma import LZMAFile + except ImportError: + from lzma import LZMAFile + +This will use the newer import name when available and fall back to the old +name otherwise. + +Implementation based on ``pyzstd`` +---------------------------------- + +The implementation for this PEP is based on the `pyzstd project `_. +This project was chosen as the code was `originally written to be upstreamed `_ +to CPython by Ma Lin, who also wrote the `output buffer implementation used in +the standard library today `_. +The project has since been taken over by Rogdham and is published to PyPI. The +APIs in ``pyzstd`` are similar to the APIs for other compression modules in the +standard library such as ``bz2`` and ``lzma``. + +Minimum supported Zstandard version +----------------------------------- + +The minimum supported Zstandard was chosen as v1.4.5, released in May of 2020. +This version was chosen as a minimum based on reviewing the versions of +Zstandard available in a number of Linux distribution package repositories, +including LTS releases. This version choice is rather conservative to maximize +compatibility with existing LTS Linux distributions, but a newer Zstandard +version could likely be chosen given that newer Python releases are generally +packaged as part of newer distribution releases. + +Specification +============= + +The ``compression`` namespace +----------------------------- + +A new namespace package for compression modules will be introduced named +``compression``. The top-level module for this package will be empty to begin +with, but a standard API for interacting with compression routines may be +added in the future to the toplevel. + +The ``compression.zstd`` module +------------------------------- + +A new module, ``compression.zstd`` will be introduced with Zstandard +compression APIs that match other compression modules in the standard library, +namely + +* ``compress`` / ``decompress`` - APIs for one-shot compression/decompression +* ``ZstdFile`` / ``open`` - APIs for interacting with streams and file-like + objects +* ``ZstdCompressor`` / ``ZstdDecompressor`` - APIs for incremental compression/ + decompression + +It will also contain some Zstandard-specific functionality + +* ``ZstdDict`` / ``train_dict`` / ``finalize_dict`` - APIs for interacting with + Zstandard dictionaries, which are useful for compressing many small chunks of + similar data + +``libzstd`` optional dependency +------------------------------- + +The ``libzstd`` library will become an optional dependency of CPython. If the +library is not available, the ``compression.zstd`` module will be unavailable. +This is handled automatically on Unix platforms as part of the normal build +environment detection. + +On Windows, ``libzstd`` will be added to +`the source dependencies `_ +used to build libraries CPython depends on for Windows. + +Other compression modules +------------------------- + +New import names ``compression.lzma``, ``compression.bz2``, and +``compression.zlib`` will be introduced in Python 3.14 re-exporting the +contents of the existing ``lzma``, ``bz2``, and ``zlib`` modules respectively. + +The ``_compression`` module, given that it is marked private, will be +immediately renamed to ``compression._common.streams``. The new name was +selected due to the current contents of the module being I/O related helpers +for stream APIs (e.g. ``LZMAFile``) in standard library compression modules. + +Compression module migration timeline +------------------------------------- + +Existing modules will emit a ``DeprecationWarning`` in the Python +release following the last Python without the ``compression`` module leaving +support. For example, if the ``compression`` namespace is introduced in 3.14, +then the ``DeprecationWarnings`` would be emitted in 3.19, the next release +after 3.13 reaches end of life. Following the standard deprecation timeline +specified in :pep:`387`, in Python 3.24 the existing modules will be removed +and code must use the ``compression`` sub-modules. The documentation for these +modules will be updated to discuss the planned deprecation and removal +timelines. + + +Backwards Compatibility +======================= + +The main compatibility concern is usage of existing standard library +compression APIs with the existing import names. These names will be +deprecated in 3.19 and will be removed in 3.24. Given the long coexistance of +the modules and a 5 year deprecation period, most users will likely migrate to +the new import names well before then. Additionally, a libCST codemod can be +provided to automatically rewrite imports, easing the migration. + +Security Implications +===================== + +As with any new C code, especially code operating on potentially untrusted user +input, there are risks of memory safety issues. The author plans on +contributing integration with libfuzzer to enable fuzzing the ``_zstd`` code +and ensure it is robust. Furthermore, there are a number of tests that exercise +the compression and decompression routines. These tests pass without error when +compiled with AddressSanitizer. + +Taking on a new dependency also always has security risks, but the ``zstd`` +library is mature, fuzzed on each commit, and `participates in Meta's bug bounty +program `_. + +How to Teach This +================= + +Documentation for the new module is in the reference implementation branch. The +documentation for other modules will be updated to discuss the deprecation of +their existing import names, and how to migrate. + +Reference Implementation +======================== + +The `reference implementation `_ +contains the ``_zstd`` C code, the ``compression.zstd`` code, modifications to +``tarfile``, ``shutil``, and ``zipfile``, and tests for each new API and +integration added. It also contains the re-exports of other compression +modules. Deprecations for the existing import names will be added once a +decision is reached regarding the open issues. + +Rejected Ideas +============== + +Name the module ``libzstd`` and do not make a new ``compression`` namespace +--------------------------------------------------------------------------- + +One option instead of making a new ``compression`` namespace would be to find +a different name, such as ``libzstd``, as the import name. However, the issue +of existing import names is likely to persist for future compression formats +added to the standard library. LZ4, a common high speed compression format, +has `a package on PyPI `_, ``lz4``, with the +import name ``lz4``. Instead of solving this issue for each compression format, +it is better to solve it once and for all by using the already-claimed +``compression`` namespace. + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive.