|
| 1 | +PEP: 9999 |
| 2 | +Title: JSON Package Metadata |
| 3 | +Author: Emma Harper Smith <emma@python.org> |
| 4 | +PEP-Delegate: Paul Moore |
| 5 | +Discussions-To: Pending |
| 6 | +Status: Draft |
| 7 | +Type: Standards Track |
| 8 | +Topic: Packaging |
| 9 | +Created: 2025-12-09 |
| 10 | +Post-History: Pending |
| 11 | + |
| 12 | + |
| 13 | +Abstract |
| 14 | +======== |
| 15 | + |
| 16 | +Python package metadata ("core metadata") was first defined in :pep:`241` to |
| 17 | +use :rfc:`822` email headers to encode information about packages. This was |
| 18 | +reasonable at the time; email messages were the only widely used, standardized |
| 19 | +text format that had a parser in the standard library at the time. However, |
| 20 | +issues with handling different encodings, differing handling of line breaks, |
| 21 | +and other differences between implementations have caused numerous packaging |
| 22 | +bugs. To resolve these issues, this PEP proposes introducing a |
| 23 | +`Javascript Object Notation (JSON) <https://www.json.org/json-en.html>`_ |
| 24 | +encoded file containing core metadata in Python packages. |
| 25 | + |
| 26 | + |
| 27 | +Motivation |
| 28 | +========== |
| 29 | + |
| 30 | +The email message format has a number of complexities and limitations which |
| 31 | +reduce its utility as a portable textual interchange format for packaging |
| 32 | +metadata. Due to the :mod:`email` parser requiring configuration changes to |
| 33 | +properly generate valid core metadata, many projects do not use the |
| 34 | +:mod:`!email` module and instead generate core metadata in a custom manner. |
| 35 | +There are many pitfalls with generating email headers that these custom |
| 36 | +generators can hit. First, core metadata fields may contain newlines in the |
| 37 | +value of fields. These newlines must be handled properly to "unfolded" multiple |
| 38 | +lines per :rfc:`822`. Improperly escaped newlines can lead to generating |
| 39 | +invalid core metadata. Second, as discussed in the core metadata |
| 40 | +specifications: |
| 41 | + |
| 42 | +.. epigraph:: |
| 43 | + The standard file format for metadata (including in wheels and installed |
| 44 | + projects) is based on the format of email headers. However, email formats |
| 45 | + have been revised several times, and exactly which email RFC applies to |
| 46 | + packaging metadata is not specified. In the absence of a precise |
| 47 | + definition, the practical standard is set by what the standard library |
| 48 | + :mod:`email.parser` module can parse using the |
| 49 | + :attr:`email.policy.compat32` policy. |
| 50 | + |
| 51 | +Since no specific email RFC is selected, the current core metadata |
| 52 | +specification is ambiguous whether a given core metadata document is valid. |
| 53 | +:rfc:`822` is the only email standard to be explicitly listed in a PEP. |
| 54 | +However, the core metadata specifications also requires that core metadata is |
| 55 | +encoded using UTF-8 when written to a file. This de-facto makes the core |
| 56 | +metadata follow :rfc:`6532`, which specifies internationalization of email |
| 57 | +headers. This has practical interoperability concerns. Until a few years ago, |
| 58 | +it was unspecified how to handle non-ASCII encoded content in core metadata, |
| 59 | +causing confusion about how to properly encode non-ASCII emails in core |
| 60 | +metadata. Third, the current format is difficult to properly validate and |
| 61 | +parse. Many tools do not check for issues with the output of the :mod:`!email` |
| 62 | +parser. If a document is malformed, it may still parse without error by the |
| 63 | +:mod:`!email` module as a valid email message. Furthermore, due to limitations |
| 64 | +in the email format, fields like ``Project-Url`` must create custom encodings |
| 65 | +of nested key-value items, further complicating parsing. Finally, the lack of |
| 66 | +a schema makes it difficult to validate the contents of email message encoded |
| 67 | +metadata. While introducing a specification for the current format has been |
| 68 | +`discussed previously <https://discuss.python.org/t/python-metadata-format-specification-and-implementation/7550>`_, |
| 69 | +no progress had been made, and converting to JSON was a suggested resolution |
| 70 | +to the issues raised. |
| 71 | + |
| 72 | + |
| 73 | +Rationale |
| 74 | +========= |
| 75 | + |
| 76 | +Introducing a new core metadata file with a well-specified format will greatly |
| 77 | +ease generating, parsing, and validating metadata. JSON is a natural choice for |
| 78 | +storing package core metadata. It is easily machine readable and writable, is |
| 79 | +understandable to humans, and is well supported across many languages. |
| 80 | +Furthermore, :pep:`566` already specifies a canonicalization of email formatted |
| 81 | +core metadata to JSON. JSON is also a frequently used format for data |
| 82 | +interchange on the web. For discussion of other formats considered, please |
| 83 | +refer to the rejected ideas section. |
| 84 | + |
| 85 | +To maintain backwards compatibility, the JSON metadata file MUST be generated |
| 86 | +alongside the existing email formatted metadata file. This ensures that tools |
| 87 | +that do not support the new format can still read package metadata for new |
| 88 | +packages. |
| 89 | + |
| 90 | +The JSON formatted metadata file must be semantically equivalent to the email |
| 91 | +encoded file. This ensures that the metadata is unambiguous between the two |
| 92 | +formats, and tools may read either when both are present. To maintain |
| 93 | +performance, this equivalence is not required to be verified by installers, |
| 94 | +though other tools may do so. Some tools may choose to make the check dependent |
| 95 | +on a configuration flag. |
| 96 | + |
| 97 | +Package indexes SHOULD check that the metadata files are semantically |
| 98 | +equivalent when the package is added to the index. This is a low-cost, one-time |
| 99 | +check that ensures users of the index are served valid packages. |
| 100 | + |
| 101 | + |
| 102 | +Specification |
| 103 | +============= |
| 104 | + |
| 105 | +JSON Format Core Metadata File |
| 106 | +------------------------------ |
| 107 | + |
| 108 | +A new optional file ``METADATA.json`` shall be introduced as a metadata file |
| 109 | +for Python packages. If generated, the ``METADATA.json`` file MUST be placed in |
| 110 | +the same directory as the current email formatted ``METADATA`` or ``PKG-INFO`` |
| 111 | +file. |
| 112 | + |
| 113 | +For wheels, this means that ``METADATA.json`` MUST be located in the |
| 114 | +``.dist-info`` directory. The wheel format minor version will be incremented to |
| 115 | +indicate the change in the format. |
| 116 | + |
| 117 | +For source distribution packages, the ``METADATA.json`` file MUST be located |
| 118 | +in the root directory of the project sources. Tools that prefer the JSON |
| 119 | +formatted metadata file MUST check for the existence of a ``METADATA.json`` |
| 120 | +in the source distribution before reading the file. |
| 121 | + |
| 122 | +The semantic contents of the ``METADATA`` and ``METADATA.json`` files MUST be |
| 123 | +equivalent if ``METADATA.json`` is present. Installers MAY verify this |
| 124 | +information. Public package indexes SHOULD verify the files are semantically |
| 125 | +equivalent. |
| 126 | + |
| 127 | +Conversion to JSON Encoding |
| 128 | +--------------------------- |
| 129 | + |
| 130 | +Conversion from the current email format for core metadata to JSON should |
| 131 | +follow the process described in :pep:`566`, with the following modification: |
| 132 | +the ``Project-URL`` entries should be converted into an object with keys |
| 133 | +containing the labels and values containing the URLs from the original email |
| 134 | +value. The overall process thus becomes: |
| 135 | + |
| 136 | +#. The original key-value format should be read with |
| 137 | + ``email.parser.HeaderParser``; |
| 138 | +#. All transformed keys should be reduced to lower case. Hyphens should be |
| 139 | + replaced with underscores, but otherwise should retain all other characters; |
| 140 | +#. The transformed value for any field marked with "(Multiple-use") should be a |
| 141 | + single list containing all the original values for the given key; |
| 142 | +#. The ``Keywords`` field should be converted to a list by splitting the |
| 143 | + original value on commas; |
| 144 | +#. The ``Project-URL`` field should be converted into a JSON object with keys |
| 145 | + containing the labels and values containing the URLs from the original email |
| 146 | + value. |
| 147 | +#. The message body, if present, should be set to the value of the |
| 148 | + ``description`` key. |
| 149 | +#. The result should be stored as a string-keyed dictionary. |
| 150 | + |
| 151 | +One edge case in the above conversion is that the ``Project-URL`` label is |
| 152 | +"free text, with a maximum length of 32 characters." This presents a problem |
| 153 | +when trying to decode the label. Therefore this PEP sets the requirement that |
| 154 | +the ``Project-URL`` label be any text *except* the comma (``,``) character. |
| 155 | +This allows for unambiguous parsing of the ``Project-URL`` entries by splitting |
| 156 | +the text on the left-most comma (``,``) character. |
| 157 | + |
| 158 | +JSON Schema for Core Metadata |
| 159 | +----------------------------- |
| 160 | + |
| 161 | +To enable verification of JSON encoded core metadata, a |
| 162 | +`JSON schema <https://json-schema.org/>`_ for core metadata has been produced. |
| 163 | +This schema will be updated with each revision to the core metadata |
| 164 | +specification. The schema is available in |
| 165 | +:ref:`9999-core-metadata-json-schema`. |
| 166 | + |
| 167 | +TODO: where should the schema be served/what should the $id be? |
| 168 | + |
| 169 | +Serving METADATA.json in the Simple Repository API |
| 170 | +-------------------------------------------------- |
| 171 | + |
| 172 | +:pep:`658` introduced a means of serving package metadata in the Simple |
| 173 | +Repository API. The JSON encoded version of the package metadata may also be |
| 174 | +served, via the following modifications to the Simple Repository API: |
| 175 | + |
| 176 | +A new attribute ``data-dist-info-metadata-json`` may be added to anchor tags |
| 177 | +in the Simple API. This attribute should have a value containing the hash |
| 178 | +information for the ``METADATA.json`` file in the same format as |
| 179 | +``data-dist-info-metadata``. If ``data-dist-info-metadata-json`` is present, |
| 180 | +the repository MUST serve the JSON encoded metadata file at the |
| 181 | +distribution's path with ``.metadata.json`` appended to it. For example, if a |
| 182 | +distribution is served at ``/simple/foo-1.0-py3-none-any.whl``, the JSON |
| 183 | +encoded core metadata file MUST be served at |
| 184 | +``/simple/foo-1.0-py3-none-any.whl.metadata.json``. |
| 185 | + |
| 186 | +Deprecation of the ``METADATA`` and ``PKG-INFO`` Files |
| 187 | +------------------------------------------------------ |
| 188 | + |
| 189 | +The ``METADATA`` and ``PKG-INFO`` files are now deprecated. This means that a |
| 190 | +future PEP may make the ``METADATA`` and ``PKG-INFO`` files optional and |
| 191 | +require ``METADATA.json`` to be present. Please see the next section for |
| 192 | +caveats to that change. |
| 193 | + |
| 194 | +Despite the ``METADATA`` and ``PKG-INFO`` files being deprecated, new core |
| 195 | +metadata revisions should be implemented for both JSON and email to ensure that |
| 196 | +they may remain semantically equivalent. |
| 197 | + |
| 198 | +Backwards Compatibility |
| 199 | +======================= |
| 200 | + |
| 201 | +The specification for ``METADATA.json`` is designed such that the new format is |
| 202 | +completely backwards compatible. Existing tools may read metadata from the |
| 203 | +existing email formatted files, and new tools may take advantage of the new |
| 204 | +format. |
| 205 | + |
| 206 | +A future major revision of the wheel specification may make the ``METADATA`` |
| 207 | +and ``PKG-INFO`` files optional and make the ``METADATA.json`` file required. |
| 208 | +Note that tools will need to maintain parsing of email metadata indefinitely to |
| 209 | +support parsing metadata for old packages which only have the ``METADATA`` or |
| 210 | +``PKG-INFO`` files. |
| 211 | + |
| 212 | + |
| 213 | +Security Implications |
| 214 | +===================== |
| 215 | + |
| 216 | +One attack vector with JSON encoded core metadata is if the JSON payload is |
| 217 | +designed to consume excessive memory or CPU resources in a denial of service |
| 218 | +attack. While this attack is not likely to affect users whom can cancel |
| 219 | +resource-intensive operations, it may be an issue for package indexes. |
| 220 | + |
| 221 | +There are several mitigations that can be made to prevent this: |
| 222 | + |
| 223 | +#. The length of the JSON payload can be restricted to a reasonable size. |
| 224 | +#. The reader may use a :class:`~json.JSONDecoder` to omit parsing :class:`int` |
| 225 | + and :class:`float` values to avoid quadratic number parsing time complexity |
| 226 | + attacks. |
| 227 | +#. I plan to contribute a change to the :class:`~json.JSONDecoder` in Python |
| 228 | + 3.15+ that will allow it to be configured to restrict the nesting of JSON |
| 229 | + payloads to a reasonable depth. |
| 230 | + |
| 231 | +With these mitigations in place, concerns about denial of service attacks with |
| 232 | +JSON encoded core metadata are minimal. |
| 233 | + |
| 234 | + |
| 235 | +Reference Implementation |
| 236 | +======================== |
| 237 | + |
| 238 | +A reference implementation of the JSON schema for JSON core metadata is |
| 239 | +available in :ref:`9999-core-metadata-json-schema`. |
| 240 | + |
| 241 | +Furthermore, a reference implementation in the ``packaging`` library `is |
| 242 | +available |
| 243 | +<https://github.com/wheelnext/packaging/tree/PEP-9999-JSON-metadata>`__. |
| 244 | + |
| 245 | + |
| 246 | +Rejected Ideas |
| 247 | +============== |
| 248 | + |
| 249 | +Using Another File Format (TOML, YAML, etc.) |
| 250 | +-------------------------------------------- |
| 251 | + |
| 252 | +While TOML or another format could be used for the new core metadata file |
| 253 | +format, JSON has been chosen for a few reasons: |
| 254 | + |
| 255 | +#. Core metadata is mostly meant as a machine interchange format to be used by |
| 256 | + tools and services which wish to interoperate. Therefore the |
| 257 | + human-readability of TOML is not an important consideration in this |
| 258 | + selection. |
| 259 | +#. JSON parsers are implemented in many languages' standard libraries and the |
| 260 | + :mod:`json` module has been part of Python's standard library for a very |
| 261 | + long time. |
| 262 | +#. JSON is fast to parse and emit. |
| 263 | +#. JSON schemas are JSON native and commonly used. |
| 264 | + |
| 265 | + |
| 266 | +Open Issues |
| 267 | +=========== |
| 268 | + |
| 269 | +Where Should the JSON Schema be Served? |
| 270 | +--------------------------------------- |
| 271 | + |
| 272 | +Where should the standard JSON Schema be served? Some options would be |
| 273 | +packaging.python.org, pypi.org, python.org, or pypa.org. |
| 274 | + |
| 275 | +My first choice would be packaging.python.org, but I am open to other options. |
| 276 | + |
| 277 | +Should we also update the ``WHEEL`` metadata file format to be JSON encoded? |
| 278 | +---------------------------------------------------------------------------- |
| 279 | + |
| 280 | +The ``WHEEL`` metadata file format is also an email formatted file. This means |
| 281 | +that it is subject to the same parsing and validation issues as the |
| 282 | +``METADATA`` and ``PKG-INFO`` files. However, the ``WHEEL`` file is part of the |
| 283 | +initial wheel format version check done by installers. Changing the file format |
| 284 | +might harm backwards compatibility by making old installers unable to read new |
| 285 | +metadata. |
| 286 | + |
| 287 | +I think it could make sense to introduce a ``WHEEL.json`` file. Then a future |
| 288 | +wheel major version could remove the ``WHEEL`` file and require the |
| 289 | +``WHEEL.json`` file instead. |
| 290 | + |
| 291 | + |
| 292 | +Copyright |
| 293 | +========= |
| 294 | + |
| 295 | +This document is placed in the public domain or under the |
| 296 | +CC0-1.0-Universal license, whichever is more permissive. |
| 297 | + |
0 commit comments