diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index a48e9dbc1b0..a5516e95e20 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -574,7 +574,7 @@ peps/pep-0690.rst @warsaw peps/pep-0691.rst @dstufft peps/pep-0692.rst @jellezijlstra peps/pep-0693.rst @Yhg1s -peps/pep-0694.rst @dstufft +peps/pep-0694.rst @dstufft @warsaw peps/pep-0695.rst @gvanrossum peps/pep-0696.rst @jellezijlstra peps/pep-0697.rst @encukou diff --git a/peps/pep-0694.rst b/peps/pep-0694.rst index 30e3a32b8bd..6dc0ccacc1b 100644 --- a/peps/pep-0694.rst +++ b/peps/pep-0694.rst @@ -1,6 +1,6 @@ PEP: 694 -Title: Upload 2.0 API for Python Package Repositories -Author: Donald Stufft +Title: Upload 2.0 API for Python Package Indexes +Author: Barry Warsaw , Donald Stufft Discussions-To: https://discuss.python.org/t/pep-694-upload-2-0-api-for-python-package-repositories/16879 Status: Draft Type: Standards Track @@ -13,147 +13,182 @@ Post-History: `27-Jun-2022 `__; -- It is a fully synchronous API, which means that we're forced to have a single - request being held open for potentially a long time, both for the upload itself, - and then while the repository processes the uploaded file to determine success - or failure. +* artifacts which can be overwritten and replaced, until a session is published; -- It does not support any mechanism for resuming an upload, with the largest file - size on PyPI being just under 1GB in size, that's a lot of wasted bandwidth if - a large file has a network blip towards the end of an upload. +* asynchronous and "chunked", resumable file uploads, for more efficient use of network bandwidth; -- It treats a single file as the atomic unit of operation, which can be problematic - when a release might have multiple binary wheels which can cause people to get - different versions while the files are uploading, and if the sdist happens to - not go last, possibly some hard to build packages are attempting to be built - from source. +* detailed status on the state of artifact uploads; -- It has very limited support for communicating back to the user, with no support - for multiple errors, warnings, deprecations, etc. It is limited entirely to the - HTTP status code and reason phrase, of which the reason phrase has been - deprecated since HTTP/2 (:rfc:`RFC 7540 <7540#section-8.1.2.4>`). +* new project creation without requiring the uploading of an artifact. -- The metadata for a release/file is submitted alongside the file, however this - metadata is famously unreliable, and most installers instead choose to download - the entire file and read that in part due to that unreliability. +Once this new upload API is adopted, the existing legacy API can be deprecated, however this PEP +does not propose a deprecation schedule for the legacy API. -- There is no mechanism for allowing a repository to do any sort of sanity - checks before bandwidth starts getting expended on an upload, whereas a lot - of the cases of invalid metadata or incorrect permissions could be checked - prior to upload. -- It has no support for "staging" a draft release prior to publishing it to the - repository. +Rationale +========= + +There is currently no standardized API for uploading files to a Python package index such as +PyPI. Instead, everyone has been forced to reverse engineer the existing `"legacy" +`__ API. + +The legacy API, while functional, leaks implementation details of the original PyPI code base, +which has been faithfully replicated in the new code base and alternative implementations. + +In addition, there are a number of major issues with the legacy API: + +* It is fully synchronous, which forces requests to be held open both for the upload itself, and + while the index processes the uploaded file to determine success or failure. + +* It does not support any mechanism for resuming an upload. With the largest default file size on + PyPI being around 1GB in size, requiring the entire upload to complete successfully means + bandwidth is wasted when such uploads experience a network interruption while the request is in + progress. + +* The atomic unit of operation is a single file. This is problematic when a release logically + includes an sdist and multiple binary wheels, leading to race conditions where consumers get + different versions of the package if they are unlucky enough to require a package before their + platform's wheel has completely uploaded. If the release uploads its sdist first, this may also + manifest in some consumers seeing only the sdist, triggering a local build from source. + +* Status reporting is very limited. There's no support for reporting multiple errors, warnings, + deprecations, etc. Status is limited to the HTTP status code and reason phrase, of which the + reason phrase has been deprecated since HTTP/2 (:rfc:`RFC 7540 <7540#section-8.1.2.4>`). + +* Metadata for a release is submitted alongside the file. However, as this metadata is famously + unreliable, most installers instead choose to download the entire file and read the metadata from + there. + +* There is no mechanism for allowing an index to do any sort of sanity checks before bandwidth gets + expended on an upload. Many cases of invalid metadata or incorrect permissions could be checked + prior to uploading files. + +* There is no support for "staging" a release prior to publishing it to the index. -- It has no support for creating new projects, without uploading a file. +* Creation of new projects requires the uploading of at least one file, leading to "stub" uploads + to claim a project namespace. -This PEP proposes a new API for uploads, and deprecates the existing non standard -API. +The new upload API proposed in this PEP solves all of these problems, providing for a much more +flexible, bandwidth friendly approach, with better error reporting, a better release testing +experience, and atomic and simultaneous publishing of all release artifacts. -Status Quo +Legacy API ========== -This does not attempt to be a fully exhaustive documentation of the current API, but -give a high level overview of the existing API. +The following is an overview of the legacy API. For the detailed description, consult the +`PyPI user guide documentation `__. Endpoint -------- -The existing upload API (and the now removed register API) lives at an url, currently -``https://upload.pypi.org/legacy/``, and to communicate which specific API you want -to call, you add a ``:action`` url parameter with a value of ``file_upload``. The values -of ``submit``, ``submit_pkg_info``, and ``doc_upload`` also used to be supported, but -no longer are. +The existing upload API lives at a base URL. For PyPI, that URL is currently +``https://upload.pypi.org/legacy/``. Clients performing uploads specify the API they want to call +by adding an ``:action`` URL parameter with a value of ``file_upload``. [#fn-action]_ -It also has a ``protocol_version`` parameter, in theory to allow new versions of the -API to be written, but in practice that has never happened, and the value is always -``1``. +The legacy API also has a ``protocol_version`` parameter, in theory allowing new versions of the API +to be defined. In practice this has never happened, and the value is always ``1``. -So in practice, on PyPI, the endpoint is +Thus, the effective upload API on PyPI is: ``https://upload.pypi.org/legacy/?:action=file_upload&protocol_version=1``. - Encoding -------- -The data to be submitted is submitted as a ``POST`` request with the content type -of ``multipart/form-data``. This is due to the historical nature, that this API -was not actually designed as an API, but rather was a form on the initial PyPI -implementation, then client code was written to programmatically submit that form. +The data to be submitted is submitted as a ``POST`` request with the content type of +``multipart/form-data``. This reflects the legacy API's historical nature, which was originally +designed not as an API, but rather as a web form on the initial PyPI implementation, with client code +written to programmatically submit that form. Content ------- -Roughly speaking, the metadata contained within the package is submitted as parts -where the content-disposition is ``form-data``, and the name is the name of the -field. The names of these various pieces of metadata are not documented, and they -sometimes, but not always match the names used in the ``METADATA`` files. The casing -rarely matches though, but overall the ``METADATA`` to ``form-data`` conversion is -extremely inconsistent. +Roughly speaking, the metadata contained within the package is submitted as parts where the content +disposition is ``form-data``, and the metadata key is the name of the field. The names of these +various pieces of metadata are not documented, and they sometimes, but not always match the names +used in the ``METADATA`` files for package artifacts. The case rarely matches, and the ``form-data`` +to ``METADATA`` conversion is inconsistent. -The file itself is then sent as a ``application/octet-stream`` part with the name -of ``content``, and if there is a PGP signature attached, then it will be included -as a ``application/octet-stream`` part with the name of ``gpg_signature``. +The upload artifact file itself is sent as a ``application/octet-stream`` part with the name of +``content``, and if there is a PGP signature attached, then it will be included as a +``application/octet-stream`` part with the name of ``gpg_signature``. -Specification -============= +Authentication +-------------- -This PEP traces the root cause of most of the issues with the existing API to be -roughly two things: +Upload authentication is also not standardized. On PyPI, authentication is through `API tokens +`__ or `Trusted Publisher (OpenID Connect) +`__. Other indexes may support different authentication +methods. -- The metadata is submitted alongside the file, rather than being parsed from the - file itself. +.. _spec: + +Upload 2.0 API Specification +============================ - - This is actually fine if used as a pre-check, but it should be validated - against the actual ``METADATA`` or similar files within the distribution. +This PEP draws inspiration from the `Resumable Uploads for HTTP `_ internet draft, +however there are significant differences. This is largely due to the unique nature of Python +package releases (i.e. metadata, multiple related artifacts, etc.), and the support for an upload +session and release stages. Where it makes sense to adopt details of the draft, this PEP does so. -- It supports a single request, using nothing but form data, that either succeeds - or fails, and everything is done and contained within that single request. +This PEP traces the root cause of most of the issues with the existing API to be roughly two things: -We then propose a multi-request workflow, that essentially boils down to: +- The metadata is submitted alongside the file, rather than being parsed from the + file itself. [#fn-metadata]_ -1. Initiate an upload session. -2. Upload the file(s) as part of the upload session. -3. Complete the upload session. -4. (Optional) Check the status of an upload session. +- It supports only a single request, using only form data, that either succeeds or fails, and all + actions are atomic within that single request. -All URLs described here will be relative to the root endpoint, which may be -located anywhere within the url structure of a domain. So it could be at -``https://upload.example.com/``, or ``https://example.com/upload/``. +To address these issues, this PEP proposes a multi-request workflow, which at a high level involves +these steps: + +#. Initiate an upload session, creating a release stage. +#. Upload the file(s) to that stage as part of the upload session. +#. Complete the upload session, publishing or discarding the stage. +#. Optionally check the status of an upload session. Versioning ---------- -This PEP uses the same ``MAJOR.MINOR`` versioning system as used in :pep:`691`, -but it is otherwise independently versioned. The existing API is considered by -this spec to be version ``1.0``, but it otherwise does not attempt to modify -that API in any way. +This PEP uses the same ``MAJOR.MINOR`` versioning system as used in :pep:`691`, but it is otherwise +independently versioned. The legacy API is considered by this PEP to be version ``1.0``, but this +PEP does not modify the legacy API in any way. +The API proposed in this PEP therefor has the version number ``2.0``. + + +Root Endpoint +------------- + +All URLs described here are relative to the "root endpoint", which may be located anywhere within +the url structure of a domain. For example, the root endpoint could be +``https://upload.example.com/``, or ``https://example.com/upload/``. -Endpoints ---------- +Specifically for PyPI, this PEP proposes to implement the root endpoint at +``https://upload.pypi.org/2.0``. This root URL will be considered provisional while the feature is +being tested, and will be blessed as permanent after sufficient testing with live projects. + + +.. _session-create: Create an Upload Session ~~~~~~~~~~~~~~~~~~~~~~~~ -To create a new upload session, you can send a ``POST`` request to ``/``, -with a payload that looks like: +A release starts by creating a new upload session. To create the session, a client submits a ``POST`` request +to the root URL, with a payload that looks like: .. code-block:: json @@ -162,23 +197,49 @@ with a payload that looks like: "api-version": "2.0" }, "name": "foo", - "version": "1.0" + "version": "1.0", + "nonce": "" } -This currently has three keys, ``meta``, ``name``, and ``version``. +The request includes the following top-level keys: + +``meta`` (**required**) + Describes information about the payload itself. Currently, the only defined sub-key is + ``api-version`` the value of which must be the string ``"2.0"``. + +``name`` (**required**) + The name of the project that this session is attempting to release a new version of. + +``version`` (**required**) + The version of the project that this session is attempting to add files to. + +``nonce`` (**optional**) + An additional client-side string input to the :ref:`"session token" ` + algorithm. Details are provided below, but if this key is omitted, it is equivalent + to passing the empty string. -The ``meta`` key is included in all payloads, and it describes information about the -payload itself. +Upon successful session creation, the server returns a ``201 Created`` response. If an error +occurs, the appropriate ``4xx`` code will be returned, as described in the :ref:`session-errors` +section. -The ``name`` key is the name of the project that this session is attempting to -add files to. +If a session is created for a project which has no previous release, then the index **MAY** reserve +the project name before the session is published, however it **MUST NOT** be possible to navigate to +that project using the "regular" (i.e. :ref:`unstaged `) access protocols, *until* +the stage is published. If this first-release stage gets canceled, then the index **SHOULD** delete +the project record, as if it were never uploaded. -The ``version`` key is the version of the project that this session is attepmting to -add files to. +The session is owned by the user that created it, and all subsequent requests **MUST** be performed +with the same credentials, otherwise a ``403 Forbidden`` will be returned on those subsequent +requests. -If creating the session was successful, then the server must return a response -that looks like: + +.. _session-response: + +Response Body ++++++++++++++ + +The successful response includes the following JSON content: .. code-block:: json @@ -186,11 +247,12 @@ that looks like: "meta": { "api-version": "2.0" }, - "urls": { + "links": { + "stage": "...", "upload": "...", - "draft": "...", - "publish": "..." + "session": "...", }, + "session-token": "", "valid-for": 604800, "status": "pending", "files": {}, @@ -200,74 +262,104 @@ that looks like: } -Besides the ``meta`` key, this response has five keys, ``urls``, ``valid-for``, -``status``, ``files``, and ``notices``. +Besides the ``meta`` key, which has the same format as the request JSON, the success response has +the following keys: -The ``urls`` key is a dictionary mapping identifiers to related URLs to this -session. +``links`` + A dictionary mapping :ref:`keys to URLs ` related to this session, the details of + which are provided below. -The ``valid-for`` key is an integer representing how long, in seconds, until the -server itself will expire this session (and thus all of the URLs contained in it). -The session **SHOULD** live at least this much longer unless the client itself -has canceled the session. Servers **MAY** choose to *increase* this time, but should -never *decrease* it, except naturally through the passage of time. +``session-token`` + If the index supports :ref:`previewing staged releases `, this key will contain + the unique :ref:`"session token" ` that can be provided to installers in order to + preview the staged release before it's published. If the index does *not* support stage + previewing, this key **MUST** be omitted. -The ``status`` key is a string that contains one of ``pending``, ``published``, -``errored``, or ``canceled``, this string represents the overall status of -the session. +``valid-for`` + An integer representing how long, in seconds, until the server itself will expire this session, + and thus all of its content, including any uploaded files and the URL links related to the + session. This value is roughly relative to the time at which the session was created or + :ref:`extended `. The session **SHOULD** live at least this much longer + unless the client itself has canceled or published the session. Servers **MAY** choose to + *increase* this time, but should never *decrease* it, except naturally through the passage of + time. Clients can query the :ref:`session status ` to get time remaining in the + session. -The ``files`` key is a mapping containing the filenames that have been uploaded -to this session, to a mapping containing details about each file. +``status`` + A string that contains one of ``pending``, ``published``, ``error``, or ``canceled``, + representing the overall :ref:`status of the session `. -The ``notices`` key is an optional key that points to an array of notices that -the server wishes to communicate to the end user that are not specific to any -one file. +``files`` + A mapping containing the filenames that have been uploaded to this session, to a mapping + containing details about each :ref:`file referenced in this session `. -For each filename in ``files`` the mapping has three keys, ``status``, ``url``, -and ``notices``. +``notices`` + An optional key that points to an array of human-readable informational notices that the server + wishes to communicate to the end user. These notices are specific to the overall session, not + to any particular file in the session. -The ``status`` key is the same as the top level ``status`` key, except that it -indicates the status of a specific file. +.. _session-links: -The ``url`` key is the *absolute* URL that the client should upload that specific -file to (or use to delete that file). +Session Links ++++++++++++++ -The ``notices`` key is an optional key, that is an array of notices that the server -wishes to communicate to the end user that are specific to this file. +For the ``links`` key in the success JSON, the following sub-keys are valid: -The required response code to a successful creation of the session is a -``201 Created`` response and it **MUST** include a ``Location`` header that is the -URL for this session, which may be used to check its status or cancel it. +``upload`` + The endpoint session clients will use to initiate :ref:`uploads ` for each file to + be included in this session. -For the ``urls`` key, there are currently three keys that may appear: +``stage`` + The endpoint where this staged release can be :ref:`previewed ` prior to + publishing the session. This can be used to download and verify the not-yet-public files. If + the index does not support previewing staged releases, this key **MUST** be omitted. -The ``upload`` key, which is the upload endpoint for this session to initiate -a file upload. +``session`` + The endpoint where actions for this session can be performed, including :ref:`publishing this + session `, :ref:`canceling and discarding the session `, + :ref:`querying the current session status `, and :ref:`requesting an extension + of the session lifetime ` (*if* the server supports it). -The ``draft`` key, which is the repository URL that these files are available at -prior to publishing. -The ``publish`` key, which is the endpoint to trigger publishing the session. +.. _session-files: +Session Files ++++++++++++++ + +The ``files`` key contains a mapping from the names of the files uploaded in this session to a +sub-mapping with the following keys: + +``status`` + A string with the same values and semantics as the :ref:`session status key `, + except that it indicates the status of the specific referenced file. + +``link`` + The *absolute* URL that the client should use to reference this specific file. This URL is used + to retrieve, replace, or delete the :ref:`referenced file `. If a ``nonce`` was + provided, this URL **MUST** be obfuscated with a non-guessable token as described in the + :ref:`session token ` section. + +``notices`` + An optional key with similar format and semantics as the ``notices`` session key, except that + these notices are specific to the referenced file. -In addition to the above, if a second session is created for the same name+version -pair, then the upload server **MUST** return the already existing session rather -than creating a new, empty one. +If a second session is created for the same name-version pair while a session for that pair is in +the ``pending`` state, then the server **MUST** return the JSON status response for the already +existing session, along with the ``200 Ok`` status code rather than creating a new, empty session. -Upload Each File -~~~~~~~~~~~~~~~~ +.. _file-uploads: -Once you have initiated an upload session for one or more files, then you have -to actually upload each of those files. +File Upload +~~~~~~~~~~~ -There is no set endpoint for actually uploading the file, that is given to the -client by the server as part of the creation of the upload session, and clients -**MUST NOT** assume that there is any commonality to what those URLs look like from -one session to the next. +After creating the session, the ``upload`` endpoint from the response's :ref:`session links +` mapping is used to begin the upload of new files into that session. Clients +**MUST** use the provided ``upload`` URL and **MUST NOT** assume there is any pattern or commonality +to those URLs from one session to the next. -To initiate a file upload, a client sends a ``POST`` request to the upload URL -in the session, with a request body that looks like: +To initiate a file upload, a client first sends a ``POST`` request to the ``upload`` URL. The +request body has the following JSON format: .. code-block:: json @@ -282,212 +374,434 @@ in the session, with a request body that looks like: } -Besides the standard ``meta`` key, this currently has 4 keys: +Besides the standard ``meta`` key, the request JSON has the following additional keys: -- ``filename``: The filename of the file being uploaded. -- ``size``: The size, in bytes, of the file that is being uploaded. -- ``hashes``: A mapping of hash names to hex encoded digests, each of these digests - are the digests of that file, when hashed by the hash identified in the name. +``filename`` (**required**) + The name of the file being uploaded. - By default, any hash algorithm available via `hashlib - `_ (specifically any that can - be passed to ``hashlib.new()`` and do not require additional parameters) can - be used as a key for the hashes dictionary. At least one secure algorithm from - ``hashlib.algorithms_guaranteed`` **MUST** always be included. At the time - of this PEP, ``sha256`` specifically is recommended. +``size`` (**required**) + The size in bytes of the file being uploaded. - Multiple hashes may be passed at a time, but all hashes must be valid for the - file. -- ``metadata``: An optional key that is a string containing the file's - `core metadata `_. +``hashes`` (**required**) + A mapping of hash names to hex-encoded digests. Each of these digests are the checksums of the + file being uploaded when hashed by the algorithm identified in the name. -Servers **MAY** use the data provided in this response to do some sanity checking -prior to allowing the file to be uploaded, which may include but is not limited -to: + By default, any hash algorithm available in `hashlib + `_ can be used as a key for the hashes + dictionary [#fn-hash]_. At least one secure algorithm from ``hashlib.algorithms_guaranteed`` + **MUST** always be included. This PEP specifically recommends ``sha256``. -- Checking if the ``filename`` already exists. -- Checking if the ``size`` would invalidate some quota. -- Checking if the contents of the ``metadata``, if provided, are valid. + Multiple hashes may be passed at a time, but all hashes provided **MUST** be valid for the file. -If the server determines that the client should attempt the upload, it will return -a ``201 Created`` response, with an empty body, and a ``Location`` header pointing -to the URL that the file itself should be uploaded to. +``metadata`` (**optional**) + If given, this is a string value containing the file's `core metadata + `_. -At this point, the status of the session should show the filename, with the above url -included in it. +Servers **MAY** use the data provided in this request to do some sanity checking prior to allowing +the file to be uploaded. These checks may include, but are not limited to: +- checking if the ``filename`` already exists in a published release; -Upload Data -+++++++++++ +- checking if the ``size`` would exceed any project or file quota; -To upload the file, a client has two choices, they may upload the file as either -a single chunk, or as multiple chunks. Either option is acceptable, but it is -recommended that most clients should choose to upload each file as a single chunk -as that requires fewer requests and typically has better performance. +- checking if the contents of the ``metadata``, if provided, are valid. -However for particularly large files, uploading within a single request may result -in timeouts, so larger files may need to be uploaded in multiple chunks. +If the server determines that upload should proceed, it will return a ``201 Created`` response, with +an empty body, and a ``Location`` header pointing to the URL that the file content should be +uploaded to. The :ref:`status ` of the session will also include the filename in +the ``files`` mapping, with the above ``Location`` URL included in under the ``link`` sub-key. -In either case, the client must generate a unique token (or nonce) for each upload -attempt for a file, and **MUST** include that token in each request in the ``Upload-Token`` -header. The ``Upload-Token`` is a binary blob encoded using base64 surrounded by -a ``:`` on either side. Clients **SHOULD** use at least 32 bytes of cryptographically -random data. You can generate it using the following: +.. IMPORTANT:: -.. code-block:: python + The `IETF draft `_ calls this the URL of the `upload resource + `_, and this PEP uses that nomenclature as well. + +.. _ietf-upload-resource: https://www.ietf.org/archive/id/draft-ietf-httpbis-resumable-upload-05.html#name-upload-creation-2 + + +.. _upload-contents: + +Upload File Contents +++++++++++++++++++++ - import base64 - import secrets +The actual file contents are uploaded by issuing a ``POST`` request to the upload resource URL +[#fn-location]_. The client may either upload the entire file in a single request, or it may opt +for "chunked" upload where the file contents are split into multiple requests, as described below. - header = ":" + base64.b64encode(secrets.token_bytes(32)).decode() + ":" +.. IMPORTANT:: -The one time that it is permissible to omit the ``Upload-Token`` from an upload -request is when a client wishes to opt out of the resumable or chunked file upload -feature completely. In that case, they **MAY** omit the ``Upload-Token``, and the -file must be successfully uploaded in a single HTTP request, and if it fails, the -entire file must be resent in another single HTTP request. + The protocol defined in this PEP differs from the `IETF draft `_ in a few ways: -To upload in a single chunk, a client sends a ``POST`` request to the URL from the -session response for that filename. The client **MUST** include a ``Content-Length`` -header that is equal to the size of the file in bytes, and this **MUST** match the -size given in the original session creation. + * For chunked uploads, the `second and subsequent chunks `_ are uploaded + using a ``POST`` request instead of ``PATCH`` requests. Similarly, this PEP uses + ``application/octet-stream`` for the ``Content-Type`` headers for all chunks. -As an example, if uploading a 100,000 byte file, you would send headers like:: + * No ``Upload-Draft-Interop-Version`` header is required. + + * Some of the server responses are different. + +.. _ietf-upload-append: https://www.ietf.org/archive/id/draft-ietf-httpbis-resumable-upload-05.html#name-upload-append-2 + + +When uploading the entire file in a single request, the request **MUST** include the following +headers (e.g. for a 100,000 byte file): + +.. code-block:: email + + Content-Length: 100000 + Content-Type: application/octet-stream + Upload-Length: 100000 + Upload-Complete: ?1 + +The body of this request contains all 100,000 bytes of the unencoded raw binary data. + +``Content-Length`` + The number of file bytes contained in the body of *this* request. + +``Content-Type`` + **MUST** be ``application/octet-stream``. + +``Upload-Length`` + Indicates the total number of bytes that will be uploaded for this file. For single-request + uploads this will always be equal to ``Content-Length``, but these values will likely differ for + chunked uploads. This value **MUST** equal the number of bytes given in the ``size`` field of + the file upload initiation request. + +``Upload-Complete`` + A flag indicating whether more chunks are coming for this file. For single-request uploads, the + value of this header **MUST** be ``?1``. + +If the upload completes successfully, the server **MUST** respond with a ``201 Created`` status. +The response body has no content. + +If this single-request upload fails, the entire file must be resent in another single HTTP request. +This is the recommended, preferred format for file uploads since fewer requests are required. + +As an example, if the client was to upload a 100,000 byte file, the headers would look like: + +.. code-block:: email Content-Length: 100000 - Upload-Token: :nYuc7Lg2/Lv9S4EYoT9WE6nwFZgN/TcUXyk9wtwoABg=: + Content-Type: application/octet-stream + Upload-Length: 100000 + Upload-Complete: ?1 + +Clients can opt to upload the file in multiple chunks. Because the upload resource URL provided in +the metadata response will be unique per file, clients **MUST** use the given upload resource URL +for all chunks. Clients upload file chunks by sending multiple ``POST`` requests to this URL, with +one request per chunk. -If the upload completes successfully, the server **MUST** respond with a -``201 Created`` status. At this point this file **MUST** not be present in the -repository, but merely staged until the upload session has completed. +For chunked uploads, the ``Content-Length`` is equal to the size in bytes of the chunk that is +currently being sent. The client **MUST** include a ``Upload-Offset`` header which indicates the +byte offset that the content included in this chunk's request starts at, and an ``Upload-Complete`` +header with the value ``?0``. For the first chunk, the ``Upload-Offset`` header **MUST** be set to +``0``. As with single-request uploads, the ``Content-Type`` header is ``application/octet-stream`` +and the body is the raw, unencoded bytes of the chunk. -To upload in multiple chunks, a client sends multiple ``POST`` requests to the same -URL as before, one for each chunk. +For example, if uploading a 100,000 byte file in 1000 byte chunks, the first chunk's request headers +would be: -This time however, the ``Content-Length`` is equal to the size, in bytes, of the -chunk that they are sending. In addition, the client **MUST** include a -``Upload-Offset`` header which indicates a byte offset that the content included -in this request starts at and a ``Upload-Incomplete`` header set to ``1``. +.. code-block:: email -As an example, if uploading a 100,000 byte file in 1000 byte chunks, and this chunk -represents bytes 1001 through 2000, you would send headers like:: + Content-Length: 1000 + Content-Type: application/octet-stream + Upload-Offset: 0 + Upload-Length: 100000 + Upload-Complete: ?0 + +For the second chunk representing bytes 1000 through 1999, include the following headers: + +.. code-block:: email Content-Length: 1000 - Upload-Token: :nYuc7Lg2/Lv9S4EYoT9WE6nwFZgN/TcUXyk9wtwoABg=: - Upload-Offset: 1001 - Upload-Incomplete: 1 + Content-Type: application/octet-stream + Upload-Offset: 1000 + Upload-Length: 100000 + Upload-Complete: ?0 -However, the **final** chunk of data omits the ``Upload-Incomplete`` header, since -at that point the upload is no longer incomplete. +These requests would continue sequentially until the last chunk is ready to be uploaded. -For each successful chunk, the server **MUST** respond with a ``202 Accepted`` -header, except for the final chunk, which **MUST** be a ``201 Created``. +For each successful chunk, the server **MUST** respond with a ``202 Accepted`` header, except for +the final chunk, which **MUST** be a ``201 Created``, and as with non-chunked uploads, the body of +these responses has no content. -The following constraints are placed on uploads regardless of whether they are -single chunk or multiple chunks: +.. _complete-the-upload: -- A client **MUST NOT** perform multiple ``POST`` requests in parallel for the - same file to avoid race conditions and data loss or corruption. The server - **MAY** terminate any ongoing ``POST`` request that utilizes the same - ``Upload-Token``. -- If the offset provided in ``Upload-Offset`` is not ``0`` or the next chunk - in an incomplete upload, then the server **MUST** respond with a 409 Conflict. -- Once an upload has started with a specific token, you may not use another token - for that file without deleting the in progress upload. -- Once a file has uploaded successfully, you may initiate another upload for - that file, and doing so will replace that file. +The final chunk of data **MUST** include the ``Upload-Complete: ?1`` header, since at that point the +entire file has been uploaded. +With both chunked and non-chunked uploads, once completed successfully, the file **MUST** not be +publicly visible in the repository, but merely staged until the upload session is :ref:`completed +`. If the server supports :ref:`previews `, the file **MUST** be +visible at the ``stage`` :ref:`URL `. Partially uploaded chunked files **SHOULD +NOT** be visible at the ``stage`` URL. -Resume Upload -+++++++++++++ +The following constraints are placed on uploads regardless of whether they are single chunk or +multiple chunks: + +- A client **MUST NOT** perform multiple ``POST`` requests in parallel for the same file to avoid + race conditions and data loss or corruption. -To resume an upload, you first have to know how much of the data the server has -already received, regardless of if you were originally uploading the file as -a single chunk, or in multiple chunks. +- If the offset provided in ``Upload-Offset`` is not ``0`` or correctly specifies the byte offset of + the next chunk in an incomplete upload, then the server **MUST** respond with a ``409 Conflict``. + This means that a client **MAY NOT** upload chunks out of order. -To get the status of an individual upload, a client can make a ``HEAD`` request -with their existing ``Upload-Token`` to the same URL they were uploading to. +- Once a file upload has completed successfully, you may initiate another upload for that file, + which **once completed**, will replace that file. This is possible until the entire session is + completed, at which point no further file uploads (either creating or replacing a session file) + are accepted. I.e. once a session is published, the files included in that release are immutable + [#fn-immutable]_. -The server **MUST** respond back with a ``204 No Content`` response, with an -``Upload-Offset`` header that indicates what offset the client should continue -uploading from. If the server has not received any data, then this would be ``0``, -if it has received 1007 bytes then it would be ``1007``. -Once the client has retrieved the offset that they need to start from, they can -upload the rest of the file as described above, either in a single request -containing all of the remaining data or in multiple chunks. +Resume an Upload +++++++++++++++++ +To resume an upload, you first have to know how much of the file's contents the server has already +received. If this is not already known, a client can make a ``HEAD`` request to the upload resource +URL. -Canceling an In Progress Upload +The server **MUST** respond with a ``204 No Content`` response, with an ``Upload-Offset`` header +that indicates what offset the client should continue uploading from. If the server has not received +any data, then this would be ``0``, if it has received 1007 bytes then it would be ``1007``. For +this example, the full response headers would look like: + +.. code-block:: email + + Upload-Offset: 1007 + Upload-Complete: ?0 + Cache-Control: no-store + + +Once the client has retrieved the offset that they need to start from, they can upload the rest of +the file as described above, either in a single request containing all of the remaining bytes, or in +multiple chunks as per the above protocol. + + +.. _cancel-an-upload: + +Canceling an In-Progress Upload +++++++++++++++++++++++++++++++ -If a client wishes to cancel an upload of a specific file, for instance because -they need to upload a different file, they may do so by issuing a ``DELETE`` -request to the file upload URL with the ``Upload-Token`` used to upload the -file in the first place. +If a client wishes to cancel an upload of a specific file, for instance because they need to upload +a different file, they may do so by issuing a ``DELETE`` request to the upload resource URL of the +file they want to delete. + +A successful cancellation request **MUST** respond with a ``204 No Content``. -A successful cancellation request **MUST** response with a ``204 No Content``. +Once deleting, a client **MUST NOT** assume that the previous upload resource URL can be reused. -Delete an uploaded File -+++++++++++++++++++++++ +Delete a Partial or Fully Uploaded File ++++++++++++++++++++++++++++++++++++++++ -Already uploaded files may be deleted by issuing a ``DELETE`` request to the file -upload URL without the ``Upload-Token``. +Similarly, for files which have already been completely uploaded, clients can delete the file by +issuing a ``DELETE`` request to the upload resource URL. A successful deletion request **MUST** response with a ``204 No Content``. +Once deleting, a client **MUST NOT** assume that the previous upload resource URL can be reused. + + +Replacing a Partially or Fully Uploaded File +++++++++++++++++++++++++++++++++++++++++++++ + +To replace a session file, the file upload **MUST** have been previously completed or deleted. It +is not possible to replace a file if the upload for that file is incomplete. Clients have two +options to replace an incomplete upload: + +- :ref:`Cancel the in-progress upload ` by issuing a ``DELETE`` to the upload + resource URL for the file they want to replace. After this, the new file upload can be initiated + by beginning the entire :ref:`file upload ` sequence over again. This means + providing the metadata request again to retrieve a new upload resource URL. Client **MUST NOT** + assume that the previous upload resource URL can be reused after deletion. + +- :ref:`Complete the in-progress upload ` by uploading a zero-length chunk + providing the ``Upload-Complete: ?1`` header. This effectively truncates and completes the + in-progress upload, after which point the new upload can commence. In this case, clients + **SHOULD** reuse the previous upload resource URL and do not need to begin the entire :ref:`file + upload ` sequence over again. + + +.. _session-status: Session Status ~~~~~~~~~~~~~~ -Similarly to file upload, the session URL is provided in the response to -creating the upload session, and clients **MUST NOT** assume that there is any -commonality to what those URLs look like from one session to the next. +At any time, a client can query the status of the session by issuing a ``GET`` request to the +``session`` :ref:`link ` given in the :ref:`session creation response body +`. + +The server will respond to this ``GET`` request with the same :ref:`response ` +that they got when they initially created the upload session, except with any changes to ``status``, +``valid-for``, or ``files`` reflected. -To check the status of a session, clients issue a ``GET`` request to the -session URL, to which the server will respond with the same response that -they got when they initially created the upload session, except with any -changes to ``status``, ``valid-for``, or updated ``files`` reflected. +.. _session-extension: + +Session Extension +~~~~~~~~~~~~~~~~~ + +Servers **MAY** allow clients to extend sessions, but the overall lifetime and number of extensions +allowed is left to the server. To extend a session, a client issues a ``POST`` request to the +``session`` :ref:`link ` given in the :ref:`session creation response body +`. + +The JSON body of this request looks like: + +.. code-block:: json + + { + "meta": { + "api-version": "2.0" + }, + ":action": "extend", + "extend-for": 3600 + } + +The number of seconds specified is just a suggestion to the server for the number of additional +seconds to extend the current session. For example, if the client wants to extend the current +session for another hour, ``extend-for`` would be ``3600``. Upon successful extension, the server +will respond with the same :ref:`response ` that they got when they initially +created the upload session, except with any changes to ``status``, ``valid-for``, or ``files`` +reflected. + +If the server refuses to extend the session for the requested number of seconds, it still returns a +success response, and the ``valid-for`` key will simply include the number of seconds remaining in +the current session. + + +.. _session-cancellation: Session Cancellation ~~~~~~~~~~~~~~~~~~~~ -To cancel an upload session, a client issues a ``DELETE`` request to the -same session URL as before. At which point the server marks the session as -canceled, **MAY** purge any data that was uploaded as part of that session, -and future attempts to access that session URL or any of the file upload URLs -**MAY** return a ``404 Not Found``. +To cancel an entire session, a client issues a ``DELETE`` request to the ``session`` :ref:`link +` given in the :ref:`session creation response body `. The server +then marks the session as canceled, and **SHOULD** purge any data that was uploaded as part of that +session. Future attempts to access that session URL or any of the upload session URLs **MUST** +return a ``404 Not Found``. + +To prevent dangling sessions, servers may also choose to cancel timed-out sessions on their own +accord. It is recommended that servers expunge their sessions after no less than a week, but each +server may choose their own schedule. Servers **MAY** support client-directed :ref:`session +extensions `. -To prevent a lot of dangling sessions, servers may also choose to cancel a -session on their own accord. It is recommended that servers expunge their -sessions after no less than a week, but each server may choose their own -schedule. +.. _publish-session: Session Completion ~~~~~~~~~~~~~~~~~~ -To complete a session, and publish the files that have been included in it, -a client **MUST** send a ``POST`` request to the ``publish`` url in the -session status payload. +To complete a session and publish the files that have been included in it, a client issues a +``POST`` request to the ``session`` :ref:`link ` given in the :ref:`session creation +response body `. + +The JSON body of this request looks like: + +.. code-block:: json + + { + "meta": { + "api-version": "2.0" + }, + ":action": "publish", + } + + +If the server is able to immediately complete the session, it may do so and return a ``201 Created`` +response. If it is unable to immediately complete the session (for instance, if it needs to do +processing that may take longer than reasonable in a single HTTP request), then it may return a +``202 Accepted`` response. + +In either case, the server should include a ``Location`` header pointing back to the session status +URL, and if the server returned a ``202 Accepted``, the client may poll that URL to watch for the +status to change. + +If a session is published that has no staged files, the operation is effectively a no-op, except +where a new project name is being reserved. In this case, the new project is created, reserved, and +owned by the user that created the session. + + +.. _session-token: + +Session Token +~~~~~~~~~~~~~ + +When creating a session, clients can provide a ``nonce`` in the :ref:`initial session creation +request ` . This nonce is a string with arbitrary content. The ``nonce`` is +optional, and if omitted, is equivalent to providing an empty string. + +In order to support previewing of staged uploads, the package ``name`` and ``version``, along with +this ``nonce`` are used as input into a hashing algorithm to produce a unique "session token". This +session token is valid for the life of the session (i.e., until it is completed, either by +cancellation or publishing), and can be provided to supporting installers to gain access to the +staged release. + +The use of the ``nonce`` allows clients to decide whether they want to obscure the visibility of +their staged releases or not, and there can be good reasons for either choice. For example, if a CI +system wants to upload some wheels for a new release, and wants to allow independent validation of a +stage before it's published, the client may opt for not including a nonce. On the other hand, if a +client would like to pre-seed a release which it publishes atomically at the time of a public +announcement, that client will likely opt for providing a nonce. + +The `SHA256 algorithm `_ is used to +turn these inputs into a unique token, in the order ``name``, ``version``, ``nonce``, using the +following Python code as an example: + +.. code-block:: python + + from hashlib import sha256 + + def gentoken(name: bytes, version: bytes, nonce: bytes = b''): + h = sha256() + h.update(name) + h.update(version) + h.update(nonce) + return h.hexdigest() + +It should be evident that if no ``nonce`` is provided in the :ref:`session creation request +`, then the preview token is easily guessable from the package name and version +number alone. Clients can elect to omit the ``nonce`` (or set it to the empty string themselves) if +they want to allow previewing from anybody without access to the preview token. By providing a +non-empty ``nonce``, clients can elect for security-through-obscurity, but this does not protect +staged files behind any kind of authentication. + + +.. _staged-preview: + +Stage Previews +~~~~~~~~~~~~~~ + +The ability to preview staged releases before they are published is an important feature of this +PEP, enabling an additional level of last-mile testing before the release is available to the +public. Indexes **MAY** provide this functionality through the URL provided in the ``stage`` +sub-key of the :ref:`links key ` returned when the session is created. The ``stage`` +URL can be passed to installers such as ``pip`` by setting the `--extra-index-url +`__ flag to this value. +Multiple stages can even be previewed by repeating this flag with multiple values. + +In the future, it may be valuable to include something like a ``Stage-Token`` header to the `Simple +Repository API `_ +requests or the :pep:`691` JSON-based Simple API, with the value from the ``session-token`` sub-key +of the JSON response to the session creation request. Multiple ``Stage-Token`` headers could be +allowed, and installers could support enabling stage previews by adding a ``--staged `` or +similarly named option to set the ``Stage-Token`` header at the command line. This feature is not +currently support, nor proposed by this PEP, though it could be proposed by a separate PEP in the +future. -If the server is able to immediately complete the session, it may do so -and return a ``201 Created`` response. If it is unable to immediately -complete the session (for instance, if it needs to do processing that may -take longer than reasonable in a single HTTP request), then it may return -a ``202 Accepted`` response. +In either case, the index will return views that expose the staged releases to the installer tool, +making them available to download and install into virtual environments built for that last-mile +testing. The former option allows for existing installers to preview staged releases with no +changes, although perhaps in a less user-friendly way. The latter option can be a better user +experience, but the details of this are left to installer tool maintainers. -In either case, the server should include a ``Location`` header pointing -back to the session status url, and if the server returned a ``202 Accepted``, -the client may poll that URL to watch for the status to change. +.. _session-errors: Errors ------ -All Error responses that contain a body will have a body that looks like: +All error responses that contain content will have a body that looks like: .. code-block:: json @@ -504,71 +818,60 @@ All Error responses that contain a body will have a body that looks like: ] } -Besides the standard ``meta`` key, this has two top level keys, ``message`` -and ``errors``. +Besides the standard ``meta`` key, this has the following top level keys: -The ``message`` key is a singular message that encapsulates all errors that -may have happened on this request. +``message`` + A singular message that encapsulates all errors that may have happened on this + request. -The ``errors`` key is an array of specific errors, each of which contains -a ``source`` key, which is a string that indicates what the source of the -error is, and a ``message`` key for that specific error. +``errors`` + An array of specific errors, each of which contains a ``source`` key, which is a string that + indicates what the source of the error is, and a ``message`` key for that specific error. -The ``message`` and ``source`` strings do not have any specific meaning, and -are intended for human interpretation to figure out what the underlying issue -was. +The ``message`` and ``source`` strings do not have any specific meaning, and are intended for human +interpretation to aid in diagnosing underlying issue. -Content-Types +Content Types ------------- -Like :pep:`691`, this PEP proposes that all requests and responses from the -Upload API will have a standard content type that describes what the content -is, what version of the API it represents, and what serialization format has -been used. +Like :pep:`691`, this PEP proposes that all requests and responses from this upload API will have a +standard content type that describes what the content is, what version of the API it represents, and +what serialization format has been used. -The structure of this content type will be: +This standard request content type applies to all requests *except* for :ref:`file upload requests +` which, since they contain only binary data, is always ``application/octet-stream``. -.. code-block:: text - - application/vnd.pypi.upload.$version+format - -Since only major versions should be disruptive to systems attempting to -understand one of these API content bodies, only the major version will be -included in the content type, and will be prefixed with a ``v`` to clarify -that it is a version number. +The structure of the ``Content-Type`` header for all other requests is: -Unlike :pep:`691`, this PEP does not change the existing ``1.0`` API in any -way, so servers will be required to host the new API described in this PEP at -a different endpoint than the existing upload API. - -Which means that for the new 2.0 API, the content types would be: +.. code-block:: text -- **JSON:** ``application/vnd.pypi.upload.v2+json`` + application/vnd.pypi.upload.$version+$format -In addition to the above, a special "meta" version is supported named ``latest``, -whose purpose is to allow clients to request the absolute latest version, without -having to know ahead of time what that version is. It is recommended however, -that clients be explicit about what versions they support. +Since minor API version differences should never be disruptive, only the major version is included +in the content type; the version number is prefixed with a ``v``. -These content types **DO NOT** apply to the file uploads themselves, only to the -other API requests/responses in the upload API. The files themselves should use -the ``application/octet-stream`` content-type. +Unlike :pep:`691`, this PEP does not change the existing *legacy* ``1.0`` upload API in any way, so +servers are required to host the new API described in this PEP at a different endpoint than the +existing upload API. +Since JSON is the only defined request format defined in this PEP, all non-file-upload requests +defined in this PEP **MUST** include a ``Content-Type`` header value of: -Version + Format Selection --------------------------- +- ``application/vnd.pypi.upload.v2+json``. -Again similar to :pep:`691`, this PEP standardizes on using server-driven -content negotiation to allow clients to request different versions or -serialization formats, which includes the ``format`` url parameter. +As with :pep:`691`, a special "meta" version is supported named ``latest``, the purpose of which is +to allow clients to request the latest version implemented by the server, without having to know +ahead of time what that version is. It is recommended however, that clients be explicit about what +versions they support. -Since this PEP expects the existing legacy ``1.0`` upload API to exist at a -different endpoint, and it currently only provides for JSON serialization, this -mechanism is not particularly useful, and clients only have a single version and -serialization they can request. However clients **SHOULD** be setup to handle -content negotiation gracefully in the case that additional formats or versions -are added in the future. +Similar to :pep:`691`, this PEP also standardizes on using server-driven content negotiation to +allow clients to request different versions or serialization formats, which includes the ``format`` +part of the content type. However, since this PEP expects the existing legacy ``1.0`` upload API to +exist at a different endpoint, and this PEP currently only provides for JSON serialization, this +mechanism is not particularly useful. Clients only have a single version and serialization they can +request. However clients **SHOULD** be prepared to handle content negotiation gracefully in the case +that additional formats or versions are added in the future. FAQ @@ -577,13 +880,11 @@ FAQ Does this mean PyPI is planning to drop support for the existing upload API? ---------------------------------------------------------------------------- -At this time PyPI does not have any specific plans to drop support for the -existing upload API. +At this time PyPI does not have any specific plans to drop support for the existing upload API. -Unlike with :pep:`691` there are wide benefits to doing so, so it is likely -that we will want to drop support for it at some point in the future, but -until this API is implemented, and receiving broad use it would be premature -to make any plans for actually dropping support for it. +Unlike with :pep:`691` there are significant benefits to doing so, so it is likely that support for +the legacy upload API to be (responsibly) deprecated and removed at some point in the future. Such +future deprecation planning is explicitly out of scope for *this* PEP. Is this Resumable Upload protocol based on anything? @@ -591,139 +892,151 @@ Is this Resumable Upload protocol based on anything? Yes! -It's actually the protocol specified in an -`Active Internet-Draft `_, -where the authors took what they learned implementing `tus `_ -to provide the idea of resumable uploads in a wholly generic, standards based -way. - -The only deviation we've made from that spec is that we don't use the -``104 Upload Resumption Supported`` informational response in the first -``POST`` request. This decision was made for a few reasons: - -- The ``104 Upload Resumption Supported`` is the only part of that draft - which does not rely entirely on things that are already supported in the - existing standards, since it was adding a new informational status. -- Many clients and web frameworks don't support ``1xx`` informational - responses in a very good way, if at all, adding it would complicate - implementation for very little benefit. -- The purpose of the ``104 Upload Resumption Supported`` support is to allow - clients to determine that an arbitrary endpoint that they're interacting - with supports resumable uploads. Since this PEP is mandating support for - that in servers, clients can just assume that the server they are +It's actually based on the protocol specified in an `active internet draft `_, where the +authors took what they learned implementing `tus `_ to provide the idea of +resumable uploads in a wholly generic, standards based way. + +.. _ietf-draft: https://www.ietf.org/archive/id/draft-ietf-httpbis-resumable-upload-05.html + +This PEP deviates from that spec in several ways, as described in the body of the proposal. This +decision was made for a few reasons: + +- The ``104 Upload Resumption Supported`` is the only part of that draft which does not rely + entirely on things that are already supported in the existing standards, since it was adding a new + informational status. + +- Many clients and web frameworks don't support ``1xx`` informational responses in a very good way, + if at all, adding it would complicate implementation for very little benefit. + +- The purpose of the ``104 Upload Resumption Supported`` support is to allow clients to determine + that an arbitrary endpoint that they're interacting with supports resumable uploads. Since this + PEP is mandating support for that in servers, clients can just assume that the server they are interacting with supports it, which makes using it unneeded. -- In theory, if the support for ``1xx`` responses got resolved and the draft - gets accepted with it in, we can add that in at a later date without - changing the overall flow of the API. -There is a risk that the above draft doesn't get accepted, but even if it -does not, that doesn't actually affect us. It would just mean that our -support for resumable uploads is an application specific protocol, but is -still wholly standards compliant. +- In theory, if the support for ``1xx`` responses got resolved and the draft gets accepted with it + in, we can add that in at a later date without changing the overall flow of the API. + + +Can I use the upload 2.0 API to reserve a project name? +------------------------------------------------------- + +Yes! If you're not ready to upload files to make a release, you can still reserve a project +name (assuming of course that the name doesn't already exist). + +To do this, :ref:`create a new session `, then :ref:`publish the session +` without uploading any files. While the ``version`` key is required in the JSON +body of the create session request, you can simply use the placeholder version number ``"0.0.0"``. + +The user that created the session will become the owner of the new project. Open Questions ============== - Multipart Uploads vs tus ------------------------ -This PEP currently bases the actual uploading of files on an internet draft -from tus.io that supports resumable file uploads. +This PEP currently bases the actual uploading of files on an `internet draft `_ +(originally designed by `tus.io `__) that supports resumable file uploads. That protocol requires a few things: -- That the client selects a secure ``Upload-Token`` that they use to identify - uploading a single file. -- That if clients don't upload the entire file in one shot, that they have - to submit the chunks serially, and in the correct order, with all but the - final chunk having a ``Upload-Incomplete: 1`` header. -- Resumption of an upload is essentially just querying the server to see how - much data they've gotten, then sending the remaining bytes (either as a single - request, or in chunks). -- The upload implicitly is completed when the server successfully gets all of - the data from the client. - -This has one big benefit, that if a client doesn't care about resuming their -download, the work to support, from a client side, resumable uploads is able -to be completely ignored. They can just ``POST`` the file to the URL, and if -it doesn't succeed, they can just ``POST`` the whole file again. - -The other benefit is that even if you do want to support resumption, you can -still just ``POST`` the file, and unless you *need* to resume the download, -that's all you have to do. - -Another, possibly theoretical, benefit is that for hashing the uploaded files, -the serial chunks requirement means that the server can maintain hashing state -between requests, update it for each request, then write that file back to -storage. Unfortunately this isn't actually possible to do with Python's hashlib, -though there are some libraries like `Rehash `_ -that implement it, but they don't support every hash that hashlib does -(specifically not blake2 or sha3 at the time of writing). - -We might also need to reconstitute the download for processing anyways to do -things like extract metadata, etc from it, which would make it a moot point. - -The downside is that there is no ability to parallelize the upload of a single -file because each chunk has to be submitted serially. - -AWS S3 has a similar API (and most blob stores have copied it either wholesale -or something like it) which they call multipart uploading. +- That if clients don't upload the entire file in one shot, that they have to submit the chunks + serially, and in the correct order, with all but the final chunk having a ``Upload-Complete: ?0`` + header. + +- Resumption of an upload is essentially just querying the server to see how much data they've + gotten, then sending the remaining bytes (either as a single request, or in chunks). + +- The upload implicitly is completed when the server successfully gets all of the data from the + client. + +This has the benefit that if a client doesn't care about resuming their download, it can essentially +ignore the protocol. Clients can just ``POST`` the file to the file upload URL, and if it doesn't +succeed, they can just ``POST`` the whole file again. + +The other benefit is that even if clients do want to support resumption, unless they *need* to +resume the download, they can still just ``POST`` the file. + +Another, possibly theoretical benefit is that for hashing the uploaded files, the serial chunks +requirement means that the server can maintain hashing state between requests, update it for each +request, then write that file back to storage. Unfortunately this isn't actually possible to do with +Python's `hashlib `__ standard library module. +There are some libraries third party libraries, such as `Rehash +`__ that do implement the necessary APIs, but they don't +support every hash that ``hashlib`` does (e.g. ``blake2`` or ``sha3`` at the time of writing). + +We might also need to reconstitute the download for processing anyways to do things like extract +metadata, etc from it, which would make it a moot point. + +The downside is that there is no ability to parallelize the upload of a single file because each +chunk has to be submitted serially. + +AWS S3 has a similar API, and most blob stores have copied it either wholesale or something like it +which they call multipart uploading. The basic flow for a multipart upload is: -1. Initiate a Multipart Upload to get an Upload ID. -2. Break your file up into chunks, and upload each one of them individually. -3. Once all chunks have been uploaded, finalize the upload. - - This is the step where any errors would occur. - -It does not directly support resuming an upload, but it allows clients to -control the "blast radius" of failure by adjusting the size of each part -they upload, and if any of the parts fail, they only have to resend those -specific parts. - -This has a big benefit in that it allows parallelization in uploading files, -allowing clients to maximize their bandwidth using multiple threads to send -the data. - -We wouldn't need an explicit step (1), because our session would implicitly -initiate a multipart upload for each file. - -It does have its own downsides: - -- Clients have to do more work on every request to have something resembling - resumable uploads. They would *have* to break the file up into multiple parts - rather than just making a single POST request, and only needing to deal - with the complexity if something fails. - -- Clients that don't care about resumption at all still have to deal with - the third explicit step, though they could just upload the file all as a - single part. - - - S3 works around this by having another API for one shot uploads, but - I'd rather not have two different APIs for uploading the same file. - -- Verifying hashes gets somewhat more complicated. AWS implements hashing - multipart uploads by hashing each part, then the overall hash is just a - hash of those hashes, not of the content itself. We need to know the - actual hash of the file itself for PyPI, so we would have to reconstitute - the file and read its content and hash it once it's been fully uploaded, - though we could still use the hash of hashes trick for checksumming the - upload itself. - - - See above about whether this is actually a downside in practice, or - if it's just in theory. - -I lean towards the tus style resumable uploads as I think they're simpler -to use and to implement, and the main downside is that we possibly leave -some multi-threaded performance on the table, which I think that I'm -personally fine with? - -I guess one additional benefit of the S3 style multi part uploads is that -you don't have to try and do any sort of protection against parallel uploads, -since they're just supported. That alone might erase most of the server side -implementation simplification. +#. Initiate a multipart upload to get an upload ID. +#. Break your file up into chunks, and upload each one of them individually. +#. Once all chunks have been uploaded, finalize the upload. This is the step where any errors would + occur. + +Such multipart uploads do not directly support resuming an upload, but it allows clients to control +the "blast radius" of failure by adjusting the size of each part they upload, and if any of the +parts fail, they only have to resend those specific parts. The trade-off is that it allows for more +parallelism when uploading a single file, allowing clients to maximize their bandwidth using +multiple threads to send the file data. + +We wouldn't need an explicit step (1), because our session would implicitly initiate a multipart +upload for each file. + +There are downsides to this though: + +- Clients have to do more work on every request to have something resembling resumable uploads. They + would *have* to break the file up into multiple parts rather than just making a single POST + request, and only needing to deal with the complexity if something fails. + +- Clients that don't care about resumption at all still have to deal with the third explicit step, + though they could just upload the file all as a single part. (S3 works around this by having + another API for one shot uploads, but the PEP authors place a high value on having a single API + for uploading any individual file.) + +- Verifying hashes gets somewhat more complicated. AWS implements hashing multipart uploads by + hashing each part, then the overall hash is just a hash of those hashes, not of the content + itself. Since PyPI needs to know the actual hash of the file itself anyway, we would have to + reconstitute the file, read its content, and hash it once it's been fully uploaded, though it + could still use the hash of hashes trick for checksumming the upload itself. + +The PEP authors lean towards ``tus`` style resumable uploads, due to them being simpler to use, +easier to imp;lement, and more consistent, with the main downside being that multi-threaded +performance is theoretically left on the table. + +One other possible benefit of the S3 style multipart uploads is that you don't have to try and do +any sort of protection against parallel uploads, since they're just supported. That alone might +erase most of the server side implementation simplification. + +.. rubric:: Footnotes + +.. [#fn-action] Obsolete ``:action`` values ``submit``, ``submit_pkg_info``, and ``doc_upload`` are + no longer supported + + +.. [#fn-metadata] This would be fine if used as a pre-check, but the parallel metadata should be + validated against the actual ``METADATA`` or similar files within the + distribution. + +.. [#fn-hash] Specifically any hash algorithm name that `can be passed to + `_ ``hashlib.new()`` and + which does not require additional parameters. + +.. [#fn-immutable] Published files may still be yanked (i.e. :pep:`592`) or `deleted + `__ as normal. + +.. [#fn-location] Or the URL given in the ``Location`` header in the response to the file upload + initiation request, i.e. the metadata upload request; both of these links **MUST** + be the same. + Copyright ========= diff --git a/peps/pep-0768.rst b/peps/pep-0768.rst index 7dbab864a71..113db96096d 100644 --- a/peps/pep-0768.rst +++ b/peps/pep-0768.rst @@ -147,7 +147,7 @@ provides Python code to be executed when the interpreter reaches a safe point. The value for ``MAX_SCRIPT_SIZE`` will be a trade-off between binary size and how big debugging scripts can be. As most of the logic should be in libraries -and arbitrary code can be executed with very short ammount of Python we are +and arbitrary code can be executed with very short amount of Python we are proposing to start with 4kb initially. This value can be extended in the future if we ever need to.