Complete this phase of the rewrite. One more pass to go!

warsaw · warsaw · commit 70b1bf79da92 · 2024-12-04T13:49:08.000-08:00
diff --git a/peps/pep-0694.rst b/peps/pep-0694.rst
@@ -415,6 +415,8 @@ to.  The :ref:`status <session-status>` of the session will also include the fil
 ``files`` mapping, with the above ``Location`` URL included in under the ``link`` sub-key.
 
 
+.. _upload-contents:
+
 Upload File Contents
 ++++++++++++++++++++
 
@@ -772,57 +774,46 @@ The ``message`` and ``source`` strings do not have any specific meaning, and are
 interpretation to aid in diagnosing underlying issue.
 
 
-**XXX REWRITTEN TO HERE**
-
 Content Types
 -------------
 
 Like :pep:`691`, this PEP proposes that all requests and responses from this upload API will have a
 standard content type that describes what the content is, what version of the API it represents, and
 what serialization format has been used.
 
-The structure of this content type will be:
-
-.. code-block:: text
-
-    application/vnd.pypi.upload.$version+format
+This standard request content type applies to all requests *except* for :ref:`file upload requests
+<upload-contents>` which, since they contain only binary data, is ``application/octet-stream``.
 
-Since only major versions should be disruptive to systems attempting to
-understand one of these API content bodies, only the major version will be
-included in the content type, and will be prefixed with a ``v`` to clarify
-that it is a version number.
+The structure of the ``Content-Type`` header for all other requests is:
 
-Unlike :pep:`691`, this PEP does not change the existing ``1.0`` API in any
-way, so servers will be required to host the new API described in this PEP at
-a different endpoint than the existing upload API.
-
-Thus for the new 2.0 API, the content type would be:
+.. code-block:: text
 
-- **JSON:** ``application/vnd.pypi.upload.v2+json``
+    application/vnd.pypi.upload.$version+$format
 
-In addition to the above, a special "meta" version is supported named ``latest``,
-whose purpose is to allow clients to request the absolute latest version, without
-having to know ahead of time what that version is. It is recommended however,
-that clients be explicit about what versions they support.
+Since minor API version differences should never be disruptive, only the major version is included
+in the content type; the version number is prefixed with a ``v``.
 
-These content types **DO NOT** apply to the file uploads themselves, only to the
-other API requests/responses in the upload API. The files themselves should use
-the ``application/octet-stream`` content type.
+Unlike :pep:`691`, this PEP does not change the existing *legacy* `1.0`` upload API in any way, so
+servers are required to host the new API described in this PEP at a different endpoint than the
+existing upload API.
 
+Since JSON is the only defined request format defined in this PEP, all non-file-upload requests
+defined in this PEP **MUST** include a ``Content-Type`` header value of:
 
-Version + Format Selection
---------------------------
+- ``application/vnd.pypi.upload.v2+json``.
 
-Again, similar to :pep:`691`, this PEP standardizes on using server-driven
-content negotiation to allow clients to request different versions or
-serialization formats, which includes the ``format`` URL parameter.
+As with :pep:`691`, a special "meta" version is supported named ``latest``, the purpose of which is
+to allow clients to request the latest version implemented by the server, without having to know
+ahead of time what that version is.  It is recommended however, that clients be explicit about what
+versions they support.
 
-Since this PEP expects the existing legacy ``1.0`` upload API to exist at a
-different endpoint, and it currently only provides for JSON serialization, this
-mechanism is not particularly useful, and clients only have a single version and
-serialization they can request. However clients **SHOULD** be setup to handle
-content negotiation gracefully in the case that additional formats or versions
-are added in the future.
+Similar to :pep:`691`, this PEP also standardizes on using server-driven content negotiation to
+allow clients to request different versions or serialization formats, which includes the ``format``
+part of the content type.  However, since this PEP expects the existing legacy ``1.0`` upload API to
+exist at a different endpoint, and this PEP currently only provides for JSON serialization, this
+mechanism is not particularly useful.  Clients only have a single version and serialization they can
+request. However clients **SHOULD** be prepared to handle content negotiation gracefully in the case
+that additional formats or versions are added in the future.
 
 
 FAQ
@@ -831,49 +822,47 @@ FAQ
 Does this mean PyPI is planning to drop support for the existing upload API?
 ----------------------------------------------------------------------------
 
-At this time PyPI does not have any specific plans to drop support for the
-existing upload API.
+At this time PyPI does not have any specific plans to drop support for the existing upload API.
 
-Unlike with :pep:`691` there are wide benefits to doing so, so it is likely
-that we will want to drop support for it at some point in the future, but
-until this API is implemented, and receiving broad use it would be premature
-to make any plans for actually dropping support for it.
+Unlike with :pep:`691` there are significant benefits to doing so, so it is likely that support for
+the legacy upload API to be (responsibly) deprecated and removed at some point in the future.  Such
+future deprecation planning is explicitly out of scope for *this* PEP.
 
 
 Is this Resumable Upload protocol based on anything?
 ----------------------------------------------------
 
 Yes!
 
-It's actually the protocol specified in an
-`Active Internet-Draft <https://datatracker.ietf.org/doc/draft-tus-httpbis-resumable-uploads-protocol/>`_,
-where the authors took what they learned implementing `tus <https://tus.io/>`_
-to provide the idea of resumable uploads in a wholly generic, standards based
-way.
-
-The only deviation we've made from that spec is that we don't use the
-``104 Upload Resumption Supported`` informational response in the first
-``POST`` request. This decision was made for a few reasons:
-
-- The ``104 Upload Resumption Supported`` is the only part of that draft
-  which does not rely entirely on things that are already supported in the
-  existing standards, since it was adding a new informational status.
-- Many clients and web frameworks don't support ``1xx`` informational
-  responses in a very good way, if at all, adding it would complicate
-  implementation for very little benefit.
-- The purpose of the ``104 Upload Resumption Supported`` support is to allow
-  clients to determine that an arbitrary endpoint that they're interacting
-  with supports resumable uploads. Since this PEP is mandating support for
-  that in servers, clients can just assume that the server they are
+It's actually the protocol specified in an `Active Internet-Draft <ietf-draft>`_, where the authors
+took what they learned implementing `tus <https://tus.io/>`_ to provide the idea of resumable
+uploads in a wholly generic, standards based way.
+
+.. _ietf-draft: https://datatracker.ietf.org/doc/draft-ietf-httpbis-resumable-upload/
+
+The only deviation we've made from that spec is that we don't use the ``104 Upload Resumption
+Supported`` informational response in the first ``POST`` request. This decision was made for a few
+reasons:
+
+- The ``104 Upload Resumption Supported`` is the only part of that draft which does not rely
+  entirely on things that are already supported in the existing standards, since it was adding a new
+  informational status.
+
+- Many clients and web frameworks don't support ``1xx`` informational responses in a very good way,
+  if at all, adding it would complicate implementation for very little benefit.
+
+- The purpose of the ``104 Upload Resumption Supported`` support is to allow clients to determine
+  that an arbitrary endpoint that they're interacting with supports resumable uploads. Since this
+  PEP is mandating support for that in servers, clients can just assume that the server they are
   interacting with supports it, which makes using it unneeded.
-- In theory, if the support for ``1xx`` responses got resolved and the draft
-  gets accepted with it in, we can add that in at a later date without
-  changing the overall flow of the API.
 
-There is a risk that the above draft doesn't get accepted, but even if it
-does not, that doesn't actually affect us. It would just mean that our
-support for resumable uploads is an application specific protocol, but is
-still wholly standards compliant.
+- In theory, if the support for ``1xx`` responses got resolved and the draft gets accepted with it
+  in, we can add that in at a later date without changing the overall flow of the API.
+
+There is a risk that the above draft doesn't get accepted, but even if it does not, that doesn't
+actually affect us. It would just mean that our support for resumable uploads is an application
+specific protocol, but is still wholly standards compliant.
+
 
 Can I use the upload 2.0 API to reserve a project name?
 -------------------------------------------------------
@@ -891,105 +880,91 @@ The user that created the session will become the owner of the new project.
 Open Questions
 ==============
 
-
 Multipart Uploads vs tus
 ------------------------
 
-This PEP currently bases the actual uploading of files on an internet draft
-from ``tus.io`` that supports resumable file uploads.
+This PEP currently bases the actual uploading of files on an internet draft from ``tus.io`` that
+supports resumable file uploads.
 
 That protocol requires a few things:
 
-- That the client selects a secure ``Upload-Token`` that they use to identify
-  uploading a single file.
-- That if clients don't upload the entire file in one shot, that they have
-  to submit the chunks serially, and in the correct order, with all but the
-  final chunk having a ``Upload-Incomplete: 1`` header.
-- Resumption of an upload is essentially just querying the server to see how
-  much data they've gotten, then sending the remaining bytes (either as a single
-  request, or in chunks).
-- The upload implicitly is completed when the server successfully gets all of
-  the data from the client.
-
-This has one big benefit, that if a client doesn't care about resuming their
-download, the work to support, from a client side, resumable uploads is able
-to be completely ignored. They can just ``POST`` the file to the URL, and if
-it doesn't succeed, they can just ``POST`` the whole file again.
-
-The other benefit is that even if you do want to support resumption, you can
-still just ``POST`` the file, and unless you *need* to resume the download,
-that's all you have to do.
-
-Another, possibly theoretical benefit is that for hashing the uploaded files,
-the serial chunks requirement means that the server can maintain hashing state
-between requests, update it for each request, then write that file back to
-storage. Unfortunately this isn't actually possible to do with Python's hashlib,
-though there are some libraries like `Rehash <https://github.com/kislyuk/rehash>`_
-that implement it, but they don't support every hash that hashlib does
-(specifically not blake2 or sha3 at the time of writing).
-
-We might also need to reconstitute the download for processing anyways to do
-things like extract metadata, etc from it, which would make it a moot point.
-
-The downside is that there is no ability to parallelize the upload of a single
-file because each chunk has to be submitted serially.
-
-AWS S3 has a similar API (and most blob stores have copied it either wholesale
-or something like it) which they call multipart uploading.
+- That the client selects a secure ``Upload-Token`` that they use to identify uploading a single
+  file.
+
+- That if clients don't upload the entire file in one shot, that they have to submit the chunks
+  serially, and in the correct order, with all but the final chunk having a ``Upload-Incomplete: 1``
+  header.
+
+- Resumption of an upload is essentially just querying the server to see how much data they've
+  gotten, then sending the remaining bytes (either as a single request, or in chunks).
+
+- The upload implicitly is completed when the server successfully gets all of the data from the
+  client.
+
+This has the benefit that if a client doesn't care about resuming their download, it can essentially
+ignore the protocol.  Clients can just ``POST`` the file to the file upload URL, and if it doesn't
+succeed, they can just ``POST`` the whole file again.
+
+The other benefit is that even if clients do want to support resumption, unless they *need* to
+resume the download, they can still just ``POST`` the file.
+
+Another, possibly theoretical benefit is that for hashing the uploaded files, the serial chunks
+requirement means that the server can maintain hashing state between requests, update it for each
+request, then write that file back to storage. Unfortunately this isn't actually possible to do with
+Python's `hashlib <https://docs.python.org/3/library/hashlib.html>`__ standard library module.
+There are some libraries third party libraries, such as `Rehash
+<https://rehash.readthedocs.io/en/latest/>`__ that do implement the necessary APIs, but they don't
+support every hash that ``hashlib`` does (e.g. ``blake2`` or ``sha3`` at the time of writing).
+
+We might also need to reconstitute the download for processing anyways to do things like extract
+metadata, etc from it, which would make it a moot point.
+
+The downside is that there is no ability to parallelize the upload of a single file because each
+chunk has to be submitted serially.
+
+AWS S3 has a similar API, and most blob stores have copied it either wholesale or something like it
+which they call multipart uploading.
 
 The basic flow for a multipart upload is:
 
-1. Initiate a Multipart Upload to get an Upload ID.
-2. Break your file up into chunks, and upload each one of them individually.
-3. Once all chunks have been uploaded, finalize the upload.
-   - This is the step where any errors would occur.
-
-It does not directly support resuming an upload, but it allows clients to
-control the "blast radius" of failure by adjusting the size of each part
-they upload, and if any of the parts fail, they only have to resend those
-specific parts.
-
-This has a big benefit in that it allows parallelization in uploading files,
-allowing clients to maximize their bandwidth using multiple threads to send
-the data.
-
-We wouldn't need an explicit step (1), because our session would implicitly
-initiate a multipart upload for each file.
-
-It does have its own downsides:
-
-- Clients have to do more work on every request to have something resembling
-  resumable uploads. They would *have* to break the file up into multiple parts
-  rather than just making a single POST request, and only needing to deal
-  with the complexity if something fails.
-
-- Clients that don't care about resumption at all still have to deal with
-  the third explicit step, though they could just upload the file all as a
-  single part.
-
-  - S3 works around this by having another API for one shot uploads, but
-    I'd rather not have two different APIs for uploading the same file.
-
-- Verifying hashes gets somewhat more complicated. AWS implements hashing
-  multipart uploads by hashing each part, then the overall hash is just a
-  hash of those hashes, not of the content itself. We need to know the
-  actual hash of the file itself for PyPI, so we would have to reconstitute
-  the file and read its content and hash it once it's been fully uploaded,
-  though we could still use the hash of hashes trick for checksumming the
-  upload itself.
-
-  - See above about whether this is actually a downside in practice, or
-    if it's just in theory.
-
-I lean towards the ``tus`` style resumable uploads as I think they're simpler
-to use and to implement, and the main downside is that we possibly leave
-some multi-threaded performance on the table, which I think that I'm
-personally fine with?
-
-I guess one additional benefit of the S3 style multi part uploads is that
-you don't have to try and do any sort of protection against parallel uploads,
-since they're just supported. That alone might erase most of the server side
-implementation simplification.
+#. Initiate a multipart upload to get an upload ID.
+#. Break your file up into chunks, and upload each one of them individually.
+#. Once all chunks have been uploaded, finalize the upload. This is the step where any errors would
+   occur.
+
+Such multipart uploads do not directly support resuming an upload, but it allows clients to control
+the "blast radius" of failure by adjusting the size of each part they upload, and if any of the
+parts fail, they only have to resend those specific parts.  The trade-off is that it allows for more
+parallelism when uploading a single file, allowing clients to maximize their bandwidth using
+multiple threads to send the file data.
+
+We wouldn't need an explicit step (1), because our session would implicitly initiate a multipart
+upload for each file.
+
+There are downsides to this though:
+
+- Clients have to do more work on every request to have something resembling resumable uploads. They
+  would *have* to break the file up into multiple parts rather than just making a single POST
+  request, and only needing to deal with the complexity if something fails.
+
+- Clients that don't care about resumption at all still have to deal with the third explicit step,
+  though they could just upload the file all as a single part. (S3 works around this by having
+  another API for one shot uploads, but the PEP authors place a high value on having a single API
+  for uploading any individual file.)
+
+- Verifying hashes gets somewhat more complicated. AWS implements hashing multipart uploads by
+  hashing each part, then the overall hash is just a hash of those hashes, not of the content
+  itself.  Since PyPI needs to know the actual hash of the file itself anyway, we would have to
+  reconstitute the file, read its content, and hash it once it's been fully uploaded, though it
+  could still use the hash of hashes trick for checksumming the upload itself.
+
+The PEP authors lean towards ``tus`` style resumable uploads, due to them being simpler to use,
+easier to imp;lement, and more consistent, with the main downside being that multi-threaded
+performance is theoretically left on the table.
+
+One other possible benefit of the S3 style multipart uploads is that you don't have to try and do
+any sort of protection against parallel uploads, since they're just supported. That alone might
+erase most of the server side implementation simplification.
 
 .. rubric:: Footnotes