Skip to content

Commit 70b1bf7

Browse files
committed
Complete this phase of the rewrite. One more pass to go!
1 parent b3de19a commit 70b1bf7

File tree

1 file changed

+133
-158
lines changed

1 file changed

+133
-158
lines changed

peps/pep-0694.rst

Lines changed: 133 additions & 158 deletions
Original file line numberDiff line numberDiff line change
@@ -415,6 +415,8 @@ to. The :ref:`status <session-status>` of the session will also include the fil
415415
``files`` mapping, with the above ``Location`` URL included in under the ``link`` sub-key.
416416

417417

418+
.. _upload-contents:
419+
418420
Upload File Contents
419421
++++++++++++++++++++
420422

@@ -772,57 +774,46 @@ The ``message`` and ``source`` strings do not have any specific meaning, and are
772774
interpretation to aid in diagnosing underlying issue.
773775

774776

775-
**XXX REWRITTEN TO HERE**
776-
777777
Content Types
778778
-------------
779779

780780
Like :pep:`691`, this PEP proposes that all requests and responses from this upload API will have a
781781
standard content type that describes what the content is, what version of the API it represents, and
782782
what serialization format has been used.
783783

784-
The structure of this content type will be:
785-
786-
.. code-block:: text
787-
788-
application/vnd.pypi.upload.$version+format
784+
This standard request content type applies to all requests *except* for :ref:`file upload requests
785+
<upload-contents>` which, since they contain only binary data, is ``application/octet-stream``.
789786

790-
Since only major versions should be disruptive to systems attempting to
791-
understand one of these API content bodies, only the major version will be
792-
included in the content type, and will be prefixed with a ``v`` to clarify
793-
that it is a version number.
787+
The structure of the ``Content-Type`` header for all other requests is:
794788

795-
Unlike :pep:`691`, this PEP does not change the existing ``1.0`` API in any
796-
way, so servers will be required to host the new API described in this PEP at
797-
a different endpoint than the existing upload API.
798-
799-
Thus for the new 2.0 API, the content type would be:
789+
.. code-block:: text
800790
801-
- **JSON:** ``application/vnd.pypi.upload.v2+json``
791+
application/vnd.pypi.upload.$version+$format
802792
803-
In addition to the above, a special "meta" version is supported named ``latest``,
804-
whose purpose is to allow clients to request the absolute latest version, without
805-
having to know ahead of time what that version is. It is recommended however,
806-
that clients be explicit about what versions they support.
793+
Since minor API version differences should never be disruptive, only the major version is included
794+
in the content type; the version number is prefixed with a ``v``.
807795

808-
These content types **DO NOT** apply to the file uploads themselves, only to the
809-
other API requests/responses in the upload API. The files themselves should use
810-
the ``application/octet-stream`` content type.
796+
Unlike :pep:`691`, this PEP does not change the existing *legacy* `1.0`` upload API in any way, so
797+
servers are required to host the new API described in this PEP at a different endpoint than the
798+
existing upload API.
811799

800+
Since JSON is the only defined request format defined in this PEP, all non-file-upload requests
801+
defined in this PEP **MUST** include a ``Content-Type`` header value of:
812802

813-
Version + Format Selection
814-
--------------------------
803+
- ``application/vnd.pypi.upload.v2+json``.
815804

816-
Again, similar to :pep:`691`, this PEP standardizes on using server-driven
817-
content negotiation to allow clients to request different versions or
818-
serialization formats, which includes the ``format`` URL parameter.
805+
As with :pep:`691`, a special "meta" version is supported named ``latest``, the purpose of which is
806+
to allow clients to request the latest version implemented by the server, without having to know
807+
ahead of time what that version is. It is recommended however, that clients be explicit about what
808+
versions they support.
819809

820-
Since this PEP expects the existing legacy ``1.0`` upload API to exist at a
821-
different endpoint, and it currently only provides for JSON serialization, this
822-
mechanism is not particularly useful, and clients only have a single version and
823-
serialization they can request. However clients **SHOULD** be setup to handle
824-
content negotiation gracefully in the case that additional formats or versions
825-
are added in the future.
810+
Similar to :pep:`691`, this PEP also standardizes on using server-driven content negotiation to
811+
allow clients to request different versions or serialization formats, which includes the ``format``
812+
part of the content type. However, since this PEP expects the existing legacy ``1.0`` upload API to
813+
exist at a different endpoint, and this PEP currently only provides for JSON serialization, this
814+
mechanism is not particularly useful. Clients only have a single version and serialization they can
815+
request. However clients **SHOULD** be prepared to handle content negotiation gracefully in the case
816+
that additional formats or versions are added in the future.
826817

827818

828819
FAQ
@@ -831,49 +822,47 @@ FAQ
831822
Does this mean PyPI is planning to drop support for the existing upload API?
832823
----------------------------------------------------------------------------
833824

834-
At this time PyPI does not have any specific plans to drop support for the
835-
existing upload API.
825+
At this time PyPI does not have any specific plans to drop support for the existing upload API.
836826

837-
Unlike with :pep:`691` there are wide benefits to doing so, so it is likely
838-
that we will want to drop support for it at some point in the future, but
839-
until this API is implemented, and receiving broad use it would be premature
840-
to make any plans for actually dropping support for it.
827+
Unlike with :pep:`691` there are significant benefits to doing so, so it is likely that support for
828+
the legacy upload API to be (responsibly) deprecated and removed at some point in the future. Such
829+
future deprecation planning is explicitly out of scope for *this* PEP.
841830

842831

843832
Is this Resumable Upload protocol based on anything?
844833
----------------------------------------------------
845834

846835
Yes!
847836

848-
It's actually the protocol specified in an
849-
`Active Internet-Draft <https://datatracker.ietf.org/doc/draft-tus-httpbis-resumable-uploads-protocol/>`_,
850-
where the authors took what they learned implementing `tus <https://tus.io/>`_
851-
to provide the idea of resumable uploads in a wholly generic, standards based
852-
way.
853-
854-
The only deviation we've made from that spec is that we don't use the
855-
``104 Upload Resumption Supported`` informational response in the first
856-
``POST`` request. This decision was made for a few reasons:
857-
858-
- The ``104 Upload Resumption Supported`` is the only part of that draft
859-
which does not rely entirely on things that are already supported in the
860-
existing standards, since it was adding a new informational status.
861-
- Many clients and web frameworks don't support ``1xx`` informational
862-
responses in a very good way, if at all, adding it would complicate
863-
implementation for very little benefit.
864-
- The purpose of the ``104 Upload Resumption Supported`` support is to allow
865-
clients to determine that an arbitrary endpoint that they're interacting
866-
with supports resumable uploads. Since this PEP is mandating support for
867-
that in servers, clients can just assume that the server they are
837+
It's actually the protocol specified in an `Active Internet-Draft <ietf-draft>`_, where the authors
838+
took what they learned implementing `tus <https://tus.io/>`_ to provide the idea of resumable
839+
uploads in a wholly generic, standards based way.
840+
841+
.. _ietf-draft: https://datatracker.ietf.org/doc/draft-ietf-httpbis-resumable-upload/
842+
843+
The only deviation we've made from that spec is that we don't use the ``104 Upload Resumption
844+
Supported`` informational response in the first ``POST`` request. This decision was made for a few
845+
reasons:
846+
847+
- The ``104 Upload Resumption Supported`` is the only part of that draft which does not rely
848+
entirely on things that are already supported in the existing standards, since it was adding a new
849+
informational status.
850+
851+
- Many clients and web frameworks don't support ``1xx`` informational responses in a very good way,
852+
if at all, adding it would complicate implementation for very little benefit.
853+
854+
- The purpose of the ``104 Upload Resumption Supported`` support is to allow clients to determine
855+
that an arbitrary endpoint that they're interacting with supports resumable uploads. Since this
856+
PEP is mandating support for that in servers, clients can just assume that the server they are
868857
interacting with supports it, which makes using it unneeded.
869-
- In theory, if the support for ``1xx`` responses got resolved and the draft
870-
gets accepted with it in, we can add that in at a later date without
871-
changing the overall flow of the API.
872858

873-
There is a risk that the above draft doesn't get accepted, but even if it
874-
does not, that doesn't actually affect us. It would just mean that our
875-
support for resumable uploads is an application specific protocol, but is
876-
still wholly standards compliant.
859+
- In theory, if the support for ``1xx`` responses got resolved and the draft gets accepted with it
860+
in, we can add that in at a later date without changing the overall flow of the API.
861+
862+
There is a risk that the above draft doesn't get accepted, but even if it does not, that doesn't
863+
actually affect us. It would just mean that our support for resumable uploads is an application
864+
specific protocol, but is still wholly standards compliant.
865+
877866

878867
Can I use the upload 2.0 API to reserve a project name?
879868
-------------------------------------------------------
@@ -891,105 +880,91 @@ The user that created the session will become the owner of the new project.
891880
Open Questions
892881
==============
893882

894-
895883
Multipart Uploads vs tus
896884
------------------------
897885

898-
This PEP currently bases the actual uploading of files on an internet draft
899-
from ``tus.io`` that supports resumable file uploads.
886+
This PEP currently bases the actual uploading of files on an internet draft from ``tus.io`` that
887+
supports resumable file uploads.
900888

901889
That protocol requires a few things:
902890

903-
- That the client selects a secure ``Upload-Token`` that they use to identify
904-
uploading a single file.
905-
- That if clients don't upload the entire file in one shot, that they have
906-
to submit the chunks serially, and in the correct order, with all but the
907-
final chunk having a ``Upload-Incomplete: 1`` header.
908-
- Resumption of an upload is essentially just querying the server to see how
909-
much data they've gotten, then sending the remaining bytes (either as a single
910-
request, or in chunks).
911-
- The upload implicitly is completed when the server successfully gets all of
912-
the data from the client.
913-
914-
This has one big benefit, that if a client doesn't care about resuming their
915-
download, the work to support, from a client side, resumable uploads is able
916-
to be completely ignored. They can just ``POST`` the file to the URL, and if
917-
it doesn't succeed, they can just ``POST`` the whole file again.
918-
919-
The other benefit is that even if you do want to support resumption, you can
920-
still just ``POST`` the file, and unless you *need* to resume the download,
921-
that's all you have to do.
922-
923-
Another, possibly theoretical benefit is that for hashing the uploaded files,
924-
the serial chunks requirement means that the server can maintain hashing state
925-
between requests, update it for each request, then write that file back to
926-
storage. Unfortunately this isn't actually possible to do with Python's hashlib,
927-
though there are some libraries like `Rehash <https://github.com/kislyuk/rehash>`_
928-
that implement it, but they don't support every hash that hashlib does
929-
(specifically not blake2 or sha3 at the time of writing).
930-
931-
We might also need to reconstitute the download for processing anyways to do
932-
things like extract metadata, etc from it, which would make it a moot point.
933-
934-
The downside is that there is no ability to parallelize the upload of a single
935-
file because each chunk has to be submitted serially.
936-
937-
AWS S3 has a similar API (and most blob stores have copied it either wholesale
938-
or something like it) which they call multipart uploading.
891+
- That the client selects a secure ``Upload-Token`` that they use to identify uploading a single
892+
file.
893+
894+
- That if clients don't upload the entire file in one shot, that they have to submit the chunks
895+
serially, and in the correct order, with all but the final chunk having a ``Upload-Incomplete: 1``
896+
header.
897+
898+
- Resumption of an upload is essentially just querying the server to see how much data they've
899+
gotten, then sending the remaining bytes (either as a single request, or in chunks).
900+
901+
- The upload implicitly is completed when the server successfully gets all of the data from the
902+
client.
903+
904+
This has the benefit that if a client doesn't care about resuming their download, it can essentially
905+
ignore the protocol. Clients can just ``POST`` the file to the file upload URL, and if it doesn't
906+
succeed, they can just ``POST`` the whole file again.
907+
908+
The other benefit is that even if clients do want to support resumption, unless they *need* to
909+
resume the download, they can still just ``POST`` the file.
910+
911+
Another, possibly theoretical benefit is that for hashing the uploaded files, the serial chunks
912+
requirement means that the server can maintain hashing state between requests, update it for each
913+
request, then write that file back to storage. Unfortunately this isn't actually possible to do with
914+
Python's `hashlib <https://docs.python.org/3/library/hashlib.html>`__ standard library module.
915+
There are some libraries third party libraries, such as `Rehash
916+
<https://rehash.readthedocs.io/en/latest/>`__ that do implement the necessary APIs, but they don't
917+
support every hash that ``hashlib`` does (e.g. ``blake2`` or ``sha3`` at the time of writing).
918+
919+
We might also need to reconstitute the download for processing anyways to do things like extract
920+
metadata, etc from it, which would make it a moot point.
921+
922+
The downside is that there is no ability to parallelize the upload of a single file because each
923+
chunk has to be submitted serially.
924+
925+
AWS S3 has a similar API, and most blob stores have copied it either wholesale or something like it
926+
which they call multipart uploading.
939927

940928
The basic flow for a multipart upload is:
941929

942-
1. Initiate a Multipart Upload to get an Upload ID.
943-
2. Break your file up into chunks, and upload each one of them individually.
944-
3. Once all chunks have been uploaded, finalize the upload.
945-
- This is the step where any errors would occur.
946-
947-
It does not directly support resuming an upload, but it allows clients to
948-
control the "blast radius" of failure by adjusting the size of each part
949-
they upload, and if any of the parts fail, they only have to resend those
950-
specific parts.
951-
952-
This has a big benefit in that it allows parallelization in uploading files,
953-
allowing clients to maximize their bandwidth using multiple threads to send
954-
the data.
955-
956-
We wouldn't need an explicit step (1), because our session would implicitly
957-
initiate a multipart upload for each file.
958-
959-
It does have its own downsides:
960-
961-
- Clients have to do more work on every request to have something resembling
962-
resumable uploads. They would *have* to break the file up into multiple parts
963-
rather than just making a single POST request, and only needing to deal
964-
with the complexity if something fails.
965-
966-
- Clients that don't care about resumption at all still have to deal with
967-
the third explicit step, though they could just upload the file all as a
968-
single part.
969-
970-
- S3 works around this by having another API for one shot uploads, but
971-
I'd rather not have two different APIs for uploading the same file.
972-
973-
- Verifying hashes gets somewhat more complicated. AWS implements hashing
974-
multipart uploads by hashing each part, then the overall hash is just a
975-
hash of those hashes, not of the content itself. We need to know the
976-
actual hash of the file itself for PyPI, so we would have to reconstitute
977-
the file and read its content and hash it once it's been fully uploaded,
978-
though we could still use the hash of hashes trick for checksumming the
979-
upload itself.
980-
981-
- See above about whether this is actually a downside in practice, or
982-
if it's just in theory.
983-
984-
I lean towards the ``tus`` style resumable uploads as I think they're simpler
985-
to use and to implement, and the main downside is that we possibly leave
986-
some multi-threaded performance on the table, which I think that I'm
987-
personally fine with?
988-
989-
I guess one additional benefit of the S3 style multi part uploads is that
990-
you don't have to try and do any sort of protection against parallel uploads,
991-
since they're just supported. That alone might erase most of the server side
992-
implementation simplification.
930+
#. Initiate a multipart upload to get an upload ID.
931+
#. Break your file up into chunks, and upload each one of them individually.
932+
#. Once all chunks have been uploaded, finalize the upload. This is the step where any errors would
933+
occur.
934+
935+
Such multipart uploads do not directly support resuming an upload, but it allows clients to control
936+
the "blast radius" of failure by adjusting the size of each part they upload, and if any of the
937+
parts fail, they only have to resend those specific parts. The trade-off is that it allows for more
938+
parallelism when uploading a single file, allowing clients to maximize their bandwidth using
939+
multiple threads to send the file data.
940+
941+
We wouldn't need an explicit step (1), because our session would implicitly initiate a multipart
942+
upload for each file.
943+
944+
There are downsides to this though:
945+
946+
- Clients have to do more work on every request to have something resembling resumable uploads. They
947+
would *have* to break the file up into multiple parts rather than just making a single POST
948+
request, and only needing to deal with the complexity if something fails.
949+
950+
- Clients that don't care about resumption at all still have to deal with the third explicit step,
951+
though they could just upload the file all as a single part. (S3 works around this by having
952+
another API for one shot uploads, but the PEP authors place a high value on having a single API
953+
for uploading any individual file.)
954+
955+
- Verifying hashes gets somewhat more complicated. AWS implements hashing multipart uploads by
956+
hashing each part, then the overall hash is just a hash of those hashes, not of the content
957+
itself. Since PyPI needs to know the actual hash of the file itself anyway, we would have to
958+
reconstitute the file, read its content, and hash it once it's been fully uploaded, though it
959+
could still use the hash of hashes trick for checksumming the upload itself.
960+
961+
The PEP authors lean towards ``tus`` style resumable uploads, due to them being simpler to use,
962+
easier to imp;lement, and more consistent, with the main downside being that multi-threaded
963+
performance is theoretically left on the table.
964+
965+
One other possible benefit of the S3 style multipart uploads is that you don't have to try and do
966+
any sort of protection against parallel uploads, since they're just supported. That alone might
967+
erase most of the server side implementation simplification.
993968

994969
.. rubric:: Footnotes
995970

0 commit comments

Comments
 (0)