@@ -415,6 +415,8 @@ to. The :ref:`status <session-status>` of the session will also include the fil
415415``files `` mapping, with the above ``Location `` URL included in under the ``link `` sub-key.
416416
417417
418+ .. _upload-contents :
419+
418420Upload File Contents
419421++++++++++++++++++++
420422
@@ -772,57 +774,46 @@ The ``message`` and ``source`` strings do not have any specific meaning, and are
772774interpretation to aid in diagnosing underlying issue.
773775
774776
775- **XXX REWRITTEN TO HERE **
776-
777777Content Types
778778-------------
779779
780780Like :pep: `691 `, this PEP proposes that all requests and responses from this upload API will have a
781781standard content type that describes what the content is, what version of the API it represents, and
782782what serialization format has been used.
783783
784- The structure of this content type will be:
785-
786- .. code-block :: text
787-
788- application/vnd.pypi.upload.$version+format
784+ This standard request content type applies to all requests *except * for :ref: `file upload requests
785+ <upload-contents>` which, since they contain only binary data, is ``application/octet-stream ``.
789786
790- Since only major versions should be disruptive to systems attempting to
791- understand one of these API content bodies, only the major version will be
792- included in the content type, and will be prefixed with a ``v `` to clarify
793- that it is a version number.
787+ The structure of the ``Content-Type `` header for all other requests is:
794788
795- Unlike :pep: `691 `, this PEP does not change the existing ``1.0 `` API in any
796- way, so servers will be required to host the new API described in this PEP at
797- a different endpoint than the existing upload API.
798-
799- Thus for the new 2.0 API, the content type would be:
789+ .. code-block :: text
800790
801- - ** JSON: ** `` application/vnd.pypi.upload.v2+json ``
791+ application/vnd.pypi.upload.$version+$format
802792
803- In addition to the above, a special "meta" version is supported named ``latest ``,
804- whose purpose is to allow clients to request the absolute latest version, without
805- having to know ahead of time what that version is. It is recommended however,
806- that clients be explicit about what versions they support.
793+ Since minor API version differences should never be disruptive, only the major version is included
794+ in the content type; the version number is prefixed with a ``v ``.
807795
808- These content types ** DO NOT ** apply to the file uploads themselves, only to the
809- other API requests/responses in the upload API. The files themselves should use
810- the `` application/octet-stream `` content type .
796+ Unlike :pep: ` 691 `, this PEP does not change the existing * legacy * `1.0`` upload API in any way, so
797+ servers are required to host the new API described in this PEP at a different endpoint than the
798+ existing upload API .
811799
800+ Since JSON is the only defined request format defined in this PEP, all non-file-upload requests
801+ defined in this PEP **MUST ** include a ``Content-Type `` header value of:
812802
813- Version + Format Selection
814- --------------------------
803+ - ``application/vnd.pypi.upload.v2+json ``.
815804
816- Again, similar to :pep: `691 `, this PEP standardizes on using server-driven
817- content negotiation to allow clients to request different versions or
818- serialization formats, which includes the ``format `` URL parameter.
805+ As with :pep: `691 `, a special "meta" version is supported named ``latest ``, the purpose of which is
806+ to allow clients to request the latest version implemented by the server, without having to know
807+ ahead of time what that version is. It is recommended however, that clients be explicit about what
808+ versions they support.
819809
820- Since this PEP expects the existing legacy ``1.0 `` upload API to exist at a
821- different endpoint, and it currently only provides for JSON serialization, this
822- mechanism is not particularly useful, and clients only have a single version and
823- serialization they can request. However clients **SHOULD ** be setup to handle
824- content negotiation gracefully in the case that additional formats or versions
825- are added in the future.
810+ Similar to :pep: `691 `, this PEP also standardizes on using server-driven content negotiation to
811+ allow clients to request different versions or serialization formats, which includes the ``format ``
812+ part of the content type. However, since this PEP expects the existing legacy ``1.0 `` upload API to
813+ exist at a different endpoint, and this PEP currently only provides for JSON serialization, this
814+ mechanism is not particularly useful. Clients only have a single version and serialization they can
815+ request. However clients **SHOULD ** be prepared to handle content negotiation gracefully in the case
816+ that additional formats or versions are added in the future.
826817
827818
828819FAQ
@@ -831,49 +822,47 @@ FAQ
831822Does this mean PyPI is planning to drop support for the existing upload API?
832823----------------------------------------------------------------------------
833824
834- At this time PyPI does not have any specific plans to drop support for the
835- existing upload API.
825+ At this time PyPI does not have any specific plans to drop support for the existing upload API.
836826
837- Unlike with :pep: `691 ` there are wide benefits to doing so, so it is likely
838- that we will want to drop support for it at some point in the future, but
839- until this API is implemented, and receiving broad use it would be premature
840- to make any plans for actually dropping support for it.
827+ Unlike with :pep: `691 ` there are significant benefits to doing so, so it is likely that support for
828+ the legacy upload API to be (responsibly) deprecated and removed at some point in the future. Such
829+ future deprecation planning is explicitly out of scope for *this * PEP.
841830
842831
843832Is this Resumable Upload protocol based on anything?
844833----------------------------------------------------
845834
846835Yes!
847836
848- It's actually the protocol specified in an
849- ` Active Internet-Draft <https://datatracker.ietf.org/doc/draft-tus-httpbis-resumable-uploads-protocol/ >`_,
850- where the authors took what they learned implementing ` tus < https://tus.io/ >`_
851- to provide the idea of resumable uploads in a wholly generic, standards based
852- way.
853-
854- The only deviation we've made from that spec is that we don't use the
855- `` 104 Upload Resumption Supported `` informational response in the first
856- `` POST `` request. This decision was made for a few reasons:
857-
858- - The ``104 Upload Resumption Supported `` is the only part of that draft
859- which does not rely entirely on things that are already supported in the
860- existing standards, since it was adding a new informational status.
861- - Many clients and web frameworks don't support `` 1xx `` informational
862- responses in a very good way, if at all, adding it would complicate
863- implementation for very little benefit.
864- - The purpose of the `` 104 Upload Resumption Supported `` support is to allow
865- clients to determine that an arbitrary endpoint that they're interacting
866- with supports resumable uploads. Since this PEP is mandating support for
867- that in servers, clients can just assume that the server they are
837+ It's actually the protocol specified in an ` Active Internet-Draft < ietf-draft >`_, where the authors
838+ took what they learned implementing ` tus <https://tus.io/ >`_ to provide the idea of resumable
839+ uploads in a wholly generic, standards based way.
840+
841+ .. _ ietf-draft : https://datatracker.ietf.org/doc/draft-ietf-httpbis-resumable-upload/
842+
843+ The only deviation we've made from that spec is that we don't use the `` 104 Upload Resumption
844+ Supported `` informational response in the first `` POST `` request. This decision was made for a few
845+ reasons:
846+
847+ - The ``104 Upload Resumption Supported `` is the only part of that draft which does not rely
848+ entirely on things that are already supported in the existing standards, since it was adding a new
849+ informational status.
850+
851+ - Many clients and web frameworks don't support `` 1xx `` informational responses in a very good way,
852+ if at all, adding it would complicate implementation for very little benefit.
853+
854+ - The purpose of the `` 104 Upload Resumption Supported `` support is to allow clients to determine
855+ that an arbitrary endpoint that they're interacting with supports resumable uploads. Since this
856+ PEP is mandating support for that in servers, clients can just assume that the server they are
868857 interacting with supports it, which makes using it unneeded.
869- - In theory, if the support for ``1xx `` responses got resolved and the draft
870- gets accepted with it in, we can add that in at a later date without
871- changing the overall flow of the API.
872858
873- There is a risk that the above draft doesn't get accepted, but even if it
874- does not, that doesn't actually affect us. It would just mean that our
875- support for resumable uploads is an application specific protocol, but is
876- still wholly standards compliant.
859+ - In theory, if the support for ``1xx `` responses got resolved and the draft gets accepted with it
860+ in, we can add that in at a later date without changing the overall flow of the API.
861+
862+ There is a risk that the above draft doesn't get accepted, but even if it does not, that doesn't
863+ actually affect us. It would just mean that our support for resumable uploads is an application
864+ specific protocol, but is still wholly standards compliant.
865+
877866
878867Can I use the upload 2.0 API to reserve a project name?
879868-------------------------------------------------------
@@ -891,105 +880,91 @@ The user that created the session will become the owner of the new project.
891880Open Questions
892881==============
893882
894-
895883Multipart Uploads vs tus
896884------------------------
897885
898- This PEP currently bases the actual uploading of files on an internet draft
899- from `` tus.io `` that supports resumable file uploads.
886+ This PEP currently bases the actual uploading of files on an internet draft from `` tus.io `` that
887+ supports resumable file uploads.
900888
901889That protocol requires a few things:
902890
903- - That the client selects a secure ``Upload-Token `` that they use to identify
904- uploading a single file.
905- - That if clients don't upload the entire file in one shot, that they have
906- to submit the chunks serially, and in the correct order, with all but the
907- final chunk having a ``Upload-Incomplete: 1 `` header.
908- - Resumption of an upload is essentially just querying the server to see how
909- much data they've gotten, then sending the remaining bytes (either as a single
910- request, or in chunks).
911- - The upload implicitly is completed when the server successfully gets all of
912- the data from the client.
913-
914- This has one big benefit, that if a client doesn't care about resuming their
915- download, the work to support, from a client side, resumable uploads is able
916- to be completely ignored. They can just `` POST `` the file to the URL, and if
917- it doesn't succeed, they can just ``POST `` the whole file again.
918-
919- The other benefit is that even if you do want to support resumption, you can
920- still just `` POST `` the file, and unless you *need * to resume the download,
921- that's all you have to do .
922-
923- Another, possibly theoretical benefit is that for hashing the uploaded files,
924- the serial chunks requirement means that the server can maintain hashing state
925- between requests, update it for each request, then write that file back to
926- storage. Unfortunately this isn't actually possible to do with Python's hashlib,
927- though there are some libraries like ` Rehash < https://github.com/kislyuk/rehash >`_
928- that implement it , but they don't support every hash that hashlib does
929- (specifically not blake2 or sha3 at the time of writing).
930-
931- We might also need to reconstitute the download for processing anyways to do
932- things like extract metadata, etc from it, which would make it a moot point.
933-
934- The downside is that there is no ability to parallelize the upload of a single
935- file because each chunk has to be submitted serially.
936-
937- AWS S3 has a similar API ( and most blob stores have copied it either wholesale
938- or something like it) which they call multipart uploading.
891+ - That the client selects a secure ``Upload-Token `` that they use to identify uploading a single
892+ file.
893+
894+ - That if clients don't upload the entire file in one shot, that they have to submit the chunks
895+ serially, and in the correct order, with all but the final chunk having a ``Upload-Incomplete: 1 ``
896+ header.
897+
898+ - Resumption of an upload is essentially just querying the server to see how much data they've
899+ gotten, then sending the remaining bytes (either as a single request, or in chunks).
900+
901+ - The upload implicitly is completed when the server successfully gets all of the data from the
902+ client.
903+
904+ This has the benefit that if a client doesn't care about resuming their download, it can essentially
905+ ignore the protocol. Clients can just ``POST `` the file to the file upload URL, and if it doesn't
906+ succeed, they can just `` POST `` the whole file again.
907+
908+ The other benefit is that even if clients do want to support resumption, unless they *need * to
909+ resume the download, they can still just `` POST `` the file .
910+
911+ Another, possibly theoretical benefit is that for hashing the uploaded files, the serial chunks
912+ requirement means that the server can maintain hashing state between requests, update it for each
913+ request, then write that file back to storage. Unfortunately this isn't actually possible to do with
914+ Python's ` hashlib < https://docs.python.org/3/library/hashlib.html >`__ standard library module.
915+ There are some libraries third party libraries, such as ` Rehash
916+ <https://rehash.readthedocs.io/en/latest/> `__ that do implement the necessary APIs , but they don't
917+ support every hash that `` hashlib `` does (e.g. `` blake2 `` or `` sha3 `` at the time of writing).
918+
919+ We might also need to reconstitute the download for processing anyways to do things like extract
920+ metadata, etc from it, which would make it a moot point.
921+
922+ The downside is that there is no ability to parallelize the upload of a single file because each
923+ chunk has to be submitted serially.
924+
925+ AWS S3 has a similar API, and most blob stores have copied it either wholesale or something like it
926+ which they call multipart uploading.
939927
940928The basic flow for a multipart upload is:
941929
942- 1. Initiate a Multipart Upload to get an Upload ID.
943- 2. Break your file up into chunks, and upload each one of them individually.
944- 3. Once all chunks have been uploaded, finalize the upload.
945- - This is the step where any errors would occur.
946-
947- It does not directly support resuming an upload, but it allows clients to
948- control the "blast radius" of failure by adjusting the size of each part
949- they upload, and if any of the parts fail, they only have to resend those
950- specific parts.
951-
952- This has a big benefit in that it allows parallelization in uploading files,
953- allowing clients to maximize their bandwidth using multiple threads to send
954- the data.
955-
956- We wouldn't need an explicit step (1), because our session would implicitly
957- initiate a multipart upload for each file.
958-
959- It does have its own downsides:
960-
961- - Clients have to do more work on every request to have something resembling
962- resumable uploads. They would *have * to break the file up into multiple parts
963- rather than just making a single POST request, and only needing to deal
964- with the complexity if something fails.
965-
966- - Clients that don't care about resumption at all still have to deal with
967- the third explicit step, though they could just upload the file all as a
968- single part.
969-
970- - S3 works around this by having another API for one shot uploads, but
971- I'd rather not have two different APIs for uploading the same file.
972-
973- - Verifying hashes gets somewhat more complicated. AWS implements hashing
974- multipart uploads by hashing each part, then the overall hash is just a
975- hash of those hashes, not of the content itself. We need to know the
976- actual hash of the file itself for PyPI, so we would have to reconstitute
977- the file and read its content and hash it once it's been fully uploaded,
978- though we could still use the hash of hashes trick for checksumming the
979- upload itself.
980-
981- - See above about whether this is actually a downside in practice, or
982- if it's just in theory.
983-
984- I lean towards the ``tus `` style resumable uploads as I think they're simpler
985- to use and to implement, and the main downside is that we possibly leave
986- some multi-threaded performance on the table, which I think that I'm
987- personally fine with?
988-
989- I guess one additional benefit of the S3 style multi part uploads is that
990- you don't have to try and do any sort of protection against parallel uploads,
991- since they're just supported. That alone might erase most of the server side
992- implementation simplification.
930+ #. Initiate a multipart upload to get an upload ID.
931+ #. Break your file up into chunks, and upload each one of them individually.
932+ #. Once all chunks have been uploaded, finalize the upload. This is the step where any errors would
933+ occur.
934+
935+ Such multipart uploads do not directly support resuming an upload, but it allows clients to control
936+ the "blast radius" of failure by adjusting the size of each part they upload, and if any of the
937+ parts fail, they only have to resend those specific parts. The trade-off is that it allows for more
938+ parallelism when uploading a single file, allowing clients to maximize their bandwidth using
939+ multiple threads to send the file data.
940+
941+ We wouldn't need an explicit step (1), because our session would implicitly initiate a multipart
942+ upload for each file.
943+
944+ There are downsides to this though:
945+
946+ - Clients have to do more work on every request to have something resembling resumable uploads. They
947+ would *have * to break the file up into multiple parts rather than just making a single POST
948+ request, and only needing to deal with the complexity if something fails.
949+
950+ - Clients that don't care about resumption at all still have to deal with the third explicit step,
951+ though they could just upload the file all as a single part. (S3 works around this by having
952+ another API for one shot uploads, but the PEP authors place a high value on having a single API
953+ for uploading any individual file.)
954+
955+ - Verifying hashes gets somewhat more complicated. AWS implements hashing multipart uploads by
956+ hashing each part, then the overall hash is just a hash of those hashes, not of the content
957+ itself. Since PyPI needs to know the actual hash of the file itself anyway, we would have to
958+ reconstitute the file, read its content, and hash it once it's been fully uploaded, though it
959+ could still use the hash of hashes trick for checksumming the upload itself.
960+
961+ The PEP authors lean towards ``tus `` style resumable uploads, due to them being simpler to use,
962+ easier to imp;lement, and more consistent, with the main downside being that multi-threaded
963+ performance is theoretically left on the table.
964+
965+ One other possible benefit of the S3 style multipart uploads is that you don't have to try and do
966+ any sort of protection against parallel uploads, since they're just supported. That alone might
967+ erase most of the server side implementation simplification.
993968
994969.. rubric :: Footnotes
995970
0 commit comments