diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 449127b2a4f..542b0bdfb4a 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -642,6 +642,7 @@ peps/pep-0761.rst @sethmlarson @hugovk peps/pep-0762.rst @pablogsal @ambv @lysnikolaou @emilyemorehouse peps/pep-0763.rst @dstufft peps/pep-0765.rst @iritkatriel @ncoghlan +peps/pep-0766.rst @warsaw # ... peps/pep-0777.rst @warsaw # ... diff --git a/peps/pep-0766.rst b/peps/pep-0766.rst new file mode 100644 index 00000000000..b1e22e78c92 --- /dev/null +++ b/peps/pep-0766.rst @@ -0,0 +1,445 @@ +PEP: 766 +Title: Explicit Priority Choices Among Multiple Indexes +Author: Michael Sarahan +Sponsor: Barry Warsaw +PEP-Delegate: Paul Moore +Discussions-To: https://discuss.python.org/t/pep-for-handling-multiple-indexes-index-priority/71589 +Status: Draft +Type: Informational +Topic: Packaging +Created: 18-Nov-2024 +Post-History: `18-Nov-2024 `__, + +Abstract +======== + +Package resolution is a key part of the Python user experience as the means of +extending Python's core functionality. The experience of package resolution is +mostly taken for granted until someone encounters a situation where the package +installer does something they don't expect. The installer behavior with +multiple indexes has been `a common source of unexpected behavior +`__. Through its ubiquity, pip has +long defined the standard expected behavior across other tools in the ecosystem, +but Python installers are diverging with respect to how they handle multiple +indexes. At the core of this divergence is whether index contents are combined +before resolving distributions, or each index is handled individually in order. +pip merges all indexes before matching distributions, while uv matches +distributions on one index before moving on to the next. Each approach has +advantages and disadvantages. This PEP aims to describe each of these +behaviors, which are referred to as “version priority” and “index priority” +respectively, so that community discussions and troubleshooting can share a +common vocabulary, and so that tools can implement predictable behavior based on +these descriptions. + +Motivation +========== + +Python package users frequently find themselves in need of specifying an index +or package source other than PyPI. There are many reasons for external indexes +to exist: + +- File size/quota limitations on PyPI +- Implementation variants, such as `different GPU library builds in PyTorch `__ +- `Local builds of packages shared internally at an organization `__ +- `Situations where a local package has remote dependencies + `__, and the user wishes to prioritize + local packages over remote dependencies, while still falling back to remote + dependencies where needed + +In most of these cases, it is not desirable to completely forego PyPI. Instead, +users generally want PyPI to still be a source of packages, but a lower priority +source. Unfortunately, `pip's current design precludes this concept of priority `__. +Some Python installer tools have developed alternative ways to handle multiple +indexes that incorporate mechanisms to express index priority, such as `uv +`__ +and `PDM +`__. + +The innovation and the potential for customization is exciting, but it comes at +the risk of further fragmenting the python packaging ecosystem, which is already +perceived as one of Python's weak points. The motivation of this PEP is to encourage +installers to provide more insight into how they handle multiple indexes, and to +provide a vocabulary that can be common to the broader community. + +Specification +============= + +“Version priority” +------------------ + +This behavior is characterized by the installer always getting the +"best" version of a package, regardless of the index that it comes +from. "Best" is defined by the installer's algorithm for optimizing +the various traits of a package, also factoring in user input (such as +preferring only binaries, or no binaries). While installers may differ +in their optimization criteria and user options, the general trait that +all version priority installers share is that the index +contents are collated prior to candidate selection. + +Version priority is most useful when all configured indexes are equally trusted +and well-behaved regarding the distribution interchangeability assumption. +Mirrors are especially well-behaved in this regard. That interchangeability +assumption is what makes comparing distributions of a given package meaningful. +Without it, the installer is no longer comparing “apples to apples.” In +practice, it is common for different indexes to have files that have different +contents than other indexes, such as builds for special hardware, or differing +metadata for the same package. Version priority behavior can lead to +undesirable, unexpected outcomes in these cases, and this is where `users +generally look for some kind of index priority +`__. Additionally, when there is a +difference in trust among indexes, version priority does not provide a way to +prefer more trusted indexes over less trusted indexes. This has been exploited by +dependency confusion attacks, and :pep:`708` was proposed as a way of +hard-coding a notion of trusted external indexes into the index. + +The "version priority" name is new, and introduction of new terms should always +be minimized. This PEP looks toward the uv project, which refers to `its implementation of the version priority +behavior `__ +as “``unsafe-best-match``.” Naming is really hard here. On one hand, it +isn’t accurate to call pip’s default behavior intrinsically “unsafe.” +The addition of possibly malicious indexes is what +introduces concern with this behavior. :pep:`708` added a way to restrict +installers from drawing packages from unexpected, potentially insecure +indexes. On the other hand, the term “best-match” is technically +correct, but also misleading. The “best match” varies by user and by +application. “Best” is technically correct in the sense that it is a +global optimum according to the match criteria specified above, but that +is not necessarily what is “best” in a user’s eyes. “Version priority” +is a proposed term that avoids the concerns with the uv terminology, +while approximating the behavior in the most user-identifiable way that +packages are compared. + +“Index priority” +---------------- + +In index priority, the resolver finds candidates for each index, one at a time. +The resolver proceeds to subsequent indexes only if the current package request +has no viable candidates. Index priority does not combine indexes into one +global, flat namespace. Because indexes are searched in order, the package from +an earlier index will be preferred over a package from a later index, +regardless of whether the later index had a better match with the installer's +optimization criteria. For a given installer, the optimization criteria and +selection algorithm should be the same for both index priority and version +priority. It is only the treatment of multiple indexes that differs: all +together for version priority, and individually for index priority. + +The order of specification of indexes determines their priority in the +finding process. As a result, the way that installers load the index +configuration must be predictable and reproducible. This PEP does not prescribe +any particular mechanism, other than to say that installers should provide +a way of ordering their collection of sources. Installers should also +ideally provide optional debugging output that provides insight into +which index is being considered. + +Each package’s finder should start at the beginning of the list of indexes, so each +package starts over with the index list. In other words, if one package has no +valid candidates on the first index, but finds a hit on the second index, +subsequent packages should still start their search on the first index, rather than +starting on the second. + +One desirable behavior that the index priority strategy implies is that +there are no “surprise” updates, where a version bump on a +lower-priority index wins out over a curated, approved higher-priority +index. This is related to the security improvement of :pep:`708`, where +packages can restrict the external indexes that distributions can come +from, but index priority is more configurable by end users. The package installs are +only expected to change when either the higher-priority index or the +index priority configuration change. This stability and predictability +makes it more viable to configure indexes as a more persistent property of an +environment, rather than a one-off argument for one install command. + +Cache keys +~~~~~~~~~~ + +Because index priority is acknowledging the possibility that different indexes +may have different content for a given package, caching and lockfiles should now +include the index from which distributions were downloaded. Without this +aspect, it is possible that after changing the list of configured indexes, the +cache or lockfile could provide a similarly-named distribution from a +lower-priority index. If every index follows the recommended behavior of +providing identical files across indexes for a given filename, this is not an +issue. However, that recommendation is not readily enforceable, and augmenting +the cache key with origin index would be a wise defensive change. + +Ways that a request falls through to a lower priority index +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- Package name is not present at all in higher priority index +- All distributions from higher priority index filtered out due to + version specifier, compatible Python version, platform tag, yanking or otherwise +- A denylist configuration for the installer specifies that a particular package + name should be ignored on a given index +- A higher priority index is unreachable (e.g. blocked by firewall + rules, temporarily unavailable due to maintenance, other miscellaneous + and temporary networking issues). This is a less clear-cut detail that + should be controllable by users. On one hand, this behavior would lead + to less predictable, likely unreproducible results by unexpectedly + falling through to lower priority indexes. On the other hand, graceful + fallback may be more valuable to some users, especially if they can + safely assume that all of their indexes are equally trusted. pip’s + behavior today is graceful fallback: you see warnings if an index is + having connection issues, but the installation will proceed with any + other available indexes. Because index priority can convey different trust + levels between indexes, installers that implement index priority should + default to raising errors and aborting on network issues. Installers may + choose to provide a flag to allow fall-through to lower-priority indexes in + case of network error. + +Treatment within a given index follows existing behavior, but stops at +the bounds of one index and moves on to the next index only after all +priority preferences within the one index are exhausted. This means that +existing priorities among the unified collection of packages apply to +each index individually before falling through to a lower priority +index. + +There are tradeoffs to make at every level of the optimization criteria: + +- version: index priority will use an older version from a higher-priority index + even if a newer version is available on another index. +- wheel vs sdist: Should the installer use an sdist from a higher-priority + index before trying a wheel from a lower-priority index? +- more platform-specific wheels before less specific ones: Should the + installer use less specific wheels from higher-priority indexes + before using more specific wheels from lower priority indexes? +- flags such as pip's ``--prefer-binary``: Should the installer use an sdist from a higher + priority index before considering wheels on a lower priority index? + +Installers are free to implement these priorities in different ways for +themselves, but they should document their optimization criteria and how they +handle fall-through to lower-priority indexes. For example, an installer could +say that ``--prefer-binary`` should not install an sdist unless it had iterated +through all configured indexes and found no installable binary candidates. + +Mirroring +~~~~~~~~~ + +As described thus far, the index priority scheme breaks the use case of more +than one index url serving the same content. Such mirrors may be used with the +intent of ameliorating network issues or otherwise improving reliability. One +approach that installers could take to preserve mirroring functionality while +adding index priority would be to add a notion of user-definable index groups, +where each index in the group is assumed to be equivalent. This is related to +`Poetry's notion of package sources +`__, except that this would allow +arbitrary numbers of prioritizable groups, and that this would assume members of +a group to be mirrors. Within each group, content could be combined, or each +member could be fetched concurrently. The fastest responding index would then +represent the group. + +Backwards Compatibility +======================= + +This PEP does not prescribe any changes as mandatory for any installer, +so it only introduces compatibility concerns if tools choose to adopt an +index behavior other than the behavior(s) they currently implement. + +This PEP’s language does not quite align with existing tools, including +pip and uv. Either this PEP’s language can change during review of this PEP, or if +this PEP’s language is preferred, other projects could conform to it. +The only goal of proposing these terms is to create a central, common vocabulary +that makes it easier for users to learn about other installers. + +As some tools rely on one or the other behavior, there are some possible +issues that may emerge, where tailoring available resources/packages for +a particular behavior may detract from the user experience for people +who rely on the other behavior. + +- Different indexes may have different metadata. For example, one cannot assume + that the metadata for package “something” on index “A” has the same dependencies + as “something” on index “B”. This breaks fundamental assumptions of version + priority, but index priority can handle this. When an installer falls through to a + lower-priority index in the search order, it implies refreshing the package metadata + from the new index. This is both an improvement and a complication. It is a + complication in the sense that a cached metadata entry must be keyed by both + package name and index url, instead of just package name. It is a potential + improvement in that different implementation variants of a package can differ in + dependencies as long as their distributions are separated into different indexes. + +- Users may not get updates as they expect when using index priority, because some higher priority + index has not updated/synchronized with PyPI to get the latest + packages. If the higher priority index has a valid candidate, newer + packages will not be found. This will need to be communicated + verbosely, because it is counter to pip’s well-established behavior. + +- By adding index priority, an installer will improve the predictability of + which index will be selected, and index hosts may abuse this as a way of having + similarly named files that have different contents. With version priority, + this violates the key package interchangeability assumption, and insanity will ensue. + Index priority would be more workable, but the situation still + has great potential for confusion. It would be helpful to develop tools that + support installers in identifying these confusing issues. These tools could + operate independently of the installer process, as a means of validating the + sanity of a set of indexes. Depending on the time cost of these tools, the + installers could run them as part of their process. Users could, of course, + ignore the recommendations at their own risk. + +Security Implications +===================== + +Index priority creates a mechanism for users to explicitly specify a trust +hierarchy among their indexes. As such, it limits the potential for dependency +confusion attacks. Index priority was rejected by :pep:`708` as a solution for +dependency confusion attacks. This PEP requests that the rejection be +reconsidered, with index priority serving a different purpose. This PEP is +primarily motivated by the desire to support implementation variants, which is +the subject of `another discussion that hopefully leads to a PEP +`__. +It is not mutually exclusive with :pep:`708`, nor does it suggest reverting or +withdrawing :pep:`708`. It is an answer to `how we could allow users to choose +which index to use at a more fine grained level than “per install”. +`__ + +For a more thorough discussion of the :pep:`708` rejection of index +priority, please see the `discuss.python.org thread for this PEP +`__. + +How to Teach This +================= + +At the outset, the goal is not to convert pip or any other tool to +change its default priority behavior. The best way to teach is perhaps +to watch message boards, GitHub issue trackers and chat channels, +keeping an eye out for problems that index priority could help solve. +There are `several `__ +`long-standing `__ +`discussions `__ +`that `__ +`would `__ be good places to +start advertising the concepts. The topics of the two officially +supported behaviors need documentation, and we, the authors of this +PEP, would develop these as part of the review period of this PEP. +These docs would likely consist of additions across several +indexes, cross-linking the concepts between installers. At a +minimum, we expect to add to the +`PyPUG `__ and to `pip’s +documentation `__. + +It will be important for installers to advertise the active behavior, especially in +error messaging, and that will provide ways to provide resources to +users about these behaviors. + +uv users are already experiencing index priority. uv `documents this +behavior `__ +well, but it is always possible to `improve the +discoverability `__ of that +documentation from the command line, `where users will actually +encounter the unexpected +behavior `__. + +Reference Implementation +======================== + +The uv project demonstrates index priority with its default behavior. uv +is implemented in Rust, though, so if a reference implementation to a Python-based tool +is necessary, we, the authors of this PEP, will provide one. For pip in +particular, we see the implementation plan as something like: + +- For users who don’t use ``--extra-index-url`` or ``--find-links``, + there will be no change, and no migration is necessary. +- pip users would be able opt in to the index priority behavior with a + new config setting in the CLI and in ``pip.conf``. This proposal does not + recommend any strategy as the default for any installer. It only + recommends documenting the strategies that a tool provides. +- Enable extra info-level output for any pip operation where more than + one index is used. In this output, state the current strategy setting, + and a terse summary of implied behavior, as well as a link to docs + that describe the different options +- Add debugging output that verbosely identifies the index being used at + each step, including where the file is in the configuration hierarchy, + and where it is being included (via config file, env var, or CLI + flag). +- Plumb tracking of which index gets used for which + package/distribution through the entire pip install process. Store + this information so that it is available to tools like ``pip freeze`` +- Supplement :pep:`751` (lockfiles) with capture of index where a + package/distribution came from + +Rejected Ideas +============== + +- Tell users to set up a proxy/mirror, such as `devpi `__ + or `Artifactory `__ that + serves local files if present, and forwards to another server (PyPI) + if no local files match + + This matches the behavior of this proposal very closely, except that + this method requires hosting some server, and may be inaccessible or + not configurable to users in some environments. It is also important + to consider that for an organization that operates its own index + (for overcoming PyPI size restrictions, for example), this does not + solve the need for ``--extra-index-url`` or proxy/mirror for end + users. That is, organizations get no improvement from this approach + unless they proxy/mirror PyPI as a whole, and get users to configure + their proxy/mirror as their sole index. + +- Are build tags and/or local version specifiers enough? + + Build tags and local version specifiers will take precedence over + packages without those tags and/or local version specifiers. In a pool + of packages, builds that have these additions hosted on a server other + than PyPI will take priority over packages on PyPI, which rarely use + build tags, and forbid local version specifiers. This approach is + viable when package providers want to provide their own local + override, such as `HPC maintainers who provide optimized builds for + their + users `__. + It is less viable in some ways, such as build tags not showing up in + ``pip freeze`` metadata, and `local version specifiers not being + allowed on + PyPI `__. + There is also significant work entailed in building and maintaining + package collections with local build tag variants. + + https://discuss.python.org/t/dependency-notation-including-the-index-url/5659/21 + +- What about :pep:`708`? Isn’t that + enough? + + :pep:`708` is aimed specifically at addressing dependency confusion + attacks, and doesn’t address the potential for implementation variants + among indexes. It is a way of filtering external URLs and encoding an + allow-list for external indexes in index metadata. It does not change + the lack of priority or preference among channels that currently + exists. + +- `Namespacing `__ + + Namespacing is a means of specifying a package such that the Python + usage of the package does not change, but the package installation + restricts where the package comes from. :pep:`752` recently proposed a way to + multiplex a package’s owners in a flat package namespace (e.g. + PyPI) by reserving prefixes as grouping elements. `NPM’s concept + of “scopes” `__ has + been raised as another good example of how this might look. This PEP + differs in that it is targeted to multiple index, not a flat package + namespace. The net effect is roughly the same in terms of predictably + choosing a particular package source, except that the namespacing + approach relies more on naming packages with these namespace prefixes, + whereas this PEP would be less granular, pulling in packages on + whatever higher-priority index the user specifies. The namespacing + approach relies on all configured indexes treating a given namespace + similarly, which leaves the usual concern that not all configured + indexes are trusted equally. The namespace idea is not incompatible + with this PEP, but it also does not improve expression of trust of + indexes in the way that this PEP does. + +Open Issues +=========== + +[Any points that are still being decided/discussed.] + +Acknowledgements +================ + +This work was supported financially by NVIDIA through employment of the author. +NVIDIA teammates dramatically improved this PEP with their +input. Astral Software pioneered the behaviors of index priority and thus laid the +foundation of this document. The pip authors deserve great praise for their +consistent direction and patient communication of the version priority behavior, +especially in the face of contentious security concerns. + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive.