Skip to content

Commit d5d4d2c

Browse files
[3.13] gh-144148: Update the urllib.parse documentation (GH-144497) (GH-144507) (GH-144509)
(cherry picked from commit 2fb9cde) Document urlsplit() as the main parsing function and urlparse() as an obsolete variant. (cherry picked from commit 67ddba9) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
1 parent 65acd01 commit d5d4d2c

File tree

3 files changed

+71
-117
lines changed

3 files changed

+71
-117
lines changed

Doc/library/urllib.parse.rst

Lines changed: 65 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -50,11 +50,12 @@ URL Parsing
5050
The URL parsing functions focus on splitting a URL string into its components,
5151
or on combining URL components into a URL string.
5252

53-
.. function:: urlparse(urlstring, scheme='', allow_fragments=True)
53+
.. function:: urlsplit(urlstring, scheme=None, allow_fragments=True)
5454

55-
Parse a URL into six components, returning a 6-item :term:`named tuple`. This
56-
corresponds to the general structure of a URL:
57-
``scheme://netloc/path;parameters?query#fragment``.
55+
Parse a URL into five components, returning a 5-item :term:`named tuple`
56+
:class:`SplitResult` or :class:`SplitResultBytes`.
57+
This corresponds to the general structure of a URL:
58+
``scheme://netloc/path?query#fragment``.
5859
Each tuple item is a string, possibly empty. The components are not broken up
5960
into smaller parts (for example, the network location is a single string), and %
6061
escapes are not expanded. The delimiters as shown above are not part of the
@@ -64,15 +65,15 @@ or on combining URL components into a URL string.
6465
.. doctest::
6566
:options: +NORMALIZE_WHITESPACE
6667

67-
>>> from urllib.parse import urlparse
68-
>>> urlparse("scheme://netloc/path;parameters?query#fragment")
69-
ParseResult(scheme='scheme', netloc='netloc', path='/path;parameters', params='',
68+
>>> from urllib.parse import urlsplit
69+
>>> urlsplit("scheme://netloc/path?query#fragment")
70+
SplitResult(scheme='scheme', netloc='netloc', path='/path',
7071
query='query', fragment='fragment')
71-
>>> o = urlparse("http://docs.python.org:80/3/library/urllib.parse.html?"
72+
>>> o = urlsplit("http://docs.python.org:80/3/library/urllib.parse.html?"
7273
... "highlight=params#url-parsing")
7374
>>> o
74-
ParseResult(scheme='http', netloc='docs.python.org:80',
75-
path='/3/library/urllib.parse.html', params='',
75+
SplitResult(scheme='http', netloc='docs.python.org:80',
76+
path='/3/library/urllib.parse.html',
7677
query='highlight=params', fragment='url-parsing')
7778
>>> o.scheme
7879
'http'
@@ -85,23 +86,23 @@ or on combining URL components into a URL string.
8586
>>> o._replace(fragment="").geturl()
8687
'http://docs.python.org:80/3/library/urllib.parse.html?highlight=params'
8788

88-
Following the syntax specifications in :rfc:`1808`, urlparse recognizes
89+
Following the syntax specifications in :rfc:`1808`, :func:`!urlsplit` recognizes
8990
a netloc only if it is properly introduced by '//'. Otherwise the
9091
input is presumed to be a relative URL and thus to start with
9192
a path component.
9293

9394
.. doctest::
9495
:options: +NORMALIZE_WHITESPACE
9596

96-
>>> from urllib.parse import urlparse
97-
>>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
98-
ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
99-
params='', query='', fragment='')
100-
>>> urlparse('www.cwi.nl/%7Eguido/Python.html')
101-
ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
102-
params='', query='', fragment='')
103-
>>> urlparse('help/Python.html')
104-
ParseResult(scheme='', netloc='', path='help/Python.html', params='',
97+
>>> from urllib.parse import urlsplit
98+
>>> urlsplit('//www.cwi.nl:80/%7Eguido/Python.html')
99+
SplitResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
100+
query='', fragment='')
101+
>>> urlsplit('www.cwi.nl/%7Eguido/Python.html')
102+
SplitResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
103+
query='', fragment='')
104+
>>> urlsplit('help/Python.html')
105+
SplitResult(scheme='', netloc='', path='help/Python.html',
105106
query='', fragment='')
106107

107108
The *scheme* argument gives the default addressing scheme, to be
@@ -126,12 +127,9 @@ or on combining URL components into a URL string.
126127
+------------------+-------+-------------------------+------------------------+
127128
| :attr:`path` | 2 | Hierarchical path | empty string |
128129
+------------------+-------+-------------------------+------------------------+
129-
| :attr:`params` | 3 | Parameters for last | empty string |
130-
| | | path element | |
131-
+------------------+-------+-------------------------+------------------------+
132-
| :attr:`query` | 4 | Query component | empty string |
130+
| :attr:`query` | 3 | Query component | empty string |
133131
+------------------+-------+-------------------------+------------------------+
134-
| :attr:`fragment` | 5 | Fragment identifier | empty string |
132+
| :attr:`fragment` | 4 | Fragment identifier | empty string |
135133
+------------------+-------+-------------------------+------------------------+
136134
| :attr:`username` | | User name | :const:`None` |
137135
+------------------+-------+-------------------------+------------------------+
@@ -155,26 +153,30 @@ or on combining URL components into a URL string.
155153
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
156154
decomposed before parsing, no error will be raised.
157155

156+
Following some of the `WHATWG spec`_ that updates :rfc:`3986`, leading C0
157+
control and space characters are stripped from the URL. ``\n``,
158+
``\r`` and tab ``\t`` characters are removed from the URL at any position.
159+
158160
As is the case with all named tuples, the subclass has a few additional methods
159161
and attributes that are particularly useful. One such method is :meth:`_replace`.
160-
The :meth:`_replace` method will return a new ParseResult object replacing specified
161-
fields with new values.
162+
The :meth:`_replace` method will return a new :class:`SplitResult` object
163+
replacing specified fields with new values.
162164

163165
.. doctest::
164166
:options: +NORMALIZE_WHITESPACE
165167

166-
>>> from urllib.parse import urlparse
167-
>>> u = urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
168+
>>> from urllib.parse import urlsplit
169+
>>> u = urlsplit('//www.cwi.nl:80/%7Eguido/Python.html')
168170
>>> u
169-
ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
170-
params='', query='', fragment='')
171+
SplitResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
172+
query='', fragment='')
171173
>>> u._replace(scheme='http')
172-
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
173-
params='', query='', fragment='')
174+
SplitResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
175+
query='', fragment='')
174176

175177
.. warning::
176178

177-
:func:`urlparse` does not perform validation. See :ref:`URL parsing
179+
:func:`urlsplit` does not perform validation. See :ref:`URL parsing
178180
security <url-parsing-security>` for details.
179181

180182
.. versionchanged:: 3.2
@@ -193,6 +195,14 @@ or on combining URL components into a URL string.
193195
Characters that affect netloc parsing under NFKC normalization will
194196
now raise :exc:`ValueError`.
195197

198+
.. versionchanged:: 3.10
199+
ASCII newline and tab characters are stripped from the URL.
200+
201+
.. versionchanged:: 3.12
202+
Leading WHATWG C0 control and space characters are stripped from the URL.
203+
204+
.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser
205+
196206

197207
.. function:: parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None, separator='&')
198208

@@ -283,93 +293,35 @@ or on combining URL components into a URL string.
283293
separator key, with ``&`` as the default separator.
284294

285295

286-
.. function:: urlunparse(parts)
296+
.. function:: urlunsplit(parts)
287297

288-
Construct a URL from a tuple as returned by ``urlparse()``. The *parts*
289-
argument can be any six-item iterable. This may result in a slightly
298+
Construct a URL from a tuple as returned by ``urlsplit()``. The *parts*
299+
argument can be any five-item iterable. This may result in a slightly
290300
different, but equivalent URL, if the URL that was parsed originally had
291301
unnecessary delimiters (for example, a ``?`` with an empty query; the RFC
292302
states that these are equivalent).
293303

294304

295-
.. function:: urlsplit(urlstring, scheme='', allow_fragments=True)
296-
297-
This is similar to :func:`urlparse`, but does not split the params from the URL.
298-
This should generally be used instead of :func:`urlparse` if the more recent URL
299-
syntax allowing parameters to be applied to each segment of the *path* portion
300-
of the URL (see :rfc:`2396`) is wanted. A separate function is needed to
301-
separate the path segments and parameters. This function returns a 5-item
302-
:term:`named tuple`::
303-
304-
(addressing scheme, network location, path, query, fragment identifier).
305-
306-
The return value is a :term:`named tuple`, its items can be accessed by index
307-
or as named attributes:
308-
309-
+------------------+-------+-------------------------+----------------------+
310-
| Attribute | Index | Value | Value if not present |
311-
+==================+=======+=========================+======================+
312-
| :attr:`scheme` | 0 | URL scheme specifier | *scheme* parameter |
313-
+------------------+-------+-------------------------+----------------------+
314-
| :attr:`netloc` | 1 | Network location part | empty string |
315-
+------------------+-------+-------------------------+----------------------+
316-
| :attr:`path` | 2 | Hierarchical path | empty string |
317-
+------------------+-------+-------------------------+----------------------+
318-
| :attr:`query` | 3 | Query component | empty string |
319-
+------------------+-------+-------------------------+----------------------+
320-
| :attr:`fragment` | 4 | Fragment identifier | empty string |
321-
+------------------+-------+-------------------------+----------------------+
322-
| :attr:`username` | | User name | :const:`None` |
323-
+------------------+-------+-------------------------+----------------------+
324-
| :attr:`password` | | Password | :const:`None` |
325-
+------------------+-------+-------------------------+----------------------+
326-
| :attr:`hostname` | | Host name (lower case) | :const:`None` |
327-
+------------------+-------+-------------------------+----------------------+
328-
| :attr:`port` | | Port number as integer, | :const:`None` |
329-
| | | if present | |
330-
+------------------+-------+-------------------------+----------------------+
331-
332-
Reading the :attr:`port` attribute will raise a :exc:`ValueError` if
333-
an invalid port is specified in the URL. See section
334-
:ref:`urlparse-result-object` for more information on the result object.
335-
336-
Unmatched square brackets in the :attr:`netloc` attribute will raise a
337-
:exc:`ValueError`.
338-
339-
Characters in the :attr:`netloc` attribute that decompose under NFKC
340-
normalization (as used by the IDNA encoding) into any of ``/``, ``?``,
341-
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
342-
decomposed before parsing, no error will be raised.
343-
344-
Following some of the `WHATWG spec`_ that updates RFC 3986, leading C0
345-
control and space characters are stripped from the URL. ``\n``,
346-
``\r`` and tab ``\t`` characters are removed from the URL at any position.
347-
348-
.. warning::
349-
350-
:func:`urlsplit` does not perform validation. See :ref:`URL parsing
351-
security <url-parsing-security>` for details.
305+
.. function:: urlparse(urlstring, scheme=None, allow_fragments=True)
352306

353-
.. versionchanged:: 3.6
354-
Out-of-range port numbers now raise :exc:`ValueError`, instead of
355-
returning :const:`None`.
307+
This is similar to :func:`urlsplit`, but additionally splits the *path*
308+
component on *path* and *params*.
309+
This function returns a 6-item :term:`named tuple` :class:`ParseResult`
310+
or :class:`ParseResultBytes`.
311+
Its items are the same as for the :func:`!urlsplit` result, except that
312+
*params* is inserted at index 3, between *path* and *query*.
356313

357-
.. versionchanged:: 3.8
358-
Characters that affect netloc parsing under NFKC normalization will
359-
now raise :exc:`ValueError`.
314+
This function is based on obsoleted :rfc:`1738` and :rfc:`1808`, which
315+
listed *params* as the main URL component.
316+
The more recent URL syntax allows parameters to be applied to each segment
317+
of the *path* portion of the URL (see :rfc:`3986`).
318+
:func:`urlsplit` should generally be used instead of :func:`urlparse`.
319+
A separate function is needed to separate the path segments and parameters.
360320

361-
.. versionchanged:: 3.10
362-
ASCII newline and tab characters are stripped from the URL.
363-
364-
.. versionchanged:: 3.12
365-
Leading WHATWG C0 control and space characters are stripped from the URL.
366-
367-
.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser
368-
369-
.. function:: urlunsplit(parts)
321+
.. function:: urlunparse(parts)
370322

371-
Combine the elements of a tuple as returned by :func:`urlsplit` into a
372-
complete URL as a string. The *parts* argument can be any five-item
323+
Combine the elements of a tuple as returned by :func:`urlparse` into a
324+
complete URL as a string. The *parts* argument can be any six-item
373325
iterable. This may result in a slightly different, but equivalent URL, if the
374326
URL that was parsed originally had unnecessary delimiters (for example, a ?
375327
with an empty query; the RFC states that these are equivalent).
@@ -387,7 +339,7 @@ or on combining URL components into a URL string.
387339
'http://www.cwi.nl/%7Eguido/FAQ.html'
388340

389341
The *allow_fragments* argument has the same meaning and default as for
390-
:func:`urlparse`.
342+
:func:`urlsplit`.
391343

392344
.. note::
393345

@@ -527,7 +479,7 @@ individual URL quoting functions.
527479
Structured Parse Results
528480
------------------------
529481

530-
The result objects from the :func:`urlparse`, :func:`urlsplit` and
482+
The result objects from the :func:`urlsplit`, :func:`urlparse` and
531483
:func:`urldefrag` functions are subclasses of the :class:`tuple` type.
532484
These subclasses add the attributes listed in the documentation for
533485
those functions, the encoding and decoding support described in the

Doc/library/venv.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -542,7 +542,7 @@ subclass which installs setuptools and pip into a created virtual environment::
542542
from subprocess import Popen, PIPE
543543
import sys
544544
from threading import Thread
545-
from urllib.parse import urlparse
545+
from urllib.parse import urlsplit
546546
from urllib.request import urlretrieve
547547
import venv
548548

@@ -613,7 +613,7 @@ subclass which installs setuptools and pip into a created virtual environment::
613613
stream.close()
614614

615615
def install_script(self, context, name, url):
616-
_, _, path, _, _, _ = urlparse(url)
616+
_, _, path, _, _ = urlsplit(url)
617617
fn = os.path.split(path)[-1]
618618
binpath = context.bin_path
619619
distpath = os.path.join(binpath, fn)

Lib/urllib/parse.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
"""Parse (absolute and relative) URLs.
22
3-
urlparse module is based upon the following RFC specifications.
3+
urllib.parse module is based upon the following RFC specifications.
44
55
RFC 3986 (STD66): "Uniform Resource Identifiers" by T. Berners-Lee, R. Fielding
66
and L. Masinter, January 2005.
@@ -20,7 +20,7 @@
2020
McCahill, December 1994
2121
2222
RFC 3986 is considered the current standard and any future changes to
23-
urlparse module should conform with it. The urlparse module is
23+
urllib.parse module should conform with it. The urllib.parse module is
2424
currently not entirely compliant with this RFC due to defacto
2525
scenarios for parsing, and for backward compatibility purposes, some
2626
parsing quirks from older RFCs are retained. The testcases in
@@ -390,6 +390,8 @@ def urlparse(url, scheme='', allow_fragments=True):
390390
path or query.
391391
392392
Note that % escapes are not expanded.
393+
394+
urlsplit() should generally be used instead of urlparse().
393395
"""
394396
url, scheme, _coerce_result = _coerce_args(url, scheme)
395397
splitresult = urlsplit(url, scheme, allow_fragments)

0 commit comments

Comments
 (0)