Skip to content

Commit 67ddba9

Browse files
gh-144148: Update the urllib.parse documentation (GH-144497)
Document urlsplit() as the main parsing function and urlparse() as an obsolete variant.
1 parent d5cb9f6 commit 67ddba9

File tree

5 files changed

+109
-161
lines changed

5 files changed

+109
-161
lines changed

Doc/library/urllib.parse.rst

Lines changed: 81 additions & 135 deletions
Original file line numberDiff line numberDiff line change
@@ -50,11 +50,12 @@ URL Parsing
5050
The URL parsing functions focus on splitting a URL string into its components,
5151
or on combining URL components into a URL string.
5252

53-
.. function:: urlparse(urlstring, scheme=None, allow_fragments=True, *, missing_as_none=False)
53+
.. function:: urlsplit(urlstring, scheme=None, allow_fragments=True, *, missing_as_none=False)
5454

55-
Parse a URL into six components, returning a 6-item :term:`named tuple`. This
56-
corresponds to the general structure of a URL:
57-
``scheme://netloc/path;parameters?query#fragment``.
55+
Parse a URL into five components, returning a 5-item :term:`named tuple`
56+
:class:`SplitResult` or :class:`SplitResultBytes`.
57+
This corresponds to the general structure of a URL:
58+
``scheme://netloc/path?query#fragment``.
5859
Each tuple item is a string, possibly empty, or ``None`` if
5960
*missing_as_none* is true.
6061
Not defined component are represented an empty string (by default) or
@@ -68,15 +69,15 @@ or on combining URL components into a URL string.
6869
.. doctest::
6970
:options: +NORMALIZE_WHITESPACE
7071

71-
>>> from urllib.parse import urlparse
72-
>>> urlparse("scheme://netloc/path;parameters?query#fragment")
73-
ParseResult(scheme='scheme', netloc='netloc', path='/path;parameters', params='',
72+
>>> from urllib.parse import urlsplit
73+
>>> urlsplit("scheme://netloc/path?query#fragment")
74+
SplitResult(scheme='scheme', netloc='netloc', path='/path',
7475
query='query', fragment='fragment')
75-
>>> o = urlparse("http://docs.python.org:80/3/library/urllib.parse.html?"
76+
>>> o = urlsplit("http://docs.python.org:80/3/library/urllib.parse.html?"
7677
... "highlight=params#url-parsing")
7778
>>> o
78-
ParseResult(scheme='http', netloc='docs.python.org:80',
79-
path='/3/library/urllib.parse.html', params='',
79+
SplitResult(scheme='http', netloc='docs.python.org:80',
80+
path='/3/library/urllib.parse.html',
8081
query='highlight=params', fragment='url-parsing')
8182
>>> o.scheme
8283
'http'
@@ -88,42 +89,42 @@ or on combining URL components into a URL string.
8889
80
8990
>>> o._replace(fragment="").geturl()
9091
'http://docs.python.org:80/3/library/urllib.parse.html?highlight=params'
91-
>>> urlparse("http://docs.python.org?")
92-
ParseResult(scheme='http', netloc='docs.python.org',
93-
path='', params='', query='', fragment='')
94-
>>> urlparse("http://docs.python.org?", missing_as_none=True)
95-
ParseResult(scheme='http', netloc='docs.python.org',
96-
path='', params=None, query='', fragment=None)
97-
98-
Following the syntax specifications in :rfc:`1808`, urlparse recognizes
92+
>>> urlsplit("http://docs.python.org?")
93+
SplitResult(scheme='http', netloc='docs.python.org', path='',
94+
query='', fragment='')
95+
>>> urlsplit("http://docs.python.org?", missing_as_none=True)
96+
SplitResult(scheme='http', netloc='docs.python.org', path='',
97+
query='', fragment=None)
98+
99+
Following the syntax specifications in :rfc:`1808`, :func:`!urlsplit` recognizes
99100
a netloc only if it is properly introduced by '//'. Otherwise the
100101
input is presumed to be a relative URL and thus to start with
101102
a path component.
102103

103104
.. doctest::
104105
:options: +NORMALIZE_WHITESPACE
105106

106-
>>> from urllib.parse import urlparse
107-
>>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
108-
ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
109-
params='', query='', fragment='')
110-
>>> urlparse('www.cwi.nl/%7Eguido/Python.html')
111-
ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
112-
params='', query='', fragment='')
113-
>>> urlparse('help/Python.html')
114-
ParseResult(scheme='', netloc='', path='help/Python.html',
115-
params='', query='', fragment='')
116-
>>> urlparse('help/Python.html', missing_as_none=True)
117-
ParseResult(scheme=None, netloc=None, path='help/Python.html',
118-
params=None, query=None, fragment=None)
107+
>>> from urllib.parse import urlsplit
108+
>>> urlsplit('//www.cwi.nl:80/%7Eguido/Python.html')
109+
SplitResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
110+
query='', fragment='')
111+
>>> urlsplit('www.cwi.nl/%7Eguido/Python.html')
112+
SplitResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
113+
query='', fragment='')
114+
>>> urlsplit('help/Python.html')
115+
SplitResult(scheme='', netloc='', path='help/Python.html',
116+
query='', fragment='')
117+
>>> urlsplit('help/Python.html', missing_as_none=True)
118+
SplitResult(scheme=None, netloc=None, path='help/Python.html',
119+
query=None, fragment=None)
119120

120121
The *scheme* argument gives the default addressing scheme, to be
121122
used only if the URL does not specify one. It should be the same type
122123
(text or bytes) as *urlstring* or ``None``, except that the ``''`` is
123124
always allowed, and is automatically converted to ``b''`` if appropriate.
124125

125126
If the *allow_fragments* argument is false, fragment identifiers are not
126-
recognized. Instead, they are parsed as part of the path, parameters
127+
recognized. Instead, they are parsed as part of the path
127128
or query component, and :attr:`fragment` is set to ``None`` or the empty
128129
string (depending on the value of *missing_as_none*) in the return value.
129130

@@ -140,12 +141,9 @@ or on combining URL components into a URL string.
140141
+------------------+-------+-------------------------+-------------------------------+
141142
| :attr:`path` | 2 | Hierarchical path | empty string |
142143
+------------------+-------+-------------------------+-------------------------------+
143-
| :attr:`params` | 3 | Parameters for last | ``None`` or empty string [1]_ |
144-
| | | path element | |
145-
+------------------+-------+-------------------------+-------------------------------+
146-
| :attr:`query` | 4 | Query component | ``None`` or empty string [1]_ |
144+
| :attr:`query` | 3 | Query component | ``None`` or empty string [1]_ |
147145
+------------------+-------+-------------------------+-------------------------------+
148-
| :attr:`fragment` | 5 | Fragment identifier | ``None`` or empty string [1]_ |
146+
| :attr:`fragment` | 4 | Fragment identifier | ``None`` or empty string [1]_ |
149147
+------------------+-------+-------------------------+-------------------------------+
150148
| :attr:`username` | | User name | ``None`` |
151149
+------------------+-------+-------------------------+-------------------------------+
@@ -171,26 +169,30 @@ or on combining URL components into a URL string.
171169
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
172170
decomposed before parsing, no error will be raised.
173171

172+
Following some of the `WHATWG spec`_ that updates :rfc:`3986`, leading C0
173+
control and space characters are stripped from the URL. ``\n``,
174+
``\r`` and tab ``\t`` characters are removed from the URL at any position.
175+
174176
As is the case with all named tuples, the subclass has a few additional methods
175177
and attributes that are particularly useful. One such method is :meth:`_replace`.
176-
The :meth:`_replace` method will return a new ParseResult object replacing specified
177-
fields with new values.
178+
The :meth:`_replace` method will return a new :class:`SplitResult` object
179+
replacing specified fields with new values.
178180

179181
.. doctest::
180182
:options: +NORMALIZE_WHITESPACE
181183

182-
>>> from urllib.parse import urlparse
183-
>>> u = urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
184+
>>> from urllib.parse import urlsplit
185+
>>> u = urlsplit('//www.cwi.nl:80/%7Eguido/Python.html')
184186
>>> u
185-
ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
186-
params='', query='', fragment='')
187+
SplitResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
188+
query='', fragment='')
187189
>>> u._replace(scheme='http')
188-
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
189-
params='', query='', fragment='')
190+
SplitResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
191+
query='', fragment='')
190192

191193
.. warning::
192194

193-
:func:`urlparse` does not perform validation. See :ref:`URL parsing
195+
:func:`urlsplit` does not perform validation. See :ref:`URL parsing
194196
security <url-parsing-security>` for details.
195197

196198
.. versionchanged:: 3.2
@@ -209,9 +211,17 @@ or on combining URL components into a URL string.
209211
Characters that affect netloc parsing under NFKC normalization will
210212
now raise :exc:`ValueError`.
211213

214+
.. versionchanged:: 3.10
215+
ASCII newline and tab characters are stripped from the URL.
216+
217+
.. versionchanged:: 3.12
218+
Leading WHATWG C0 control and space characters are stripped from the URL.
219+
212220
.. versionchanged:: next
213221
Added the *missing_as_none* parameter.
214222

223+
.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser
224+
215225

216226
.. function:: parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None, separator='&')
217227

@@ -306,11 +316,11 @@ or on combining URL components into a URL string.
306316
separator key, with ``&`` as the default separator.
307317

308318

309-
.. function:: urlunparse(parts)
310-
urlunparse(parts, *, keep_empty)
319+
.. function:: urlunsplit(parts)
320+
urlunsplit(parts, *, keep_empty)
311321
312-
Construct a URL from a tuple as returned by ``urlparse()``. The *parts*
313-
argument can be any six-item iterable.
322+
Construct a URL from a tuple as returned by :func:`urlsplit`. The *parts*
323+
argument can be any five-item iterable.
314324

315325
This may result in a slightly different, but equivalent URL, if the
316326
URL that was parsed originally had unnecessary delimiters (for example,
@@ -321,97 +331,33 @@ or on combining URL components into a URL string.
321331
This allows rebuilding a URL that was parsed with option
322332
``missing_as_none=True``.
323333
By default, *keep_empty* is true if *parts* is the result of the
324-
:func:`urlparse` call with ``missing_as_none=True``.
334+
:func:`urlsplit` call with ``missing_as_none=True``.
325335

326336
.. versionchanged:: next
327337
Added the *keep_empty* parameter.
328338

329339

330-
.. function:: urlsplit(urlstring, scheme=None, allow_fragments=True, *, missing_as_none=False)
331-
332-
This is similar to :func:`urlparse`, but does not split the params from the URL.
333-
This should generally be used instead of :func:`urlparse` if the more recent URL
334-
syntax allowing parameters to be applied to each segment of the *path* portion
335-
of the URL (see :rfc:`2396`) is wanted. A separate function is needed to
336-
separate the path segments and parameters. This function returns a 5-item
337-
:term:`named tuple`::
338-
339-
(addressing scheme, network location, path, query, fragment identifier).
340-
341-
The return value is a :term:`named tuple`, its items can be accessed by index
342-
or as named attributes:
343-
344-
+------------------+-------+-------------------------+-------------------------------+
345-
| Attribute | Index | Value | Value if not present |
346-
+==================+=======+=========================+===============================+
347-
| :attr:`scheme` | 0 | URL scheme specifier | *scheme* parameter or |
348-
| | | | empty string [1]_ |
349-
+------------------+-------+-------------------------+-------------------------------+
350-
| :attr:`netloc` | 1 | Network location part | ``None`` or empty string [2]_ |
351-
+------------------+-------+-------------------------+-------------------------------+
352-
| :attr:`path` | 2 | Hierarchical path | empty string |
353-
+------------------+-------+-------------------------+-------------------------------+
354-
| :attr:`query` | 3 | Query component | ``None`` or empty string [2]_ |
355-
+------------------+-------+-------------------------+-------------------------------+
356-
| :attr:`fragment` | 4 | Fragment identifier | ``None`` or empty string [2]_ |
357-
+------------------+-------+-------------------------+-------------------------------+
358-
| :attr:`username` | | User name | ``None`` |
359-
+------------------+-------+-------------------------+-------------------------------+
360-
| :attr:`password` | | Password | ``None`` |
361-
+------------------+-------+-------------------------+-------------------------------+
362-
| :attr:`hostname` | | Host name (lower case) | ``None`` |
363-
+------------------+-------+-------------------------+-------------------------------+
364-
| :attr:`port` | | Port number as integer, | ``None`` |
365-
| | | if present | |
366-
+------------------+-------+-------------------------+-------------------------------+
367-
368-
.. [2] Depending on the value of the *missing_as_none* argument.
369-
370-
Reading the :attr:`port` attribute will raise a :exc:`ValueError` if
371-
an invalid port is specified in the URL. See section
372-
:ref:`urlparse-result-object` for more information on the result object.
373-
374-
Unmatched square brackets in the :attr:`netloc` attribute will raise a
375-
:exc:`ValueError`.
376-
377-
Characters in the :attr:`netloc` attribute that decompose under NFKC
378-
normalization (as used by the IDNA encoding) into any of ``/``, ``?``,
379-
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
380-
decomposed before parsing, no error will be raised.
381-
382-
Following some of the `WHATWG spec`_ that updates RFC 3986, leading C0
383-
control and space characters are stripped from the URL. ``\n``,
384-
``\r`` and tab ``\t`` characters are removed from the URL at any position.
385-
386-
.. warning::
387-
388-
:func:`urlsplit` does not perform validation. See :ref:`URL parsing
389-
security <url-parsing-security>` for details.
390-
391-
.. versionchanged:: 3.6
392-
Out-of-range port numbers now raise :exc:`ValueError`, instead of
393-
returning ``None``.
394-
395-
.. versionchanged:: 3.8
396-
Characters that affect netloc parsing under NFKC normalization will
397-
now raise :exc:`ValueError`.
398-
399-
.. versionchanged:: 3.10
400-
ASCII newline and tab characters are stripped from the URL.
401-
402-
.. versionchanged:: 3.12
403-
Leading WHATWG C0 control and space characters are stripped from the URL.
340+
.. function:: urlparse(urlstring, scheme=None, allow_fragments=True, *, missing_as_none=False)
404341

405-
.. versionchanged:: next
406-
Added the *missing_as_none* parameter.
342+
This is similar to :func:`urlsplit`, but additionally splits the *path*
343+
component on *path* and *params*.
344+
This function returns a 6-item :term:`named tuple` :class:`ParseResult`
345+
or :class:`ParseResultBytes`.
346+
Its items are the same as for the :func:`!urlsplit` result, except that
347+
*params* is inserted at index 3, between *path* and *query*.
407348

408-
.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser
349+
This function is based on obsoleted :rfc:`1738` and :rfc:`1808`, which
350+
listed *params* as the main URL component.
351+
The more recent URL syntax allows parameters to be applied to each segment
352+
of the *path* portion of the URL (see :rfc:`3986`).
353+
:func:`urlsplit` should generally be used instead of :func:`urlparse`.
354+
A separate function is needed to separate the path segments and parameters.
409355

410-
.. function:: urlunsplit(parts)
411-
urlunsplit(parts, *, keep_empty)
356+
.. function:: urlunparse(parts)
357+
urlunparse(parts, *, keep_empty)
412358
413-
Combine the elements of a tuple as returned by :func:`urlsplit` into a
414-
complete URL as a string. The *parts* argument can be any five-item
359+
Combine the elements of a tuple as returned by :func:`urlparse` into a
360+
complete URL as a string. The *parts* argument can be any six-item
415361
iterable.
416362

417363
This may result in a slightly different, but equivalent URL, if the
@@ -423,7 +369,7 @@ or on combining URL components into a URL string.
423369
This allows rebuilding a URL that was parsed with option
424370
``missing_as_none=True``.
425371
By default, *keep_empty* is true if *parts* is the result of the
426-
:func:`urlsplit` call with ``missing_as_none=True``.
372+
:func:`urlparse` call with ``missing_as_none=True``.
427373

428374
.. versionchanged:: next
429375
Added the *keep_empty* parameter.
@@ -441,7 +387,7 @@ or on combining URL components into a URL string.
441387
'http://www.cwi.nl/%7Eguido/FAQ.html'
442388

443389
The *allow_fragments* argument has the same meaning and default as for
444-
:func:`urlparse`.
390+
:func:`urlsplit`.
445391

446392
.. note::
447393

@@ -587,7 +533,7 @@ individual URL quoting functions.
587533
Structured Parse Results
588534
------------------------
589535

590-
The result objects from the :func:`urlparse`, :func:`urlsplit` and
536+
The result objects from the :func:`urlsplit`, :func:`urlparse` and
591537
:func:`urldefrag` functions are subclasses of the :class:`tuple` type.
592538
These subclasses add the attributes listed in the documentation for
593539
those functions, the encoding and decoding support described in the

Doc/library/venv.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -550,7 +550,7 @@ subclass which installs setuptools and pip into a created virtual environment::
550550
from subprocess import Popen, PIPE
551551
import sys
552552
from threading import Thread
553-
from urllib.parse import urlparse
553+
from urllib.parse import urlsplit
554554
from urllib.request import urlretrieve
555555
import venv
556556

@@ -621,7 +621,7 @@ subclass which installs setuptools and pip into a created virtual environment::
621621
stream.close()
622622

623623
def install_script(self, context, name, url):
624-
_, _, path, _, _, _ = urlparse(url)
624+
_, _, path, _, _ = urlsplit(url)
625625
fn = os.path.split(path)[-1]
626626
binpath = context.bin_path
627627
distpath = os.path.join(binpath, fn)

Doc/whatsnew/3.15.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -968,10 +968,10 @@ unittest
968968
urllib.parse
969969
------------
970970

971-
* Add the *missing_as_none* parameter to :func:`~urllib.parse.urlparse`,
972-
:func:`~urllib.parse.urlsplit` and :func:`~urllib.parse.urldefrag` functions.
973-
Add the *keep_empty* parameter to :func:`~urllib.parse.urlunparse` and
974-
:func:`~urllib.parse.urlunsplit` functions.
971+
* Add the *missing_as_none* parameter to :func:`~urllib.parse.urlsplit`,
972+
:func:`~urllib.parse.urlparse` and :func:`~urllib.parse.urldefrag` functions.
973+
Add the *keep_empty* parameter to :func:`~urllib.parse.urlunsplit` and
974+
:func:`~urllib.parse.urlunparse` functions.
975975
This allows to distinguish between empty and not defined URI components
976976
and preserve empty components.
977977
(Contributed by Serhiy Storchaka in :gh:`67041`.)

0 commit comments

Comments
 (0)