Skip to content
6 changes: 6 additions & 0 deletions Doc/deprecations/pending-removal-in-3.17.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,12 @@ Pending removal in Python 3.17
(Contributed by Shantanu Jain in :gh:`91896`.)


* :mod:`encodings`:

- Passing non-ascii *encoding* names to :func:`encodings.normalize_encoding`
is deprecated and scheduled for removal in Python 3.17.
(Contributed by Stan Ulbrych in :gh:`136702`)

* :mod:`typing`:

- Before Python 3.14, old-style unions were implemented using the private class
Expand Down
3 changes: 2 additions & 1 deletion Lib/email/_header_value_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -796,13 +796,14 @@ def params(self):
value = urllib.parse.unquote(value, encoding='latin-1')
else:
try:
charset = utils._sanitize_charset_name(charset, 'ascii')
value = value.decode(charset, 'surrogateescape')
except (LookupError, UnicodeEncodeError):
# XXX: there should really be a custom defect for
# unknown character set to make it easy to find,
# because otherwise unknown charset is a silent
# failure.
value = value.decode('us-ascii', 'surrogateescape')
value = value.decode('ascii', 'surrogateescape')
if utils._has_surrogates(value):
param.defects.append(errors.UndecodableBytesDefect())
value_parts.append(value)
Expand Down
11 changes: 10 additions & 1 deletion Lib/email/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -446,8 +446,16 @@ def decode_params(params):
new_params.append((name, '"%s"' % value))
return new_params

_SANITIZE_TABLE = str.maketrans({i: None for i in range(128, 65536)})

def _sanitize_charset_name(charset, fallback_charset):
if not charset:
return charset
sanitized = charset.translate(_SANITIZE_TABLE)
return sanitized if sanitized else fallback_charset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the trigger for this change? Do I actually have a test that uses a non-ascii charset name? If I did it should be an error case, since non-ascii is not permitted in charset names per the RFCs. I'm surprised I don't appear to be registering a defect for that, though I didn't go through the code enough to be sure I don't ;)

Regardless it isn't clear to me that 'sanitizing' is a useful operation. It isn't likely to produce a valid charset name, we should just be falling back to ascii at that point. What led you to choose this approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently done by normalize_encoding.

Copy link
Member

@bitdancer bitdancer Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. emal doesn't call lookup directly and no tests fail without the changes.

I presume you did this to preserve backward compatibility. Unless I'm missing something, I don't think we should bother to do that. Given a non-ascii charset name, there are two possible outcomes from the current code: the name after sanitizing is not a valid codec name, or it is. If it is valid after sanitizing, there are two cases: the sanitized name results in successful decoding, or it does not. It is only the first of these second two cases that would be affected by the post-deprecation change.

How often would that case occur in reality? I would guess it would be a vanishingly small number of cases, if it ever occurs at all.

I think it will be better to remove the changes to the email package from this PR. If anyone sees the deprecation warning maybe they'll open an issue, but I'm betting nobody ever sees it from the email package. The behavior after the deprecation is over is the behavior we want: if the codec name contains non-ascii it is not a valid codec name, so any non-ascii in the text being decoded using that charset name will ultimately get turned into the 'unknown character' glyph when decoded by the email package.

Copy link
Member Author

@StanFromIreland StanFromIreland Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you did this to preserve backward compatibility.

Yes, I'm no email expert and I did not dig into the specifications, so I did this to not change any behaviour. I can remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What't the conclusion here ? I still see the email package changes in place, but they look pretty harmless to me.

def collapse_rfc2231_value(value, errors='replace',
fallback_charset='us-ascii'):
fallback_charset='ascii'):
if not isinstance(value, tuple) or len(value) != 3:
return unquote(value)
# While value comes to us as a unicode string, we need it to be a bytes
Expand All @@ -458,6 +466,7 @@ def collapse_rfc2231_value(value, errors='replace',
# Issue 17369: if charset/lang is None, decode_rfc2231 couldn't parse
# the value, so use the fallback_charset.
charset = fallback_charset
charset = _sanitize_charset_name(charset, fallback_charset)
rawbytes = bytes(text, 'raw-unicode-escape')
try:
return str(rawbytes, charset, errors)
Expand Down
8 changes: 7 additions & 1 deletion Lib/encodings/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
(c) Copyright CNRI, All Rights Reserved. NO WARRANTY.
"""#"
"""

import codecs
import sys
Expand Down Expand Up @@ -56,6 +56,12 @@ def normalize_encoding(encoding):
if isinstance(encoding, bytes):
encoding = str(encoding, "ascii")

if not encoding.isascii():
import warnings
warnings.warn(
"Support for non-ascii encoding names will be removed in 3.17",
DeprecationWarning, stacklevel=2)

return _normalize_encoding(encoding)

def search_function(encoding):
Expand Down
10 changes: 7 additions & 3 deletions Lib/test/test_codecs.py
Original file line number Diff line number Diff line change
Expand Up @@ -3886,22 +3886,26 @@ def search_function(encoding):
self.assertEqual(codecs.lookup('TEST.AAA 8'), ('test.aaa-8', 2, 3, 4))
self.assertEqual(codecs.lookup('TEST.AAA---8'), ('test.aaa---8', 2, 3, 4))
self.assertEqual(codecs.lookup('TEST.AAA 8'), ('test.aaa---8', 2, 3, 4))
self.assertEqual(codecs.lookup('TEST.AAA\xe9\u20ac-8'), ('test.aaa\xe9\u20ac-8', 2, 3, 4))
self.assertEqual(codecs.lookup('TEST.AAA.8'), ('test.aaa.8', 2, 3, 4))
self.assertEqual(codecs.lookup('TEST.AAA...8'), ('test.aaa...8', 2, 3, 4))
with self.assertWarns(DeprecationWarning):
self.assertEqual(codecs.lookup('TEST.AAA\xe9\u20ac-8'), ('test.aaa\xe9\u20ac-8', 2, 3, 4))

def test_encodings_normalize_encoding(self):
# encodings.normalize_encoding() ignores non-ASCII characters.
normalize = encodings.normalize_encoding
self.assertEqual(normalize('utf_8'), 'utf_8')
self.assertEqual(normalize('utf\xE9\u20AC\U0010ffff-8'), 'utf_8')
self.assertEqual(normalize('utf 8'), 'utf_8')
# encodings.normalize_encoding() doesn't convert
# characters to lower case.
self.assertEqual(normalize('UTF 8'), 'UTF_8')
self.assertEqual(normalize('utf.8'), 'utf.8')
self.assertEqual(normalize('utf...8'), 'utf...8')

# Non-ASCII *encoding* is deprecated.
with self.assertWarnsRegex(DeprecationWarning,
"Support for non-ascii encoding names will be removed in 3.17"):
self.assertEqual(normalize('utf\xE9\u20AC\U0010ffff-8'), 'utf_8')


if __name__ == "__main__":
unittest.main()
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
:mod:`encodings`: Deprecate passing a non-ascii *encoding* name to
:func:`encodings.normalize_encoding` and schedule removal of support for
Python 3.17.
Loading