@@ -516,36 +516,21 @@ characters:
516516Non-ASCII characters in names
517517-----------------------------
518518
519- Python identifiers may contain all sorts of characters.
520- For example, `` ř_1 ``, `` 蛇 ``, or `` साँप `` are valid identifiers.
521- However, `` r〰2 ``, `` € ``, or `` 🐍 `` are not .
522- Additionally, some variations are considered equivalent: for example ,
523- ``fi `` (2 letters) and `` fi `` (1 ligature) .
519+ Names that contain non-ASCII characters need additional normalization
520+ and validation beyond the rules and grammar explained
521+ :ref: ` above < identifiers >` .
522+ For example, `` ř_1 ``, `` 蛇 ``, or `` साँप `` are valid names, but `` r〰2 `` ,
523+ ``€ ``, or `` 🐍 `` are not .
524524
525-
526- A :ref: `name token <identifiers >` that only contains ASCII characters
527- (``A-Z ``, ``a-z ``, ``_ `` and ``0-9 ``) is always valid, and distinct from
528- different ASCII-only names.
529- The rules are somewhat more complicated when using non-ASCII characters.
530-
531- Informally, all names must be composed of letters, digits, numbers and
532- underscores, and cannot start with a digit.
533-
534-
535-
536-
537- Besides ``A-Z ``, ``a-z ``, ``_ `` and ``0-9 ``, names can use characters
538- from outside the ASCII range.
539-
540- , as detailed in this section.
525+ This section explains the exact rules.
541526
542527All names are converted into the `normalization form `_ NFKC while parsing.
543528This means that, for example, some typographic variants of characters are
544- converted to their "basic" form. For example, ``nᵘₘᵇₑʳ `` normalizes to
545- ``number ``, so Python treats them as the same name::
529+ converted to their "basic" form. For example, ``fiⁿₐˡᵢᶻₐᵗᵢᵒₙ `` normalizes to
530+ ``finalization ``, so Python treats them as the same name::
546531
547- >>> nᵘₘᵇₑʳ = 3
548- >>> number
532+ >>> fiⁿₐˡᵢᶻₐᵗᵢᵒₙ = 3
533+ >>> finalization
549534 3
550535
551536.. note ::
@@ -554,15 +539,26 @@ converted to their "basic" form. For example, ``nᵘₘᵇₑʳ`` normalizes to
554539 Run-time functions that take names as *strings * generally do not normalize
555540 their arguments.
556541 For example, the variable defined above is accessible at run time in the
557- :func: `globals ` dictionary as ``globals()["number"] `` but not
558- ``globals()["nᵘₘᵇₑʳ"] ``.
542+ :func: `globals ` dictionary as ``globals()["finalization"] `` but not
543+ ``globals()["fiⁿₐˡᵢᶻₐᵗᵢᵒₙ"] ``.
544+
545+ Similarly to how ASCII-only names must contain only letters, digits and
546+ the underscore, and cannot start with a digit, a valid name must
547+ start with a character in the "letter-like" set ``xid_start ``,
548+ and the remaining characters must be in the "letter- and digit-like" set
549+ ``xid_continue ``.
550+
551+ These sets based on the *XID_Start * and *XID_Continue * sets as defined by the
552+ Unicode standard annex `UAX-31 `_.
553+ Python's ``xid_start `` additionally includes the underscore (``_ ``).
554+ Note that Python does not necessarily conform to `UAX-31 `_.
559555
560- Similarly to how ASCII-only names must contain only letters, numbers and
561- the underscore, and cannot start with a digit, the normalized name must
556+ A non-normative listing of characters in the *XID_Start * and *XID_Continue *
557+ sets as defined by Unicode is available in the `DerivedCoreProperties.txt `_
558+ file in the Unicode Character Database.
559+ For reference, the construction rules for the ``xid_* `` sets are given below.
562560
563- The first character of a normalized identifier must be "letter-like".
564- Formally, this means it must belong to the set ``id_start ``,
565- which is defined as the union of:
561+ The set ``id_start `` is defined as the union of:
566562
567563* Unicode category ``<Lu> `` - uppercase letters (includes ``A `` to ``Z ``)
568564* Unicode category ``<Ll> `` - lowercase letters (includes ``a `` to ``z ``)
@@ -574,9 +570,11 @@ which is defined as the union of:
574570* ``<Other_ID_Start> `` - an explicit set of characters in `PropList.txt `_
575571 to support backwards compatibility
576572
577- The remaining characters must be "letter-like" or "digit-like".
578- Formally, they must belong to the set ``id_continue ``, which is defined as
579- the union of:
573+ The set ``xid_start `` then closes this set under NFKC normalization, by
574+ removing all characters whose normalization is not of the form
575+ ``id_start id_continue* ``.
576+
577+ The set ``id_continue `` is defined as the union of:
580578
581579* ``id_start `` (see above)
582580* Unicode category ``<Nd> `` - decimal numbers (includes ``0 `` to ``9 ``)
@@ -586,25 +584,21 @@ the union of:
586584* ``<Other_ID_Continue> `` - another explicit set of characters in
587585 `PropList.txt `_ to support backwards compatibility
588586
587+ Again, ``xid_continue `` closes this set under NFKC normalization.
588+
589589Unicode categories use the version of the Unicode Character Database as
590590included in the :mod: `unicodedata ` module.
591591
592- The ``id_start `` and ``id_continue `` sets are based on the Unicode standard
593- annex `UAX-31 `_. See also :pep: `3131 ` for further details.
594- Note that Python does not necessarily conform to `UAX-31 `_.
595-
596- A non-normative listing of all valid identifier characters as defined by
597- Unicode is available in the `DerivedCoreProperties.txt `_ file in the Unicode
598- Character Database.
599- The properties *ID_Start * and *ID_Continue * are very similar to Python's
600- ``id_start `` and ``id_continue `` sets; the properties *XID_Start * and
601- *XID_Continue * play similar roles for identifiers before NFKC normalization.
602-
603592.. _UAX-31 : https://www.unicode.org/reports/tr31/
604593.. _PropList.txt : https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
605594.. _DerivedCoreProperties.txt : https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
606595.. _normalization form : https://www.unicode.org/reports/tr15/#Norm_Forms
607596
597+ .. seealso ::
598+
599+ * :pep: `3131 ` -- Supporting Non-ASCII Identifiers
600+ * :pep: `672 ` -- Unicode-related Security Considerations for Python
601+
608602
609603.. _literals :
610604
0 commit comments