Skip to content

Commit 43f6091

Browse files
committed
Reword to use XID_Start and XID_Continue
1 parent 2e7f7c0 commit 43f6091

File tree

1 file changed

+40
-46
lines changed

1 file changed

+40
-46
lines changed

Doc/reference/lexical_analysis.rst

Lines changed: 40 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -516,36 +516,21 @@ characters:
516516
Non-ASCII characters in names
517517
-----------------------------
518518

519-
Python identifiers may contain all sorts of characters.
520-
For example, ``ř_1``, ````, or ``साँप`` are valid identifiers.
521-
However, ``r〰2``, ````, or ``🐍`` are not.
522-
Additionally, some variations are considered equivalent: for example,
523-
``fi`` (2 letters) and ```` (1 ligature).
519+
Names that contain non-ASCII characters need additional normalization
520+
and validation beyond the rules and grammar explained
521+
:ref:`above <identifiers>`.
522+
For example, ``ř_1``, ````, or ``साँप`` are valid names, but ``r〰2``,
523+
````, or ``🐍`` are not.
524524

525-
526-
A :ref:`name token <identifiers>` that only contains ASCII characters
527-
(``A-Z``, ``a-z``, ``_`` and ``0-9``) is always valid, and distinct from
528-
different ASCII-only names.
529-
The rules are somewhat more complicated when using non-ASCII characters.
530-
531-
Informally, all names must be composed of letters, digits, numbers and
532-
underscores, and cannot start with a digit.
533-
534-
535-
536-
537-
Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can use characters
538-
from outside the ASCII range.
539-
540-
, as detailed in this section.
525+
This section explains the exact rules.
541526

542527
All names are converted into the `normalization form`_ NFKC while parsing.
543528
This means that, for example, some typographic variants of characters are
544-
converted to their "basic" form. For example, ``nᵘₘᵇₑʳ`` normalizes to
545-
``number``, so Python treats them as the same name::
529+
converted to their "basic" form. For example, ``fiⁿₐˡᵢᶻₐᵗᵢᵒₙ`` normalizes to
530+
``finalization``, so Python treats them as the same name::
546531

547-
>>> nᵘₘᵇₑʳ = 3
548-
>>> number
532+
>>> fiⁿₐˡᵢᶻₐᵗᵢᵒₙ = 3
533+
>>> finalization
549534
3
550535

551536
.. note::
@@ -554,15 +539,26 @@ converted to their "basic" form. For example, ``nᵘₘᵇₑʳ`` normalizes to
554539
Run-time functions that take names as *strings* generally do not normalize
555540
their arguments.
556541
For example, the variable defined above is accessible at run time in the
557-
:func:`globals` dictionary as ``globals()["number"]`` but not
558-
``globals()["nᵘₘᵇₑʳ"]``.
542+
:func:`globals` dictionary as ``globals()["finalization"]`` but not
543+
``globals()["fiⁿₐˡᵢᶻₐᵗᵢᵒₙ"]``.
544+
545+
Similarly to how ASCII-only names must contain only letters, digits and
546+
the underscore, and cannot start with a digit, a valid name must
547+
start with a character in the "letter-like" set ``xid_start``,
548+
and the remaining characters must be in the "letter- and digit-like" set
549+
``xid_continue``.
550+
551+
These sets based on the *XID_Start* and *XID_Continue* sets as defined by the
552+
Unicode standard annex `UAX-31`_.
553+
Python's ``xid_start`` additionally includes the underscore (``_``).
554+
Note that Python does not necessarily conform to `UAX-31`_.
559555

560-
Similarly to how ASCII-only names must contain only letters, numbers and
561-
the underscore, and cannot start with a digit, the normalized name must
556+
A non-normative listing of characters in the *XID_Start* and *XID_Continue*
557+
sets as defined by Unicode is available in the `DerivedCoreProperties.txt`_
558+
file in the Unicode Character Database.
559+
For reference, the construction rules for the ``xid_*`` sets are given below.
562560

563-
The first character of a normalized identifier must be "letter-like".
564-
Formally, this means it must belong to the set ``id_start``,
565-
which is defined as the union of:
561+
The set ``id_start`` is defined as the union of:
566562

567563
* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
568564
* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
@@ -574,9 +570,11 @@ which is defined as the union of:
574570
* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
575571
to support backwards compatibility
576572

577-
The remaining characters must be "letter-like" or "digit-like".
578-
Formally, they must belong to the set ``id_continue``, which is defined as
579-
the union of:
573+
The set ``xid_start`` then closes this set under NFKC normalization, by
574+
removing all characters whose normalization is not of the form
575+
``id_start id_continue*``.
576+
577+
The set ``id_continue`` is defined as the union of:
580578

581579
* ``id_start`` (see above)
582580
* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
@@ -586,25 +584,21 @@ the union of:
586584
* ``<Other_ID_Continue>`` - another explicit set of characters in
587585
`PropList.txt`_ to support backwards compatibility
588586

587+
Again, ``xid_continue`` closes this set under NFKC normalization.
588+
589589
Unicode categories use the version of the Unicode Character Database as
590590
included in the :mod:`unicodedata` module.
591591

592-
The ``id_start`` and ``id_continue`` sets are based on the Unicode standard
593-
annex `UAX-31`_. See also :pep:`3131` for further details.
594-
Note that Python does not necessarily conform to `UAX-31`_.
595-
596-
A non-normative listing of all valid identifier characters as defined by
597-
Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
598-
Character Database.
599-
The properties *ID_Start* and *ID_Continue* are very similar to Python's
600-
``id_start`` and ``id_continue`` sets; the properties *XID_Start* and
601-
*XID_Continue* play similar roles for identifiers before NFKC normalization.
602-
603592
.. _UAX-31: https://www.unicode.org/reports/tr31/
604593
.. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
605594
.. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
606595
.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms
607596

597+
.. seealso::
598+
599+
* :pep:`3131` -- Supporting Non-ASCII Identifiers
600+
* :pep:`672` -- Unicode-related Security Considerations for Python
601+
608602

609603
.. _literals:
610604

0 commit comments

Comments
 (0)