Skip to content

Commit e44fa66

Browse files
committed
WIP
1 parent 80ad85c commit e44fa66

File tree

3 files changed

+76
-55
lines changed

3 files changed

+76
-55
lines changed

Doc/reference/grammar.rst

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,8 @@ error recovery.
1010

1111
The notation used here is the same as in the preceding docs,
1212
and is described in the :ref:`notation <notation>` section,
13-
except for a few extra complications:
13+
except for an extra complication:
1414

15-
* ``&e``: a positive lookahead (that is, ``e`` is required to match but
16-
not consumed)
17-
* ``!e``: a negative lookahead (that is, ``e`` is required *not* to match)
1815
* ``~`` ("cut"): commit to the current alternative and fail the rule
1916
even if this fails to parse
2017

Doc/reference/introduction.rst

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -145,15 +145,23 @@ The definition to the right of the colon uses the following syntax elements:
145145
* ``e?``: A question mark has exactly the same meaning as square brackets:
146146
the preceding item is optional.
147147
* ``(e)``: Parentheses are used for grouping.
148+
149+
The following notation is only used in
150+
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.
151+
148152
* ``"a"..."z"``: Two literal characters separated by three dots mean a choice
149153
of any single character in the given (inclusive) range of ASCII characters.
150-
This notation is only used in
151-
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.
152154
* ``<...>``: A phrase between angular brackets gives an informal description
153155
of the matched symbol (for example, ``<any ASCII character except "\">``),
154156
or an abbreviation that is defined in nearby text (for example, ``<Lu>``).
155-
This notation is only used in
156-
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.
157+
158+
.. _lexical-lookaheads:
159+
160+
Some definitions also use *lookaheads*, which indicate that an element
161+
must (or must not) match at a given position, but without consuming any input:
162+
163+
* ``&e``: a positive lookahead (that is, ``e`` is required to match)
164+
* ``!e``: a negative lookahead (that is, ``e`` is required *not* to match)
157165

158166
The unary operators (``*``, ``+``, ``?``) bind as tightly as possible;
159167
the vertical bar (``|``) binds most loosely.

Doc/reference/lexical_analysis.rst

Lines changed: 63 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -39,25 +39,37 @@ The end of a logical line is represented by the token :data:`~token.NEWLINE`.
3939
Statements cannot cross logical line boundaries except where :data:`!NEWLINE`
4040
is allowed by the syntax (e.g., between statements in compound statements).
4141
A logical line is constructed from one or more *physical lines* by following
42-
the explicit or implicit *line joining* rules.
42+
the :ref:`explicit <explicit-joining>` or :ref:`implicit <implicit-joining>`
43+
*line joining* rules.
4344

4445

4546
.. _physical-lines:
4647

4748
Physical lines
4849
--------------
4950

50-
A physical line is a sequence of characters terminated by an end-of-line
51-
sequence. In source files and strings, any of the standard platform line
52-
termination sequences can be used - the Unix form using ASCII LF (linefeed),
53-
the Windows form using the ASCII sequence CR LF (return followed by linefeed),
54-
or the old Macintosh form using the ASCII CR (return) character. All of these
55-
forms can be used equally, regardless of platform. The end of input also serves
56-
as an implicit terminator for the final physical line.
51+
A physical line is a sequence of characters terminated by one the following
52+
end-of-line sequences:
5753

58-
When embedding Python, source code strings should be passed to Python APIs using
59-
the standard C conventions for newline characters (the ``\n`` character,
60-
representing ASCII LF, is the line terminator).
54+
* the Unix form using ASCII LF (linefeed),
55+
* the Windows form using the ASCII sequence CR LF (return followed by linefeed),
56+
* the old Macintosh form using the ASCII CR (return) character.
57+
58+
Regardless of platform, each of these sequences is replaced by a single
59+
ASCII LF (linefeed) character.
60+
(This is done even inside :ref:`string literals <strings>`.)
61+
Each line can use any of the sequences; they do not need to be consistent
62+
within a file.
63+
64+
The end of input also serves as an implicit terminator for the final
65+
physical line.
66+
67+
Formally:
68+
69+
.. grammar-snippet::
70+
:group: python-grammar
71+
72+
newline: <ASCII LF> | <ASCII CR> <ASCII LF> | <ASCII CR>
6173

6274

6375
.. _comments:
@@ -484,14 +496,21 @@ Literals
484496

485497
Literals are notations for constant values of some built-in types.
486498

499+
In terms of lexical analysis, Python has :ref:`string, bytes <strings>`
500+
and :ref:`numeric <numbers>` literals.
501+
502+
Other “literals” are lexically denoted using :ref:`keywords <keywords>`
503+
(``None``, ``True``, ``False``) and the special
504+
:ref:`ellipsis token <lexical-ellipsis>` (``...``):
505+
487506

488507
.. index:: string literal, bytes literal, ASCII
489508
single: ' (single quote); string literal
490509
single: " (double quote); string literal
491510
.. _strings:
492511

493512
String and Bytes literals
494-
-------------------------
513+
=========================
495514

496515
String literals are text enclosed in single quotes (``'``) or double
497516
quotes (``"``). For example:
@@ -635,41 +654,26 @@ They may not be combined with ``'b'``, ``'u'``, or each other.
635654

636655

637656
String literals, except "F-strings" and "T-strings", are described by the
638-
following lexical definitions:
657+
following lexical definitions.
658+
659+
These definitions use :ref:`negative lookaheads <lexical-lookaheads>` (``!``)
660+
to indicate that an ending quote ends the literal.
639661

640662
.. grammar-snippet::
641663
:group: python-grammar
642664

643-
STRING: stringliteral | bytesliteral | fstring | tstring
644-
645-
stringliteral: [`stringprefix`](`stringcontent`)
646-
stringprefix: <("r" | "u"), case-insensitive>
647-
stringcontent: `quote` `stringitem`* <matching `quote`>
648-
quote: "'" | '"' | "'''" | '"""'
665+
STRING: [`stringprefix`] (`stringcontent`)
666+
stringprefix: <("r" | "u" | "b" | "br" | "rb"), case-insensitive>
667+
stringcontent:
668+
| "'" ( !"'" `stringitem`)* "'"
669+
| '"' ( !'"' `stringitem`)* '"'
670+
| "'''" ( !"'''" `longstringitem`)* "'''"
671+
| '"""' ( !'"""' `longstringitem`)* '"""'
649672
stringitem: `stringchar` | `stringescapeseq`
650-
stringchar: <any `source_character`, except as listed below>
673+
stringchar: <any `source_character`, except backslash and newline>
674+
longstringitem: `stringitem` | newline
651675
stringescapeseq: "\" <any `source_character`>
652676

653-
``stringchar`` can not include:
654-
655-
- the backslash, ``\``;
656-
- in triple-quoted strings (quoted by ``'''`` or ``"""``), the newline;
657-
- the quote character.
658-
659-
660-
.. grammar-snippet::
661-
:group: python-grammar
662-
663-
bytesliteral: `bytesprefix`(`shortbytes` | `longbytes`)
664-
bytesprefix: <("b" | "br" | "rb" ), case-insensitive>
665-
shortbytes: "'" `shortbytesitem`* "'" | '"' `shortbytesitem`* '"'
666-
longbytes: "'''" `longbytesitem`* "'''" | '"""' `longbytesitem`* '"""'
667-
shortbytesitem: `shortbyteschar` | `bytesescapeseq`
668-
longbytesitem: `longbyteschar` | `bytesescapeseq`
669-
shortbyteschar: <any ASCII `source_character` except "\" or newline or the quote>
670-
longbyteschar: <any ASCII `source_character` except "\">
671-
bytesescapeseq: "\" <any ASCII `source_character`>
672-
673677
Note that as in all lexical definitions, whitespace is significant.
674678
The prefix, if any, must be followed immediately by the quoted string content.
675679

@@ -692,7 +696,7 @@ The prefix, if any, must be followed immediately by the quoted string content.
692696
.. _escape-sequences:
693697

694698
Escape sequences
695-
^^^^^^^^^^^^^^^^
699+
----------------
696700

697701
Unless an ``'r'`` or ``'R'`` prefix is present, escape sequences in string and
698702
bytes literals are interpreted according to rules similar to those used by
@@ -985,7 +989,7 @@ and :meth:`str.format`, which uses a related format string mechanism.
985989
.. _numbers:
986990

987991
Numeric literals
988-
----------------
992+
================
989993

990994
.. index:: number, numeric literal, integer literal
991995
floating-point literal, hexadecimal literal
@@ -1241,14 +1245,26 @@ The following tokens serve as delimiters in the grammar:
12411245
12421246
( ) [ ] { }
12431247
, : ! . ; @ =
1248+
1249+
The period can also occur in floating-point and imaginary literals.
1250+
1251+
.. _lexical-ellipsis:
1252+
1253+
A sequence of three periods has a special meaning as an
1254+
:py:data:`Ellipsis` literal:
1255+
1256+
.. code-block:: none
1257+
1258+
...
1259+
1260+
The following *augmented assignment operators* serve
1261+
lexically as delimiters, but also perform an operation:
1262+
1263+
.. code-block:: none
1264+
12441265
-> += -= *= /= //= %=
12451266
@= &= |= ^= >>= <<= **=
12461267
1247-
The period can also occur in floating-point and imaginary literals. A sequence
1248-
of three periods has a special meaning as an ellipsis literal. The second half
1249-
of the list, the augmented assignment operators, serve lexically as delimiters,
1250-
but also perform an operation.
1251-
12521268
The following printing ASCII characters have special meaning as part of other
12531269
tokens or are otherwise significant to the lexical analyzer:
12541270

0 commit comments

Comments
 (0)