From a85f11e1791df4cc7abf6ca26e4d6e19fa015c64 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Sat, 20 Dec 2025 12:03:40 -0800 Subject: [PATCH] Clarify UNICODE_ESCAPE valid token value This clarifies the UNICODE_ESCAPE rule that the hex value must be a valid Unicode scalar value. This resolves the problem that a string like `"\u{ffffff}"` is not a valid token, but the grammar did not reflect that. I don't see a practical way to define this with character ranges. The resulting expression is huge. Note that this restriction means that the UNICODE_ESCAPE rule will not match an invalid value, and that all the places where UNICODE_ESCAPE is used, the preceding character must *not* be `\`, which forces those rules to fail their match. In turn the only rules that contain UNICODE_ESCAPE have `'` or `"` characters, which won't match any other rule in the grammar, forcing them to fail the parse. If all those assumptions seem too fragile, then we can consider adding the [cut operator](https://github.com/rust-lang/reference/pull/2104) just after the `\u` so that the interpretation is clear that a failure to match the part from the opening brace is an immediate parse failure. --- src/tokens.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/tokens.md b/src/tokens.md index f34fcb92d6..d83917ef01 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -157,9 +157,11 @@ ASCII_ESCAPE -> | `\n` | `\r` | `\t` | `\\` | `\0` UNICODE_ESCAPE -> - `\u{` ( HEX_DIGIT `_`* ){1..6} `}` + `\u{` ( HEX_DIGIT `_`* ){1..6} _valid hex char value_ `}`[^valid-hex-char] ``` +[^valid-hex-char]: See [lex.token.literal.char-escape.unicode]. + r[lex.token.literal.char.intro] A _character literal_ is a single Unicode character enclosed within two `U+0027` (single-quote) characters, with the exception of `U+0027` itself, which must be _escaped_ by a preceding `U+005C` character (`\`). @@ -196,7 +198,7 @@ r[lex.token.literal.char-escape.ascii] * A _7-bit code point escape_ starts with `U+0078` (`x`) and is followed by exactly two _hex digits_ with value up to `0x7F`. It denotes the ASCII character with value equal to the provided hex value. Higher values are not permitted because it is ambiguous whether they mean Unicode code points or byte values. r[lex.token.literal.char-escape.unicode] -* A _24-bit code point escape_ starts with `U+0075` (`u`) and is followed by up to six _hex digits_ surrounded by braces `U+007B` (`{`) and `U+007D` (`}`). It denotes the Unicode code point equal to the provided hex value. +* A _24-bit code point escape_ starts with `U+0075` (`u`) and is followed by up to six _hex digits_ surrounded by braces `U+007B` (`{`) and `U+007D` (`}`). It denotes the Unicode code point equal to the provided hex value. The value must be a valid Unicode scalar value. r[lex.token.literal.char-escape.whitespace] * A _whitespace escape_ is one of the characters `U+006E` (`n`), `U+0072` (`r`), or `U+0074` (`t`), denoting the Unicode values `U+000A` (LF), `U+000D` (CR) or `U+0009` (HT) respectively.