From a85f11e1791df4cc7abf6ca26e4d6e19fa015c64 Mon Sep 17 00:00:00 2001
From: Eric Huss <eric@huss.org>
Date: Sat, 20 Dec 2025 12:03:40 -0800
Subject: [PATCH] Clarify UNICODE_ESCAPE valid token value

This clarifies the UNICODE_ESCAPE rule that the hex value must be a
valid Unicode scalar value. This resolves the problem that a string like
`"\u{ffffff}"` is not a valid token, but the grammar did not reflect
that.

I don't see a practical way to define this with character ranges. The
resulting expression is huge.

Note that this restriction means that the UNICODE_ESCAPE rule will not
match an invalid value, and that all the places where UNICODE_ESCAPE is
used, the preceding character must *not* be `\`, which forces those
rules to fail their match. In turn the only rules that contain
UNICODE_ESCAPE have `'` or `"` characters, which won't match any other
rule in the grammar, forcing them to fail the parse.

If all those assumptions seem too fragile, then we can consider adding
the [cut operator](https://github.com/rust-lang/reference/pull/2104)
just after the `\u` so that the interpretation is clear that a failure
to match the part from the opening brace is an immediate parse failure.
---
 src/tokens.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/tokens.md b/src/tokens.md
index f34fcb92d6..d83917ef01 100644
--- a/src/tokens.md
+++ b/src/tokens.md
@@ -157,9 +157,11 @@ ASCII_ESCAPE ->
     | `\n` | `\r` | `\t` | `\\` | `\0`
 
 UNICODE_ESCAPE ->
-    `\u{` ( HEX_DIGIT `_`* ){1..6} `}`
+    `\u{` ( HEX_DIGIT `_`* ){1..6} _valid hex char value_ `}`[^valid-hex-char]
 ```
 
+[^valid-hex-char]: See [lex.token.literal.char-escape.unicode].
+
 r[lex.token.literal.char.intro]
 A _character literal_ is a single Unicode character enclosed within two `U+0027` (single-quote) characters, with the exception of `U+0027` itself, which must be _escaped_ by a preceding `U+005C` character (`\`).
 
@@ -196,7 +198,7 @@ r[lex.token.literal.char-escape.ascii]
 * A _7-bit code point escape_ starts with `U+0078` (`x`) and is followed by exactly two _hex digits_ with value up to `0x7F`. It denotes the ASCII character with value equal to the provided hex value. Higher values are not permitted because it is ambiguous whether they mean Unicode code points or byte values.
 
 r[lex.token.literal.char-escape.unicode]
-* A _24-bit code point escape_ starts with `U+0075` (`u`) and is followed by up to six _hex digits_ surrounded by braces `U+007B` (`{`) and `U+007D` (`}`). It denotes the Unicode code point equal to the provided hex value.
+* A _24-bit code point escape_ starts with `U+0075` (`u`) and is followed by up to six _hex digits_ surrounded by braces `U+007B` (`{`) and `U+007D` (`}`). It denotes the Unicode code point equal to the provided hex value. The value must be a valid Unicode scalar value.
 
 r[lex.token.literal.char-escape.whitespace]
 * A _whitespace escape_ is one of the characters `U+006E` (`n`), `U+0072` (`r`), or `U+0074` (`t`), denoting the Unicode values `U+000A` (LF), `U+000D` (CR) or `U+0009` (HT) respectively.