Skip to content

Treat lookahead/lookbehind as zero-width in RegexGroupParser and look through non-capturing groups in getRootAlternation#5603

Open
phpstan-bot wants to merge 1 commit intophpstan:2.1.xfrom
phpstan-bot:create-pull-request/patch-08h0vuf
Open

Treat lookahead/lookbehind as zero-width in RegexGroupParser and look through non-capturing groups in getRootAlternation#5603
phpstan-bot wants to merge 1 commit intophpstan:2.1.xfrom
phpstan-bot:create-pull-request/patch-08h0vuf

Conversation

@phpstan-bot
Copy link
Copy Markdown
Collaborator

Summary

Regex capturing groups containing lookahead/lookbehind assertions or empty alternation branches inside non-capturing groups were incorrectly inferred as non-empty-string or non-falsy-string when they could match empty strings. This fix teaches RegexGroupParser that lookahead/lookbehind assertions are zero-width (don't consume characters) and that alternation branches should use "any" semantics for maybe-empty checks.

Changes

  • src/Type/Regex/RegexGroupParser.php:

    • walkGroupAst: Added early return for #lookahead, #negativelookahead, #lookbehind, #negativelookbehind nodes to prevent their content from influencing nonEmpty/nonFalsy/numeric state
    • isMaybeEmptyNode: Added early return true for lookahead/lookbehind nodes (zero-width assertions). Added special handling for #alternation nodes: an alternation is maybe-empty if ANY branch is maybe-empty (previously used the generic recursion which required ALL children to be maybe-empty). Isolated $isNonFalsy per alternation branch to prevent cross-branch leakage
    • getRootAlternation: Extended to look through #noncapturing groups for both #capturing and #namedcapturing groups, enabling union type creation for patterns like ((?:|\d+))''|numeric-string
  • tests/PHPStan/Analyser/nsrt/preg_match_shapes.php:

    • Added bug12840 test function with 9 test cases covering: empty alternation in non-capturing group, lookahead/lookbehind/negative variants in alternation, lone lookahead in group, two non-capturing groups with empty alternation, named capturing groups with non-capturing wrappers, and lookahead in concatenation with literal

Root cause

Three interrelated issues in RegexGroupParser:

  1. Lookahead/lookbehind treated as character-consuming: walkGroupAst recursed into their children, causing the content of zero-width assertions (e.g., the x in (?=x)) to set nonEmpty=yes as if the assertion consumed characters. isMaybeEmptyNode similarly didn't recognize them as zero-width.

  2. isMaybeEmptyNode used wrong semantics for alternation: The generic recursion checked if ALL children were maybe-empty (correct for concatenation) but alternation only needs ANY branch to be maybe-empty. This meant (?:(?=x)|y) was considered non-empty because the y branch was non-empty, even though the (?=x) branch is zero-width.

  3. getRootAlternation didn't see through non-capturing groups: Patterns like ((?:|\d+)) couldn't use the union-type code path (which creates ''|numeric-string) because the alternation was wrapped in a #noncapturing node.

Test

  • bug12840 function in tests/PHPStan/Analyser/nsrt/preg_match_shapes.php with 9 assertType calls covering:
    • Empty alternation in non-capturing group: ((?:|\d+))''|numeric-string
    • All four assertion variants in alternation: (?=...), (?<=...), (?!...), (?<!...)string (not non-falsy-string)
    • Lone lookahead in group: ((?=x))string (not non-empty-string)
    • Named group with non-capturing wrapper: (?P<g>(?:|\d+))''|numeric-string
    • Lookahead in concatenation with literal: ((?=x)a)non-falsy-string (correctly preserved)

Fixes phpstan/phpstan#12840

…ok through non-capturing groups in `getRootAlternation`

- In `walkGroupAst`, skip recursion into `#lookahead`, `#negativelookahead`,
  `#lookbehind`, `#negativelookbehind` children since they are zero-width
  assertions that do not contribute characters to the match
- In `isMaybeEmptyNode`, return true for lookahead/lookbehind nodes (always
  zero-width) and handle `#alternation` correctly: an alternation is
  maybe-empty if ANY branch is maybe-empty, not ALL (fixing the generic
  recursion which required all children to be maybe-empty)
- Isolate `$isNonFalsy` per alternation branch in `isMaybeEmptyNode` to
  prevent a non-falsy literal in one branch from incorrectly marking the
  entire alternation as non-falsy
- In `getRootAlternation`, look through `#noncapturing` groups to find
  the alternation inside, enabling union type creation for patterns like
  `((?:|\d+))` → `''|numeric-string`
@staabm staabm self-assigned this May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants