Skip to content

Implement Thompson NFA-based Regular Expressions#1172

Open
JAi-SATHVIK wants to merge 119 commits intofortran-lang:masterfrom
JAi-SATHVIK:regex
Open

Implement Thompson NFA-based Regular Expressions#1172
JAi-SATHVIK wants to merge 119 commits intofortran-lang:masterfrom
JAi-SATHVIK:regex

Conversation

@JAi-SATHVIK
Copy link
Copy Markdown
Contributor

issue #1163

@JAi-SATHVIK
Copy link
Copy Markdown
Contributor Author

Hi @jalvesz @jvdp1, The "cmake-3.14" job in CI is failing because pip install cmake==3.14.3 requested a version which is no longer available on the PyPI index. can we update the .github/workflows/CI.yml to use cmake==3.14.3.post1?

@JAi-SATHVIK
Copy link
Copy Markdown
Contributor Author

JAi-SATHVIK commented Apr 4, 2026

Update

I have finalized the core implementation of the pure Fortran regex engine. Here is a summary of what I've completed:

  • Correct Shunting-Yard Parsing: Fixed logic bugs in the parser to properly handle parentheses (( and )) and operator precedence during postfix conversion.
  • Lexer Anchor Handling: Updated the lexer to accurately handle start (^) and end ($) anchors with correct implicit concatenation logic.
  • Accurate Match Reporting: Fixed an off-by-one error in regmatch to ensure correct 1-based match_start indices.
  • Safety and Stability: Hardened the engine against out-of-bounds access and memory issues by refactoring eager logical evaluations and utilizing local state management for thread safety.
  • Unit Test Integration: Migrated the test suite to the repository's standard test-drive framework, with 10 comprehensive test cases covering literals, character classes, anchors, and alternation.

The engine is now stable, zero-dependency, and ready for your feedback! @arjenmarkus @jvdp1 @jalvesz

Comment thread doc/specs/stdlib_regex.md
Comment thread doc/specs/stdlib_regex.md
Comment thread src/regex/stdlib_regex.f90 Outdated
Comment thread src/regex/stdlib_regex.f90 Outdated
Comment thread src/regex/stdlib_regex.f90 Outdated
Comment thread src/regex/stdlib_regex.f90 Outdated
Comment thread src/regex/stdlib_regex.f90
Comment thread src/regex/stdlib_regex.f90 Outdated
Comment thread src/regex/stdlib_regex.f90
Comment thread src/regex/stdlib_regex.f90 Outdated
@arjenmarkus
Copy link
Copy Markdown
Member

arjenmarkus commented Apr 5, 2026 via email

@arjenmarkus
Copy link
Copy Markdown
Member

arjenmarkus commented Apr 5, 2026 via email

@arjenmarkus
Copy link
Copy Markdown
Member

arjenmarkus commented Apr 5, 2026 via email

@JAi-SATHVIK
Copy link
Copy Markdown
Contributor Author

Thanks @jalvesz @arjenmarkus ! I’ve updated the code and addressed those issues:

Off-by-one and match lengths: Fixed! abc now correctly returns ms=5, me=7, and aaaab with a*b correctly returns ms=1, me=5.
Leftmost-Longest Priority: The engine follows the standard where the leftmost start always wins first. Because an "A" matches starting at index 1, it is chosen over any subsequent matches elsewhere in the string.

Among all matches starting at that same leftmost position, the engine will strictly select the longest one before concluding.

@JAi-SATHVIK JAi-SATHVIK requested a review from jalvesz April 5, 2026 19:59
@arjenmarkus
Copy link
Copy Markdown
Member

arjenmarkus commented Apr 7, 2026 via email

@JAi-SATHVIK
Copy link
Copy Markdown
Contributor Author

Thanks for the tests @arjenmarkus

  • Updated tokenize lexer logic to actively evaluate the preceding AST token before assigning repeat quantifiers (*, +, ?).
  • The parser now properly rejects nested or invalid quantifiers that lack a valid operand (e.g a**, a+*, (*a), or |*).
  • Enhanced parenthesis matching logic to correctly identify and throw errors for explicitly empty groups ().
  • Tested for strict compliance using Arjen Markus's Regex test catalog runner script. All edge cases successfully trigger stat = 1 immediately during regcomp.

@JAi-SATHVIK
Copy link
Copy Markdown
Contributor Author

Hi @jvdp1 @jalvesz , there are some ci failures which I have addressed in issue #1178 can you once have a look?

Comment thread src/regex/stdlib_regex.f90 Outdated
Comment thread src/regex/stdlib_regex.f90 Outdated
Comment thread src/regex/stdlib_regex.f90 Outdated
@JAi-SATHVIK
Copy link
Copy Markdown
Contributor Author

Thanks @jalvesz, I’ve updated is_term_ender, is_term_starter, and prec to be elemental. This makes them much more versatile for array-based operations within the module.

@JAi-SATHVIK JAi-SATHVIK requested a review from jalvesz April 17, 2026 16:19
@jalvesz jalvesz requested a review from arjenmarkus April 17, 2026 16:38
Copy link
Copy Markdown
Member

@jvdp1 jvdp1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @JAi-SATHVIK . Here are a few comments after a very quick review

Comment thread test/regex/catalogue_regex.f90 Outdated
integer :: mismatches
logical :: matched

open( 10, file = 'catalogue_regex.inp', status = 'old', iostat = ierr )
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use newunit instead of a defined unit.

Suggested change
open( 10, file = 'catalogue_regex.inp', status = 'old', iostat = ierr )
open( newunit=un, file = 'catalogue_regex.inp', status = 'old', iostat = ierr )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to use newunit

Comment thread test/regex/catalogue_regex.f90 Outdated
error stop
endif

open( 20, file = 'catalogue_regex.report' )
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
open( 20, file = 'catalogue_regex.report' )
open( newunit=un20, file = 'catalogue_regex.report' )

Comment thread test/regex/catalogue_regex.f90 Outdated
mismatches = 0

do
read( 10, '(a)', iostat = ierr ) line
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
read( 10, '(a)', iostat = ierr ) line
read( un, '(a)', iostat = ierr ) line

! Anchored match
call regcomp(re, "^hello", stat)
call regmatch(re, "hello world", found)
print "(A,L1)", "found = ", found
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print "(A,L1)", "found = ", found
print "(a,l1)", "found = ", found

Comment thread src/regex/stdlib_regex.f90 Outdated
Comment on lines +34 to +44
integer, parameter :: CHAR_ZERO = iachar('0')
integer, parameter :: CHAR_NINE = iachar('9')
integer, parameter :: CHAR_LOWER_A = iachar('a')
integer, parameter :: CHAR_LOWER_Z = iachar('z')
integer, parameter :: CHAR_UPPER_A = iachar('A')
integer, parameter :: CHAR_UPPER_Z = iachar('Z')
integer, parameter :: CHAR_SPACE = iachar(' ')
integer, parameter :: CHAR_TAB = 9
integer, parameter :: CHAR_LF = 10
integer, parameter :: CHAR_CR = 13
integer, parameter :: CHAR_UNDERSCORE = iachar('_')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of these might be already defined in stdlib_ascii. Did you check them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update stdlib_regex.f90 to use constants from stdlib_ascii (TAB, LF, CR)

@JAi-SATHVIK JAi-SATHVIK requested a review from jvdp1 April 19, 2026 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants