Implement Thompson NFA-based Regular Expressions by JAi-SATHVIK · Pull Request #1172 · fortran-lang/stdlib

JAi-SATHVIK · 2026-03-31T19:26:58Z

…m` routines.

Fix CMakeLists.txt for the addition of stdlib_storting_pca

Master cpy

optimized for performance and stability

JAi-SATHVIK · 2026-04-03T21:31:00Z

Hi @jalvesz @jvdp1, The "cmake-3.14" job in CI is failing because pip install cmake==3.14.3 requested a version which is no longer available on the PyPI index. can we update the .github/workflows/CI.yml to use cmake==3.14.3.post1?

JAi-SATHVIK · 2026-04-04T08:46:03Z

Update

I have finalized the core implementation of the pure Fortran regex engine. Here is a summary of what I've completed:

Correct Shunting-Yard Parsing: Fixed logic bugs in the parser to properly handle parentheses (( and )) and operator precedence during postfix conversion.
Lexer Anchor Handling: Updated the lexer to accurately handle start (^) and end ($) anchors with correct implicit concatenation logic.
Accurate Match Reporting: Fixed an off-by-one error in regmatch to ensure correct 1-based match_start indices.
Safety and Stability: Hardened the engine against out-of-bounds access and memory issues by refactoring eager logical evaluations and utilizing local state management for thread safety.
Unit Test Integration: Migrated the test suite to the repository's standard test-drive framework, with 10 comprehensive test cases covering literals, character classes, anchors, and alternation.

The engine is now stable, zero-dependency, and ready for your feedback! @arjenmarkus @jvdp1 @jalvesz

…egex

arjenmarkus · 2026-04-05T13:04:42Z

I had a first look at the test program you provided some days ago. I noticed that the indices are off by one: === Testing Fortran Regex (Thompson NFA) === regcomp 'abc': status = 0 Match 'xyz_abc_def' -> T 4 7 The substring "abc" starts at 5, not 4. This off by one error occurs in another test as well. Another one: Match 'aaaab' with 'a*b' -> T 4 5 The match starts at 1, not 4.. foo123bar: the matching substring is too short - the reported substring is from 3 to 4, instead of 4 to 6. cats: the matching subststring is reported as 7 to 11 (five characters) whereas the matching substring is "cats", so four characters only. So, some work to be done, unless you have already fixed these bugs ;), but in any case a good start. Op za 4 apr 2026 om 22:43 schreef José Alves ***@***.***>:

…

***@***.**** requested changes on this pull request. ------------------------------ In doc/specs/stdlib_regex.md <#1172 (comment)>: > +The regular expression pattern string to compile. + +`status` (optional): Shall be of type `integer`. It is an `intent(out)` argument. +Returns 0 on success, or a non-zero value if the pattern is invalid +(e.g., mismatched parentheses or brackets). + +### Example + +```fortran +use stdlib_regex, only: regex_type, regcomp +type(regex_type) :: re +integer :: stat + +call regcomp(re, "(cat|dog)s?", stat) +if (stat /= 0) error stop "Invalid regex pattern" +``` This should ideally be an executable example program in the examples folder ------------------------------ In doc/specs/stdlib_regex.md <#1172 (comment)>: > + +`string`: Shall be of type `character(len=*)`. It is an `intent(in)` argument. +The input string to search for a match. + +`is_match`: Shall be of type `logical`. It is an `intent(out)` argument. +Set to `.true.` if a match is found, `.false.` otherwise. + +`match_start` (optional): Shall be of type `integer`. It is an `intent(out)` argument. +The 1-based index of the first character of the match. + +`match_end` (optional): Shall be of type `integer`. It is an `intent(out)` argument. +The 1-based index of the last character of the match. + +### Example + +```fortran same as before, this should be an executable program in the examples folder ------------------------------ In src/regex/stdlib_regex.f90 <#1172 (comment)>: > + integer :: tail + end type out_list_type + + type :: frag_type + integer :: start + type(out_list_type) :: out_list + end type frag_type + + type :: thread + integer :: state + integer :: start_pos + end type thread + +contains + + logical function is_term_ender(tag) can this be made pure or even better elemental ? ------------------------------ In src/regex/stdlib_regex.f90 <#1172 (comment)>: > + integer :: state + integer :: start_pos + end type thread + +contains + + logical function is_term_ender(tag) + integer, intent(in) :: tag + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. & + tag == TOK_CLASS .or. tag == TOK_STAR .or. & + tag == TOK_PLUS .or. tag == TOK_QUEST .or. & + tag == TOK_RPAREN .or. tag == TOK_END .or. & + tag == TOK_START) + end function is_term_ender + + logical function is_term_starter(tag) can this be made pure or even better elemental ? ------------------------------ In src/regex/stdlib_regex.f90 <#1172 (comment)>: > + integer, intent(in) :: tag + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. & + tag == TOK_CLASS .or. tag == TOK_STAR .or. & + tag == TOK_PLUS .or. tag == TOK_QUEST .or. & + tag == TOK_RPAREN .or. tag == TOK_END .or. & + tag == TOK_START) + end function is_term_ender + + logical function is_term_starter(tag) + integer, intent(in) :: tag + is_term_starter = (tag == TOK_CHAR .or. tag == TOK_ANY .or. & + tag == TOK_CLASS .or. tag == TOK_LPAREN .or. & + tag == TOK_START .or. tag == TOK_END) + end function is_term_starter + + integer function prec(tag) can this be made pure or even better elemental ? ------------------------------ In src/regex/stdlib_regex.f90 <#1172 (comment)>: > + c = pattern(i:i) + t%tag = TOK_CHAR + t%c = ' ' + t%bmap = .false. + t%invert = .false. + + if (c == '\') then + if (i < len_p) then + i = i + 1 + c = pattern(i:i) + end if + t%tag = TOK_CHAR + t%c = c + if (c == 'd') then + t%tag = TOK_CLASS + do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do Two things: 1. Could you make all calls to iachar( ) affect module level parameters that then are usable accross the module ? you can check out https://github.com/fortran-lang/stdlib/blob/master/src/core/stdlib_ascii.fypp for inspiration. Maybe some of those constants would be worth to be stored there ? 2. Please avoid one-liners for anything that is not a scalar constant: - one = 1; zero =0 is tolerable - do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do is not very "debugger friendly" as one can not easily set break points to follow iterations. For this specific case, t%bmap(iachar('0'):iachar('9')) = .true. would be equivalent, fortranic, and no need for semicolons. ------------------------------ In src/regex/stdlib_regex.f90 <#1172 (comment)>: > + stack(top) = tokens(i) + end if + end do + + do while (top > 0) + if (stack(top)%tag == TOK_LPAREN) then + stat = 1 + return + end if + num_postfix = num_postfix + 1 + postfix(num_postfix) = stack(top) + top = top - 1 + end do + end subroutine parse_to_postfix + + integer function new_out(s, o, pool, p_size) This function has side-effects on pool, I totally agree with this https://github.com/JorgeG94/fortran_programmer_llm/blob/main/Fortran_programmer.md#functions-should-have-no-side-effects . Is it possible to consider subroutines when facing mutation on input derived types? ------------------------------ In src/regex/stdlib_regex.f90 <#1172 (comment)>: > + top = top - 1 + end do + end subroutine parse_to_postfix + + integer function new_out(s, o, pool, p_size) + integer, intent(in) :: s, o + type(out_node), intent(inout) :: pool(:) + integer, intent(inout) :: p_size + p_size = p_size + 1 + pool(p_size)%s = s + pool(p_size)%o = o + pool(p_size)%next = 0 + new_out = p_size + end function new_out + + subroutine merge_lists(l1, l2, res, pool) same https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036020902 ------------------------------ In src/regex/stdlib_regex.f90 <#1172 (comment)>: > + type(out_list_type), intent(in) :: l1, l2 + type(out_list_type), intent(out) :: res + type(out_node), intent(inout) :: pool(:) + if (l1%head == 0) then + res = l2 + else if (l2%head == 0) then + res = l1 + else + pool(l1%tail)%next = l2%head + res%head = l1%head + res%tail = l2%tail + end if + end subroutine merge_lists + + subroutine do_patch(states, list, target, pool) + type(state_type), intent(inout) :: states(:) Another style comment: usually, with subroutines, it is customary to have the non-mutable inputs first (strict intent(in)), then the intent(out) or intent(inout)s, and finally the optionals. Would it be possible to keep this recommendation? ------------------------------ In src/regex/stdlib_regex.f90 <#1172 (comment)>: > + integer, intent(in) :: target + type(out_node), intent(in) :: pool(:) + integer :: curr + curr = list%head + do while (curr /= 0) + if (pool(curr)%o == 1) then + states(pool(curr)%s)%out1 = target + else + states(pool(curr)%s)%out2 = target + end if + curr = pool(curr)%next + end do + end subroutine do_patch + + subroutine build_nfa(postfix, num_postfix, states, n_states, start_state, stat) + type(token_type), intent(in) :: postfix(:) same as https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863 ------------------------------ In src/regex/stdlib_regex.f90 <#1172 (comment)>: > + if (stat /= 0) then + if (present(status)) status = stat + return + end if + + call parse_to_postfix(tokens, n_tok, postfix, n_post, stat) + if (stat /= 0) then + if (present(status)) status = stat + return + end if + + call build_nfa(postfix, n_post, re%states, re%n_states, re%start_state, stat) + if (present(status)) status = stat + end subroutine regcomp + + recursive subroutine add_thread(list, count, state_idx, start_pos, step_index, states, str_len, visited) same as https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863 On the implementation level, is recursiveness absolutly necessary? or could there be a way to implement this without using recursivity? — Reply to this email directly, view it on GitHub <#1172 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAN6YRY2IFR2JVVQLWR6RGD4UFXXVAVCNFSM6AAAAACXIGTOHWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DANJYHAYTONBVG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

arjenmarkus · 2026-04-05T13:11:59Z

I also tried a test call regcomp(re, "A+", stat) call regmatch(re, "A AAAAAAAAAAAAAAA", is_match, match_start, match_end) print *, "Match 'A AAAAAAAAAAAA' with 'A+' -> ", is_match, match_start, match_end The result was a match with the first subststring, not the longest. That is not the classical behaviour of a regular expression matcher. I will read the documentation to see if this was expected 😇 Op zo 5 apr 2026 om 15:04 schreef Arjen Markus ***@***.***>:

…

I had a first look at the test program you provided some days ago. I noticed that the indices are off by one: === Testing Fortran Regex (Thompson NFA) === regcomp 'abc': status = 0 Match 'xyz_abc_def' -> T 4 7 The substring "abc" starts at 5, not 4. This off by one error occurs in another test as well. Another one: Match 'aaaab' with 'a*b' -> T 4 5 The match starts at 1, not 4.. foo123bar: the matching substring is too short - the reported substring is from 3 to 4, instead of 4 to 6. cats: the matching subststring is reported as 7 to 11 (five characters) whereas the matching substring is "cats", so four characters only. So, some work to be done, unless you have already fixed these bugs ;), but in any case a good start. Op za 4 apr 2026 om 22:43 schreef José Alves ***@***.***>: > ***@***.**** requested changes on this pull request. > ------------------------------ > > In doc/specs/stdlib_regex.md > <#1172 (comment)> > : > > > +The regular expression pattern string to compile. > + > +`status` (optional): Shall be of type `integer`. It is an `intent(out)` argument. > +Returns 0 on success, or a non-zero value if the pattern is invalid > +(e.g., mismatched parentheses or brackets). > + > +### Example > + > +```fortran > +use stdlib_regex, only: regex_type, regcomp > +type(regex_type) :: re > +integer :: stat > + > +call regcomp(re, "(cat|dog)s?", stat) > +if (stat /= 0) error stop "Invalid regex pattern" > +``` > > This should ideally be an executable example program in the examples > folder > ------------------------------ > > In doc/specs/stdlib_regex.md > <#1172 (comment)> > : > > > + > +`string`: Shall be of type `character(len=*)`. It is an `intent(in)` argument. > +The input string to search for a match. > + > +`is_match`: Shall be of type `logical`. It is an `intent(out)` argument. > +Set to `.true.` if a match is found, `.false.` otherwise. > + > +`match_start` (optional): Shall be of type `integer`. It is an `intent(out)` argument. > +The 1-based index of the first character of the match. > + > +`match_end` (optional): Shall be of type `integer`. It is an `intent(out)` argument. > +The 1-based index of the last character of the match. > + > +### Example > + > +```fortran > > same as before, this should be an executable program in the examples > folder > ------------------------------ > > In src/regex/stdlib_regex.f90 > <#1172 (comment)> > : > > > + integer :: tail > + end type out_list_type > + > + type :: frag_type > + integer :: start > + type(out_list_type) :: out_list > + end type frag_type > + > + type :: thread > + integer :: state > + integer :: start_pos > + end type thread > + > +contains > + > + logical function is_term_ender(tag) > > can this be made pure or even better elemental ? > ------------------------------ > > In src/regex/stdlib_regex.f90 > <#1172 (comment)> > : > > > + integer :: state > + integer :: start_pos > + end type thread > + > +contains > + > + logical function is_term_ender(tag) > + integer, intent(in) :: tag > + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. & > + tag == TOK_CLASS .or. tag == TOK_STAR .or. & > + tag == TOK_PLUS .or. tag == TOK_QUEST .or. & > + tag == TOK_RPAREN .or. tag == TOK_END .or. & > + tag == TOK_START) > + end function is_term_ender > + > + logical function is_term_starter(tag) > > can this be made pure or even better elemental ? > ------------------------------ > > In src/regex/stdlib_regex.f90 > <#1172 (comment)> > : > > > + integer, intent(in) :: tag > + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. & > + tag == TOK_CLASS .or. tag == TOK_STAR .or. & > + tag == TOK_PLUS .or. tag == TOK_QUEST .or. & > + tag == TOK_RPAREN .or. tag == TOK_END .or. & > + tag == TOK_START) > + end function is_term_ender > + > + logical function is_term_starter(tag) > + integer, intent(in) :: tag > + is_term_starter = (tag == TOK_CHAR .or. tag == TOK_ANY .or. & > + tag == TOK_CLASS .or. tag == TOK_LPAREN .or. & > + tag == TOK_START .or. tag == TOK_END) > + end function is_term_starter > + > + integer function prec(tag) > > can this be made pure or even better elemental ? > ------------------------------ > > In src/regex/stdlib_regex.f90 > <#1172 (comment)> > : > > > + c = pattern(i:i) > + t%tag = TOK_CHAR > + t%c = ' ' > + t%bmap = .false. > + t%invert = .false. > + > + if (c == '\') then > + if (i < len_p) then > + i = i + 1 > + c = pattern(i:i) > + end if > + t%tag = TOK_CHAR > + t%c = c > + if (c == 'd') then > + t%tag = TOK_CLASS > + do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do > > Two things: > > 1. > > Could you make all calls to iachar( ) affect module level parameters > that then are usable accross the module ? you can check out > https://github.com/fortran-lang/stdlib/blob/master/src/core/stdlib_ascii.fypp > for inspiration. Maybe some of those constants would be worth to be stored > there ? > 2. > > Please avoid one-liners for anything that is not a scalar constant: > > > - > > one = 1; zero =0 is tolerable > - > > do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do is not > very "debugger friendly" as one can not easily set break points to follow > iterations. > For this specific case, t%bmap(iachar('0'):iachar('9')) = .true. > would be equivalent, fortranic, and no need for semicolons. > > ------------------------------ > > In src/regex/stdlib_regex.f90 > <#1172 (comment)> > : > > > + stack(top) = tokens(i) > + end if > + end do > + > + do while (top > 0) > + if (stack(top)%tag == TOK_LPAREN) then > + stat = 1 > + return > + end if > + num_postfix = num_postfix + 1 > + postfix(num_postfix) = stack(top) > + top = top - 1 > + end do > + end subroutine parse_to_postfix > + > + integer function new_out(s, o, pool, p_size) > > This function has side-effects on pool, I totally agree with this > https://github.com/JorgeG94/fortran_programmer_llm/blob/main/Fortran_programmer.md#functions-should-have-no-side-effects > . > > Is it possible to consider subroutines when facing mutation on input > derived types? > ------------------------------ > > In src/regex/stdlib_regex.f90 > <#1172 (comment)> > : > > > + top = top - 1 > + end do > + end subroutine parse_to_postfix > + > + integer function new_out(s, o, pool, p_size) > + integer, intent(in) :: s, o > + type(out_node), intent(inout) :: pool(:) > + integer, intent(inout) :: p_size > + p_size = p_size + 1 > + pool(p_size)%s = s > + pool(p_size)%o = o > + pool(p_size)%next = 0 > + new_out = p_size > + end function new_out > + > + subroutine merge_lists(l1, l2, res, pool) > > same https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036020902 > ------------------------------ > > In src/regex/stdlib_regex.f90 > <#1172 (comment)> > : > > > + type(out_list_type), intent(in) :: l1, l2 > + type(out_list_type), intent(out) :: res > + type(out_node), intent(inout) :: pool(:) > + if (l1%head == 0) then > + res = l2 > + else if (l2%head == 0) then > + res = l1 > + else > + pool(l1%tail)%next = l2%head > + res%head = l1%head > + res%tail = l2%tail > + end if > + end subroutine merge_lists > + > + subroutine do_patch(states, list, target, pool) > + type(state_type), intent(inout) :: states(:) > > Another style comment: usually, with subroutines, it is customary to have > the non-mutable inputs first (strict intent(in)), then the intent(out) > or intent(inout)s, and finally the optionals. > > Would it be possible to keep this recommendation? > ------------------------------ > > In src/regex/stdlib_regex.f90 > <#1172 (comment)> > : > > > + integer, intent(in) :: target > + type(out_node), intent(in) :: pool(:) > + integer :: curr > + curr = list%head > + do while (curr /= 0) > + if (pool(curr)%o == 1) then > + states(pool(curr)%s)%out1 = target > + else > + states(pool(curr)%s)%out2 = target > + end if > + curr = pool(curr)%next > + end do > + end subroutine do_patch > + > + subroutine build_nfa(postfix, num_postfix, states, n_states, start_state, stat) > + type(token_type), intent(in) :: postfix(:) > > same as > https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863 > ------------------------------ > > In src/regex/stdlib_regex.f90 > <#1172 (comment)> > : > > > + if (stat /= 0) then > + if (present(status)) status = stat > + return > + end if > + > + call parse_to_postfix(tokens, n_tok, postfix, n_post, stat) > + if (stat /= 0) then > + if (present(status)) status = stat > + return > + end if > + > + call build_nfa(postfix, n_post, re%states, re%n_states, re%start_state, stat) > + if (present(status)) status = stat > + end subroutine regcomp > + > + recursive subroutine add_thread(list, count, state_idx, start_pos, step_index, states, str_len, visited) > > same as > https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863 > > On the implementation level, is recursiveness absolutly necessary? or > could there be a way to implement this without using recursivity? > > — > Reply to this email directly, view it on GitHub > <#1172 (review)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAN6YRY2IFR2JVVQLWR6RGD4UFXXVAVCNFSM6AAAAACXIGTOHWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DANJYHAYTONBVG4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

arjenmarkus · 2026-04-05T13:40:36Z

Ah, that was my mistake. The engine should look for the longest matching substring when a start has been found, unless non-greedy expressions are selected. Always a tricky part of regular expressions: to know precisely what is to be matched and what not. Op zo 5 apr 2026 om 15:11 schreef Arjen Markus ***@***.***>:

…

I also tried a test call regcomp(re, "A+", stat) call regmatch(re, "A AAAAAAAAAAAAAAA", is_match, match_start, match_end) print *, "Match 'A AAAAAAAAAAAA' with 'A+' -> ", is_match, match_start, match_end The result was a match with the first subststring, not the longest. That is not the classical behaviour of a regular expression matcher. I will read the documentation to see if this was expected 😇 Op zo 5 apr 2026 om 15:04 schreef Arjen Markus ***@***.*** >: > I had a first look at the test program you provided some days ago. I > noticed that the indices are off by one: > > === Testing Fortran Regex (Thompson NFA) === > regcomp 'abc': status = 0 > Match 'xyz_abc_def' -> T 4 7 > > The substring "abc" starts at 5, not 4. This off by one error occurs in > another test as well. > > Another one: > > Match 'aaaab' with 'a*b' -> T 4 5 > > The match starts at 1, not 4.. > > foo123bar: the matching substring is too short - the reported substring > is from 3 to 4, instead of 4 to 6. > > cats: the matching subststring is reported as 7 to 11 (five characters) > whereas the matching substring is "cats", so four characters only. > > So, some work to be done, unless you have already fixed these bugs ;), > but in any case a good start. > > Op za 4 apr 2026 om 22:43 schreef José Alves ***@***.***>: > >> ***@***.**** requested changes on this pull request. >> ------------------------------ >> >> In doc/specs/stdlib_regex.md >> <#1172 (comment)> >> : >> >> > +The regular expression pattern string to compile. >> + >> +`status` (optional): Shall be of type `integer`. It is an `intent(out)` argument. >> +Returns 0 on success, or a non-zero value if the pattern is invalid >> +(e.g., mismatched parentheses or brackets). >> + >> +### Example >> + >> +```fortran >> +use stdlib_regex, only: regex_type, regcomp >> +type(regex_type) :: re >> +integer :: stat >> + >> +call regcomp(re, "(cat|dog)s?", stat) >> +if (stat /= 0) error stop "Invalid regex pattern" >> +``` >> >> This should ideally be an executable example program in the examples >> folder >> ------------------------------ >> >> In doc/specs/stdlib_regex.md >> <#1172 (comment)> >> : >> >> > + >> +`string`: Shall be of type `character(len=*)`. It is an `intent(in)` argument. >> +The input string to search for a match. >> + >> +`is_match`: Shall be of type `logical`. It is an `intent(out)` argument. >> +Set to `.true.` if a match is found, `.false.` otherwise. >> + >> +`match_start` (optional): Shall be of type `integer`. It is an `intent(out)` argument. >> +The 1-based index of the first character of the match. >> + >> +`match_end` (optional): Shall be of type `integer`. It is an `intent(out)` argument. >> +The 1-based index of the last character of the match. >> + >> +### Example >> + >> +```fortran >> >> same as before, this should be an executable program in the examples >> folder >> ------------------------------ >> >> In src/regex/stdlib_regex.f90 >> <#1172 (comment)> >> : >> >> > + integer :: tail >> + end type out_list_type >> + >> + type :: frag_type >> + integer :: start >> + type(out_list_type) :: out_list >> + end type frag_type >> + >> + type :: thread >> + integer :: state >> + integer :: start_pos >> + end type thread >> + >> +contains >> + >> + logical function is_term_ender(tag) >> >> can this be made pure or even better elemental ? >> ------------------------------ >> >> In src/regex/stdlib_regex.f90 >> <#1172 (comment)> >> : >> >> > + integer :: state >> + integer :: start_pos >> + end type thread >> + >> +contains >> + >> + logical function is_term_ender(tag) >> + integer, intent(in) :: tag >> + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. & >> + tag == TOK_CLASS .or. tag == TOK_STAR .or. & >> + tag == TOK_PLUS .or. tag == TOK_QUEST .or. & >> + tag == TOK_RPAREN .or. tag == TOK_END .or. & >> + tag == TOK_START) >> + end function is_term_ender >> + >> + logical function is_term_starter(tag) >> >> can this be made pure or even better elemental ? >> ------------------------------ >> >> In src/regex/stdlib_regex.f90 >> <#1172 (comment)> >> : >> >> > + integer, intent(in) :: tag >> + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. & >> + tag == TOK_CLASS .or. tag == TOK_STAR .or. & >> + tag == TOK_PLUS .or. tag == TOK_QUEST .or. & >> + tag == TOK_RPAREN .or. tag == TOK_END .or. & >> + tag == TOK_START) >> + end function is_term_ender >> + >> + logical function is_term_starter(tag) >> + integer, intent(in) :: tag >> + is_term_starter = (tag == TOK_CHAR .or. tag == TOK_ANY .or. & >> + tag == TOK_CLASS .or. tag == TOK_LPAREN .or. & >> + tag == TOK_START .or. tag == TOK_END) >> + end function is_term_starter >> + >> + integer function prec(tag) >> >> can this be made pure or even better elemental ? >> ------------------------------ >> >> In src/regex/stdlib_regex.f90 >> <#1172 (comment)> >> : >> >> > + c = pattern(i:i) >> + t%tag = TOK_CHAR >> + t%c = ' ' >> + t%bmap = .false. >> + t%invert = .false. >> + >> + if (c == '\') then >> + if (i < len_p) then >> + i = i + 1 >> + c = pattern(i:i) >> + end if >> + t%tag = TOK_CHAR >> + t%c = c >> + if (c == 'd') then >> + t%tag = TOK_CLASS >> + do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do >> >> Two things: >> >> 1. >> >> Could you make all calls to iachar( ) affect module level parameters >> that then are usable accross the module ? you can check out >> https://github.com/fortran-lang/stdlib/blob/master/src/core/stdlib_ascii.fypp >> for inspiration. Maybe some of those constants would be worth to be stored >> there ? >> 2. >> >> Please avoid one-liners for anything that is not a scalar constant: >> >> >> - >> >> one = 1; zero =0 is tolerable >> - >> >> do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do is not >> very "debugger friendly" as one can not easily set break points to follow >> iterations. >> For this specific case, t%bmap(iachar('0'):iachar('9')) = .true. >> would be equivalent, fortranic, and no need for semicolons. >> >> ------------------------------ >> >> In src/regex/stdlib_regex.f90 >> <#1172 (comment)> >> : >> >> > + stack(top) = tokens(i) >> + end if >> + end do >> + >> + do while (top > 0) >> + if (stack(top)%tag == TOK_LPAREN) then >> + stat = 1 >> + return >> + end if >> + num_postfix = num_postfix + 1 >> + postfix(num_postfix) = stack(top) >> + top = top - 1 >> + end do >> + end subroutine parse_to_postfix >> + >> + integer function new_out(s, o, pool, p_size) >> >> This function has side-effects on pool, I totally agree with this >> https://github.com/JorgeG94/fortran_programmer_llm/blob/main/Fortran_programmer.md#functions-should-have-no-side-effects >> . >> >> Is it possible to consider subroutines when facing mutation on input >> derived types? >> ------------------------------ >> >> In src/regex/stdlib_regex.f90 >> <#1172 (comment)> >> : >> >> > + top = top - 1 >> + end do >> + end subroutine parse_to_postfix >> + >> + integer function new_out(s, o, pool, p_size) >> + integer, intent(in) :: s, o >> + type(out_node), intent(inout) :: pool(:) >> + integer, intent(inout) :: p_size >> + p_size = p_size + 1 >> + pool(p_size)%s = s >> + pool(p_size)%o = o >> + pool(p_size)%next = 0 >> + new_out = p_size >> + end function new_out >> + >> + subroutine merge_lists(l1, l2, res, pool) >> >> same >> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036020902 >> ------------------------------ >> >> In src/regex/stdlib_regex.f90 >> <#1172 (comment)> >> : >> >> > + type(out_list_type), intent(in) :: l1, l2 >> + type(out_list_type), intent(out) :: res >> + type(out_node), intent(inout) :: pool(:) >> + if (l1%head == 0) then >> + res = l2 >> + else if (l2%head == 0) then >> + res = l1 >> + else >> + pool(l1%tail)%next = l2%head >> + res%head = l1%head >> + res%tail = l2%tail >> + end if >> + end subroutine merge_lists >> + >> + subroutine do_patch(states, list, target, pool) >> + type(state_type), intent(inout) :: states(:) >> >> Another style comment: usually, with subroutines, it is customary to >> have the non-mutable inputs first (strict intent(in)), then the >> intent(out) or intent(inout)s, and finally the optionals. >> >> Would it be possible to keep this recommendation? >> ------------------------------ >> >> In src/regex/stdlib_regex.f90 >> <#1172 (comment)> >> : >> >> > + integer, intent(in) :: target >> + type(out_node), intent(in) :: pool(:) >> + integer :: curr >> + curr = list%head >> + do while (curr /= 0) >> + if (pool(curr)%o == 1) then >> + states(pool(curr)%s)%out1 = target >> + else >> + states(pool(curr)%s)%out2 = target >> + end if >> + curr = pool(curr)%next >> + end do >> + end subroutine do_patch >> + >> + subroutine build_nfa(postfix, num_postfix, states, n_states, start_state, stat) >> + type(token_type), intent(in) :: postfix(:) >> >> same as >> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863 >> ------------------------------ >> >> In src/regex/stdlib_regex.f90 >> <#1172 (comment)> >> : >> >> > + if (stat /= 0) then >> + if (present(status)) status = stat >> + return >> + end if >> + >> + call parse_to_postfix(tokens, n_tok, postfix, n_post, stat) >> + if (stat /= 0) then >> + if (present(status)) status = stat >> + return >> + end if >> + >> + call build_nfa(postfix, n_post, re%states, re%n_states, re%start_state, stat) >> + if (present(status)) status = stat >> + end subroutine regcomp >> + >> + recursive subroutine add_thread(list, count, state_idx, start_pos, step_index, states, str_len, visited) >> >> same as >> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863 >> >> On the implementation level, is recursiveness absolutly necessary? or >> could there be a way to implement this without using recursivity? >> >> — >> Reply to this email directly, view it on GitHub >> <#1172 (review)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/AAN6YRY2IFR2JVVQLWR6RGD4UFXXVAVCNFSM6AAAAACXIGTOHWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DANJYHAYTONBVG4> >> . >> You are receiving this because you were mentioned.Message ID: >> ***@***.***> >> >

JAi-SATHVIK · 2026-04-05T19:56:25Z

Thanks @jalvesz @arjenmarkus ! I’ve updated the code and addressed those issues:

Off-by-one and match lengths: Fixed! abc now correctly returns ms=5, me=7, and aaaab with a*b correctly returns ms=1, me=5.
Leftmost-Longest Priority: The engine follows the standard where the leftmost start always wins first. Because an "A" matches starting at index 1, it is chosen over any subsequent matches elsewhere in the string.

Among all matches starting at that same leftmost position, the engine will strictly select the longest one before concluding.

arjenmarkus · 2026-04-07T18:42:24Z

I have written a small "interpreter" that will allow you to easily extend the set of tests. See the attachments. The sample tests are just a start, of course, but I already found one incompleteness in the checking for a proper regular expression. You are welcome to use it (or just to ignore it, if you think it is not useful). Op zo 5 apr 2026 om 21:56 schreef JAYA SATHVIK TANGA < ***@***.***>:

*JAi-SATHVIK* left a comment (fortran-lang/stdlib#1172) <#1172 (comment)> Thanks @jalvesz <https://github.com/jalvesz> @arjenmarkus <https://github.com/arjenmarkus> ! I’ve updated the code and addressed those issues: *Off-by-one and match lengths:* Fixed! abc now correctly returns ms=5, me=7, and aaaab with a*b correctly returns ms=1, me=5. *Leftmost-Longest Priority:* The engine follows the standard where the leftmost start always wins first. Because an "A" matches starting at index 1, it is chosen over any subsequent matches elsewhere in the string. Among all matches starting at that same leftmost position, the engine will strictly select the longest one before concluding. — Reply to this email directly, view it on GitHub <#1172 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAN6YR3GF5VUKRF6AEL5MDL4UK277AVCNFSM6AAAAACXIGTOHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DCOBZGQZTMOBRGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

! catalogue_regex.f90 -- ! ! By Arjen Markus, dd. 7 april 2026. ! ! A catalogue of tests for the regular expression module ! ! The program reads a file with test cases: ! regexp "expression" - the expression to be compiled and used ! input "string" - the string to be matched against the last regular expression ! expected "string" - the string that is expected to match (hence outpur from the match routine) ! error-exp - expecting an error from the compilation of the last regular expression ! no-match - expecting a "no match" result ! ! These lines instruct the program to compile and use the regular expression. ! The results are reported in the output file. ! The lines in the file should be no more than 100 characters long. ! Also: the double quotes should surround the strings, so that they are properly delimited. ! ! The order of the lines is expected to be: ! the expected output comes after the expression and the input ! program catalogue_regex use stdlib_regex implicit none type(regex_type) :: re character(len=100) :: line character(len=20) :: keyword character(len=:), allocatable :: value character(len=:), allocatable :: expression character(len=:), allocatable :: string character(len=:), allocatable :: expected integer :: match_start, match_end, status, ierr integer :: mismatches logical :: matched open( 10, file = 'catalogue_regex.inp', status = 'old', iostat = ierr ) if ( ierr /= 0 ) then write( *, '(a)' ) 'Could not open the file "catalogue_regex.inp"' write( *, '(a)' ) 'It should exist - please check' error stop endif open( 20, file = 'catalogue_regex.report' ) mismatches = 0 do read( 10, '(a)', iostat = ierr ) line if ( ierr /= 0 ) then exit endif call extract_information( line, keyword, value ) select case( keyword ) case( 'expression' ) expression = value case( 'input' ) string = value case( 'expected' ) write( 20, '(a)' ) '' expected = value call regcomp( re, expression, status ) if ( status /= 0 ) then mismatches = mismatches + 1 write( 20, '(a,i0)' ) 'Error compiling the expression: status = ', status write( 20, '(a,2a)' ) ' Expression: "', expression, '"' else call regmatch( re, string, matched, match_start, match_end ) if ( matched ) then write( 20, '(a,2a)' ) 'Match found:' write( 20, '(a,2a)' ) ' Expression: "', expression, '"' write( 20, '(a,2a)' ) ' Input string: "', string, '"' write( 20, '(a,2a)' ) ' Substring: "', string(match_start:match_end), '"' write( 20, '(a,2a)' ) ' Expected: "', expected, '"' if ( expected == string(match_start:match_end) ) then write( 20, '(a,2a)' ) ' Success!' else mismatches = mismatches + 1 write( 20, '(a,2a)' ) ' MISMATCH!' endif else mismatches = mismatches + 1 write( 20, '(a,2a)' ) 'NO match found:' write( 20, '(a,2a)' ) ' Expression: "', expression, '"' write( 20, '(a,2a)' ) ' Input string: "', string, '"' write( 20, '(a,2a)' ) ' Substring: (none)' write( 20, '(a,2a)' ) ' Expected: "', expected, '"' endif endif case( 'error-exp' ) write( 20, '(a)' ) '' call regcomp( re, expression, status ) if ( status /= 0 ) then write( 20, '(a)' ) 'Error detected as expected:' write( 20, '(a,2a)' ) ' Expression: "', expression, '"' else mismatches = mismatches + 1 write( 20, '(a)' ) 'An error was expected but not detected:' write( 20, '(a,2a)' ) ' Expression: "', expression, '"' endif case( 'no-match' ) write( 20, '(a)' ) '' call regcomp( re, expression, status ) if ( status /= 0 ) then mismatches = mismatches + 1 write( 20, '(a,i0)' ) 'Error compiling the expression: status = ', status write( 20, '(a,2a)' ) ' Expression: "', expression, '"' else call regmatch( re, string, matched, match_start, match_end ) if ( matched ) then mismatches = mismatches + 1 write( 20, '(a,2a)' ) 'Match found where none expected:' write( 20, '(a,2a)' ) ' Expression: "', expression, '"' write( 20, '(a,2a)' ) ' Input string: "', string, '"' write( 20, '(a,2a)' ) ' Substring: "', string(match_start:match_end), '"' write( 20, '(a,2a)' ) ' Expected: (none)' else write( 20, '(a,2a)' ) 'No match found, as expected:' write( 20, '(a,2a)' ) ' Expression: "', expression, '"' write( 20, '(a,2a)' ) ' Input string: "', string, '"' write( 20, '(a,2a)' ) ' Expected: (none)' endif endif case default ! Treat any other keyword as comment end select enddo write( 20, '(/,a,i0)' ) 'Number of mismatches or other errors: ', mismatches write( *, '(a)' ) 'Program completed' contains subroutine extract_information( line, keyword, value ) character(len=*), intent(in) :: line character(len=*), intent(out) :: keyword character(len=:), intent(out), allocatable :: value character(len=20), dimension(5) :: known_keywords = & [ 'expression ', & 'input ', & 'expected ', & 'error-exp ', & 'no-match ' ] integer :: k1, k2 if ( line == " " ) then keyword = "" value = "" return endif read( line, *, iostat = ierr ) keyword if ( keyword == 'error-exp' .or. keyword == 'no-match' ) then value = "" return endif if ( any( keyword == known_keywords ) ) then allocate( value, mold = line ) k1 = index( line, '"' ) if ( k1 > 0 ) then k2 = k1 + index( line(k1+1:), '"' ) if ( k2 > 0 ) then value = line(k1+1:k2-1) else write( 20, '(a)' ) 'Error interpreting the input line:' write( 20, '(2a)' ) ' "', trim(line), '"' write( 20, '(2a)' ) 'Program stopped' write( *, '(2a)' ) 'Program stopped - error reading input. Please check' error stop endif endif else value = "" endif end subroutine extract_information end program catalogue_regex

…regex

JAi-SATHVIK · 2026-04-10T08:21:29Z

Thanks for the tests @arjenmarkus

Updated tokenize lexer logic to actively evaluate the preceding AST token before assigning repeat quantifiers (*, +, ?).
The parser now properly rejects nested or invalid quantifiers that lack a valid operand (e.g a**, a+*, (*a), or |*).
Enhanced parenthesis matching logic to correctly identify and throw errors for explicitly empty groups ().
Tested for strict compliance using Arjen Markus's Regex test catalog runner script. All edge cases successfully trigger stat = 1 immediately during regcomp.

JAi-SATHVIK · 2026-04-10T21:12:46Z

Hi @jvdp1 @jalvesz , there are some ci failures which I have addressed in issue #1178 can you once have a look?

JAi-SATHVIK · 2026-04-17T14:25:25Z

Thanks @jalvesz, I’ve updated is_term_ender, is_term_starter, and prec to be elemental. This makes them much more versatile for array-based operations within the module.

jvdp1

Thank you @JAi-SATHVIK . Here are a few comments after a very quick review

jvdp1 · 2026-04-17T20:47:40Z

+    integer            :: mismatches
+    logical            :: matched
+
+    open( 10, file = 'catalogue_regex.inp', status = 'old', iostat = ierr )


use newunit instead of a defined unit.

Suggested change

open( 10, file = 'catalogue_regex.inp', status = 'old', iostat = ierr )

open( newunit=un, file = 'catalogue_regex.inp', status = 'old', iostat = ierr )

Refactored to use newunit

jvdp1 · 2026-04-17T20:47:59Z

+        error stop
+    endif
+
+    open( 20, file = 'catalogue_regex.report' )


Suggested change

open( 20, file = 'catalogue_regex.report' )

open( newunit=un20, file = 'catalogue_regex.report' )

jvdp1 · 2026-04-17T20:48:11Z

+    mismatches = 0
+
+    do
+        read( 10, '(a)', iostat = ierr ) line


Suggested change

read( 10, '(a)', iostat = ierr ) line

read( un, '(a)', iostat = ierr ) line

jvdp1 · 2026-04-17T20:49:40Z

+  ! Anchored match
+  call regcomp(re, "^hello", stat)
+  call regmatch(re, "hello world", found)
+  print "(A,L1)", "found = ", found


Suggested change

print "(A,L1)", "found = ", found

print "(a,l1)", "found = ", found

jvdp1 · 2026-04-17T20:55:09Z

+  integer, parameter :: CHAR_ZERO = iachar('0')
+  integer, parameter :: CHAR_NINE = iachar('9')
+  integer, parameter :: CHAR_LOWER_A = iachar('a')
+  integer, parameter :: CHAR_LOWER_Z = iachar('z')
+  integer, parameter :: CHAR_UPPER_A = iachar('A')
+  integer, parameter :: CHAR_UPPER_Z = iachar('Z')
+  integer, parameter :: CHAR_SPACE = iachar(' ')
+  integer, parameter :: CHAR_TAB = 9
+  integer, parameter :: CHAR_LF = 10
+  integer, parameter :: CHAR_CR = 13
+  integer, parameter :: CHAR_UNDERSCORE = iachar('_')


some of these might be already defined in stdlib_ascii. Did you check them?

Update stdlib_regex.f90 to use constants from stdlib_ascii (TAB, LF, CR)

JAi-SATHVIK and others added 30 commits January 6, 2026 08:01

add PCA to public api

dc4aaac

include pca submodule

27599e1

Add PCA module with pca, pca_transform, and `pca_inverse_transfor…

d77fb0e

…m` routines.

add PCA unit test

24358d1

update end interface statement

1dd44ad

update CmakeLists

7f79ef6

fixed_conflicts

0d2738c

update interface

20b0e98

allined with the other linalg function

654edba

convert to subroutines,updated test

b7c2be1

fix errors

63a0a1f

fixed errors

cfbcdee

fix PCA BLAS/LAPACK linking

db19731

fix PCA BLAS/LAPACK

d9ba548

fix: remove xdp/qp from PCA use statements to fix CI builds

11902b6

both updated

d7f8790

test

f8bbd27

modify interfaces for core.

75db887

add stdlib_sorting.fypp in cmakelists.txt

d72f72c

Fix CMakeLists.txt for the addition of stdlib_storting_pca

44ee2e7

Merge pull request #1 from jvdp1/fix_jai

6d2a4fd

Fix CMakeLists.txt for the addition of stdlib_storting_pca

Add center_data Helper Subroutine

b3ea627

Replace Manual Mean with stdlib mean

0e94be3

Replace Covariance Loops with BLAS syrk

05d4968

Extract pca_svd_driver and pca_eigh_driver & Updated Main pca Subroutine

d3d1c71

Merge pull request #2 from JAi-SATHVIK/master-cpy

7b49baa

Master cpy

optimized for performance and stability

0659b39

Merge pull request #3 from JAi-SATHVIK/master-cpy

ac3b0e9

optimized for performance and stability

Merge branch 'master-cpy'

4751866

Merge branch 'master' of https://github.com/JAi-SATHVIK/stdlib

cc21db0

update doc

8e2390f

Merge branch 'master' of https://github.com/JAi-SATHVIK/stdlib into r…

7b00548

…egex

jalvesz requested changes Apr 4, 2026

View reviewed changes

core engine logic, purity fix

bab5e5e

JAi-SATHVIK added 4 commits April 5, 2026 20:05

standalone example (pattern matching)

c58b15d

new build config

8d27abc

add regex examples

8ed7942

update docs

29f598b

JAi-SATHVIK requested a review from jalvesz April 5, 2026 19:59

Merge branch 'master' of https://github.com/fortran-lang/stdlib into …

731b57c

…regex

add strict rules

dbeedce

Merge branch 'fortran-lang:master' into regex

8319cc3

jalvesz requested changes Apr 16, 2026

View reviewed changes

Comment thread src/regex/stdlib_regex.f90 Outdated

Comment thread src/regex/stdlib_regex.f90 Outdated

Comment thread src/regex/stdlib_regex.f90 Outdated

refactor utility function

0cc9abd

JAi-SATHVIK requested a review from jalvesz April 17, 2026 16:19

jalvesz requested a review from arjenmarkus April 17, 2026 16:38

jvdp1 requested changes Apr 17, 2026

View reviewed changes

JAi-SATHVIK added 2 commits April 19, 2026 16:12

address review feedback on unit numbers and constants

544f34a

regex: add CMake dependency for stdlib_core

d78f89a

JAi-SATHVIK requested a review from jvdp1 April 19, 2026 11:24

	open( 10, file = 'catalogue_regex.inp', status = 'old', iostat = ierr )
	open( newunit=un, file = 'catalogue_regex.inp', status = 'old', iostat = ierr )

	open( 20, file = 'catalogue_regex.report' )
	open( newunit=un20, file = 'catalogue_regex.report' )

	read( 10, '(a)', iostat = ierr ) line
	read( un, '(a)', iostat = ierr ) line

	print "(A,L1)", "found = ", found
	print "(a,l1)", "found = ", found

Conversation

JAi-SATHVIK commented Mar 31, 2026

Uh oh!

JAi-SATHVIK commented Apr 3, 2026

Uh oh!

JAi-SATHVIK commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arjenmarkus commented Apr 5, 2026 via email

Uh oh!

arjenmarkus commented Apr 5, 2026 via email

Uh oh!

arjenmarkus commented Apr 5, 2026 via email

Uh oh!

JAi-SATHVIK commented Apr 5, 2026

Uh oh!

arjenmarkus commented Apr 7, 2026 via email

Uh oh!

JAi-SATHVIK commented Apr 10, 2026

Uh oh!

JAi-SATHVIK commented Apr 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JAi-SATHVIK commented Apr 17, 2026

Uh oh!

jvdp1 left a comment

Choose a reason for hiding this comment

Uh oh!

jvdp1 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

JAi-SATHVIK Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

jvdp1 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

jvdp1 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

jvdp1 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

jvdp1 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

JAi-SATHVIK Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JAi-SATHVIK commented Apr 4, 2026 •

edited

Loading