Implement Thompson NFA-based Regular Expressions#1172
Implement Thompson NFA-based Regular Expressions#1172JAi-SATHVIK wants to merge 119 commits intofortran-lang:masterfrom
Conversation
Fix CMakeLists.txt for the addition of stdlib_storting_pca
Master cpy
optimized for performance and stability
UpdateI have finalized the core implementation of the pure Fortran regex engine. Here is a summary of what I've completed:
The engine is now stable, zero-dependency, and ready for your feedback! @arjenmarkus @jvdp1 @jalvesz |
|
I had a first look at the test program you provided some days ago. I
noticed that the indices are off by one:
=== Testing Fortran Regex (Thompson NFA) ===
regcomp 'abc': status = 0
Match 'xyz_abc_def' -> T 4 7
The substring "abc" starts at 5, not 4. This off by one error occurs in
another test as well.
Another one:
Match 'aaaab' with 'a*b' -> T 4 5
The match starts at 1, not 4..
foo123bar: the matching substring is too short - the reported substring is
from 3 to 4, instead of 4 to 6.
cats: the matching subststring is reported as 7 to 11 (five characters)
whereas the matching substring is "cats", so four characters only.
So, some work to be done, unless you have already fixed these bugs ;), but
in any case a good start.
Op za 4 apr 2026 om 22:43 schreef José Alves ***@***.***>:
… ***@***.**** requested changes on this pull request.
------------------------------
In doc/specs/stdlib_regex.md
<#1172 (comment)>:
> +The regular expression pattern string to compile.
+
+`status` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
+Returns 0 on success, or a non-zero value if the pattern is invalid
+(e.g., mismatched parentheses or brackets).
+
+### Example
+
+```fortran
+use stdlib_regex, only: regex_type, regcomp
+type(regex_type) :: re
+integer :: stat
+
+call regcomp(re, "(cat|dog)s?", stat)
+if (stat /= 0) error stop "Invalid regex pattern"
+```
This should ideally be an executable example program in the examples folder
------------------------------
In doc/specs/stdlib_regex.md
<#1172 (comment)>:
> +
+`string`: Shall be of type `character(len=*)`. It is an `intent(in)` argument.
+The input string to search for a match.
+
+`is_match`: Shall be of type `logical`. It is an `intent(out)` argument.
+Set to `.true.` if a match is found, `.false.` otherwise.
+
+`match_start` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
+The 1-based index of the first character of the match.
+
+`match_end` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
+The 1-based index of the last character of the match.
+
+### Example
+
+```fortran
same as before, this should be an executable program in the examples folder
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + integer :: tail
+ end type out_list_type
+
+ type :: frag_type
+ integer :: start
+ type(out_list_type) :: out_list
+ end type frag_type
+
+ type :: thread
+ integer :: state
+ integer :: start_pos
+ end type thread
+
+contains
+
+ logical function is_term_ender(tag)
can this be made pure or even better elemental ?
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + integer :: state
+ integer :: start_pos
+ end type thread
+
+contains
+
+ logical function is_term_ender(tag)
+ integer, intent(in) :: tag
+ is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
+ tag == TOK_CLASS .or. tag == TOK_STAR .or. &
+ tag == TOK_PLUS .or. tag == TOK_QUEST .or. &
+ tag == TOK_RPAREN .or. tag == TOK_END .or. &
+ tag == TOK_START)
+ end function is_term_ender
+
+ logical function is_term_starter(tag)
can this be made pure or even better elemental ?
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + integer, intent(in) :: tag
+ is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
+ tag == TOK_CLASS .or. tag == TOK_STAR .or. &
+ tag == TOK_PLUS .or. tag == TOK_QUEST .or. &
+ tag == TOK_RPAREN .or. tag == TOK_END .or. &
+ tag == TOK_START)
+ end function is_term_ender
+
+ logical function is_term_starter(tag)
+ integer, intent(in) :: tag
+ is_term_starter = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
+ tag == TOK_CLASS .or. tag == TOK_LPAREN .or. &
+ tag == TOK_START .or. tag == TOK_END)
+ end function is_term_starter
+
+ integer function prec(tag)
can this be made pure or even better elemental ?
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + c = pattern(i:i)
+ t%tag = TOK_CHAR
+ t%c = ' '
+ t%bmap = .false.
+ t%invert = .false.
+
+ if (c == '\') then
+ if (i < len_p) then
+ i = i + 1
+ c = pattern(i:i)
+ end if
+ t%tag = TOK_CHAR
+ t%c = c
+ if (c == 'd') then
+ t%tag = TOK_CLASS
+ do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do
Two things:
1.
Could you make all calls to iachar( ) affect module level parameters
that then are usable accross the module ? you can check out
https://github.com/fortran-lang/stdlib/blob/master/src/core/stdlib_ascii.fypp
for inspiration. Maybe some of those constants would be worth to be stored
there ?
2.
Please avoid one-liners for anything that is not a scalar constant:
-
one = 1; zero =0 is tolerable
-
do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do is not
very "debugger friendly" as one can not easily set break points to follow
iterations.
For this specific case, t%bmap(iachar('0'):iachar('9')) = .true. would
be equivalent, fortranic, and no need for semicolons.
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + stack(top) = tokens(i)
+ end if
+ end do
+
+ do while (top > 0)
+ if (stack(top)%tag == TOK_LPAREN) then
+ stat = 1
+ return
+ end if
+ num_postfix = num_postfix + 1
+ postfix(num_postfix) = stack(top)
+ top = top - 1
+ end do
+ end subroutine parse_to_postfix
+
+ integer function new_out(s, o, pool, p_size)
This function has side-effects on pool, I totally agree with this
https://github.com/JorgeG94/fortran_programmer_llm/blob/main/Fortran_programmer.md#functions-should-have-no-side-effects
.
Is it possible to consider subroutines when facing mutation on input
derived types?
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + top = top - 1
+ end do
+ end subroutine parse_to_postfix
+
+ integer function new_out(s, o, pool, p_size)
+ integer, intent(in) :: s, o
+ type(out_node), intent(inout) :: pool(:)
+ integer, intent(inout) :: p_size
+ p_size = p_size + 1
+ pool(p_size)%s = s
+ pool(p_size)%o = o
+ pool(p_size)%next = 0
+ new_out = p_size
+ end function new_out
+
+ subroutine merge_lists(l1, l2, res, pool)
same https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036020902
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + type(out_list_type), intent(in) :: l1, l2
+ type(out_list_type), intent(out) :: res
+ type(out_node), intent(inout) :: pool(:)
+ if (l1%head == 0) then
+ res = l2
+ else if (l2%head == 0) then
+ res = l1
+ else
+ pool(l1%tail)%next = l2%head
+ res%head = l1%head
+ res%tail = l2%tail
+ end if
+ end subroutine merge_lists
+
+ subroutine do_patch(states, list, target, pool)
+ type(state_type), intent(inout) :: states(:)
Another style comment: usually, with subroutines, it is customary to have
the non-mutable inputs first (strict intent(in)), then the intent(out) or
intent(inout)s, and finally the optionals.
Would it be possible to keep this recommendation?
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + integer, intent(in) :: target
+ type(out_node), intent(in) :: pool(:)
+ integer :: curr
+ curr = list%head
+ do while (curr /= 0)
+ if (pool(curr)%o == 1) then
+ states(pool(curr)%s)%out1 = target
+ else
+ states(pool(curr)%s)%out2 = target
+ end if
+ curr = pool(curr)%next
+ end do
+ end subroutine do_patch
+
+ subroutine build_nfa(postfix, num_postfix, states, n_states, start_state, stat)
+ type(token_type), intent(in) :: postfix(:)
same as
https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
------------------------------
In src/regex/stdlib_regex.f90
<#1172 (comment)>:
> + if (stat /= 0) then
+ if (present(status)) status = stat
+ return
+ end if
+
+ call parse_to_postfix(tokens, n_tok, postfix, n_post, stat)
+ if (stat /= 0) then
+ if (present(status)) status = stat
+ return
+ end if
+
+ call build_nfa(postfix, n_post, re%states, re%n_states, re%start_state, stat)
+ if (present(status)) status = stat
+ end subroutine regcomp
+
+ recursive subroutine add_thread(list, count, state_idx, start_pos, step_index, states, str_len, visited)
same as
https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
On the implementation level, is recursiveness absolutly necessary? or
could there be a way to implement this without using recursivity?
—
Reply to this email directly, view it on GitHub
<#1172 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YRY2IFR2JVVQLWR6RGD4UFXXVAVCNFSM6AAAAACXIGTOHWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DANJYHAYTONBVG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
I also tried a test
call regcomp(re, "A+", stat)
call regmatch(re, "A AAAAAAAAAAAAAAA", is_match, match_start, match_end)
print *, "Match 'A AAAAAAAAAAAA' with 'A+' -> ", is_match, match_start,
match_end
The result was a match with the first subststring, not the longest. That is
not the classical behaviour of a regular expression matcher. I will read
the documentation to see if this was expected 😇
Op zo 5 apr 2026 om 15:04 schreef Arjen Markus ***@***.***>:
… I had a first look at the test program you provided some days ago. I
noticed that the indices are off by one:
=== Testing Fortran Regex (Thompson NFA) ===
regcomp 'abc': status = 0
Match 'xyz_abc_def' -> T 4 7
The substring "abc" starts at 5, not 4. This off by one error occurs in
another test as well.
Another one:
Match 'aaaab' with 'a*b' -> T 4 5
The match starts at 1, not 4..
foo123bar: the matching substring is too short - the reported substring is
from 3 to 4, instead of 4 to 6.
cats: the matching subststring is reported as 7 to 11 (five characters)
whereas the matching substring is "cats", so four characters only.
So, some work to be done, unless you have already fixed these bugs ;), but
in any case a good start.
Op za 4 apr 2026 om 22:43 schreef José Alves ***@***.***>:
> ***@***.**** requested changes on this pull request.
> ------------------------------
>
> In doc/specs/stdlib_regex.md
> <#1172 (comment)>
> :
>
> > +The regular expression pattern string to compile.
> +
> +`status` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
> +Returns 0 on success, or a non-zero value if the pattern is invalid
> +(e.g., mismatched parentheses or brackets).
> +
> +### Example
> +
> +```fortran
> +use stdlib_regex, only: regex_type, regcomp
> +type(regex_type) :: re
> +integer :: stat
> +
> +call regcomp(re, "(cat|dog)s?", stat)
> +if (stat /= 0) error stop "Invalid regex pattern"
> +```
>
> This should ideally be an executable example program in the examples
> folder
> ------------------------------
>
> In doc/specs/stdlib_regex.md
> <#1172 (comment)>
> :
>
> > +
> +`string`: Shall be of type `character(len=*)`. It is an `intent(in)` argument.
> +The input string to search for a match.
> +
> +`is_match`: Shall be of type `logical`. It is an `intent(out)` argument.
> +Set to `.true.` if a match is found, `.false.` otherwise.
> +
> +`match_start` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
> +The 1-based index of the first character of the match.
> +
> +`match_end` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
> +The 1-based index of the last character of the match.
> +
> +### Example
> +
> +```fortran
>
> same as before, this should be an executable program in the examples
> folder
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + integer :: tail
> + end type out_list_type
> +
> + type :: frag_type
> + integer :: start
> + type(out_list_type) :: out_list
> + end type frag_type
> +
> + type :: thread
> + integer :: state
> + integer :: start_pos
> + end type thread
> +
> +contains
> +
> + logical function is_term_ender(tag)
>
> can this be made pure or even better elemental ?
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + integer :: state
> + integer :: start_pos
> + end type thread
> +
> +contains
> +
> + logical function is_term_ender(tag)
> + integer, intent(in) :: tag
> + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
> + tag == TOK_CLASS .or. tag == TOK_STAR .or. &
> + tag == TOK_PLUS .or. tag == TOK_QUEST .or. &
> + tag == TOK_RPAREN .or. tag == TOK_END .or. &
> + tag == TOK_START)
> + end function is_term_ender
> +
> + logical function is_term_starter(tag)
>
> can this be made pure or even better elemental ?
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + integer, intent(in) :: tag
> + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
> + tag == TOK_CLASS .or. tag == TOK_STAR .or. &
> + tag == TOK_PLUS .or. tag == TOK_QUEST .or. &
> + tag == TOK_RPAREN .or. tag == TOK_END .or. &
> + tag == TOK_START)
> + end function is_term_ender
> +
> + logical function is_term_starter(tag)
> + integer, intent(in) :: tag
> + is_term_starter = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
> + tag == TOK_CLASS .or. tag == TOK_LPAREN .or. &
> + tag == TOK_START .or. tag == TOK_END)
> + end function is_term_starter
> +
> + integer function prec(tag)
>
> can this be made pure or even better elemental ?
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + c = pattern(i:i)
> + t%tag = TOK_CHAR
> + t%c = ' '
> + t%bmap = .false.
> + t%invert = .false.
> +
> + if (c == '\') then
> + if (i < len_p) then
> + i = i + 1
> + c = pattern(i:i)
> + end if
> + t%tag = TOK_CHAR
> + t%c = c
> + if (c == 'd') then
> + t%tag = TOK_CLASS
> + do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do
>
> Two things:
>
> 1.
>
> Could you make all calls to iachar( ) affect module level parameters
> that then are usable accross the module ? you can check out
> https://github.com/fortran-lang/stdlib/blob/master/src/core/stdlib_ascii.fypp
> for inspiration. Maybe some of those constants would be worth to be stored
> there ?
> 2.
>
> Please avoid one-liners for anything that is not a scalar constant:
>
>
> -
>
> one = 1; zero =0 is tolerable
> -
>
> do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do is not
> very "debugger friendly" as one can not easily set break points to follow
> iterations.
> For this specific case, t%bmap(iachar('0'):iachar('9')) = .true.
> would be equivalent, fortranic, and no need for semicolons.
>
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + stack(top) = tokens(i)
> + end if
> + end do
> +
> + do while (top > 0)
> + if (stack(top)%tag == TOK_LPAREN) then
> + stat = 1
> + return
> + end if
> + num_postfix = num_postfix + 1
> + postfix(num_postfix) = stack(top)
> + top = top - 1
> + end do
> + end subroutine parse_to_postfix
> +
> + integer function new_out(s, o, pool, p_size)
>
> This function has side-effects on pool, I totally agree with this
> https://github.com/JorgeG94/fortran_programmer_llm/blob/main/Fortran_programmer.md#functions-should-have-no-side-effects
> .
>
> Is it possible to consider subroutines when facing mutation on input
> derived types?
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + top = top - 1
> + end do
> + end subroutine parse_to_postfix
> +
> + integer function new_out(s, o, pool, p_size)
> + integer, intent(in) :: s, o
> + type(out_node), intent(inout) :: pool(:)
> + integer, intent(inout) :: p_size
> + p_size = p_size + 1
> + pool(p_size)%s = s
> + pool(p_size)%o = o
> + pool(p_size)%next = 0
> + new_out = p_size
> + end function new_out
> +
> + subroutine merge_lists(l1, l2, res, pool)
>
> same https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036020902
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + type(out_list_type), intent(in) :: l1, l2
> + type(out_list_type), intent(out) :: res
> + type(out_node), intent(inout) :: pool(:)
> + if (l1%head == 0) then
> + res = l2
> + else if (l2%head == 0) then
> + res = l1
> + else
> + pool(l1%tail)%next = l2%head
> + res%head = l1%head
> + res%tail = l2%tail
> + end if
> + end subroutine merge_lists
> +
> + subroutine do_patch(states, list, target, pool)
> + type(state_type), intent(inout) :: states(:)
>
> Another style comment: usually, with subroutines, it is customary to have
> the non-mutable inputs first (strict intent(in)), then the intent(out)
> or intent(inout)s, and finally the optionals.
>
> Would it be possible to keep this recommendation?
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + integer, intent(in) :: target
> + type(out_node), intent(in) :: pool(:)
> + integer :: curr
> + curr = list%head
> + do while (curr /= 0)
> + if (pool(curr)%o == 1) then
> + states(pool(curr)%s)%out1 = target
> + else
> + states(pool(curr)%s)%out2 = target
> + end if
> + curr = pool(curr)%next
> + end do
> + end subroutine do_patch
> +
> + subroutine build_nfa(postfix, num_postfix, states, n_states, start_state, stat)
> + type(token_type), intent(in) :: postfix(:)
>
> same as
> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
> ------------------------------
>
> In src/regex/stdlib_regex.f90
> <#1172 (comment)>
> :
>
> > + if (stat /= 0) then
> + if (present(status)) status = stat
> + return
> + end if
> +
> + call parse_to_postfix(tokens, n_tok, postfix, n_post, stat)
> + if (stat /= 0) then
> + if (present(status)) status = stat
> + return
> + end if
> +
> + call build_nfa(postfix, n_post, re%states, re%n_states, re%start_state, stat)
> + if (present(status)) status = stat
> + end subroutine regcomp
> +
> + recursive subroutine add_thread(list, count, state_idx, start_pos, step_index, states, str_len, visited)
>
> same as
> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
>
> On the implementation level, is recursiveness absolutly necessary? or
> could there be a way to implement this without using recursivity?
>
> —
> Reply to this email directly, view it on GitHub
> <#1172 (review)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAN6YRY2IFR2JVVQLWR6RGD4UFXXVAVCNFSM6AAAAACXIGTOHWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DANJYHAYTONBVG4>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
|
Ah, that was my mistake. The engine should look for the longest matching
substring when a start has been found, unless non-greedy expressions are
selected. Always a tricky part of regular expressions: to know precisely
what is to be matched and what not.
Op zo 5 apr 2026 om 15:11 schreef Arjen Markus ***@***.***>:
… I also tried a test
call regcomp(re, "A+", stat)
call regmatch(re, "A AAAAAAAAAAAAAAA", is_match, match_start, match_end)
print *, "Match 'A AAAAAAAAAAAA' with 'A+' -> ", is_match, match_start,
match_end
The result was a match with the first subststring, not the longest. That
is not the classical behaviour of a regular expression matcher. I will read
the documentation to see if this was expected 😇
Op zo 5 apr 2026 om 15:04 schreef Arjen Markus ***@***.***
>:
> I had a first look at the test program you provided some days ago. I
> noticed that the indices are off by one:
>
> === Testing Fortran Regex (Thompson NFA) ===
> regcomp 'abc': status = 0
> Match 'xyz_abc_def' -> T 4 7
>
> The substring "abc" starts at 5, not 4. This off by one error occurs in
> another test as well.
>
> Another one:
>
> Match 'aaaab' with 'a*b' -> T 4 5
>
> The match starts at 1, not 4..
>
> foo123bar: the matching substring is too short - the reported substring
> is from 3 to 4, instead of 4 to 6.
>
> cats: the matching subststring is reported as 7 to 11 (five characters)
> whereas the matching substring is "cats", so four characters only.
>
> So, some work to be done, unless you have already fixed these bugs ;),
> but in any case a good start.
>
> Op za 4 apr 2026 om 22:43 schreef José Alves ***@***.***>:
>
>> ***@***.**** requested changes on this pull request.
>> ------------------------------
>>
>> In doc/specs/stdlib_regex.md
>> <#1172 (comment)>
>> :
>>
>> > +The regular expression pattern string to compile.
>> +
>> +`status` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
>> +Returns 0 on success, or a non-zero value if the pattern is invalid
>> +(e.g., mismatched parentheses or brackets).
>> +
>> +### Example
>> +
>> +```fortran
>> +use stdlib_regex, only: regex_type, regcomp
>> +type(regex_type) :: re
>> +integer :: stat
>> +
>> +call regcomp(re, "(cat|dog)s?", stat)
>> +if (stat /= 0) error stop "Invalid regex pattern"
>> +```
>>
>> This should ideally be an executable example program in the examples
>> folder
>> ------------------------------
>>
>> In doc/specs/stdlib_regex.md
>> <#1172 (comment)>
>> :
>>
>> > +
>> +`string`: Shall be of type `character(len=*)`. It is an `intent(in)` argument.
>> +The input string to search for a match.
>> +
>> +`is_match`: Shall be of type `logical`. It is an `intent(out)` argument.
>> +Set to `.true.` if a match is found, `.false.` otherwise.
>> +
>> +`match_start` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
>> +The 1-based index of the first character of the match.
>> +
>> +`match_end` (optional): Shall be of type `integer`. It is an `intent(out)` argument.
>> +The 1-based index of the last character of the match.
>> +
>> +### Example
>> +
>> +```fortran
>>
>> same as before, this should be an executable program in the examples
>> folder
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + integer :: tail
>> + end type out_list_type
>> +
>> + type :: frag_type
>> + integer :: start
>> + type(out_list_type) :: out_list
>> + end type frag_type
>> +
>> + type :: thread
>> + integer :: state
>> + integer :: start_pos
>> + end type thread
>> +
>> +contains
>> +
>> + logical function is_term_ender(tag)
>>
>> can this be made pure or even better elemental ?
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + integer :: state
>> + integer :: start_pos
>> + end type thread
>> +
>> +contains
>> +
>> + logical function is_term_ender(tag)
>> + integer, intent(in) :: tag
>> + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
>> + tag == TOK_CLASS .or. tag == TOK_STAR .or. &
>> + tag == TOK_PLUS .or. tag == TOK_QUEST .or. &
>> + tag == TOK_RPAREN .or. tag == TOK_END .or. &
>> + tag == TOK_START)
>> + end function is_term_ender
>> +
>> + logical function is_term_starter(tag)
>>
>> can this be made pure or even better elemental ?
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + integer, intent(in) :: tag
>> + is_term_ender = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
>> + tag == TOK_CLASS .or. tag == TOK_STAR .or. &
>> + tag == TOK_PLUS .or. tag == TOK_QUEST .or. &
>> + tag == TOK_RPAREN .or. tag == TOK_END .or. &
>> + tag == TOK_START)
>> + end function is_term_ender
>> +
>> + logical function is_term_starter(tag)
>> + integer, intent(in) :: tag
>> + is_term_starter = (tag == TOK_CHAR .or. tag == TOK_ANY .or. &
>> + tag == TOK_CLASS .or. tag == TOK_LPAREN .or. &
>> + tag == TOK_START .or. tag == TOK_END)
>> + end function is_term_starter
>> +
>> + integer function prec(tag)
>>
>> can this be made pure or even better elemental ?
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + c = pattern(i:i)
>> + t%tag = TOK_CHAR
>> + t%c = ' '
>> + t%bmap = .false.
>> + t%invert = .false.
>> +
>> + if (c == '\') then
>> + if (i < len_p) then
>> + i = i + 1
>> + c = pattern(i:i)
>> + end if
>> + t%tag = TOK_CHAR
>> + t%c = c
>> + if (c == 'd') then
>> + t%tag = TOK_CLASS
>> + do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do
>>
>> Two things:
>>
>> 1.
>>
>> Could you make all calls to iachar( ) affect module level parameters
>> that then are usable accross the module ? you can check out
>> https://github.com/fortran-lang/stdlib/blob/master/src/core/stdlib_ascii.fypp
>> for inspiration. Maybe some of those constants would be worth to be stored
>> there ?
>> 2.
>>
>> Please avoid one-liners for anything that is not a scalar constant:
>>
>>
>> -
>>
>> one = 1; zero =0 is tolerable
>> -
>>
>> do k = iachar('0'), iachar('9'); t%bmap(k) = .true.; end do is not
>> very "debugger friendly" as one can not easily set break points to follow
>> iterations.
>> For this specific case, t%bmap(iachar('0'):iachar('9')) = .true.
>> would be equivalent, fortranic, and no need for semicolons.
>>
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + stack(top) = tokens(i)
>> + end if
>> + end do
>> +
>> + do while (top > 0)
>> + if (stack(top)%tag == TOK_LPAREN) then
>> + stat = 1
>> + return
>> + end if
>> + num_postfix = num_postfix + 1
>> + postfix(num_postfix) = stack(top)
>> + top = top - 1
>> + end do
>> + end subroutine parse_to_postfix
>> +
>> + integer function new_out(s, o, pool, p_size)
>>
>> This function has side-effects on pool, I totally agree with this
>> https://github.com/JorgeG94/fortran_programmer_llm/blob/main/Fortran_programmer.md#functions-should-have-no-side-effects
>> .
>>
>> Is it possible to consider subroutines when facing mutation on input
>> derived types?
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + top = top - 1
>> + end do
>> + end subroutine parse_to_postfix
>> +
>> + integer function new_out(s, o, pool, p_size)
>> + integer, intent(in) :: s, o
>> + type(out_node), intent(inout) :: pool(:)
>> + integer, intent(inout) :: p_size
>> + p_size = p_size + 1
>> + pool(p_size)%s = s
>> + pool(p_size)%o = o
>> + pool(p_size)%next = 0
>> + new_out = p_size
>> + end function new_out
>> +
>> + subroutine merge_lists(l1, l2, res, pool)
>>
>> same
>> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036020902
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + type(out_list_type), intent(in) :: l1, l2
>> + type(out_list_type), intent(out) :: res
>> + type(out_node), intent(inout) :: pool(:)
>> + if (l1%head == 0) then
>> + res = l2
>> + else if (l2%head == 0) then
>> + res = l1
>> + else
>> + pool(l1%tail)%next = l2%head
>> + res%head = l1%head
>> + res%tail = l2%tail
>> + end if
>> + end subroutine merge_lists
>> +
>> + subroutine do_patch(states, list, target, pool)
>> + type(state_type), intent(inout) :: states(:)
>>
>> Another style comment: usually, with subroutines, it is customary to
>> have the non-mutable inputs first (strict intent(in)), then the
>> intent(out) or intent(inout)s, and finally the optionals.
>>
>> Would it be possible to keep this recommendation?
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + integer, intent(in) :: target
>> + type(out_node), intent(in) :: pool(:)
>> + integer :: curr
>> + curr = list%head
>> + do while (curr /= 0)
>> + if (pool(curr)%o == 1) then
>> + states(pool(curr)%s)%out1 = target
>> + else
>> + states(pool(curr)%s)%out2 = target
>> + end if
>> + curr = pool(curr)%next
>> + end do
>> + end subroutine do_patch
>> +
>> + subroutine build_nfa(postfix, num_postfix, states, n_states, start_state, stat)
>> + type(token_type), intent(in) :: postfix(:)
>>
>> same as
>> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
>> ------------------------------
>>
>> In src/regex/stdlib_regex.f90
>> <#1172 (comment)>
>> :
>>
>> > + if (stat /= 0) then
>> + if (present(status)) status = stat
>> + return
>> + end if
>> +
>> + call parse_to_postfix(tokens, n_tok, postfix, n_post, stat)
>> + if (stat /= 0) then
>> + if (present(status)) status = stat
>> + return
>> + end if
>> +
>> + call build_nfa(postfix, n_post, re%states, re%n_states, re%start_state, stat)
>> + if (present(status)) status = stat
>> + end subroutine regcomp
>> +
>> + recursive subroutine add_thread(list, count, state_idx, start_pos, step_index, states, str_len, visited)
>>
>> same as
>> https://github.com/fortran-lang/stdlib/pull/1172/changes#r3036024863
>>
>> On the implementation level, is recursiveness absolutly necessary? or
>> could there be a way to implement this without using recursivity?
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#1172 (review)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/AAN6YRY2IFR2JVVQLWR6RGD4UFXXVAVCNFSM6AAAAACXIGTOHWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DANJYHAYTONBVG4>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> ***@***.***>
>>
>
|
|
Thanks @jalvesz @arjenmarkus ! I’ve updated the code and addressed those issues: Off-by-one and match lengths: Fixed! Among all matches starting at that same leftmost position, the engine will strictly select the longest one before concluding. |
|
I have written a small "interpreter" that will allow you to easily extend
the set of tests. See the attachments. The sample tests are just a start,
of course, but I already found one incompleteness in the checking for a
proper regular expression. You are welcome to use it (or just to ignore it,
if you think it is not useful).
Op zo 5 apr 2026 om 21:56 schreef JAYA SATHVIK TANGA <
***@***.***>:
*JAi-SATHVIK* left a comment (fortran-lang/stdlib#1172)
<#1172 (comment)>
Thanks @jalvesz <https://github.com/jalvesz> @arjenmarkus
<https://github.com/arjenmarkus> ! I’ve updated the code and addressed
those issues:
*Off-by-one and match lengths:* Fixed! abc now correctly returns ms=5,
me=7, and aaaab with a*b correctly returns ms=1, me=5.
*Leftmost-Longest Priority:* The engine follows the standard where the
leftmost start always wins first. Because an "A" matches starting at
index 1, it is chosen over any subsequent matches elsewhere in the string.
Among all matches starting at that same leftmost position, the engine will
strictly select the longest one before concluding.
—
Reply to this email directly, view it on GitHub
<#1172 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR3GF5VUKRF6AEL5MDL4UK277AVCNFSM6AAAAACXIGTOHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DCOBZGQZTMOBRGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
! catalogue_regex.f90 --
!
! By Arjen Markus, dd. 7 april 2026.
!
! A catalogue of tests for the regular expression module
!
! The program reads a file with test cases:
! regexp "expression" - the expression to be compiled and used
! input "string" - the string to be matched against the last regular expression
! expected "string" - the string that is expected to match (hence outpur from the match routine)
! error-exp - expecting an error from the compilation of the last regular expression
! no-match - expecting a "no match" result
!
! These lines instruct the program to compile and use the regular expression.
! The results are reported in the output file.
! The lines in the file should be no more than 100 characters long.
! Also: the double quotes should surround the strings, so that they are properly delimited.
!
! The order of the lines is expected to be:
! the expected output comes after the expression and the input
!
program catalogue_regex
use stdlib_regex
implicit none
type(regex_type) :: re
character(len=100) :: line
character(len=20) :: keyword
character(len=:), allocatable :: value
character(len=:), allocatable :: expression
character(len=:), allocatable :: string
character(len=:), allocatable :: expected
integer :: match_start, match_end, status, ierr
integer :: mismatches
logical :: matched
open( 10, file = 'catalogue_regex.inp', status = 'old', iostat = ierr )
if ( ierr /= 0 ) then
write( *, '(a)' ) 'Could not open the file "catalogue_regex.inp"'
write( *, '(a)' ) 'It should exist - please check'
error stop
endif
open( 20, file = 'catalogue_regex.report' )
mismatches = 0
do
read( 10, '(a)', iostat = ierr ) line
if ( ierr /= 0 ) then
exit
endif
call extract_information( line, keyword, value )
select case( keyword )
case( 'expression' )
expression = value
case( 'input' )
string = value
case( 'expected' )
write( 20, '(a)' ) ''
expected = value
call regcomp( re, expression, status )
if ( status /= 0 ) then
mismatches = mismatches + 1
write( 20, '(a,i0)' ) 'Error compiling the expression: status = ', status
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
else
call regmatch( re, string, matched, match_start, match_end )
if ( matched ) then
write( 20, '(a,2a)' ) 'Match found:'
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
write( 20, '(a,2a)' ) ' Input string: "', string, '"'
write( 20, '(a,2a)' ) ' Substring: "', string(match_start:match_end), '"'
write( 20, '(a,2a)' ) ' Expected: "', expected, '"'
if ( expected == string(match_start:match_end) ) then
write( 20, '(a,2a)' ) ' Success!'
else
mismatches = mismatches + 1
write( 20, '(a,2a)' ) ' MISMATCH!'
endif
else
mismatches = mismatches + 1
write( 20, '(a,2a)' ) 'NO match found:'
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
write( 20, '(a,2a)' ) ' Input string: "', string, '"'
write( 20, '(a,2a)' ) ' Substring: (none)'
write( 20, '(a,2a)' ) ' Expected: "', expected, '"'
endif
endif
case( 'error-exp' )
write( 20, '(a)' ) ''
call regcomp( re, expression, status )
if ( status /= 0 ) then
write( 20, '(a)' ) 'Error detected as expected:'
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
else
mismatches = mismatches + 1
write( 20, '(a)' ) 'An error was expected but not detected:'
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
endif
case( 'no-match' )
write( 20, '(a)' ) ''
call regcomp( re, expression, status )
if ( status /= 0 ) then
mismatches = mismatches + 1
write( 20, '(a,i0)' ) 'Error compiling the expression: status = ', status
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
else
call regmatch( re, string, matched, match_start, match_end )
if ( matched ) then
mismatches = mismatches + 1
write( 20, '(a,2a)' ) 'Match found where none expected:'
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
write( 20, '(a,2a)' ) ' Input string: "', string, '"'
write( 20, '(a,2a)' ) ' Substring: "', string(match_start:match_end), '"'
write( 20, '(a,2a)' ) ' Expected: (none)'
else
write( 20, '(a,2a)' ) 'No match found, as expected:'
write( 20, '(a,2a)' ) ' Expression: "', expression, '"'
write( 20, '(a,2a)' ) ' Input string: "', string, '"'
write( 20, '(a,2a)' ) ' Expected: (none)'
endif
endif
case default
! Treat any other keyword as comment
end select
enddo
write( 20, '(/,a,i0)' ) 'Number of mismatches or other errors: ', mismatches
write( *, '(a)' ) 'Program completed'
contains
subroutine extract_information( line, keyword, value )
character(len=*), intent(in) :: line
character(len=*), intent(out) :: keyword
character(len=:), intent(out), allocatable :: value
character(len=20), dimension(5) :: known_keywords = &
[ 'expression ', &
'input ', &
'expected ', &
'error-exp ', &
'no-match ' ]
integer :: k1, k2
if ( line == " " ) then
keyword = ""
value = ""
return
endif
read( line, *, iostat = ierr ) keyword
if ( keyword == 'error-exp' .or. keyword == 'no-match' ) then
value = ""
return
endif
if ( any( keyword == known_keywords ) ) then
allocate( value, mold = line )
k1 = index( line, '"' )
if ( k1 > 0 ) then
k2 = k1 + index( line(k1+1:), '"' )
if ( k2 > 0 ) then
value = line(k1+1:k2-1)
else
write( 20, '(a)' ) 'Error interpreting the input line:'
write( 20, '(2a)' ) ' "', trim(line), '"'
write( 20, '(2a)' ) 'Program stopped'
write( *, '(2a)' ) 'Program stopped - error reading input. Please check'
error stop
endif
endif
else
value = ""
endif
end subroutine extract_information
end program catalogue_regex
|
|
Thanks for the tests @arjenmarkus
|
|
Thanks @jalvesz, I’ve updated |
jvdp1
left a comment
There was a problem hiding this comment.
Thank you @JAi-SATHVIK . Here are a few comments after a very quick review
| integer :: mismatches | ||
| logical :: matched | ||
|
|
||
| open( 10, file = 'catalogue_regex.inp', status = 'old', iostat = ierr ) |
There was a problem hiding this comment.
use newunit instead of a defined unit.
| open( 10, file = 'catalogue_regex.inp', status = 'old', iostat = ierr ) | |
| open( newunit=un, file = 'catalogue_regex.inp', status = 'old', iostat = ierr ) |
There was a problem hiding this comment.
Refactored to use newunit
| error stop | ||
| endif | ||
|
|
||
| open( 20, file = 'catalogue_regex.report' ) |
There was a problem hiding this comment.
| open( 20, file = 'catalogue_regex.report' ) | |
| open( newunit=un20, file = 'catalogue_regex.report' ) |
| mismatches = 0 | ||
|
|
||
| do | ||
| read( 10, '(a)', iostat = ierr ) line |
There was a problem hiding this comment.
| read( 10, '(a)', iostat = ierr ) line | |
| read( un, '(a)', iostat = ierr ) line |
| ! Anchored match | ||
| call regcomp(re, "^hello", stat) | ||
| call regmatch(re, "hello world", found) | ||
| print "(A,L1)", "found = ", found |
There was a problem hiding this comment.
| print "(A,L1)", "found = ", found | |
| print "(a,l1)", "found = ", found |
| integer, parameter :: CHAR_ZERO = iachar('0') | ||
| integer, parameter :: CHAR_NINE = iachar('9') | ||
| integer, parameter :: CHAR_LOWER_A = iachar('a') | ||
| integer, parameter :: CHAR_LOWER_Z = iachar('z') | ||
| integer, parameter :: CHAR_UPPER_A = iachar('A') | ||
| integer, parameter :: CHAR_UPPER_Z = iachar('Z') | ||
| integer, parameter :: CHAR_SPACE = iachar(' ') | ||
| integer, parameter :: CHAR_TAB = 9 | ||
| integer, parameter :: CHAR_LF = 10 | ||
| integer, parameter :: CHAR_CR = 13 | ||
| integer, parameter :: CHAR_UNDERSCORE = iachar('_') |
There was a problem hiding this comment.
some of these might be already defined in stdlib_ascii. Did you check them?
There was a problem hiding this comment.
Update stdlib_regex.f90 to use constants from stdlib_ascii (TAB, LF, CR)
issue #1163