After the merge of this PR: https://github.com/pallets/markupsafe/pull/413/changes the function Markup.striptags has become much slower for larger inputs, contrary to what has been suggested by this PR.
This problem has been noticed after benchmarking with arbitrary large inputs:
| Case |
Regex µs |
Current µs |
Speedup |
| plaintext_no_tags_50k_words |
2937.05 |
2795.78 |
0.95x ← current_implementation |
| html_entities_only_no_tags_5k |
4843.88 |
4818.92 |
0.99x ← current_implementation |
| comments_only_1k |
382.75 |
1985.56 |
5.19x ← regex_old_implementation |
| comments_containing_tags_1k |
234.88 |
1243.65 |
5.29x ← regex_old_implementation |
| comments_only_50k |
21047.84 |
21081024.01 |
1001.58x ← regex_old_implementation |
| comments_containing_tags_50k |
13948.90 |
11301014.95 |
810.17x ← regex_old_implementation |
| short_tags_5k |
915.32 |
28235.99 |
30.85x ← regex_old_implementation |
| short_tags_20k |
3603.37 |
876001.98 |
243.11x ← regex_old_implementation |
| short_tags_50k |
10648.04 |
9064394.50 |
851.27x ← regex_old_implementation |
| nested_divs_1k_deep |
148.99 |
880.14 |
5.91x ← regex_old_implementation |
| nested_divs_10k_deep |
1730.14 |
54162.25 |
31.31x ← regex_old_implementation |
| tags_with_many_attrs_2k |
1165.15 |
16554.24 |
14.21x ← regex_old_implementation |
| tags_with_many_attrs_20k |
12659.77 |
7297957.10 |
576.47x ← regex_old_implementation |
| multiline_tags_20k |
10227.63 |
5476065.12 |
535.42x ← regex_old_implementation |
| mixed_comments_with_tags_text_2k |
309.79 |
3034.69 |
9.80x ← regex_old_implementation |
| mixed_comments_with_tags_text_10k |
1593.39 |
79993.58 |
50.20x ← regex_old_implementation |
The benchmark cases are in the form <case_description>_<number_of_tags>. You can see that the only cases where the current implementation is slightly faster is when you have no tags in the input which can be explained by the fact that the while loops will simply exit early. The time lost in the regex implementation is likely due to the deeper call stack to scan for the regex.
Apart from that, the old implementation is consistently much more performant.
I believe that this new approach is causing a quadratic complexity which scales very badly with larger inputs, unlike the regex scan which happens in C and could achieve near linear complexity (I'm not 100% sure how regex scan for compiled expressions work). In any case, a C-level loop would be much faster than in python. On top of that, the current implementation recreates an f-string in every iteration of the loop which is adding more complexity.
I would like to suggest reverting the commit added in this PR to avoid the performance regression.
Environment:
- Python version: 3.12
- MarkupSafe version: >=2.1.4
After the merge of this PR: https://github.com/pallets/markupsafe/pull/413/changes the function
Markup.striptagshas become much slower for larger inputs, contrary to what has been suggested by this PR.This problem has been noticed after benchmarking with arbitrary large inputs:
The benchmark cases are in the form
<case_description>_<number_of_tags>. You can see that the only cases where the current implementation is slightly faster is when you have no tags in the input which can be explained by the fact that the while loops will simply exit early. The time lost in the regex implementation is likely due to the deeper call stack to scan for the regex.Apart from that, the old implementation is consistently much more performant.
I believe that this new approach is causing a quadratic complexity which scales very badly with larger inputs, unlike the regex scan which happens in C and could achieve near linear complexity (I'm not 100% sure how regex scan for compiled expressions work). In any case, a C-level loop would be much faster than in python. On top of that, the current implementation recreates an f-string in every iteration of the loop which is adding more complexity.
I would like to suggest reverting the commit added in this PR to avoid the performance regression.
Environment: