Performance regression in striptags

After the merge of this PR: https://github.com/pallets/markupsafe/pull/413/changes the function `Markup.striptags` has become much slower for larger inputs, contrary to what has been suggested by this PR.  

This problem has been noticed after benchmarking with arbitrary large inputs: 


| Case                               | Regex µs | Current µs   | Speedup  |
|----------------------------------------|----------|--------------|----------|
| plaintext_no_tags_50k_words            | 2937.05  | 2795.78      | 0.95x ← current_implementation |
| html_entities_only_no_tags_5k               | 4843.88  | 4818.92      | 0.99x ← current_implementation |
| comments_only_1k                       | 382.75   | 1985.56      | 5.19x ← regex_old_implementation |
| comments_containing_tags_1k            | 234.88   | 1243.65      | 5.29x ← regex_old_implementation |
| comments_only_50k                      | 21047.84 | 21081024.01  | 1001.58x ← regex_old_implementation |
| comments_containing_tags_50k           | 13948.90 | 11301014.95  | 810.17x ← regex_old_implementation |
| short_tags_5k                          | 915.32   | 28235.99     | 30.85x ← regex_old_implementation |
| short_tags_20k                         | 3603.37  | 876001.98    | 243.11x ← regex_old_implementation |
| short_tags_50k                         | 10648.04 | 9064394.50   | 851.27x ← regex_old_implementation |
| nested_divs_1k_deep                    | 148.99   | 880.14       | 5.91x ← regex_old_implementation |
| nested_divs_10k_deep                   | 1730.14  | 54162.25     | 31.31x ← regex_old_implementation |
| tags_with_many_attrs_2k                | 1165.15  | 16554.24     | 14.21x ← regex_old_implementation |
| tags_with_many_attrs_20k               | 12659.77 | 7297957.10   | 576.47x ← regex_old_implementation |
| multiline_tags_20k                     | 10227.63 | 5476065.12   | 535.42x ← regex_old_implementation |
| mixed_comments_with_tags_text_2k            | 309.79   | 3034.69      | 9.80x ← regex_old_implementation |
| mixed_comments_with_tags_text_10k           | 1593.39  | 79993.58     | 50.20x ← regex_old_implementation |
 

The benchmark cases are in the form `<case_description>_<number_of_tags>`. You can see that the only cases where the current implementation is slightly faster is when you have no tags in the input which can be explained by the fact that the while loops will simply exit early. The time lost in the regex implementation is likely due to the deeper call stack to scan for the regex. 

Apart from that, the old implementation is consistently much more performant. 


I believe that this new approach is causing a quadratic complexity which scales very badly with larger inputs, unlike the regex scan which happens in C and could achieve near linear complexity (I'm not 100% sure how regex scan for compiled expressions work). In any case, a C-level loop would be much faster than in python. On top of that, the current implementation recreates an f-string in every iteration of the loop which is adding more complexity. 

I would like to suggest reverting the commit added in this PR to avoid the performance regression.

Environment:

- Python version: 3.12
- MarkupSafe version: >=2.1.4


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance regression in striptags #521

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Case	Regex µs	Current µs	Speedup
plaintext_no_tags_50k_words	2937.05	2795.78	0.95x ← current_implementation
html_entities_only_no_tags_5k	4843.88	4818.92	0.99x ← current_implementation
comments_only_1k	382.75	1985.56	5.19x ← regex_old_implementation
comments_containing_tags_1k	234.88	1243.65	5.29x ← regex_old_implementation
comments_only_50k	21047.84	21081024.01	1001.58x ← regex_old_implementation
comments_containing_tags_50k	13948.90	11301014.95	810.17x ← regex_old_implementation
short_tags_5k	915.32	28235.99	30.85x ← regex_old_implementation
short_tags_20k	3603.37	876001.98	243.11x ← regex_old_implementation
short_tags_50k	10648.04	9064394.50	851.27x ← regex_old_implementation
nested_divs_1k_deep	148.99	880.14	5.91x ← regex_old_implementation
nested_divs_10k_deep	1730.14	54162.25	31.31x ← regex_old_implementation
tags_with_many_attrs_2k	1165.15	16554.24	14.21x ← regex_old_implementation
tags_with_many_attrs_20k	12659.77	7297957.10	576.47x ← regex_old_implementation
multiline_tags_20k	10227.63	5476065.12	535.42x ← regex_old_implementation
mixed_comments_with_tags_text_2k	309.79	3034.69	9.80x ← regex_old_implementation
mixed_comments_with_tags_text_10k	1593.39	79993.58	50.20x ← regex_old_implementation

Uh oh!

Performance regression in striptags #521

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions