Skip to content

Performance regression in striptags #521

@MohamedKasem99

Description

@MohamedKasem99

After the merge of this PR: https://github.com/pallets/markupsafe/pull/413/changes the function Markup.striptags has become much slower for larger inputs, contrary to what has been suggested by this PR.

This problem has been noticed after benchmarking with arbitrary large inputs:

Case Regex µs Current µs Speedup
plaintext_no_tags_50k_words 2937.05 2795.78 0.95x ← current_implementation
html_entities_only_no_tags_5k 4843.88 4818.92 0.99x ← current_implementation
comments_only_1k 382.75 1985.56 5.19x ← regex_old_implementation
comments_containing_tags_1k 234.88 1243.65 5.29x ← regex_old_implementation
comments_only_50k 21047.84 21081024.01 1001.58x ← regex_old_implementation
comments_containing_tags_50k 13948.90 11301014.95 810.17x ← regex_old_implementation
short_tags_5k 915.32 28235.99 30.85x ← regex_old_implementation
short_tags_20k 3603.37 876001.98 243.11x ← regex_old_implementation
short_tags_50k 10648.04 9064394.50 851.27x ← regex_old_implementation
nested_divs_1k_deep 148.99 880.14 5.91x ← regex_old_implementation
nested_divs_10k_deep 1730.14 54162.25 31.31x ← regex_old_implementation
tags_with_many_attrs_2k 1165.15 16554.24 14.21x ← regex_old_implementation
tags_with_many_attrs_20k 12659.77 7297957.10 576.47x ← regex_old_implementation
multiline_tags_20k 10227.63 5476065.12 535.42x ← regex_old_implementation
mixed_comments_with_tags_text_2k 309.79 3034.69 9.80x ← regex_old_implementation
mixed_comments_with_tags_text_10k 1593.39 79993.58 50.20x ← regex_old_implementation

The benchmark cases are in the form <case_description>_<number_of_tags>. You can see that the only cases where the current implementation is slightly faster is when you have no tags in the input which can be explained by the fact that the while loops will simply exit early. The time lost in the regex implementation is likely due to the deeper call stack to scan for the regex.

Apart from that, the old implementation is consistently much more performant.

I believe that this new approach is causing a quadratic complexity which scales very badly with larger inputs, unlike the regex scan which happens in C and could achieve near linear complexity (I'm not 100% sure how regex scan for compiled expressions work). In any case, a C-level loop would be much faster than in python. On top of that, the current implementation recreates an f-string in every iteration of the loop which is adding more complexity.

I would like to suggest reverting the commit added in this PR to avoid the performance regression.

Environment:

  • Python version: 3.12
  • MarkupSafe version: >=2.1.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions