Skip to content

Conversation

@maurycy
Copy link
Contributor

@maurycy maurycy commented Jan 31, 2026

The hint enables Transparent Huge Pages on systems with madvise, which seems to be the default on Ubuntu and Fedora, at least according to this article.

More on THP:

Importantly, it seems to cary no SIGBUS risk. mimalloc seems to already do this with MIMALLOC_LARGE_OS_PAGES=1.

Reusing the benchmark from #144319:

bench_obmalloc.py
import sys, gc

def bench_small_object_churn():
    objs = []
    for _ in range(200_000): objs.append(bytearray(64))
    for _ in range(200_000): objs.append(bytearray(64)); objs.pop(0)

def bench_bulk_small_alloc():
    objs = [bytearray(48) for _ in range(1_000_000)]
    for o in objs: o[0] = 1

def bench_dict_churn():
    for _ in range(500_000): d = {"a": 1, "b": 2, "c": 3, "d": 4}; del d

def bench_mixed_sizes():
    sizes = [8, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 512]
    objs = [bytearray(sizes[i % 12]) for i in range(500_000)]

def bench_fragmentation():
    objs = [bytearray(128) for _ in range(500_000)]
    for i in range(0, len(objs), 2): objs[i] = None
    for i in range(0, len(objs), 2): objs[i] = bytearray(128)

def bench_list_of_tuples():
    objs = [(i, i+1, i+2) for i in range(1_000_000)]

def bench_class_instances():
    class Pt:
        __slots__ = ('x', 'y', 'z')
        def __init__(s, x, y, z): s.x = x; s.y = y; s.z = z
    objs = [Pt(i, i+1, i+2) for i in range(500_000)]

def bench_arena_pressure():
    layers = [[bytearray(256) for _ in range(200_000)] for _ in range(10)]

def bench_random_walk():
    import random; random.seed(42)
    objs = [bytearray(64) for _ in range(1_000_000)]
    idx = list(range(len(objs))); random.shuffle(idx)
    for i in idx: objs[i][0] = i & 0xff

BENCHMARKS = dict(small_object_churn=bench_small_object_churn,
    bulk_small_alloc=bench_bulk_small_alloc, dict_churn=bench_dict_churn,
    mixed_sizes=bench_mixed_sizes, fragmentation=bench_fragmentation,
    list_of_tuples=bench_list_of_tuples, class_instances=bench_class_instances,
    arena_pressure=bench_arena_pressure, random_walk=bench_random_walk)

if __name__ == "__main__":
    gc.collect(); gc.disable(); BENCHMARKS[sys.argv[1]](); gc.enable()

on

[126] 2026-01-31T02:32:04.127734128+0100 maurycy@eiger /home/maurycy  % sudo cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

Where the baseline is the main branch

Wall-clock time

Benchmark Baseline With MADV_HUGEPAGE Change
fragmentation 0.107s 0.101s -5.4%
bulk_small_alloc 0.126s 0.121s -4.1%
class_instances 0.078s 0.076s -2.9%
list_of_tuples 0.102s 0.101s -1.2%
mixed_sizes 0.085s 0.084s -1.1%
random_walk 0.517s 0.515s -0.4%
arena_pressure 0.325s 0.326s +0.3%

dTLB load misses

Benchmark Baseline With MADV_HUGEPAGE Change
fragmentation 123,390 99,413 -19.4%
arena_pressure 280,228 237,222 -15.3%
bulk_small_alloc 93,894 85,661 -8.8%
list_of_tuples 88,019 81,778 -7.1%

It's smaller than MAP_HUGETLB because MADV_HUGEPAGE is just a hint, so maybe khugepaged did not kick in yet.

I noted no regression with THP=always.

The only thing that I'm wondering whether and how it should be guarded. Enabling by default seems risky, but it's not exactly --with-pymalloc-hugepages. That's why I'm opening this as a draft.

pyperformance --rigorous suite (I'd say it's jitter: asyncio_tcp is I/O bound, scimark is numpy, the benchmarks are short-lived etc.)

uv run --with pyperf python -m pyperf compare_to /tmp/baseline_affinity.json /tmp/modified_affinity.json --table --table-format md
Benchmark baseline_affinity modified_affinity
many_optionals 693 us 688 us: 1.01x faster
subparsers 7.71 ms 7.65 ms: 1.01x faster
async_generators 290 ms 288 ms: 1.00x faster
async_tree_cpu_io_mixed_tg 411 ms 414 ms: 1.01x slower
async_tree_eager_cpu_io_mixed 344 ms 344 ms: 1.00x faster
async_tree_eager_cpu_io_mixed_tg 380 ms 385 ms: 1.01x slower
async_tree_eager_memoization 172 ms 170 ms: 1.01x faster
async_tree_eager_tg 162 ms 165 ms: 1.02x slower
async_tree_io 447 ms 454 ms: 1.02x slower
async_tree_memoization 243 ms 234 ms: 1.04x faster
async_tree_memoization_tg 252 ms 239 ms: 1.06x faster
async_tree_none_tg 200 ms 195 ms: 1.03x faster
asyncio_tcp 301 ms 269 ms: 1.12x faster
asyncio_tcp_ssl 1.28 sec 1.27 sec: 1.00x faster
asyncio_websockets 359 ms 357 ms: 1.01x faster
chameleon 12.1 ms 12.1 ms: 1.01x faster
chaos 44.4 ms 44.1 ms: 1.01x faster
comprehensions 12.6 us 12.7 us: 1.01x slower
bench_thread_pool 795 us 800 us: 1.01x slower
crypto_pyaes 56.8 ms 57.3 ms: 1.01x slower
dask 700 ms 698 ms: 1.00x faster
deepcopy 186 us 186 us: 1.00x slower
deepcopy_reduce 2.06 us 2.08 us: 1.01x slower
deepcopy_memo 18.6 us 19.1 us: 1.03x slower
deltablue 2.50 ms 2.47 ms: 1.01x faster
django_template 29.7 ms 29.5 ms: 1.01x faster
docutils 2.21 sec 2.19 sec: 1.01x faster
dulwich_log 44.2 ms 45.0 ms: 1.02x slower
fannkuch 285 ms 280 ms: 1.02x faster
gc_traversal 4.08 ms 4.25 ms: 1.04x slower
generators 22.9 ms 22.6 ms: 1.01x faster
genshi_text 17.1 ms 17.3 ms: 1.01x slower
genshi_xml 39.5 ms 39.2 ms: 1.01x faster
go 90.0 ms 89.8 ms: 1.00x faster
hexiom 4.39 ms 4.47 ms: 1.02x slower
html5lib 48.9 ms 48.3 ms: 1.01x faster
json_dumps 7.57 ms 7.50 ms: 1.01x faster
json_loads 18.4 us 18.5 us: 1.01x slower
logging_simple 4.54 us 4.43 us: 1.02x faster
mako 8.47 ms 8.49 ms: 1.00x slower
mdp 941 ms 965 ms: 1.03x slower
meteor_contest 95.9 ms 94.8 ms: 1.01x faster
nbody 67.5 ms 67.9 ms: 1.01x slower
nqueens 73.6 ms 72.4 ms: 1.02x faster
pathlib 10.0 ms 10.1 ms: 1.01x slower
pickle 13.8 us 13.8 us: 1.01x faster
pickle_dict 24.9 us 24.6 us: 1.01x faster
pickle_list 4.08 us 4.11 us: 1.01x slower
pickle_pure_python 250 us 247 us: 1.01x faster
pidigits 185 ms 184 ms: 1.00x faster
pprint_safe_repr 573 ms 568 ms: 1.01x faster
pprint_pformat 1.18 sec 1.16 sec: 1.02x faster
pyflate 327 ms 324 ms: 1.01x faster
python_startup 11.0 ms 11.0 ms: 1.00x faster
python_startup_no_site 6.48 ms 6.48 ms: 1.00x faster
raytrace 211 ms 214 ms: 1.02x slower
regex_compile 98.2 ms 98.3 ms: 1.00x slower
regex_dna 164 ms 156 ms: 1.05x faster
regex_effbot 2.32 ms 2.18 ms: 1.06x faster
regex_v8 18.2 ms 17.5 ms: 1.04x faster
richards 33.6 ms 34.3 ms: 1.02x slower
richards_super 38.5 ms 38.3 ms: 1.00x faster
scimark_fft 204 ms 203 ms: 1.01x faster
scimark_lu 68.9 ms 66.6 ms: 1.04x faster
scimark_monte_carlo 43.2 ms 44.0 ms: 1.02x slower
scimark_sor 75.9 ms 74.7 ms: 1.02x faster
scimark_sparse_mat_mult 3.24 ms 3.05 ms: 1.06x faster
spectral_norm 64.5 ms 64.7 ms: 1.00x slower
sphinx 808 ms 798 ms: 1.01x faster
sqlglot_v2_normalize 82.4 ms 83.6 ms: 1.01x slower
sqlglot_v2_optimize 41.7 ms 41.8 ms: 1.00x slower
sqlglot_v2_parse 973 us 990 us: 1.02x slower
sqlglot_v2_transpile 1.26 ms 1.25 ms: 1.00x faster
sympy_integrate 16.4 ms 16.4 ms: 1.00x slower
sympy_sum 112 ms 112 ms: 1.00x faster
sympy_str 214 ms 215 ms: 1.01x slower
telco 118 ms 120 ms: 1.02x slower
tomli_loads 1.49 sec 1.50 sec: 1.01x slower
tornado_http 79.7 ms 79.4 ms: 1.00x faster
typing_runtime_protocols 124 us 121 us: 1.02x faster
unpack_sequence 32.7 ns 31.9 ns: 1.03x faster
unpickle 11.0 us 10.7 us: 1.02x faster
unpickle_list 3.95 us 3.99 us: 1.01x slower
unpickle_pure_python 163 us 163 us: 1.00x slower
xdsl_constant_fold 36.0 ms 36.2 ms: 1.01x slower
xml_etree_parse 109 ms 108 ms: 1.01x faster
xml_etree_iterparse 68.3 ms 67.3 ms: 1.01x faster
xml_etree_generate 68.3 ms 67.3 ms: 1.01x faster
xml_etree_process 47.4 ms 47.7 ms: 1.01x slower
Geometric mean (ref) 1.00x faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant