[WIP] Reduce compilation overhead with @nospecialize #83

mgyoo86 · 2025-10-30T20:11:29Z

Status

🚧 Work in Progress - Testing and feedback welcome

Summary

This PR adds/fixes @nospecialize annotations to frequently-called utility functions to prevent excessive method specialization and reduce compilation overhead.

Major Changes

Core Functions Modified

ulocation(), f2u() - Added @nospecialize for IDS parameters
info(), coordinates() - Prevented specialization on IDS types
diff(), merge!(), freeze!() - Refactored type handling
dict2imas() - Fixed specialization in IO operations

Known Issues

⚠️ Specialization still occurs when calling high-level functions like hdf2imas() or through certain code paths. Root cause investigation ongoing.

Testing

Test File

See tmp_test/test_ulocation.jl for specialization behavior tests.

Required Temporary Modification

To properly test, you need to modify dd.jl to use getfield instead of getproperty or ids.field:

macOS:

sed -E -i '' 's/setfield!\(ids\.([^,]*)/setfield!(getfield(ids, :\1)/g' src/dd.jl

Linux (without macOS backup extension):

sed -E -i 's/setfield!\(ids\.([^,]*)/setfield!(getfield(ids, :\1)/g' src/dd.jl

This replaces ids.field property access with getfield(ids, :field) to bypass getproperty via ids.field implementations during testing.

Next Steps

Identify remaining specialization sources in high-level functions
Validate performance improvements with benchmarks
Determine if dd.jl modifications should be permanent or test-only
Add comprehensive test coverage

Changed from @nospecialize(x::T) where {T<:Type} pattern to @nospecialize(x::{<:Type}) to properly prevent type specialization. This ensures Julia doesn't compile separate versions for each concrete type, reducing compilation overhead.

Add @Assert type checks to prevent merging/freezing different IDS types. This fixes type safety issues introduced by commit 74ded67 where 'where T' constraints were removed. Functions updated: - merge!(::IDS, ::IDS) - assert same type - merge!(::IDSvector, ::IDSvector) - assert same eltype - freeze!(::IDS, ::IDS) - assert same type - freeze!(::IDSvector, ::IDSvector) - assert same eltype These assertions prevent runtime errors from field mismatches when operating on incompatible types.

Replace error() with Dict return when comparing different types. Now diff() returns a dict with 'type_mismatch' key instead of throwing an error, making it more flexible and non-disruptive. Example: diff(dd, dd.equilibrium) => Dict('type_mismatch' => 'dd{Float64} != equilibrium{Float64}')

Add @nospecialize annotations to info and coordinates to reduce compilation overhead and binary size.

Added @nospecialize annotations to location and conversion functions in f2.jl to prevent excessive method specialization: - utlocation(ids, field) and variants - f2u(ids) - converts IDS to universal location string - fs2u(ids_type) - converts IDS type to universal location Note: ulocation specialization still observed during constructor execution despite @nospecialize - investigation ongoing into when and why specialization occurs in the call chain.

Add dd_nospecialize() helper function that uses Base.invokelatest to prevent compiler from analyzing dd() internals during type inference. Replace all dd() default arguments in I/O functions with dd_nospecialize() to significantly reduce allocation and compilation time. The invokelatest barrier prevents the compiler from specializing on the complex 180k-line dd struct generation, while ::dd{Float64} type assertion ensures proper type propagation without additional inference overhead. Affected functions: - json2imas, jstr2imas - hdf2imas (default arg and internal call) - h5i2imas Performance impact: ~20M fewer allocations in hdf2imas calls.

Apply @nospecialize to fieldtype, getproperty, parent, name, goto, getindex, and time-related functions to reduce method specialization.

Temporary test file for investigating method specialization behavior.

codecov · 2025-10-30T20:54:45Z

Codecov Report

❌ Patch coverage is 63.47656% with 187 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.94%. Comparing base (01f7aff) to head (7b16c97).

Files with missing lines	Patch %	Lines
src/data.jl	62.50%	63 Missing ⚠️
src/identifiers.jl	0.00%	31 Missing ⚠️
src/time.jl	62.12%	25 Missing ⚠️
src/show.jl	22.72%	17 Missing ⚠️
src/diagnostics.jl	76.81%	16 Missing ⚠️
src/io.jl	68.88%	14 Missing ⚠️
src/f2.jl	82.69%	9 Missing ⚠️
src/expressions.jl	83.33%	6 Missing ⚠️
src/macros.jl	80.00%	3 Missing ⚠️
src/math.jl	0.00%	2 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #83      +/-   ##
==========================================
+ Coverage   43.76%   43.94%   +0.18%     
==========================================
  Files          13       15       +2     
  Lines       31243    31398     +155     
==========================================
+ Hits        13672    13797     +125     
- Misses      17571    17601      +30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Remove type parameters from @nospecialize signatures and extract types inside function body using eltype() to properly prevent specialization.

Remove type parameters from @nospecialize signatures in ==, isequal, isapprox, and _extract_comparable_fields to properly prevent specialization on Union types. Also update docstrings for resize!, diff, and time_groups functions.

… function

…me bottleneck Added @nospecializeinfer to 165+ functions across 9 core modules, drastically reducing type inference overhead and specialization explosion. Impact: - Eliminates ~50% of top-level type inference allocations - Prevents Union type specialization combinatorial explosion - Dramatically reduces first-time function compilation overhead - Measured: hdf2imas compilation time drops from 98.71% to minimal levels Changes: - Added `using Base: @nospecializeinfer` import - Applied to functions with @nospecialize parameters in: cocos (6), data (25), expressions (18), f2 (24), identifiers (15), io (31), show (16), time (30) Technical rationale: @nospecializeinfer prevents both specialization AND type inference propagation. Critical for Union types like Union{IDS,IDSvector,Vector{IDS}} which trigger combinatorial method generation and expensive typeinf_ext_toplevel calls.

This partially reverts the previous commit which added @nospecializeinfer to almost every function. While @nospecializeinfer improves compilation time, it prevents type inference for certain key functions that are essential for nested calls to return concrete types. Functions affected: - cocos_out, cocos_transform, transform_cocos_* (src/cocos.jl) - concrete_fieldtype_typeof, eltype_concrete_fieldtype_typeof, Base.getproperty (src/data.jl) These functions must infer concrete return types for nested IMASdd function calls to work correctly, as tested in test/runtests_concrete.jl. Result: All tests now pass again, including concrete type inference tests.

orso82 · 2025-11-01T07:25:46Z

great find with the '@nospecializeinfer' macro !!

Due to @nospecialize/@nospecializeinfer constraints, type parameters cannot be used to guarantee matching types at dispatch time. Solution: - Add hot path methods for identical types (Int32/64, Float32/64, UInt64, Bool) - These route to __convert_same_real_type! helper for fast path - Generic Real->Real method now performs runtime type checking - Convert to target type when needed (e.g., Float64 → Measurement{Float64}) Changes: - Check COCOS conversion consistently to all cases (assumed always needed)

Separated DD from Union{DD, IDSraw, IDSvectorRawElement} into dedicated method. Large Union (~130+ concrete subtypes) prevented compiler from inlining hot path. DD now gets specialized method without @nospecialize for optimal performance.

Added guard condition (user_cocos != to_cocos) before calling cocos_out. Since both default to 11, this avoids unnecessary function calls on hot path.

@inline

Added @inline and ::Bool annotations to hasdata for better type inference. Helps compiler optimize getproperty hot path by eliminating runtime type checks.

…alls

…istency

mgyoo86 · 2025-11-11T20:09:49Z

Performance Improvements: `@nospecializeinfer` Optimization

This PR dramatically reduces compilation overhead by applying @nospecializeinfer to hot compilation paths, along with targeted optimizations in key areas. While further micro-benchmarking could identify additional sweet spots, the current improvements are substantial and production-ready. This document shares the progress achieved so far.

Root Cause: Excessive Specializations

Previously, type inference penetrated deep into function call chains, ignoring @nospecialize annotations and generating tens of thousands of specialized function variants. This behavior deviated from the original design intent, and the unnecessary overhead of compiling these excessive specializations was the primary cause of slow FUSE startup times.

IMASdd Performance Comparison

Test Environment: Julia 1.12.1 | arm64-apple-darwin24.0.0 | Date: 2025-11-11

First Execution Performance (`@time`)

Benchmark	Baseline (master)	This PR	Improvement
json2imas	16.2 s (96.7M allocs, 4.69 GiB)	3.9 s (14.6M allocs, 726 MiB)	4.1x ✅
hdf2imas	36.4 s (267M allocs, 12.9 GiB)	8.2 s (24.9M allocs, 1.20 GiB)	4.4x ✅
deepcopy	18.9 s (215M allocs, 10.4 GiB)	5.2 s (62.6M allocs, 3.06 GiB)	3.6x ✅
get_timeslice	312 s (897M allocs, 43.2 GiB)	3.6 s (23.5M allocs, 1.14 GiB)	87x 🚀
diff	2754 s (885M allocs, 42.5 GiB)	0.3 s (2.85M allocs, 142 MiB)	9200x 🚀

Runtime Performance (`@btime`)

Benchmark	Baseline (master)	This PR	Improvement
json2imas	7.84 ms (72.3k allocs, 3.28 MiB)	8.92 ms (84.1k allocs, 3.89 MiB)	0.88x ⚠️
hdf2imas	53.9 ms (99.0k allocs, 3.75 MiB)	55.3 ms (112k allocs, 4.40 MiB)	0.97x ≈
deepcopy	290 µs (20.8k allocs, 837 KiB)	284 µs (20.8k allocs, 837 KiB)	1.02x ≈
get_timeslice	376 µs (22.1k allocs, 1.00 MiB)	382 µs (22.1k allocs, 1.00 MiB)	0.98x ≈
diff	N/A	14.3 ms (164k allocs, 8.3 MiB)	-

Legend: 🚀 Dramatic improvement (>2x) | ✅ Improved (>1.05x) | ≈ Similar (0.95-1.05x) | ⚠️ Regression (<0.95x)

Impact on FUSE

Problem

FUSE's initial compilation triggered excessive type inference, leading to extreme memory consumption. This was particularly problematic on memory-constrained GitHub macOS runners (~4GB RAM), where test runs took 3.5-4 hours.

Solution

By reducing inference depth through @nospecializeinfer, memory footprint was reduced by nearly 50%, enabling FUSE test runs to complete about ~1 hour on macOS runners (though performance varies with GC behavior, and remains slower than Linux runners).

Impacts on FUSE's CI

Following two combinations of CIs are compared

Before: FUSE (master) + IMASdd (master) Link

After: FUSE (master) + IMASdd (this PR) Link

Platform	Stage	Before (master)	After (this PR)	Improvement
macOS	Total CI time	3h 49m	1h 25m	2.7x faster
	`julia-runtest` only	3h 24m	1h 6m	3.1x faster
Ubuntu	Total CI time	1h 20m	1h 3m	1.3x faster
	`julia-runtest` only	1h 5m	49m	1.3x faster

Note: Total CI time includes environment setup, dependency installation, and test execution. The julia-runtest stage represents pure test execution time.

📊 Detailed FUSE Benchmark Results (Click to expand)

1. Ubuntu Runner

Testset in CI	FUSE (master) + IMASdd (master)	FUSE (master) + IMASdd (This PR)	Improvement
warmup_before_compile	1043 s (86.9 GiB)	620 s (49.6 GiB)	1.7x ✅
warmup_after_compile	62.2 s (9.56 GiB)	64.6 s (10.0 GiB)	0.96x ≈
MANTA	57.6 s (11.6 GiB)	60.9 s (12.2 GiB)	0.95x ≈
FPP	43.1 s (6.8 GiB)	42.7 s (7.1 GiB)	1.01x ≈
JET_HDB5	34.4 s (9.13 GiB)	33.6 s (9.23 GiB)	1.02x ≈
EXCITE	23.6 s (4.15 GiB)	26.0 s (4.47 GiB)	0.91x ⚠️
D3D_Hmode	23.4 s (2.53 GiB)	22.7 s (2.57 GiB)	1.03x ≈
D3D_Lmode	20.8 s (1.98 GiB)	19.7 s (2.00 GiB)	1.05x ≈
FluxMatcher	9.31 s (478 MiB)	9.76 s (479 MiB)	0.95x ≈

2. macOS Runner

Testset in CI	FUSE (master) + IMASdd (master)	FUSE (master) + IMASdd (This PR)	Improvement
warmup_before_compile	2550 s (93.6 GiB)	608 s (56.6 GiB)	4.2x ✅
warmup_after_compile	370 s (15.5 GiB)	113 s (16.7 GiB)	3.3x ✅
MANTA	257 s (15.7 GiB)	128 s (16.5 GiB)	2.0x ✅
FPP	231 s (9.87 GiB)	144 s (10.8 GiB)	1.6x ✅
JET_HDB5	113 s (4.54 GiB)	29.7 s (4.54 GiB)	3.8x ✅
EXCITE	101 s (4.08 GiB)	58.7 s (4.39 GiB)	1.7x ✅
D3D_Hmode	87.6 s (2.61 GiB)	42.9 s (2.65 GiB)	2.0x ✅
D3D_Lmode	62.1 s (2.02 GiB)	33.8 s (2.04 GiB)	1.8x ✅
FluxMatcher	26.5 s (494 MiB)	9.44 s (475 MiB)	2.8x ✅

Key Benefits & Thoughts

Performance Characteristics

Compilation Time: 2-10x reduction across benchmarks
Runtime Performance: Slight memory overhead with <5% performance variance in most cases (note: GC interference makes precise measurement challenging)

Strategy

Prevent unnecessary inference: Block excessive specialization with @nospecializeinfer
Profile hot paths: Apply targeted specialization only where profiling proves necessary
Optimize for interactive use: Most FUSE users (especially beginners) will use pure package workflows, without systemimages

System Image Generation

Current approach: Dramatically reduces compilation time during sysimage creation
Alternative: If full specialization is required, systematically remove @nospecializeinfer (by simply executing command-line tools such as sed) before compilation to restore original behavior

Call for Testing

While benchmark results and CI metrics show significant improvements, real-world validation is essential before merging. This PR needs more intensive testing with actual FUSE workflows to ensure production readiness.

@bclyons12 @orso82 Could you please test this PR in actual use cases? Any feedback would be greatly appreciated :)

bclyons12 · 2025-11-12T07:15:14Z

Here's a comparison on Julia 1.11 of FUSE.timer for FUSE.warmup(dd). Left is master, right is this branch. It looks like there's a regression in ActorHFSsizing (roughly 1 s to 2 s) and ActorFluxMatcher (rough 4 s to 5 s). The latter is a bit concerning as we typically consider it performance critical. Is there low-hanging fruit to improve this, or is it a price we'd have to pay for the faster compilation?

────────────────────────────────────────────────────────────────────────────────        ────────────────────────────────────────────────────────────────────────────────
                                       Time                    Allocations                                                     Time                    Allocations      
                              ───────────────────────   ────────────────────────                                      ───────────────────────   ────────────────────────
      Tot / % measured:            43.7s /  97.8%           27.2GiB /  95.9%                  Tot / % measured:            45.2s /  98.0%           28.3GiB /  96.0%    
										                                                                                        
Section               ncalls     time    %tot     avg     alloc    %tot      avg        Section               ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────        ────────────────────────────────────────────────────────────────────────────────
WholeFacility              2    40.6s   95.0%   20.3s   24.1GiB   92.4%  12.1GiB        WholeFacility              2    42.2s   95.2%   21.1s   25.1GiB   92.5%  12.6GiB
  Equilibrium              2    15.1s   35.4%   7.57s   11.8GiB   45.0%  5.88GiB          Equilibrium              2    15.0s   33.9%   7.51s   12.0GiB   44.3%  6.01GiB
    TEQUILA                2    15.1s   35.4%   7.57s   11.8GiB   45.0%  5.88GiB            TEQUILA                2    15.0s   33.9%   7.50s   12.0GiB   44.3%  6.01GiB
  StationaryPlasma         2    11.2s   26.1%   5.59s   6.47GiB   24.8%  3.24GiB          StationaryPlasma         2    11.9s   26.8%   5.95s   6.76GiB   24.9%  3.38GiB
    Equilibrium            4    6.67s   15.6%   1.67s   2.56GiB    9.8%   654MiB            Equilibrium            4    6.33s   14.3%   1.58s   2.62GiB    9.7%   672MiB
      TEQUILA              4    6.66s   15.6%   1.66s   2.54GiB    9.7%   651MiB              TEQUILA              4    6.31s   14.2%   1.58s   2.61GiB    9.6%   668MiB
    CoreTransport          4    4.37s   10.2%   1.09s   3.86GiB   14.8%  0.97GiB            CoreTransport          4    5.43s   12.3%   1.36s   4.08GiB   15.0%  1.02GiB
      FluxMatcher          4    4.37s   10.2%   1.09s   3.86GiB   14.8%  0.97GiB              FluxMatcher          4    5.43s   12.3%   1.36s   4.08GiB   15.0%  1.02GiB
        Pedestal           4   31.8ms    0.1%  7.96ms   6.78MiB    0.0%  1.69MiB                Pedestal           4   9.14ms    0.0%  2.29ms   6.93MiB    0.0%  1.73MiB
          EPED             4   2.74ms    0.0%   685μs   2.26MiB    0.0%   579KiB                  EPED             4   2.64ms    0.0%   661μs   2.26MiB    0.0%   580KiB
        FluxCalcul...      4   3.93ms    0.0%   982μs   3.39MiB    0.0%   868KiB                FluxCalcul...      4   4.54ms    0.0%  1.14ms   3.53MiB    0.0%   904KiB
          TGLF             4   2.13ms    0.0%   532μs   2.93MiB    0.0%   750KiB                  TGLF             4   2.47ms    0.0%   618μs   3.00MiB    0.0%   768KiB
          Neoclass...      4   1.16ms    0.0%   289μs    410KiB    0.0%   102KiB                  Neoclass...      4   1.35ms    0.0%   338μs    475KiB    0.0%   119KiB
    Current                4    115ms    0.3%  28.7ms   36.9MiB    0.1%  9.23MiB            Current                4    111ms    0.3%  27.8ms   37.1MiB    0.1%  9.28MiB
      QED                  4    112ms    0.3%  28.1ms   35.8MiB    0.1%  8.94MiB              QED                  4    108ms    0.2%  27.0ms   35.9MiB    0.1%  8.96MiB
    HCD                    4   14.0ms    0.0%  3.51ms   11.0MiB    0.0%  2.76MiB            HCD                    4   16.3ms    0.0%  4.07ms   11.5MiB    0.0%  2.86MiB
      NeutralFueling       4   5.13ms    0.0%  1.28ms   3.43MiB    0.0%   879KiB              NeutralFueling       4   5.77ms    0.0%  1.44ms   3.56MiB    0.0%   910KiB
      SimpleEC             4   4.77ms    0.0%  1.19ms   5.85MiB    0.0%  1.46MiB              SimpleEC             4   4.69ms    0.0%  1.17ms   5.91MiB    0.0%  1.48MiB
      SimpleIC             4    994μs    0.0%   249μs    234KiB    0.0%  58.4KiB              SimpleIC             4   1.24ms    0.0%   309μs    276KiB    0.0%  69.0KiB
    Pedestal               4   7.82ms    0.0%  1.95ms   6.27MiB    0.0%  1.57MiB            Pedestal               4   8.30ms    0.0%  2.07ms   6.39MiB    0.0%  1.60MiB
      EPED                 4   2.55ms    0.0%   636μs   2.07MiB    0.0%   531KiB              EPED                 4   2.48ms    0.0%   619μs   2.08MiB    0.0%   533KiB
    Sawteeth               4    675μs    0.0%   169μs   44.1KiB    0.0%  11.0KiB            Sawteeth               4    814μs    0.0%   203μs   44.1KiB    0.0%  11.0KiB
  PFdesign                 4    7.82s   18.3%   1.96s   4.92GiB   18.8%  1.23GiB          PFdesign                 4    7.87s   17.8%   1.97s   5.15GiB   19.0%  1.29GiB
    PFactive               2    158ms    0.4%  79.1ms    425MiB    1.6%   213MiB            PFactive               2    174ms    0.4%  86.9ms    444MiB    1.6%   222MiB
  PlasmaLimits             2    3.82s    8.9%   1.91s   64.0MiB    0.2%  32.0MiB          PlasmaLimits             2    3.80s    8.6%   1.90s   66.4MiB    0.2%  33.2MiB
    VerticalStability      2    3.82s    8.9%   1.91s   63.1MiB    0.2%  31.6MiB            VerticalStability      2    3.80s    8.6%   1.90s   65.6MiB    0.2%  32.8MiB
    TroyonBetaNN           2    797μs    0.0%   398μs    393KiB    0.0%   197KiB            TroyonBetaNN           2    832μs    0.0%   416μs    400KiB    0.0%   200KiB
  Blanket                  2    1.16s    2.7%   580ms    294MiB    1.1%   147MiB          Blanket                  2    1.10s    2.5%   549ms    292MiB    1.1%   146MiB
  HFSsizing                2    1.05s    2.5%   526ms    509MiB    1.9%   255MiB          HFSsizing                2    2.01s    4.5%   1.01s    710MiB    2.6%   355MiB
    FluxSwing              2    583μs    0.0%   291μs   72.3KiB    0.0%  36.1KiB            Stresses               2    614μs    0.0%   307μs    335KiB    0.0%   168KiB
    Stresses               2    527μs    0.0%   263μs    302KiB    0.0%   151KiB            FluxSwing              2    578μs    0.0%   289μs   77.1KiB    0.0%  38.5KiB
  Neutronics               2    255ms    0.6%   127ms   31.6MiB    0.1%  15.8MiB          Neutronics               2    252ms    0.6%   126ms   32.9MiB    0.1%  16.5MiB
  CXbuild                  6    151ms    0.4%  25.2ms   77.6MiB    0.3%  12.9MiB          CXbuild                  6    172ms    0.4%  28.6ms   79.7MiB    0.3%  13.3MiB
  Divertors                2   5.79ms    0.0%  2.90ms   8.55MiB    0.0%  4.27MiB          Divertors                2   15.9ms    0.0%  7.94ms   8.87MiB    0.0%  4.43MiB
  PassiveStructures        2   5.12ms    0.0%  2.56ms   3.90MiB    0.0%  1.95MiB          PassiveStructures        2   10.6ms    0.0%  5.30ms   4.68MiB    0.0%  2.34MiB
  BalanceOfPlant           2   4.40ms    0.0%  2.20ms   3.58MiB    0.0%  1.79MiB          BalanceOfPlant           2   4.27ms    0.0%  2.14ms   3.59MiB    0.0%  1.80MiB
    ThermalPlant           2   3.54ms    0.0%  1.77ms   3.49MiB    0.0%  1.74MiB            ThermalPlant           2   3.38ms    0.0%  1.69ms   3.48MiB    0.0%  1.74MiB
    PowerNeeds             2    534μs    0.0%   267μs   61.9KiB    0.0%  31.0KiB            PowerNeeds             2    579μs    0.0%   289μs   77.4KiB    0.0%  38.7KiB
  Costing                  2   2.34ms    0.0%  1.17ms    800KiB    0.0%   400KiB          Costing                  2   2.40ms    0.0%  1.20ms    843KiB    0.0%   422KiB
    CostingARIES           2   1.66ms    0.0%   832μs    745KiB    0.0%   372KiB            CostingARIES           2   1.74ms    0.0%   868μs    781KiB    0.0%   391KiB
  LFSsizing                2    368μs    0.0%   184μs   78.0KiB    0.0%  39.0KiB          LFSsizing                2    464μs    0.0%   232μs   90.5KiB    0.0%  45.2KiB
init                       1    2.05s    4.8%   2.05s   1.98GiB    7.6%  1.98GiB        init                       1    2.03s    4.6%   2.03s   2.00GiB    7.4%  2.00GiB
  init_equilibrium         1    1.21s    2.8%   1.21s    276MiB    1.0%   276MiB          init_equilibrium         1    1.17s    2.6%   1.17s    297MiB    1.1%   297MiB
    Equilibrium            2    1.20s    2.8%   602ms    275MiB    1.0%   137MiB            Equilibrium            2    1.17s    2.6%   583ms    296MiB    1.1%   148MiB
      TEQUILA              2    1.20s    2.8%   598ms    267MiB    1.0%   134MiB              TEQUILA              2    1.16s    2.6%   579ms    288MiB    1.0%   144MiB
    init_core_prof...      1    356μs    0.0%   356μs    150KiB    0.0%   150KiB            init_core_prof...      1    420μs    0.0%   420μs    176KiB    0.0%   176KiB
  init_pulse_schedule      1    703ms    1.6%   703ms   1.64GiB    6.3%  1.64GiB          init_pulse_schedule      1    715ms    1.6%   715ms   1.64GiB    6.0%  1.64GiB
  init_currents            1   68.8ms    0.2%  68.8ms   26.1MiB    0.1%  26.1MiB          init_currents            1   69.6ms    0.2%  69.6ms   26.2MiB    0.1%  26.2MiB
    Current                2   68.6ms    0.2%  34.3ms   26.0MiB    0.1%  13.0MiB            Current                2   69.3ms    0.2%  34.7ms   26.1MiB    0.1%  13.0MiB
      QED                  2   67.5ms    0.2%  33.7ms   25.4MiB    0.1%  12.7MiB              QED                  2   67.9ms    0.2%  34.0ms   25.5MiB    0.1%  12.7MiB
  init_build               1   49.9ms    0.1%  49.9ms   26.7MiB    0.1%  26.7MiB          init_build               1   54.6ms    0.1%  54.6ms   27.5MiB    0.1%  27.5MiB
    CXbuild                2   49.4ms    0.1%  24.7ms   26.5MiB    0.1%  13.3MiB            CXbuild                2   53.8ms    0.1%  26.9ms   27.2MiB    0.1%  13.6MiB
  init_hcd                 1   10.9ms    0.0%  10.9ms   11.5MiB    0.0%  11.5MiB          init_hcd                 1   11.8ms    0.0%  11.8ms   11.8MiB    0.0%  11.8MiB
    HCD                    2   7.16ms    0.0%  3.58ms   5.89MiB    0.0%  2.94MiB            HCD                    2   8.47ms    0.0%  4.24ms   6.09MiB    0.0%  3.05MiB
      SimpleEC             2   2.40ms    0.0%  1.20ms   2.92MiB    0.0%  1.46MiB              NeutralFueling       2   2.72ms    0.0%  1.36ms   1.72MiB    0.0%   881KiB
      NeutralFueling       2   2.36ms    0.0%  1.18ms   1.68MiB    0.0%   862KiB              SimpleEC             2   2.35ms    0.0%  1.17ms   2.95MiB    0.0%  1.47MiB
      SimpleIC             2    505μs    0.0%   253μs    108KiB    0.0%  54.0KiB              SimpleIC             2    629μs    0.0%   314μs    127KiB    0.0%  63.5KiB
  PassiveStructures        2   5.18ms    0.0%  2.59ms   3.78MiB    0.0%  1.89MiB          PassiveStructures        2   10.0ms    0.0%  5.02ms   4.53MiB    0.0%  2.27MiB
  init_pf_active           1    904μs    0.0%   904μs    923KiB    0.0%   923KiB          init_pf_active           1   1.43ms    0.0%  1.43ms   0.99MiB    0.0%  0.99MiB
  init_core_profiles       1    432μs    0.0%   432μs    147KiB    0.0%   147KiB          init_core_profiles       1    469μs    0.0%   469μs    173KiB    0.0%   173KiB
  init_edge_profiles       1    143μs    0.0%   143μs   57.5KiB    0.0%  57.5KiB          init_edge_profiles       1    149μs    0.0%   149μs   63.9KiB    0.0%  63.9KiB
  init_requirements        1   25.2μs    0.0%  25.2μs   20.9KiB    0.0%  20.9KiB          init_requirements        1   26.2μs    0.0%  26.2μs   21.0KiB    0.0%  21.0KiB
  init_core_sources        1   21.0μs    0.0%  21.0μs   10.9KiB    0.0%  10.9KiB          init_core_sources        1   21.8μs    0.0%  21.8μs   11.4KiB    0.0%  11.4KiB
  init_bop                 1   9.58μs    0.0%  9.58μs      976B    0.0%     976B          init_bop                 1   7.54μs    0.0%  7.54μs      992B    0.0%     992B
  init_missing_fro...      1    750ns    0.0%   750ns      272B    0.0%     272B          init_missing_fro...      1    417ns    0.0%   417ns      272B    0.0%     272B
freeze                     1   92.1ms    0.2%  92.1ms   18.8MiB    0.1%  18.8MiB        freeze                     1   97.2ms    0.2%  97.2ms   23.9MiB    0.1%  23.9MiB
────────────────────────────────────────────────────────────────────────────────	────────────────────────────────────────────────────────────────────────────────

bclyons12 · 2025-11-12T07:30:23Z

@mgyoo86 To be clear, this is outstanding work. That the performance is basically as good with the drastic improvements in compilation if remarkable. It would just be great to do even better on performance if we can.

- Add Preferences dependency - Create @maybe_nospecializeinfer macro with runtime configuration - Replace all @nospecializeinfer annotations (254 functions) - Display setting during precompilation Users can now toggle @nospecializeinfer via LocalPreferences.toml

mgyoo86 · 2025-11-13T18:46:55Z

@bclyons12

Thanks for the feedback!
As you suggested, I've implemented a new @maybe_nospecializeinfer macro that can be controlled via LocalPreferences.toml.
This allows users to toggle @nospecializeinfer behavior without code changes.

The following is an example of LocalPreferences.toml:

[IMASdd]
use_nospecializeinfer = false  # or true (default)

I'll take a closer look at ActorHFSsizing and ActorFluxMatcher as we discussed.

src/data.jl

…t` function

@inbounds

…eld indices - Replace fieldnames() iteration with fieldcount() + numeric indices - Reduces from 14 allocations (4.125 KiB) to ~0 allocations - Works correctly with @nospecialize by avoiding symbolic field names - Added @inbounds for bounds-check elimination

@inbounds

- Replace enumerate() with eachindex() + @inbounds - Replace fieldnames() with hasfield() - Optimize hasdata() to use numeric field indices - Use tuple literals instead of arrays for membership tests

Optimize setproperty! by reusing already-computed coords variable instead of calling coordinates(ids, field) multiple times: - Line 772: Use inline generator with coords reuse - Line 774: Reuse coords instead of recalling coordinates() Eliminates 2 redundant function calls per setproperty! invocation. Uses idiomatic inline generator with any() for short-circuit benefit. fix/nospecialize

Optimize name_2_index() by caching inverted Dict per IDS type: - Add global cache _NAME_2_IDX_CACHE using IdDict - Implement lazy initialization with get!() for thread-safety - First call per type: inverts idx_2_name and caches result - Subsequent calls: returns cached Dict (zero-allocation) Performance improvement: - Before: ~22μs, 5 allocations per call - After: ~2ns, 0 allocations (after first call per type) Related optimization in fix/nospecialize branch.

Simplify in_expression() by removing redundant key check: - Remove manual `if t_id ∉ keys(_in_expression)` check - Use get!() directly for atomic check-and-create operation - get!() already handles check atomically, making manual check redundant Performance improvement: - Eliminates one dict lookup (haskey check) - Cleaner code with same thread-safety guarantees Related to fix/nospecialize optimization work.

@inbounds

…debug code Optimize two functions with numeric field iteration pattern: - Stack-based fill function: Replace fieldnames() with fieldcount/fieldname - Base.empty!(): Use numeric indices for field iteration - Add @inbounds for bounds check elimination Remove debug statements: - Clean up Main.@infiltrate calls from resize!() function Performance improvement: - Eliminates allocations from fieldnames() vector creation - Enables bounds check elimination with @inbounds - Consistent with other @nospecialize optimizations Related to fix/nospecialize optimization work.

mgyoo86 · 2025-12-04T18:12:28Z

@bclyons12 @fredrikekre
The following are additional micro-optimizations that can further improve performance, such as ActorFluxMatcher, which Brendan pointed out.

Additional Performance Optimizations (6 commits)

Summary

Zero-allocation improvements across hot paths in @nospecialize functions.

Key Changes

Loop Optimization (data.jl, expressions.jl, f2.jl, findall.jl, io.jl)

Replace for (k, v) in enumerate(arr) → for k in eachindex(arr); v = @inbounds arr[k]
Eliminates tuple allocations in tight loops

Field Iteration (data.jl, expressions.jl)

Replace fieldnames(typeof(x)) iteration → numeric fieldcount/fieldname indices
hasdata(): Use early-return loop instead of generator with any()

Field Checks (identifiers.jl)

Replace :field in fieldnames(T) → hasfield(T, :field)
Avoids tuple allocation on every check

Caching (identifiers.jl)

Add lazy auto-inversion cache for name_2_index()
One-time Dict creation per IDS type

Thread-safe Access (expressions.jl)

Optimize in_expression() with direct get!() usage
Remove redundant key existence check

Misc (math.jl, data.jl)

Use tuple literals (:a, :b) instead of vectors [:a, :b] in ∈ checks
Eliminate redundant coordinates() calls in setproperty!

Files Changed

data.jl, expressions.jl, identifiers.jl, io.jl, f2.jl, findall.jl, math.jl

Add diagnose_shared_objects() to detect unintended array sharing in IDS trees. This helps identify cases where `a = b` was used instead of `a .= b`. Features: - Stack-based tree traversal following isequal pattern - SharedObjectReport with indexed access (report[1].id, report[1].paths) - Cross-IDS sharing detection (e.g., core_profiles ↔ core_sources) - REPL display with chronological path ordering

…ility Replace @maybe_nospecializeinfer with @nospecializeinfer since the macro wrapper is not defined on master branch.

- Add runtests_f2.jl with 81 test cases covering: - f2p, f2i, f2u path conversion functions - i2p, p2i, i2u string parsing functions - location, ulocation path accessors - fs2u type-based lookup - f2p_name IDS naming - Round-trip consistency validation - Edge cases (standalone IDS, utime flag, deeply nested) - Move f2-related tests from runtests_ids.jl to dedicated file - Include runtests_f2.jl in main test runner

@noinline

- Add _F2P_SKELETON_CACHE with concrete types for type-stable lookup - Split _f2p_skeleton into fast path (cache hit) and slow path (@noinline) - Pre-compute and cache result_size to avoid redundant count() calls - Use Vector{String} in cache for concrete value type - Remove String() conversion in loop (already cached as String)

- Add internal @_typed_cache macro with proper hygiene (gensym, esc) - Use helper function pattern to solve return-bypass caching bug - Apply macro to f2.jl: fs2u, _f2p_skeleton, f2p_name(Type) - Rename cache constants with _TCACHE_ prefix for consistency - Use Base.get single lookup instead of haskey+getindex pattern

- Remove type-based name computation (~15 lines) - Reuse cached skeleton from _f2p_skeleton(T) - Eliminates redundant replace/eachsplit/count calls per f2i invocation

- Add ::String return type to f2p_name(ids) for better type inference - Refactor f2p_name(ids::IDS, ::IDS) to reuse cached f2p_name(Type) - Remove redundant typename_str computation (now uses cache) - Add @nospecialize to entry point f2p_name(ids) for compile time

- i2u fast path: avoid String(loc) allocation when loc is already String - ulocation/location(IDSvector): use fs2u_base cache instead of SubString - Add int_to_string() cache for small integers (0-10) used in f2p and f2p_name - Add fs2u_base typed cache for IDSvector base paths (0 allocs) - Expand benchmark_f2.jl with comprehensive allocation tests Results: - f2p: 10→8 allocs (simple), 14→10 allocs (nested) - f2p_name(IDSvectorElement): 4→2 allocs - ulocation/location(IDSvector): 0 allocs (cached) - i2u(String, no brackets): 0 allocs

Changed eltype(ids) to typeof(ids) in ulocation/location(IDSvector) functions. With @nospecialize, eltype(ids) returns Any causing 3 allocations and boxing. Using typeof(ids) and extracting element type inside fs2u_base ensures type stability and 0 allocations. Result: ulocation/location(IDSvector) now ~27ns with 0 allocations (previously ~600ns with 3 allocations/128 bytes)

Replace zeros(Int, N) with zeros!(pool, Int, N) using @with_pool macro. This eliminates the small temporary array allocation (N typically 1-3) that occurred on every f2p/f2i call by reusing pooled memory.

…tency

…e allocation Under @nospecializeinfer, the closure created by `lock() do` can cause boxing due to captured variables (ids, field, func, throw_on_missing, etc). Using explicit try/finally eliminates the closure and reduces allocation. Changes: - exec_expression_with_ancestor_args (4-arg version): cache lock, use try/finally - onetime expression path: same pattern for consistency

mgyoo86 · 2025-12-18T21:13:16Z

@fredrikekre @bclyons12
I've updated f2.jl to reduce the allocation and improve the performance

Summary: Introduced type-based caching infrastructure to reduce allocations and improve performance in path/location functions. Also applied various micro-optimizations (array pooling, string caching, closure elimination).

Key Changes

1. Type-based caching (@_typed_cache macro)

Cached: _f2p_skeleton, fs2u, f2p_name, fs2u_base
Thread-safe via ThreadSafeDict

2. Temp array reuse (f2p, f2i)

AdaptiveArrayPools: zeros!(pool, Int, N)

3. + Other minor micro-optimizations

mgyoo86 added 10 commits October 28, 2025 17:56

refactor: prevent specialization of info and coordinates functions

eccf6bf

Add @nospecialize annotations to info and coordinates to reduce compilation overhead and binary size.

refactor: add @nospecialize to utility functions

eb5444a

Apply @nospecialize to fieldtype, getproperty, parent, name, goto, getindex, and time-related functions to reduce method specialization.

fix @nospecialize for dict2imas

f362a18

test: add temporary ulocation specialization test

2830062

Temporary test file for investigating method specialization behavior.

refactor: reorganize test cases for ulocation specialization

f7f53ff

mgyoo86 requested review from bclyons12 and fredrikekre October 30, 2025 20:11

mgyoo86 added the WIP Work in Progress label Oct 30, 2025

mgyoo86 changed the title ~~# [WIP] Reduce compilation overhead with @nospecialize~~ [WIP] Reduce compilation overhead with @nospecialize Oct 30, 2025

fix bugs in time.jl

3e59edc

mgyoo86 mentioned this pull request Oct 30, 2025

Unexpected Behavior of @nospecialize #84

Open

mgyoo86 added 7 commits October 30, 2025 15:32

fix: correct @nospecialize usage in copy_timeslice!

5574b60

Remove type parameters from @nospecialize signatures and extract types inside function body using eltype() to properly prevent specialization.

fix: apply @nospecialize to root_ids and target parameters in findall…

fbaeb18

… function

Merge branch 'master' into fix/nospecialize

0a145b1

refactor: remove test file for specialized instances of ulocation

63a0bb8

mgyoo86 added 5 commits November 1, 2025 18:08

perf: split DD getproperty to enable inlining

3698cc2

Separated DD from Union{DD, IDSraw, IDSvectorRawElement} into dedicated method. Large Union (~130+ concrete subtypes) prevented compiler from inlining hot path. DD now gets specialized method without @nospecialize for optimal performance.

perf: skip cocos_out when no conversion needed

2f9a2a1

Added guard condition (user_cocos != to_cocos) before calling cocos_out. Since both default to 11, this avoids unnecessary function calls on hot path.

perf: add Bool type assertions to hasdata

f0b7cc1

Added @inline and ::Bool annotations to hasdata for better type inference. Helps compiler optimize getproperty hot path by eliminating runtime type checks.

perf: add early termination and inline getfield calls in add_filled

f40b540

mgyoo86 added 2 commits November 10, 2025 15:57

fix: optimize getproperty for IDS by avoiding unnecessary cocos_out c…

4a16981

…alls

fix: update nearest_causal_time tests to use TimeInfo struct for cons…

b0f6ae5

…istency

bclyons12 mentioned this pull request Nov 12, 2025

Test IMASdd nospecialize changes ProjectTorreyPines/FUSE.jl#1028

Open

bclyons12 reviewed Nov 15, 2025

View reviewed changes

src/data.jl Outdated Show resolved Hide resolved

mgyoo86 added 7 commits November 16, 2025 11:21

fix: rollback error_parent_of_nothing keyword signatures for `paren…

ee1c3aa

…t` function

perf: eliminate allocations in @nospecialize functions

8de082b

- Replace enumerate() with eachindex() + @inbounds - Replace fieldnames() with hasfield() - Optimize hasdata() to use numeric field indices - Use tuple literals instead of arrays for membership tests

mgyoo86 added 12 commits December 9, 2025 21:24

fix(diagnostics): use @nospecializeinfer directly for master compatib…

e529c74

…ility Replace @maybe_nospecializeinfer with @nospecializeinfer since the macro wrapper is not defined on master branch.

perf(f2i): reuse _f2p_skeleton cache instead of duplicating computation

ac5c928

- Remove type-based name computation (~15 lines) - Reuse cached skeleton from _f2p_skeleton(T) - Eliminates redundant replace/eachsplit/count calls per f2i invocation

Use AdaptiveArrayPools to eliminate idx array allocation in f2p/f2i

ff3a206

Replace zeros(Int, N) with zeros!(pool, Int, N) using @with_pool macro. This eliminates the small temporary array allocation (N typically 1-3) that occurred on every f2p/f2i call by reusing pooled memory.

style: use string() instead of interpolation in utlocation for consis…

9df2387

…tency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Reduce compilation overhead with @nospecialize #83

[WIP] Reduce compilation overhead with @nospecialize #83

Uh oh!

mgyoo86 commented Oct 30, 2025

Uh oh!

codecov bot commented Oct 30, 2025 •

edited

Loading

Uh oh!

orso82 commented Nov 1, 2025

Uh oh!

mgyoo86 commented Nov 11, 2025 •

edited

Loading

1. Ubuntu Runner

2. macOS Runner

Uh oh!

bclyons12 commented Nov 12, 2025

Uh oh!

bclyons12 commented Nov 12, 2025

Uh oh!

mgyoo86 commented Nov 13, 2025

Uh oh!

Uh oh!

mgyoo86 commented Dec 4, 2025

Uh oh!

mgyoo86 commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[WIP] Reduce compilation overhead with @nospecialize #83

Are you sure you want to change the base?

[WIP] Reduce compilation overhead with @nospecialize #83

Uh oh!

Conversation

mgyoo86 commented Oct 30, 2025

Status

Summary

Major Changes

Core Functions Modified

Known Issues

Testing

Test File

Required Temporary Modification

Next Steps

Uh oh!

codecov bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

orso82 commented Nov 1, 2025

Uh oh!

mgyoo86 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Improvements: @nospecializeinfer Optimization

Root Cause: Excessive Specializations

IMASdd Performance Comparison

First Execution Performance (@time)

Runtime Performance (@btime)

Impact on FUSE

Problem

Solution

Impacts on FUSE's CI

1. Ubuntu Runner

2. macOS Runner

Key Benefits & Thoughts

Performance Characteristics

Strategy

System Image Generation

Call for Testing

Uh oh!

bclyons12 commented Nov 12, 2025

Uh oh!

bclyons12 commented Nov 12, 2025

Uh oh!

mgyoo86 commented Nov 13, 2025

Uh oh!

Uh oh!

mgyoo86 commented Dec 4, 2025

Additional Performance Optimizations (6 commits)

Summary

Key Changes

Files Changed

Uh oh!

mgyoo86 commented Dec 18, 2025

Key Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Oct 30, 2025 •

edited

Loading

mgyoo86 commented Nov 11, 2025 •

edited

Loading

Performance Improvements: `@nospecializeinfer` Optimization

First Execution Performance (`@time`)

Runtime Performance (`@btime`)