Skip to content

Conversation

@mgyoo86
Copy link
Contributor

@mgyoo86 mgyoo86 commented Oct 30, 2025

Status

🚧 Work in Progress - Testing and feedback welcome

Summary

This PR adds/fixes @nospecialize annotations to frequently-called utility functions to prevent excessive method specialization and reduce compilation overhead.

Major Changes

Core Functions Modified

  • ulocation(), f2u() - Added @nospecialize for IDS parameters
  • info(), coordinates() - Prevented specialization on IDS types
  • diff(), merge!(), freeze!() - Refactored type handling
  • dict2imas() - Fixed specialization in IO operations

Known Issues

⚠️ Specialization still occurs when calling high-level functions like hdf2imas() or through certain code paths. Root cause investigation ongoing.

Testing

Test File

See tmp_test/test_ulocation.jl for specialization behavior tests.

Required Temporary Modification

To properly test, you need to modify dd.jl to use getfield instead of getproperty or ids.field:

macOS:

sed -E -i '' 's/setfield!\(ids\.([^,]*)/setfield!(getfield(ids, :\1)/g' src/dd.jl

Linux (without macOS backup extension):

sed -E -i 's/setfield!\(ids\.([^,]*)/setfield!(getfield(ids, :\1)/g' src/dd.jl

This replaces ids.field property access with getfield(ids, :field) to bypass getproperty via ids.field implementations during testing.

Next Steps

  • Identify remaining specialization sources in high-level functions
  • Validate performance improvements with benchmarks
  • Determine if dd.jl modifications should be permanent or test-only
  • Add comprehensive test coverage

    Changed from @nospecialize(x::T) where {T<:Type} pattern to
    @nospecialize(x::{<:Type}) to properly prevent type specialization.

    This ensures Julia doesn't compile separate versions for each
concrete
    type, reducing compilation overhead.
Add @Assert type checks to prevent merging/freezing different IDS types.
This fixes type safety issues introduced by commit 74ded67 where 'where T'
constraints were removed.

Functions updated:
- merge!(::IDS, ::IDS) - assert same type
- merge!(::IDSvector, ::IDSvector) - assert same eltype
- freeze!(::IDS, ::IDS) - assert same type
- freeze!(::IDSvector, ::IDSvector) - assert same eltype

These assertions prevent runtime errors from field mismatches when
operating on incompatible types.
Replace error() with Dict return when comparing different types.
Now diff() returns a dict with 'type_mismatch' key instead of
throwing an error, making it more flexible and non-disruptive.

Example:
  diff(dd, dd.equilibrium)
  => Dict('type_mismatch' => 'dd{Float64} != equilibrium{Float64}')
Add @nospecialize annotations to info and coordinates to reduce
compilation overhead and binary size.
Added @nospecialize annotations to location and conversion functions
in f2.jl to prevent excessive method specialization:
- utlocation(ids, field) and variants
- f2u(ids) - converts IDS to universal location string
- fs2u(ids_type) - converts IDS type to universal location

Note: ulocation specialization still observed during constructor
execution despite @nospecialize - investigation ongoing into when
and why specialization occurs in the call chain.
Add dd_nospecialize() helper function that uses Base.invokelatest to
prevent compiler from analyzing dd() internals during type inference.
Replace all dd() default arguments in I/O functions with dd_nospecialize()
to significantly reduce allocation and compilation time.

The invokelatest barrier prevents the compiler from specializing on the
complex 180k-line dd struct generation, while ::dd{Float64} type assertion
ensures proper type propagation without additional inference overhead.

Affected functions:
- json2imas, jstr2imas
- hdf2imas (default arg and internal call)
- h5i2imas

Performance impact: ~20M fewer allocations in hdf2imas calls.
Apply @nospecialize to fieldtype, getproperty, parent, name, goto,
getindex, and time-related functions to reduce method specialization.
Temporary test file for investigating method specialization behavior.
@mgyoo86 mgyoo86 added the WIP Work in Progress label Oct 30, 2025
@mgyoo86 mgyoo86 changed the title # [WIP] Reduce compilation overhead with @nospecialize [WIP] Reduce compilation overhead with @nospecialize Oct 30, 2025
@codecov
Copy link

codecov bot commented Oct 30, 2025

Codecov Report

❌ Patch coverage is 63.47656% with 187 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.94%. Comparing base (01f7aff) to head (7b16c97).

Files with missing lines Patch % Lines
src/data.jl 62.50% 63 Missing ⚠️
src/identifiers.jl 0.00% 31 Missing ⚠️
src/time.jl 62.12% 25 Missing ⚠️
src/show.jl 22.72% 17 Missing ⚠️
src/diagnostics.jl 76.81% 16 Missing ⚠️
src/io.jl 68.88% 14 Missing ⚠️
src/f2.jl 82.69% 9 Missing ⚠️
src/expressions.jl 83.33% 6 Missing ⚠️
src/macros.jl 80.00% 3 Missing ⚠️
src/math.jl 0.00% 2 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #83      +/-   ##
==========================================
+ Coverage   43.76%   43.94%   +0.18%     
==========================================
  Files          13       15       +2     
  Lines       31243    31398     +155     
==========================================
+ Hits        13672    13797     +125     
- Misses      17571    17601      +30     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Remove type parameters from @nospecialize signatures and extract types inside function body using eltype() to properly prevent specialization.
Remove type parameters from @nospecialize signatures in ==, isequal, isapprox, and _extract_comparable_fields to properly prevent specialization on Union types.

Also update docstrings for resize!, diff, and time_groups functions.
…me bottleneck

Added @nospecializeinfer to 165+ functions across 9 core modules, drastically
reducing type inference overhead and specialization explosion.

Impact:
- Eliminates ~50% of top-level type inference allocations
- Prevents Union type specialization combinatorial explosion
- Dramatically reduces first-time function compilation overhead
- Measured: hdf2imas compilation time drops from 98.71% to minimal levels

Changes:
- Added `using Base: @nospecializeinfer` import
- Applied to functions with @nospecialize parameters in:
  cocos (6), data (25), expressions (18), f2 (24), identifiers (15),
  io (31), show (16), time (30)

Technical rationale:
@nospecializeinfer prevents both specialization AND type inference propagation.
Critical for Union types like Union{IDS,IDSvector,Vector{IDS}} which trigger
combinatorial method generation and expensive typeinf_ext_toplevel calls.
This partially reverts the previous commit which added @nospecializeinfer
to almost every function. While @nospecializeinfer improves compilation time,
it prevents type inference for certain key functions that are essential for
nested calls to return concrete types.

Functions affected:
- cocos_out, cocos_transform, transform_cocos_* (src/cocos.jl)
- concrete_fieldtype_typeof, eltype_concrete_fieldtype_typeof, Base.getproperty (src/data.jl)

These functions must infer concrete return types for nested IMASdd function
calls to work correctly, as tested in test/runtests_concrete.jl.

Result: All tests now pass again, including concrete type inference tests.
@orso82
Copy link
Contributor

orso82 commented Nov 1, 2025

great find with the '@nospecializeinfer' macro !!

Due to @nospecialize/@nospecializeinfer constraints, type parameters
cannot be used to guarantee matching types at dispatch time.

Solution:
- Add hot path methods for identical types (Int32/64, Float32/64, UInt64, Bool)
- These route to __convert_same_real_type! helper for fast path
- Generic Real->Real method now performs runtime type checking
- Convert to target type when needed (e.g., Float64 → Measurement{Float64})

Changes:
- Check COCOS conversion consistently to all cases (assumed always needed)
Separated DD from Union{DD, IDSraw, IDSvectorRawElement} into dedicated method.
Large Union (~130+ concrete subtypes) prevented compiler from inlining hot path.

DD now gets specialized method without @nospecialize for optimal performance.
Added guard condition (user_cocos != to_cocos) before calling cocos_out.
Since both default to 11, this avoids unnecessary function calls on hot path.
Added @inline and ::Bool annotations to hasdata for better type inference.
Helps compiler optimize getproperty hot path by eliminating runtime type checks.
@mgyoo86
Copy link
Contributor Author

mgyoo86 commented Nov 11, 2025

Performance Improvements: @nospecializeinfer Optimization

This PR dramatically reduces compilation overhead by applying @nospecializeinfer to hot compilation paths, along with targeted optimizations in key areas. While further micro-benchmarking could identify additional sweet spots, the current improvements are substantial and production-ready. This document shares the progress achieved so far.


Root Cause: Excessive Specializations

Previously, type inference penetrated deep into function call chains, ignoring @nospecialize annotations and generating tens of thousands of specialized function variants. This behavior deviated from the original design intent, and the unnecessary overhead of compiling these excessive specializations was the primary cause of slow FUSE startup times.

IMASdd Performance Comparison

Test Environment: Julia 1.12.1 | arm64-apple-darwin24.0.0 | Date: 2025-11-11

First Execution Performance (@time)

Benchmark Baseline (master) This PR Improvement
json2imas 16.2 s (96.7M allocs, 4.69 GiB) 3.9 s (14.6M allocs, 726 MiB) 4.1x
hdf2imas 36.4 s (267M allocs, 12.9 GiB) 8.2 s (24.9M allocs, 1.20 GiB) 4.4x
deepcopy 18.9 s (215M allocs, 10.4 GiB) 5.2 s (62.6M allocs, 3.06 GiB) 3.6x
get_timeslice 312 s (897M allocs, 43.2 GiB) 3.6 s (23.5M allocs, 1.14 GiB) 87x 🚀
diff 2754 s (885M allocs, 42.5 GiB) 0.3 s (2.85M allocs, 142 MiB) 9200x 🚀

Runtime Performance (@btime)

Benchmark Baseline (master) This PR Improvement
json2imas 7.84 ms (72.3k allocs, 3.28 MiB) 8.92 ms (84.1k allocs, 3.89 MiB) 0.88x ⚠️
hdf2imas 53.9 ms (99.0k allocs, 3.75 MiB) 55.3 ms (112k allocs, 4.40 MiB) 0.97x
deepcopy 290 µs (20.8k allocs, 837 KiB) 284 µs (20.8k allocs, 837 KiB) 1.02x
get_timeslice 376 µs (22.1k allocs, 1.00 MiB) 382 µs (22.1k allocs, 1.00 MiB) 0.98x
diff N/A 14.3 ms (164k allocs, 8.3 MiB) -

Legend: 🚀 Dramatic improvement (>2x) | ✅ Improved (>1.05x) | ≈ Similar (0.95-1.05x) | ⚠️ Regression (<0.95x)


Impact on FUSE

Problem

FUSE's initial compilation triggered excessive type inference, leading to extreme memory consumption. This was particularly problematic on memory-constrained GitHub macOS runners (~4GB RAM), where test runs took 3.5-4 hours.

Solution

By reducing inference depth through @nospecializeinfer, memory footprint was reduced by nearly 50%, enabling FUSE test runs to complete about ~1 hour on macOS runners (though performance varies with GC behavior, and remains slower than Linux runners).

Impacts on FUSE's CI

Following two combinations of CIs are compared

Before: FUSE (master) + IMASdd (master) Link

After: FUSE (master) + IMASdd (this PR) Link

Platform Stage Before (master) After (this PR) Improvement
macOS Total CI time 3h 49m 1h 25m 2.7x faster
julia-runtest only 3h 24m 1h 6m 3.1x faster
Ubuntu Total CI time 1h 20m 1h 3m 1.3x faster
julia-runtest only 1h 5m 49m 1.3x faster

Note: Total CI time includes environment setup, dependency installation, and test execution. The julia-runtest stage represents pure test execution time.


📊 Detailed FUSE Benchmark Results (Click to expand)

1. Ubuntu Runner

Testset in CI FUSE (master) + IMASdd (master) FUSE (master) + IMASdd (This PR) Improvement
warmup_before_compile 1043 s (86.9 GiB) 620 s (49.6 GiB) 1.7x
warmup_after_compile 62.2 s (9.56 GiB) 64.6 s (10.0 GiB) 0.96x
MANTA 57.6 s (11.6 GiB) 60.9 s (12.2 GiB) 0.95x
FPP 43.1 s (6.8 GiB) 42.7 s (7.1 GiB) 1.01x
JET_HDB5 34.4 s (9.13 GiB) 33.6 s (9.23 GiB) 1.02x
EXCITE 23.6 s (4.15 GiB) 26.0 s (4.47 GiB) 0.91x ⚠️
D3D_Hmode 23.4 s (2.53 GiB) 22.7 s (2.57 GiB) 1.03x
D3D_Lmode 20.8 s (1.98 GiB) 19.7 s (2.00 GiB) 1.05x
FluxMatcher 9.31 s (478 MiB) 9.76 s (479 MiB) 0.95x

2. macOS Runner

Testset in CI FUSE (master) + IMASdd (master) FUSE (master) + IMASdd (This PR) Improvement
warmup_before_compile 2550 s (93.6 GiB) 608 s (56.6 GiB) 4.2x
warmup_after_compile 370 s (15.5 GiB) 113 s (16.7 GiB) 3.3x
MANTA 257 s (15.7 GiB) 128 s (16.5 GiB) 2.0x
FPP 231 s (9.87 GiB) 144 s (10.8 GiB) 1.6x
JET_HDB5 113 s (4.54 GiB) 29.7 s (4.54 GiB) 3.8x
EXCITE 101 s (4.08 GiB) 58.7 s (4.39 GiB) 1.7x
D3D_Hmode 87.6 s (2.61 GiB) 42.9 s (2.65 GiB) 2.0x
D3D_Lmode 62.1 s (2.02 GiB) 33.8 s (2.04 GiB) 1.8x
FluxMatcher 26.5 s (494 MiB) 9.44 s (475 MiB) 2.8x

Key Benefits & Thoughts

Performance Characteristics

Compilation Time: 2-10x reduction across benchmarks
Runtime Performance: Slight memory overhead with <5% performance variance in most cases (note: GC interference makes precise measurement challenging)

Strategy

  1. Prevent unnecessary inference: Block excessive specialization with @nospecializeinfer
  2. Profile hot paths: Apply targeted specialization only where profiling proves necessary
  3. Optimize for interactive use: Most FUSE users (especially beginners) will use pure package workflows, without systemimages

System Image Generation

  • Current approach: Dramatically reduces compilation time during sysimage creation
  • Alternative: If full specialization is required, systematically remove @nospecializeinfer (by simply executing command-line tools such as sed) before compilation to restore original behavior

Call for Testing

While benchmark results and CI metrics show significant improvements, real-world validation is essential before merging. This PR needs more intensive testing with actual FUSE workflows to ensure production readiness.

@bclyons12 @orso82 Could you please test this PR in actual use cases? Any feedback would be greatly appreciated :)

@bclyons12
Copy link
Member

Here's a comparison on Julia 1.11 of FUSE.timer for FUSE.warmup(dd). Left is master, right is this branch. It looks like there's a regression in ActorHFSsizing (roughly 1 s to 2 s) and ActorFluxMatcher (rough 4 s to 5 s). The latter is a bit concerning as we typically consider it performance critical. Is there low-hanging fruit to improve this, or is it a price we'd have to pay for the faster compilation?

────────────────────────────────────────────────────────────────────────────────        ────────────────────────────────────────────────────────────────────────────────
                                       Time                    Allocations                                                     Time                    Allocations      
                              ───────────────────────   ────────────────────────                                      ───────────────────────   ────────────────────────
      Tot / % measured:            43.7s /  97.8%           27.2GiB /  95.9%                  Tot / % measured:            45.2s /  98.0%           28.3GiB /  96.0%    
										                                                                                        
Section               ncalls     time    %tot     avg     alloc    %tot      avg        Section               ncalls     time    %tot     avg     alloc    %tot      avg
────────────────────────────────────────────────────────────────────────────────        ────────────────────────────────────────────────────────────────────────────────
WholeFacility              2    40.6s   95.0%   20.3s   24.1GiB   92.4%  12.1GiB        WholeFacility              2    42.2s   95.2%   21.1s   25.1GiB   92.5%  12.6GiB
  Equilibrium              2    15.1s   35.4%   7.57s   11.8GiB   45.0%  5.88GiB          Equilibrium              2    15.0s   33.9%   7.51s   12.0GiB   44.3%  6.01GiB
    TEQUILA                2    15.1s   35.4%   7.57s   11.8GiB   45.0%  5.88GiB            TEQUILA                2    15.0s   33.9%   7.50s   12.0GiB   44.3%  6.01GiB
  StationaryPlasma         2    11.2s   26.1%   5.59s   6.47GiB   24.8%  3.24GiB          StationaryPlasma         2    11.9s   26.8%   5.95s   6.76GiB   24.9%  3.38GiB
    Equilibrium            4    6.67s   15.6%   1.67s   2.56GiB    9.8%   654MiB            Equilibrium            4    6.33s   14.3%   1.58s   2.62GiB    9.7%   672MiB
      TEQUILA              4    6.66s   15.6%   1.66s   2.54GiB    9.7%   651MiB              TEQUILA              4    6.31s   14.2%   1.58s   2.61GiB    9.6%   668MiB
    CoreTransport          4    4.37s   10.2%   1.09s   3.86GiB   14.8%  0.97GiB            CoreTransport          4    5.43s   12.3%   1.36s   4.08GiB   15.0%  1.02GiB
      FluxMatcher          4    4.37s   10.2%   1.09s   3.86GiB   14.8%  0.97GiB              FluxMatcher          4    5.43s   12.3%   1.36s   4.08GiB   15.0%  1.02GiB
        Pedestal           4   31.8ms    0.1%  7.96ms   6.78MiB    0.0%  1.69MiB                Pedestal           4   9.14ms    0.0%  2.29ms   6.93MiB    0.0%  1.73MiB
          EPED             4   2.74ms    0.0%   685μs   2.26MiB    0.0%   579KiB                  EPED             4   2.64ms    0.0%   661μs   2.26MiB    0.0%   580KiB
        FluxCalcul...      4   3.93ms    0.0%   982μs   3.39MiB    0.0%   868KiB                FluxCalcul...      4   4.54ms    0.0%  1.14ms   3.53MiB    0.0%   904KiB
          TGLF             4   2.13ms    0.0%   532μs   2.93MiB    0.0%   750KiB                  TGLF             4   2.47ms    0.0%   618μs   3.00MiB    0.0%   768KiB
          Neoclass...      4   1.16ms    0.0%   289μs    410KiB    0.0%   102KiB                  Neoclass...      4   1.35ms    0.0%   338μs    475KiB    0.0%   119KiB
    Current                4    115ms    0.3%  28.7ms   36.9MiB    0.1%  9.23MiB            Current                4    111ms    0.3%  27.8ms   37.1MiB    0.1%  9.28MiB
      QED                  4    112ms    0.3%  28.1ms   35.8MiB    0.1%  8.94MiB              QED                  4    108ms    0.2%  27.0ms   35.9MiB    0.1%  8.96MiB
    HCD                    4   14.0ms    0.0%  3.51ms   11.0MiB    0.0%  2.76MiB            HCD                    4   16.3ms    0.0%  4.07ms   11.5MiB    0.0%  2.86MiB
      NeutralFueling       4   5.13ms    0.0%  1.28ms   3.43MiB    0.0%   879KiB              NeutralFueling       4   5.77ms    0.0%  1.44ms   3.56MiB    0.0%   910KiB
      SimpleEC             4   4.77ms    0.0%  1.19ms   5.85MiB    0.0%  1.46MiB              SimpleEC             4   4.69ms    0.0%  1.17ms   5.91MiB    0.0%  1.48MiB
      SimpleIC             4    994μs    0.0%   249μs    234KiB    0.0%  58.4KiB              SimpleIC             4   1.24ms    0.0%   309μs    276KiB    0.0%  69.0KiB
    Pedestal               4   7.82ms    0.0%  1.95ms   6.27MiB    0.0%  1.57MiB            Pedestal               4   8.30ms    0.0%  2.07ms   6.39MiB    0.0%  1.60MiB
      EPED                 4   2.55ms    0.0%   636μs   2.07MiB    0.0%   531KiB              EPED                 4   2.48ms    0.0%   619μs   2.08MiB    0.0%   533KiB
    Sawteeth               4    675μs    0.0%   169μs   44.1KiB    0.0%  11.0KiB            Sawteeth               4    814μs    0.0%   203μs   44.1KiB    0.0%  11.0KiB
  PFdesign                 4    7.82s   18.3%   1.96s   4.92GiB   18.8%  1.23GiB          PFdesign                 4    7.87s   17.8%   1.97s   5.15GiB   19.0%  1.29GiB
    PFactive               2    158ms    0.4%  79.1ms    425MiB    1.6%   213MiB            PFactive               2    174ms    0.4%  86.9ms    444MiB    1.6%   222MiB
  PlasmaLimits             2    3.82s    8.9%   1.91s   64.0MiB    0.2%  32.0MiB          PlasmaLimits             2    3.80s    8.6%   1.90s   66.4MiB    0.2%  33.2MiB
    VerticalStability      2    3.82s    8.9%   1.91s   63.1MiB    0.2%  31.6MiB            VerticalStability      2    3.80s    8.6%   1.90s   65.6MiB    0.2%  32.8MiB
    TroyonBetaNN           2    797μs    0.0%   398μs    393KiB    0.0%   197KiB            TroyonBetaNN           2    832μs    0.0%   416μs    400KiB    0.0%   200KiB
  Blanket                  2    1.16s    2.7%   580ms    294MiB    1.1%   147MiB          Blanket                  2    1.10s    2.5%   549ms    292MiB    1.1%   146MiB
  HFSsizing                2    1.05s    2.5%   526ms    509MiB    1.9%   255MiB          HFSsizing                2    2.01s    4.5%   1.01s    710MiB    2.6%   355MiB
    FluxSwing              2    583μs    0.0%   291μs   72.3KiB    0.0%  36.1KiB            Stresses               2    614μs    0.0%   307μs    335KiB    0.0%   168KiB
    Stresses               2    527μs    0.0%   263μs    302KiB    0.0%   151KiB            FluxSwing              2    578μs    0.0%   289μs   77.1KiB    0.0%  38.5KiB
  Neutronics               2    255ms    0.6%   127ms   31.6MiB    0.1%  15.8MiB          Neutronics               2    252ms    0.6%   126ms   32.9MiB    0.1%  16.5MiB
  CXbuild                  6    151ms    0.4%  25.2ms   77.6MiB    0.3%  12.9MiB          CXbuild                  6    172ms    0.4%  28.6ms   79.7MiB    0.3%  13.3MiB
  Divertors                2   5.79ms    0.0%  2.90ms   8.55MiB    0.0%  4.27MiB          Divertors                2   15.9ms    0.0%  7.94ms   8.87MiB    0.0%  4.43MiB
  PassiveStructures        2   5.12ms    0.0%  2.56ms   3.90MiB    0.0%  1.95MiB          PassiveStructures        2   10.6ms    0.0%  5.30ms   4.68MiB    0.0%  2.34MiB
  BalanceOfPlant           2   4.40ms    0.0%  2.20ms   3.58MiB    0.0%  1.79MiB          BalanceOfPlant           2   4.27ms    0.0%  2.14ms   3.59MiB    0.0%  1.80MiB
    ThermalPlant           2   3.54ms    0.0%  1.77ms   3.49MiB    0.0%  1.74MiB            ThermalPlant           2   3.38ms    0.0%  1.69ms   3.48MiB    0.0%  1.74MiB
    PowerNeeds             2    534μs    0.0%   267μs   61.9KiB    0.0%  31.0KiB            PowerNeeds             2    579μs    0.0%   289μs   77.4KiB    0.0%  38.7KiB
  Costing                  2   2.34ms    0.0%  1.17ms    800KiB    0.0%   400KiB          Costing                  2   2.40ms    0.0%  1.20ms    843KiB    0.0%   422KiB
    CostingARIES           2   1.66ms    0.0%   832μs    745KiB    0.0%   372KiB            CostingARIES           2   1.74ms    0.0%   868μs    781KiB    0.0%   391KiB
  LFSsizing                2    368μs    0.0%   184μs   78.0KiB    0.0%  39.0KiB          LFSsizing                2    464μs    0.0%   232μs   90.5KiB    0.0%  45.2KiB
init                       1    2.05s    4.8%   2.05s   1.98GiB    7.6%  1.98GiB        init                       1    2.03s    4.6%   2.03s   2.00GiB    7.4%  2.00GiB
  init_equilibrium         1    1.21s    2.8%   1.21s    276MiB    1.0%   276MiB          init_equilibrium         1    1.17s    2.6%   1.17s    297MiB    1.1%   297MiB
    Equilibrium            2    1.20s    2.8%   602ms    275MiB    1.0%   137MiB            Equilibrium            2    1.17s    2.6%   583ms    296MiB    1.1%   148MiB
      TEQUILA              2    1.20s    2.8%   598ms    267MiB    1.0%   134MiB              TEQUILA              2    1.16s    2.6%   579ms    288MiB    1.0%   144MiB
    init_core_prof...      1    356μs    0.0%   356μs    150KiB    0.0%   150KiB            init_core_prof...      1    420μs    0.0%   420μs    176KiB    0.0%   176KiB
  init_pulse_schedule      1    703ms    1.6%   703ms   1.64GiB    6.3%  1.64GiB          init_pulse_schedule      1    715ms    1.6%   715ms   1.64GiB    6.0%  1.64GiB
  init_currents            1   68.8ms    0.2%  68.8ms   26.1MiB    0.1%  26.1MiB          init_currents            1   69.6ms    0.2%  69.6ms   26.2MiB    0.1%  26.2MiB
    Current                2   68.6ms    0.2%  34.3ms   26.0MiB    0.1%  13.0MiB            Current                2   69.3ms    0.2%  34.7ms   26.1MiB    0.1%  13.0MiB
      QED                  2   67.5ms    0.2%  33.7ms   25.4MiB    0.1%  12.7MiB              QED                  2   67.9ms    0.2%  34.0ms   25.5MiB    0.1%  12.7MiB
  init_build               1   49.9ms    0.1%  49.9ms   26.7MiB    0.1%  26.7MiB          init_build               1   54.6ms    0.1%  54.6ms   27.5MiB    0.1%  27.5MiB
    CXbuild                2   49.4ms    0.1%  24.7ms   26.5MiB    0.1%  13.3MiB            CXbuild                2   53.8ms    0.1%  26.9ms   27.2MiB    0.1%  13.6MiB
  init_hcd                 1   10.9ms    0.0%  10.9ms   11.5MiB    0.0%  11.5MiB          init_hcd                 1   11.8ms    0.0%  11.8ms   11.8MiB    0.0%  11.8MiB
    HCD                    2   7.16ms    0.0%  3.58ms   5.89MiB    0.0%  2.94MiB            HCD                    2   8.47ms    0.0%  4.24ms   6.09MiB    0.0%  3.05MiB
      SimpleEC             2   2.40ms    0.0%  1.20ms   2.92MiB    0.0%  1.46MiB              NeutralFueling       2   2.72ms    0.0%  1.36ms   1.72MiB    0.0%   881KiB
      NeutralFueling       2   2.36ms    0.0%  1.18ms   1.68MiB    0.0%   862KiB              SimpleEC             2   2.35ms    0.0%  1.17ms   2.95MiB    0.0%  1.47MiB
      SimpleIC             2    505μs    0.0%   253μs    108KiB    0.0%  54.0KiB              SimpleIC             2    629μs    0.0%   314μs    127KiB    0.0%  63.5KiB
  PassiveStructures        2   5.18ms    0.0%  2.59ms   3.78MiB    0.0%  1.89MiB          PassiveStructures        2   10.0ms    0.0%  5.02ms   4.53MiB    0.0%  2.27MiB
  init_pf_active           1    904μs    0.0%   904μs    923KiB    0.0%   923KiB          init_pf_active           1   1.43ms    0.0%  1.43ms   0.99MiB    0.0%  0.99MiB
  init_core_profiles       1    432μs    0.0%   432μs    147KiB    0.0%   147KiB          init_core_profiles       1    469μs    0.0%   469μs    173KiB    0.0%   173KiB
  init_edge_profiles       1    143μs    0.0%   143μs   57.5KiB    0.0%  57.5KiB          init_edge_profiles       1    149μs    0.0%   149μs   63.9KiB    0.0%  63.9KiB
  init_requirements        1   25.2μs    0.0%  25.2μs   20.9KiB    0.0%  20.9KiB          init_requirements        1   26.2μs    0.0%  26.2μs   21.0KiB    0.0%  21.0KiB
  init_core_sources        1   21.0μs    0.0%  21.0μs   10.9KiB    0.0%  10.9KiB          init_core_sources        1   21.8μs    0.0%  21.8μs   11.4KiB    0.0%  11.4KiB
  init_bop                 1   9.58μs    0.0%  9.58μs      976B    0.0%     976B          init_bop                 1   7.54μs    0.0%  7.54μs      992B    0.0%     992B
  init_missing_fro...      1    750ns    0.0%   750ns      272B    0.0%     272B          init_missing_fro...      1    417ns    0.0%   417ns      272B    0.0%     272B
freeze                     1   92.1ms    0.2%  92.1ms   18.8MiB    0.1%  18.8MiB        freeze                     1   97.2ms    0.2%  97.2ms   23.9MiB    0.1%  23.9MiB
────────────────────────────────────────────────────────────────────────────────	────────────────────────────────────────────────────────────────────────────────

@bclyons12
Copy link
Member

@mgyoo86 To be clear, this is outstanding work. That the performance is basically as good with the drastic improvements in compilation if remarkable. It would just be great to do even better on performance if we can.

- Add Preferences dependency
- Create @maybe_nospecializeinfer macro with runtime configuration
- Replace all @nospecializeinfer annotations (254 functions)
- Display setting during precompilation

Users can now toggle @nospecializeinfer via LocalPreferences.toml
@mgyoo86
Copy link
Contributor Author

mgyoo86 commented Nov 13, 2025

@bclyons12

Thanks for the feedback!
As you suggested, I've implemented a new @maybe_nospecializeinfer macro that can be controlled via LocalPreferences.toml.
This allows users to toggle @nospecializeinfer behavior without code changes.

The following is an example of LocalPreferences.toml:

[IMASdd]
use_nospecializeinfer = false  # or true (default)

I'll take a closer look at ActorHFSsizing and ActorFluxMatcher as we discussed.

…eld indices

- Replace fieldnames() iteration with fieldcount() + numeric indices
- Reduces from 14 allocations (4.125 KiB) to ~0 allocations
- Works correctly with @nospecialize by avoiding symbolic field names
- Added @inbounds for bounds-check elimination
- Replace enumerate() with eachindex() + @inbounds
- Replace fieldnames() with hasfield()
- Optimize hasdata() to use numeric field indices
- Use tuple literals instead of arrays for membership tests
Optimize setproperty! by reusing already-computed coords variable
instead of calling coordinates(ids, field) multiple times:
- Line 772: Use inline generator with coords reuse
- Line 774: Reuse coords instead of recalling coordinates()

Eliminates 2 redundant function calls per setproperty! invocation.
Uses idiomatic inline generator with any() for short-circuit benefit.

fix/nospecialize
Optimize name_2_index() by caching inverted Dict per IDS type:
- Add global cache _NAME_2_IDX_CACHE using IdDict
- Implement lazy initialization with get!() for thread-safety
- First call per type: inverts idx_2_name and caches result
- Subsequent calls: returns cached Dict (zero-allocation)

Performance improvement:
- Before: ~22μs, 5 allocations per call
- After: ~2ns, 0 allocations (after first call per type)

Related optimization in fix/nospecialize branch.
Simplify in_expression() by removing redundant key check:
- Remove manual `if t_id ∉ keys(_in_expression)` check
- Use get!() directly for atomic check-and-create operation
- get!() already handles check atomically, making manual check redundant

Performance improvement:
- Eliminates one dict lookup (haskey check)
- Cleaner code with same thread-safety guarantees

Related to fix/nospecialize optimization work.
…debug code

Optimize two functions with numeric field iteration pattern:
- Stack-based fill function: Replace fieldnames() with fieldcount/fieldname
- Base.empty!(): Use numeric indices for field iteration
- Add @inbounds for bounds check elimination

Remove debug statements:
- Clean up Main.@infiltrate calls from resize!() function

Performance improvement:
- Eliminates allocations from fieldnames() vector creation
- Enables bounds check elimination with @inbounds
- Consistent with other @nospecialize optimizations

Related to fix/nospecialize optimization work.
@mgyoo86
Copy link
Contributor Author

mgyoo86 commented Dec 4, 2025

@bclyons12 @fredrikekre
The following are additional micro-optimizations that can further improve performance, such as ActorFluxMatcher, which Brendan pointed out.

Additional Performance Optimizations (6 commits)

Summary

Zero-allocation improvements across hot paths in @nospecialize functions.

Key Changes

Loop Optimization (data.jl, expressions.jl, f2.jl, findall.jl, io.jl)

  • Replace for (k, v) in enumerate(arr)for k in eachindex(arr); v = @inbounds arr[k]
  • Eliminates tuple allocations in tight loops

Field Iteration (data.jl, expressions.jl)

  • Replace fieldnames(typeof(x)) iteration → numeric fieldcount/fieldname indices
  • hasdata(): Use early-return loop instead of generator with any()

Field Checks (identifiers.jl)

  • Replace :field in fieldnames(T)hasfield(T, :field)
  • Avoids tuple allocation on every check

Caching (identifiers.jl)

  • Add lazy auto-inversion cache for name_2_index()
  • One-time Dict creation per IDS type

Thread-safe Access (expressions.jl)

  • Optimize in_expression() with direct get!() usage
  • Remove redundant key existence check

Misc (math.jl, data.jl)

  • Use tuple literals (:a, :b) instead of vectors [:a, :b] in checks
  • Eliminate redundant coordinates() calls in setproperty!

Files Changed

data.jl, expressions.jl, identifiers.jl, io.jl, f2.jl, findall.jl, math.jl

Add diagnose_shared_objects() to detect unintended array sharing in IDS trees.
This helps identify cases where `a = b` was used instead of `a .= b`.

Features:
- Stack-based tree traversal following isequal pattern
- SharedObjectReport with indexed access (report[1].id, report[1].paths)
- Cross-IDS sharing detection (e.g., core_profiles ↔ core_sources)
- REPL display with chronological path ordering
…ility

Replace @maybe_nospecializeinfer with @nospecializeinfer since the macro
wrapper is not defined on master branch.
- Add runtests_f2.jl with 81 test cases covering:
  - f2p, f2i, f2u path conversion functions
  - i2p, p2i, i2u string parsing functions
  - location, ulocation path accessors
  - fs2u type-based lookup
  - f2p_name IDS naming
  - Round-trip consistency validation
  - Edge cases (standalone IDS, utime flag, deeply nested)

- Move f2-related tests from runtests_ids.jl to dedicated file
- Include runtests_f2.jl in main test runner
- Add _F2P_SKELETON_CACHE with concrete types for type-stable lookup
- Split _f2p_skeleton into fast path (cache hit) and slow path (@noinline)
- Pre-compute and cache result_size to avoid redundant count() calls
- Use Vector{String} in cache for concrete value type
- Remove String() conversion in loop (already cached as String)
- Add internal @_typed_cache macro with proper hygiene (gensym, esc)
- Use helper function pattern to solve return-bypass caching bug
- Apply macro to f2.jl: fs2u, _f2p_skeleton, f2p_name(Type)
- Rename cache constants with _TCACHE_ prefix for consistency
- Use Base.get single lookup instead of haskey+getindex pattern
- Remove type-based name computation (~15 lines)
- Reuse cached skeleton from _f2p_skeleton(T)
- Eliminates redundant replace/eachsplit/count calls per f2i invocation
- Add ::String return type to f2p_name(ids) for better type inference
- Refactor f2p_name(ids::IDS, ::IDS) to reuse cached f2p_name(Type)
- Remove redundant typename_str computation (now uses cache)
- Add @nospecialize to entry point f2p_name(ids) for compile time
- i2u fast path: avoid String(loc) allocation when loc is already String
- ulocation/location(IDSvector): use fs2u_base cache instead of SubString
- Add int_to_string() cache for small integers (0-10) used in f2p and f2p_name
- Add fs2u_base typed cache for IDSvector base paths (0 allocs)
- Expand benchmark_f2.jl with comprehensive allocation tests

Results:
- f2p: 10→8 allocs (simple), 14→10 allocs (nested)
- f2p_name(IDSvectorElement): 4→2 allocs
- ulocation/location(IDSvector): 0 allocs (cached)
- i2u(String, no brackets): 0 allocs
Changed eltype(ids) to typeof(ids) in ulocation/location(IDSvector) functions.
With @nospecialize, eltype(ids) returns Any causing 3 allocations and boxing.
Using typeof(ids) and extracting element type inside fs2u_base ensures type
stability and 0 allocations.

Result: ulocation/location(IDSvector) now ~27ns with 0 allocations
(previously ~600ns with 3 allocations/128 bytes)
Replace zeros(Int, N) with zeros!(pool, Int, N) using @with_pool macro.
This eliminates the small temporary array allocation (N typically 1-3)
that occurred on every f2p/f2i call by reusing pooled memory.
…e allocation

Under @nospecializeinfer, the closure created by `lock() do` can cause
boxing due to captured variables (ids, field, func, throw_on_missing, etc).
Using explicit try/finally eliminates the closure and reduces allocation.

Changes:
- exec_expression_with_ancestor_args (4-arg version): cache lock, use try/finally
- onetime expression path: same pattern for consistency
@mgyoo86
Copy link
Contributor Author

mgyoo86 commented Dec 18, 2025

@fredrikekre @bclyons12
I've updated f2.jl to reduce the allocation and improve the performance

Summary: Introduced type-based caching infrastructure to reduce allocations and improve performance in path/location functions. Also applied various micro-optimizations (array pooling, string caching, closure elimination).

Key Changes

1. Type-based caching (@_typed_cache macro)

  • Cached: _f2p_skeleton, fs2u, f2p_name, fs2u_base
  • Thread-safe via ThreadSafeDict

2. Temp array reuse (f2p, f2i)

  • AdaptiveArrayPools: zeros!(pool, Int, N)

3. + Other minor micro-optimizations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

WIP Work in Progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants