-
Notifications
You must be signed in to change notification settings - Fork 2
[WIP] Reduce compilation overhead with @nospecialize #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Changed from @nospecialize(x::T) where {T<:Type} pattern to
@nospecialize(x::{<:Type}) to properly prevent type specialization.
This ensures Julia doesn't compile separate versions for each
concrete
type, reducing compilation overhead.
Add @Assert type checks to prevent merging/freezing different IDS types. This fixes type safety issues introduced by commit 74ded67 where 'where T' constraints were removed. Functions updated: - merge!(::IDS, ::IDS) - assert same type - merge!(::IDSvector, ::IDSvector) - assert same eltype - freeze!(::IDS, ::IDS) - assert same type - freeze!(::IDSvector, ::IDSvector) - assert same eltype These assertions prevent runtime errors from field mismatches when operating on incompatible types.
Replace error() with Dict return when comparing different types.
Now diff() returns a dict with 'type_mismatch' key instead of
throwing an error, making it more flexible and non-disruptive.
Example:
diff(dd, dd.equilibrium)
=> Dict('type_mismatch' => 'dd{Float64} != equilibrium{Float64}')
Add @nospecialize annotations to info and coordinates to reduce compilation overhead and binary size.
Added @nospecialize annotations to location and conversion functions in f2.jl to prevent excessive method specialization: - utlocation(ids, field) and variants - f2u(ids) - converts IDS to universal location string - fs2u(ids_type) - converts IDS type to universal location Note: ulocation specialization still observed during constructor execution despite @nospecialize - investigation ongoing into when and why specialization occurs in the call chain.
Add dd_nospecialize() helper function that uses Base.invokelatest to
prevent compiler from analyzing dd() internals during type inference.
Replace all dd() default arguments in I/O functions with dd_nospecialize()
to significantly reduce allocation and compilation time.
The invokelatest barrier prevents the compiler from specializing on the
complex 180k-line dd struct generation, while ::dd{Float64} type assertion
ensures proper type propagation without additional inference overhead.
Affected functions:
- json2imas, jstr2imas
- hdf2imas (default arg and internal call)
- h5i2imas
Performance impact: ~20M fewer allocations in hdf2imas calls.
Apply @nospecialize to fieldtype, getproperty, parent, name, goto, getindex, and time-related functions to reduce method specialization.
Temporary test file for investigating method specialization behavior.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #83 +/- ##
==========================================
+ Coverage 43.76% 43.94% +0.18%
==========================================
Files 13 15 +2
Lines 31243 31398 +155
==========================================
+ Hits 13672 13797 +125
- Misses 17571 17601 +30 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Remove type parameters from @nospecialize signatures and extract types inside function body using eltype() to properly prevent specialization.
Remove type parameters from @nospecialize signatures in ==, isequal, isapprox, and _extract_comparable_fields to properly prevent specialization on Union types. Also update docstrings for resize!, diff, and time_groups functions.
…me bottleneck
Added @nospecializeinfer to 165+ functions across 9 core modules, drastically
reducing type inference overhead and specialization explosion.
Impact:
- Eliminates ~50% of top-level type inference allocations
- Prevents Union type specialization combinatorial explosion
- Dramatically reduces first-time function compilation overhead
- Measured: hdf2imas compilation time drops from 98.71% to minimal levels
Changes:
- Added `using Base: @nospecializeinfer` import
- Applied to functions with @nospecialize parameters in:
cocos (6), data (25), expressions (18), f2 (24), identifiers (15),
io (31), show (16), time (30)
Technical rationale:
@nospecializeinfer prevents both specialization AND type inference propagation.
Critical for Union types like Union{IDS,IDSvector,Vector{IDS}} which trigger
combinatorial method generation and expensive typeinf_ext_toplevel calls.
This partially reverts the previous commit which added @nospecializeinfer to almost every function. While @nospecializeinfer improves compilation time, it prevents type inference for certain key functions that are essential for nested calls to return concrete types. Functions affected: - cocos_out, cocos_transform, transform_cocos_* (src/cocos.jl) - concrete_fieldtype_typeof, eltype_concrete_fieldtype_typeof, Base.getproperty (src/data.jl) These functions must infer concrete return types for nested IMASdd function calls to work correctly, as tested in test/runtests_concrete.jl. Result: All tests now pass again, including concrete type inference tests.
|
great find with the '@nospecializeinfer' macro !! |
Due to @nospecialize/@nospecializeinfer constraints, type parameters
cannot be used to guarantee matching types at dispatch time.
Solution:
- Add hot path methods for identical types (Int32/64, Float32/64, UInt64, Bool)
- These route to __convert_same_real_type! helper for fast path
- Generic Real->Real method now performs runtime type checking
- Convert to target type when needed (e.g., Float64 → Measurement{Float64})
Changes:
- Check COCOS conversion consistently to all cases (assumed always needed)
Separated DD from Union{DD, IDSraw, IDSvectorRawElement} into dedicated method.
Large Union (~130+ concrete subtypes) prevented compiler from inlining hot path.
DD now gets specialized method without @nospecialize for optimal performance.
Added guard condition (user_cocos != to_cocos) before calling cocos_out. Since both default to 11, this avoids unnecessary function calls on hot path.
Added @inline and ::Bool annotations to hasdata for better type inference. Helps compiler optimize getproperty hot path by eliminating runtime type checks.
Performance Improvements:
|
| Benchmark | Baseline (master) | This PR | Improvement |
|---|---|---|---|
| json2imas | 16.2 s (96.7M allocs, 4.69 GiB) | 3.9 s (14.6M allocs, 726 MiB) | 4.1x ✅ |
| hdf2imas | 36.4 s (267M allocs, 12.9 GiB) | 8.2 s (24.9M allocs, 1.20 GiB) | 4.4x ✅ |
| deepcopy | 18.9 s (215M allocs, 10.4 GiB) | 5.2 s (62.6M allocs, 3.06 GiB) | 3.6x ✅ |
| get_timeslice | 312 s (897M allocs, 43.2 GiB) | 3.6 s (23.5M allocs, 1.14 GiB) | 87x 🚀 |
| diff | 2754 s (885M allocs, 42.5 GiB) | 0.3 s (2.85M allocs, 142 MiB) | 9200x 🚀 |
Runtime Performance (@btime)
| Benchmark | Baseline (master) | This PR | Improvement |
|---|---|---|---|
| json2imas | 7.84 ms (72.3k allocs, 3.28 MiB) | 8.92 ms (84.1k allocs, 3.89 MiB) | 0.88x |
| hdf2imas | 53.9 ms (99.0k allocs, 3.75 MiB) | 55.3 ms (112k allocs, 4.40 MiB) | 0.97x ≈ |
| deepcopy | 290 µs (20.8k allocs, 837 KiB) | 284 µs (20.8k allocs, 837 KiB) | 1.02x ≈ |
| get_timeslice | 376 µs (22.1k allocs, 1.00 MiB) | 382 µs (22.1k allocs, 1.00 MiB) | 0.98x ≈ |
| diff | N/A | 14.3 ms (164k allocs, 8.3 MiB) | - |
Legend: 🚀 Dramatic improvement (>2x) | ✅ Improved (>1.05x) | ≈ Similar (0.95-1.05x) |
Impact on FUSE
Problem
FUSE's initial compilation triggered excessive type inference, leading to extreme memory consumption. This was particularly problematic on memory-constrained GitHub macOS runners (~4GB RAM), where test runs took 3.5-4 hours.
Solution
By reducing inference depth through @nospecializeinfer, memory footprint was reduced by nearly 50%, enabling FUSE test runs to complete about ~1 hour on macOS runners (though performance varies with GC behavior, and remains slower than Linux runners).
Impacts on FUSE's CI
Following two combinations of CIs are compared
Before: FUSE (master) + IMASdd (master) Link
After: FUSE (master) + IMASdd (this PR) Link
| Platform | Stage | Before (master) | After (this PR) | Improvement |
|---|---|---|---|---|
| macOS | Total CI time | 3h 49m | 1h 25m | 2.7x faster |
julia-runtest only |
3h 24m | 1h 6m | 3.1x faster | |
| Ubuntu | Total CI time | 1h 20m | 1h 3m | 1.3x faster |
julia-runtest only |
1h 5m | 49m | 1.3x faster |
Note: Total CI time includes environment setup, dependency installation, and test execution. The julia-runtest stage represents pure test execution time.
📊 Detailed FUSE Benchmark Results (Click to expand)
1. Ubuntu Runner
| Testset in CI | FUSE (master) + IMASdd (master) | FUSE (master) + IMASdd (This PR) | Improvement |
|---|---|---|---|
| warmup_before_compile | 1043 s (86.9 GiB) | 620 s (49.6 GiB) | 1.7x ✅ |
| warmup_after_compile | 62.2 s (9.56 GiB) | 64.6 s (10.0 GiB) | 0.96x ≈ |
| MANTA | 57.6 s (11.6 GiB) | 60.9 s (12.2 GiB) | 0.95x ≈ |
| FPP | 43.1 s (6.8 GiB) | 42.7 s (7.1 GiB) | 1.01x ≈ |
| JET_HDB5 | 34.4 s (9.13 GiB) | 33.6 s (9.23 GiB) | 1.02x ≈ |
| EXCITE | 23.6 s (4.15 GiB) | 26.0 s (4.47 GiB) | 0.91x |
| D3D_Hmode | 23.4 s (2.53 GiB) | 22.7 s (2.57 GiB) | 1.03x ≈ |
| D3D_Lmode | 20.8 s (1.98 GiB) | 19.7 s (2.00 GiB) | 1.05x ≈ |
| FluxMatcher | 9.31 s (478 MiB) | 9.76 s (479 MiB) | 0.95x ≈ |
2. macOS Runner
| Testset in CI | FUSE (master) + IMASdd (master) | FUSE (master) + IMASdd (This PR) | Improvement |
|---|---|---|---|
| warmup_before_compile | 2550 s (93.6 GiB) | 608 s (56.6 GiB) | 4.2x ✅ |
| warmup_after_compile | 370 s (15.5 GiB) | 113 s (16.7 GiB) | 3.3x ✅ |
| MANTA | 257 s (15.7 GiB) | 128 s (16.5 GiB) | 2.0x ✅ |
| FPP | 231 s (9.87 GiB) | 144 s (10.8 GiB) | 1.6x ✅ |
| JET_HDB5 | 113 s (4.54 GiB) | 29.7 s (4.54 GiB) | 3.8x ✅ |
| EXCITE | 101 s (4.08 GiB) | 58.7 s (4.39 GiB) | 1.7x ✅ |
| D3D_Hmode | 87.6 s (2.61 GiB) | 42.9 s (2.65 GiB) | 2.0x ✅ |
| D3D_Lmode | 62.1 s (2.02 GiB) | 33.8 s (2.04 GiB) | 1.8x ✅ |
| FluxMatcher | 26.5 s (494 MiB) | 9.44 s (475 MiB) | 2.8x ✅ |
Key Benefits & Thoughts
Performance Characteristics
Compilation Time: 2-10x reduction across benchmarks
Runtime Performance: Slight memory overhead with <5% performance variance in most cases (note: GC interference makes precise measurement challenging)
Strategy
- Prevent unnecessary inference: Block excessive specialization with
@nospecializeinfer - Profile hot paths: Apply targeted specialization only where profiling proves necessary
- Optimize for interactive use: Most FUSE users (especially beginners) will use pure package workflows, without systemimages
System Image Generation
- Current approach: Dramatically reduces compilation time during sysimage creation
- Alternative: If full specialization is required, systematically remove
@nospecializeinfer(by simply executing command-line tools such assed) before compilation to restore original behavior
Call for Testing
While benchmark results and CI metrics show significant improvements, real-world validation is essential before merging. This PR needs more intensive testing with actual FUSE workflows to ensure production readiness.
@bclyons12 @orso82 Could you please test this PR in actual use cases? Any feedback would be greatly appreciated :)
|
Here's a comparison on Julia 1.11 of |
|
@mgyoo86 To be clear, this is outstanding work. That the performance is basically as good with the drastic improvements in compilation if remarkable. It would just be great to do even better on performance if we can. |
- Add Preferences dependency - Create @maybe_nospecializeinfer macro with runtime configuration - Replace all @nospecializeinfer annotations (254 functions) - Display setting during precompilation Users can now toggle @nospecializeinfer via LocalPreferences.toml
|
Thanks for the feedback! The following is an example of I'll take a closer look at |
…eld indices - Replace fieldnames() iteration with fieldcount() + numeric indices - Reduces from 14 allocations (4.125 KiB) to ~0 allocations - Works correctly with @nospecialize by avoiding symbolic field names - Added @inbounds for bounds-check elimination
- Replace enumerate() with eachindex() + @inbounds - Replace fieldnames() with hasfield() - Optimize hasdata() to use numeric field indices - Use tuple literals instead of arrays for membership tests
Optimize setproperty! by reusing already-computed coords variable instead of calling coordinates(ids, field) multiple times: - Line 772: Use inline generator with coords reuse - Line 774: Reuse coords instead of recalling coordinates() Eliminates 2 redundant function calls per setproperty! invocation. Uses idiomatic inline generator with any() for short-circuit benefit. fix/nospecialize
Optimize name_2_index() by caching inverted Dict per IDS type: - Add global cache _NAME_2_IDX_CACHE using IdDict - Implement lazy initialization with get!() for thread-safety - First call per type: inverts idx_2_name and caches result - Subsequent calls: returns cached Dict (zero-allocation) Performance improvement: - Before: ~22μs, 5 allocations per call - After: ~2ns, 0 allocations (after first call per type) Related optimization in fix/nospecialize branch.
Simplify in_expression() by removing redundant key check: - Remove manual `if t_id ∉ keys(_in_expression)` check - Use get!() directly for atomic check-and-create operation - get!() already handles check atomically, making manual check redundant Performance improvement: - Eliminates one dict lookup (haskey check) - Cleaner code with same thread-safety guarantees Related to fix/nospecialize optimization work.
…debug code Optimize two functions with numeric field iteration pattern: - Stack-based fill function: Replace fieldnames() with fieldcount/fieldname - Base.empty!(): Use numeric indices for field iteration - Add @inbounds for bounds check elimination Remove debug statements: - Clean up Main.@infiltrate calls from resize!() function Performance improvement: - Eliminates allocations from fieldnames() vector creation - Enables bounds check elimination with @inbounds - Consistent with other @nospecialize optimizations Related to fix/nospecialize optimization work.
|
@bclyons12 @fredrikekre Additional Performance Optimizations (6 commits)SummaryZero-allocation improvements across hot paths in Key ChangesLoop Optimization (
Field Iteration (
Field Checks (
Caching (
Thread-safe Access (
Misc (
Files Changed
|
Add diagnose_shared_objects() to detect unintended array sharing in IDS trees. This helps identify cases where `a = b` was used instead of `a .= b`. Features: - Stack-based tree traversal following isequal pattern - SharedObjectReport with indexed access (report[1].id, report[1].paths) - Cross-IDS sharing detection (e.g., core_profiles ↔ core_sources) - REPL display with chronological path ordering
…ility Replace @maybe_nospecializeinfer with @nospecializeinfer since the macro wrapper is not defined on master branch.
- Add runtests_f2.jl with 81 test cases covering: - f2p, f2i, f2u path conversion functions - i2p, p2i, i2u string parsing functions - location, ulocation path accessors - fs2u type-based lookup - f2p_name IDS naming - Round-trip consistency validation - Edge cases (standalone IDS, utime flag, deeply nested) - Move f2-related tests from runtests_ids.jl to dedicated file - Include runtests_f2.jl in main test runner
- Add _F2P_SKELETON_CACHE with concrete types for type-stable lookup - Split _f2p_skeleton into fast path (cache hit) and slow path (@noinline) - Pre-compute and cache result_size to avoid redundant count() calls - Use Vector{String} in cache for concrete value type - Remove String() conversion in loop (already cached as String)
- Add internal @_typed_cache macro with proper hygiene (gensym, esc) - Use helper function pattern to solve return-bypass caching bug - Apply macro to f2.jl: fs2u, _f2p_skeleton, f2p_name(Type) - Rename cache constants with _TCACHE_ prefix for consistency - Use Base.get single lookup instead of haskey+getindex pattern
- Remove type-based name computation (~15 lines) - Reuse cached skeleton from _f2p_skeleton(T) - Eliminates redundant replace/eachsplit/count calls per f2i invocation
- Add ::String return type to f2p_name(ids) for better type inference - Refactor f2p_name(ids::IDS, ::IDS) to reuse cached f2p_name(Type) - Remove redundant typename_str computation (now uses cache) - Add @nospecialize to entry point f2p_name(ids) for compile time
- i2u fast path: avoid String(loc) allocation when loc is already String - ulocation/location(IDSvector): use fs2u_base cache instead of SubString - Add int_to_string() cache for small integers (0-10) used in f2p and f2p_name - Add fs2u_base typed cache for IDSvector base paths (0 allocs) - Expand benchmark_f2.jl with comprehensive allocation tests Results: - f2p: 10→8 allocs (simple), 14→10 allocs (nested) - f2p_name(IDSvectorElement): 4→2 allocs - ulocation/location(IDSvector): 0 allocs (cached) - i2u(String, no brackets): 0 allocs
Changed eltype(ids) to typeof(ids) in ulocation/location(IDSvector) functions. With @nospecialize, eltype(ids) returns Any causing 3 allocations and boxing. Using typeof(ids) and extracting element type inside fs2u_base ensures type stability and 0 allocations. Result: ulocation/location(IDSvector) now ~27ns with 0 allocations (previously ~600ns with 3 allocations/128 bytes)
Replace zeros(Int, N) with zeros!(pool, Int, N) using @with_pool macro. This eliminates the small temporary array allocation (N typically 1-3) that occurred on every f2p/f2i call by reusing pooled memory.
…e allocation Under @nospecializeinfer, the closure created by `lock() do` can cause boxing due to captured variables (ids, field, func, throw_on_missing, etc). Using explicit try/finally eliminates the closure and reduces allocation. Changes: - exec_expression_with_ancestor_args (4-arg version): cache lock, use try/finally - onetime expression path: same pattern for consistency
|
@fredrikekre @bclyons12 Summary: Introduced type-based caching infrastructure to reduce allocations and improve performance in path/location functions. Also applied various micro-optimizations (array pooling, string caching, closure elimination). Key Changes1. Type-based caching (
2. Temp array reuse (
3. + Other minor micro-optimizations |
Status
🚧 Work in Progress - Testing and feedback welcome
Summary
This PR adds/fixes
@nospecializeannotations to frequently-called utility functions to prevent excessive method specialization and reduce compilation overhead.Major Changes
Core Functions Modified
ulocation(),f2u()- Added@nospecializefor IDS parametersinfo(),coordinates()- Prevented specialization on IDS typesdiff(),merge!(),freeze!()- Refactored type handlingdict2imas()- Fixed specialization in IO operationsKnown Issues
hdf2imas()or through certain code paths. Root cause investigation ongoing.Testing
Test File
See
tmp_test/test_ulocation.jlfor specialization behavior tests.Required Temporary Modification
To properly test, you need to modify
dd.jlto usegetfieldinstead ofgetpropertyorids.field:macOS:
Linux (without macOS backup extension):
sed -E -i 's/setfield!\(ids\.([^,]*)/setfield!(getfield(ids, :\1)/g' src/dd.jlThis replaces
ids.fieldproperty access withgetfield(ids, :field)to bypassgetpropertyviaids.fieldimplementations during testing.Next Steps