Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
1e88c69
Add LAMMPS export and benchmark infrastructure
jameskermode Jan 6, 2026
3535517
Add ETACE LAMMPS export tutorial and documentation
jameskermode Jan 6, 2026
a125118
Restore export-ci.yml workflow lost in rebase
jameskermode Jan 6, 2026
e56f712
Update export CI to test ETACE models, remove outdated ACEModel tests
jameskermode Jan 6, 2026
cb04425
Remove ACE registry from CI (ACEpotentials now in General)
jameskermode Jan 6, 2026
67fc49c
Fix export tests: use ETACE for multispecies, loosen tolerances to 1e-10
jameskermode Jan 6, 2026
a9f9ca6
Fix polynomial recurrence in ETACE export to match P4ML
jameskermode Jan 6, 2026
a470238
Add workflow_dispatch trigger to export-ci.yml
jameskermode Jan 6, 2026
b596da0
Mark juliac compilation step as optional in CI
jameskermode Jan 6, 2026
a784f26
Make library-dependent CI jobs conditional on juliac success
jameskermode Jan 6, 2026
b1f52d5
Add compact code generation for reduced export file size
jameskermode Jan 6, 2026
0372f90
Simplify export generator code: remove dead code and merge radial basis
jameskermode Jan 6, 2026
d256b3f
Extract species dispatch helper functions to reduce code duplication
jameskermode Jan 6, 2026
ded91b9
Split export_ace_model.jl into modular files for maintainability
jameskermode Jan 6, 2026
b5a77f4
Fix CI: documentation errors and make juliac compilation mandatory
jameskermode Jan 6, 2026
116a938
Use JuliaC.jl package instead of bundled juliac.jl script
jameskermode Jan 6, 2026
4d60bcc
Fix JuliaC.jl compilation workflow: add LinkRecipe step
jameskermode Jan 6, 2026
4f5cca0
Fix CI: use test_etace_model.jl for library compilation
jameskermode Jan 6, 2026
3d0c24b
Fix CI: use cpu_target='generic' for portable library compilation
jameskermode Jan 6, 2026
be76e6c
Relax energy conservation test thresholds for random CI model
jameskermode Jan 6, 2026
a99fcab
Relax MPI energy drift threshold for random CI model
jameskermode Jan 6, 2026
b172ab2
Fix docs: disable execution for ETACE LAMMPS tutorial
jameskermode Jan 6, 2026
b66f4e1
Fix docs: use documenter=false for ETACE LAMMPS tutorial
jameskermode Jan 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
570 changes: 570 additions & 0 deletions .github/workflows/export-ci.yml

Large diffs are not rendered by default.

25 changes: 25 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,33 @@
*Manifest.toml
!export/Manifest.toml
/docs/Manifest.toml
/docs/build/
scratch
/docs/src/literate_tutorials
.vscode
.ipynb_checkpoints
.CondaPkg

# Export feature artifacts
export/deployments/
export/**/build/
export/**/*.so
export/**/*.log
export/examples/**/deployments/

# Benchmark build artifacts
benchmark/deployments/**/build/
benchmark/deployments/**/*.o
benchmark/deployments/**/*.o.a
benchmark/deployments/**/CMakeFiles/
benchmark/deployments/**/CMakeCache.txt
benchmark/deployments/**/Makefile
benchmark/deployments/**/cmake_install.cmake
benchmark/**/log.lammps
benchmark/log.lammps
log.lammps

# Python
__pycache__/
*.pyc
.venv/
2 changes: 2 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ OffsetArrays = "6fe1bfb0-de20-5000-8ca7-80f57d26f881"
Optim = "429524aa-4258-5aef-a3af-852621145aeb"
Optimisers = "3bd65402-5787-11e9-1adc-39752487f4e2"
OrderedCollections = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
PackageCompiler = "9b87118b-4619-50d2-8e1e-99f35a4d4d9d"
Pkg = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
Polynomials4ML = "03c4bcba-a943-47e9-bfa1-b1661fc2974f"
PrettyTables = "08abe8d2-0d0c-5749-adfa-8a2ac140af0d"
Expand Down Expand Up @@ -79,6 +80,7 @@ OffsetArrays = "1"
Optim = "1"
Optimisers = "0.3.4, 0.4"
OrderedCollections = "1"
PackageCompiler = "2.2.4"
Polynomials4ML = "0.5"
PrettyTables = "1.3, 2"
Reexport = "1"
Expand Down
240 changes: 240 additions & 0 deletions benchmark/PERFORMANCE_ANALYSIS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
# Performance Analysis: juliac Export vs ML-PACE

## Executive Summary

The juliac export approach is approximately **2x slower** than ML-PACE (v0.6.9) for equivalent ACE models. This gap is **architectural**, not due to memory allocation or micro-optimization opportunities. Critically, **both approaches scale equally well in parallel** - the 2x performance ratio remains constant across all process counts.

## Benchmark Results

**System**: TiAl B2, 10x10x10 supercell (2000 atoms), order=3, totaldegree=10, rcut=5.5

| Processes | juliac Time | juliac ns/day | ML-PACE Time | ML-PACE ns/day | Ratio | Abs Gap |
|-----------|-------------|---------------|--------------|----------------|-------|---------|
| 1 | 33.92s | 0.255 | 16.01s | 0.540 | 2.12x | 0.29 |
| 4 | 8.45s | 1.023 | 4.28s | 2.019 | 1.97x | 1.00 |
| 8 | 4.58s | 1.885 | 2.30s | 3.759 | 1.99x | 1.87 |

**Parallel scaling:**
- juliac: 7.4x speedup at 8 cores (92.6% efficiency)
- ML-PACE: 7.0x speedup at 8 cores (87.0% efficiency)

## Understanding the "Widening Gap"

**Important clarification:** On a linear-scale performance plot, the gap between juliac and ML-PACE appears to widen with more processes. This is a **visual artifact**, not a scaling deficiency.

Three ways to view the same data:

| Metric | 1 → 8 Processes | Interpretation |
|--------|-----------------|----------------|
| **Performance ratio** | 2.12x → 1.99x | Slightly *narrowing* (juliac catching up) |
| **Absolute gap** | 0.29 → 1.87 ns/day | Widening (misleading on linear plots) |
| **Parallel speedup** | juliac 7.4x, ML-PACE 7.0x | Both scale excellently |

**Why the absolute gap grows:** When both approaches scale by ~7x, the initial gap also scales by ~7x. If A=0.25 and B=0.54 at 1 core, then at 8 cores A≈1.9 and B≈3.8. The ratio (B/A) stays ~2x, but the difference (B-A) grows from 0.3 to 1.9.

**The log-scale plot** shows this clearly: the lines are parallel, indicating constant relative performance. The ratio plot confirms the gap stays at ~2x across all process counts.

## Root Cause Analysis

### What Was Investigated

1. **Memory allocation hypothesis** - Investigated whether per-atom allocations (~11KB) were the bottleneck
- Implemented pre-allocated workspace buffers
- Result: **No improvement** (actually 18-26% slower)
- Conclusion: Julia's allocator is efficient; allocation is not the bottleneck

2. **View/dispatch overhead** - Tested using `@view` vs full arrays
- `AbstractMatrix` dispatch adds overhead
- SubArray indirection slows hot loops
- Result: Workspace with views was **slower** than allocating version

3. **SIMD/bounds checking** - Added `@inbounds` and `@simd` annotations
- Minor improvements but not significant
- Limited by loop structure (complex operations in inner loops)

### The Real Bottleneck

The performance gap is due to **fundamental architectural differences**:

| Aspect | juliac Export | ML-PACE |
|--------|--------------|---------|
| Evaluation pattern | Per-atom, per-neighbor loops | Batch + cache-optimized |
| Basis evaluation | Scalar function calls | Vectorized with SIMD |
| Memory layout | Julia arrays (general purpose) | Custom cache-aligned |
| Tensor contraction | EquivariantTensors (general) | Hand-tuned sparse ops |

## Why Lux Migration Will Help

The planned migration to fully Lux-based models (PR #305, `co/etback` branch) will change the architecture to:

```
Current (per-atom): Lux-based (batched):
for each atom: G = ETGraph(edges)
for each neighbor: r = map(transform, G.edge_data) # all edges
Rnl_j = eval(r_j) Rnl = rbasis(r) # batched
Ylm_j = eval(r̂_j) Ylm = ybasis(r̂) # batched
A = pool(Rnl, Ylm) B = SparseACElayer(Rnl, Ylm) # fused
... E = sum(B)
```

Key improvements from Lux migration:
1. **Vectorized evaluation** - All edges processed together
2. **GPU acceleration** - KernelAbstractions enables GPU execution
3. **Automatic differentiation** - Zygote handles gradients
4. **Better cache utilization** - Contiguous edge data in 3D arrays

## Recommendations

### Short-term (before Lux migration)
- **Do not pursue further micro-optimizations** - The bottleneck is architectural
- The current code is clean and correct; keep it maintainable
- Focus development effort on the Lux migration

### For Lux migration
- Use `mlip.jl` example from EquivariantTensors as template
- Ensure `juliac --trim` compatibility with KernelAbstractions (may need CPU-only path)
- Consider keeping C interface similar for LAMMPS compatibility

### Future benchmarking
- Re-run this benchmark after Lux migration
- Compare CPU (KernelAbstractions) vs GPU performance
- Target: Match or exceed ML-PACE on CPU

## Hybrid MPI+OpenMP Benchmark Results

### Key Finding: ML-PACE Does NOT Support OpenMP

Investigation of the ML-PACE source code reveals it has **no OpenMP support**:

```cpp
// pair_pace.cpp line 175 - Simple sequential loop, no #pragma omp
for (ii = 0; ii < inum; ii++) {
// ... all work done serially
}
```

Searching the entire ML-PACE codebase (`/tmp/lammps-user-pace/ML-PACE/`) finds zero OpenMP pragmas.

### CPU Utilization Proves This

| Config | juliac CPU% | ML-PACE CPU% | Interpretation |
|--------|-------------|--------------|----------------|
| 1×8 | **677.8%** | 99.8% | juliac uses 7 threads; ML-PACE uses 1 |
| 2×4 | **379.1%** | 99.7% | juliac uses 4 threads; ML-PACE uses 1 |
| 4×2 | **194.3%** | 99.8% | juliac uses 2 threads; ML-PACE uses 1 |
| 8×1 | 99.2% | 99.7% | Both use 1 thread per rank |

When running `1 MPI × 8 OMP` with ML-PACE:
- LAMMPS allocates 8 threads
- `pair_pace` ignores them - has no OpenMP code
- **7 of 8 cores sit completely idle**
- Performance drops to 1/8 of what pure MPI would achieve

### Raw Benchmark Data

| Config | juliac (ns/day) | ML-PACE (ns/day) | Notes |
|--------|-----------------|------------------|-------|
| 8×1 (pure MPI) | 1.838 | 3.715 | **Valid comparison**: ML-PACE 2x faster |
| 4×2 | 1.808 | 2.002 | ML-PACE wastes 1 thread/rank |
| 2×4 | 1.768 | 1.036 | ML-PACE wastes 3 threads/rank |
| 1×8 (pure OMP) | 1.599 | 0.547 | ML-PACE wastes 7 threads |

### Correct Interpretation

The hybrid benchmark results are **not a fair comparison** for configs other than 8×1:
- ML-PACE is designed for **pure MPI only** (or Kokkos/GPU via `pace/kk`)
- juliac actually implements OpenMP threading over atoms
- Comparing them with OpenMP threads allocated but unused by ML-PACE is misleading

### What ML-PACE Supports

1. **MPI parallelism** - domain decomposition (works well)
2. **Kokkos/GPU** - `pair_style pace/kk` (**NOT compatible with ACEpotentials.jl exports** - see below)
3. **NO OpenMP** - `pair_style pace` is purely sequential within each MPI rank

### pace/kk Does NOT Work with ACEpotentials.jl Exports

Investigation of the ML-PACE source code reveals a fundamental incompatibility:

**Class hierarchy:**
```
AbstractRadialBasis (base class)
├── ACERadialFunctions (pace/kk supported)
└── ACEjlRadialFunctions (ACEpotentials.jl exports)
```

**The problem:** `pair_pace_kokkos.cpp` line 255 requires `ACERadialFunctions`:
```cpp
ACERadialFunctions* radial_functions = dynamic_cast<ACERadialFunctions*>(...);
if (radial_functions == nullptr)
error->all(FLERR,"Chosen radial basis style not supported by pair style pace/kk");
```

ACEpotentials.jl exports use `radbasename: "ACE.jl"` which creates `ACEjlRadialFunctions`.
Since this is NOT a subclass of `ACERadialFunctions`, the `dynamic_cast` fails → **pace/kk errors out**.

**Verified in our benchmark's .yace file:**
```yaml
bonds:
[0, 0]:
radbasename: "ACE.jl" # Triggers ACEjlRadialFunctions, incompatible with pace/kk
```

**Implication:** For GPU acceleration of ACEpotentials.jl models, the Lux migration with KernelAbstractions is the only viable path. ML-PACE's GPU support (pace/kk) is only for potentials fitted with the Python pacemaker toolkit, not ACEpotentials.jl

### Valid Conclusions

| Scenario | juliac | ML-PACE (.yace export) |
|----------|--------|------------------------|
| Pure MPI | ✓ Works (~2x slower) | ✓ Works (fastest for CPU) |
| Hybrid MPI+OpenMP | ✓ Works (uses threads) | ✗ Threads wasted |
| GPU (Kokkos) | Pending Lux migration | ✗ pace/kk incompatible with ACEpotentials.jl |

**Bottom line for ACEpotentials.jl users:**
- **CPU (pure MPI):** Export to .yace and use ML-PACE for best performance
- **CPU (hybrid):** Use juliac plugin (only option that uses OpenMP)
- **GPU:** Not currently available; awaiting Lux migration

### juliac's OpenMP Implementation

The juliac/ACE plugin does use OpenMP threading. With `1 MPI × 8 OMP`:
- CPU utilization: 677.8% (using ~7 of 8 threads)
- Performance: 1.599 ns/day (only 13% slower than 8×1)
- Threads parallelize over atoms within the MPI domain

This is a genuine advantage of the juliac approach for hybrid parallelism scenarios.

### To Reproduce

```bash
cd benchmark/lammps
./run_hybrid_scaling.sh

cd ..
python3 plot_hybrid_scaling.py
```

Results saved in `results/hybrid/`.

## Files Modified During Investigation

All changes were **reverted** as they did not improve performance:
- `export/src/export_ace_model.jl` - Workspace code (reverted)

## Baseline for Future Comparison

```
juliac export baseline (2024-12-11):
- TiAl B2 2000 atoms, order=3, degree=10
- Single core: 33.92s, 0.255 ns/day
- 8 cores: 4.58s, 1.885 ns/day
- Speedup: 7.4x (92.6% efficiency)

ML-PACE v0.6.9 baseline:
- Same system
- Single core: 16.01s, 0.540 ns/day
- 8 cores: 2.30s, 3.759 ns/day
- Speedup: 7.0x (87.0% efficiency)

Performance ratio: ~2x constant across all process counts
```
Loading
Loading