Skip to content

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Dec 30, 2025

PR Type

Enhancement, Tests


Description

  • Unify CPU/GPU kernel code paths for turbulence models

    • Extract common computation logic into shared kernels (mixing_length, EARSM, GEP, SST transport)
    • Compile kernels for both CPU and GPU using #pragma omp declare target
    • Eliminates ~200+ lines of duplicate code across turbulence implementations
  • Return valid host pointers in CPU TurbulenceDeviceView

    • CPU builds now return actual host memory pointers instead of invalid empty view
    • Follows SolverDeviceView pattern for consistency
    • Enables unified kernel usage on both backends
  • Improve CPU/GPU consistency testing

    • Always regenerate CPU reference before GPU comparison (avoid stale files)
    • Fail (not skip) if GPU unavailable during cross-build tests
    • Add GPU device verification and execution checks
    • Use OMP_TARGET_OFFLOAD=MANDATORY to prevent silent CPU fallback
  • Rename ci.sh and improve test infrastructure

    • Rename run_ci_local.shci.sh (used by both local and CI pipelines)
    • Add ensure_build() function to guarantee both CPU/GPU builds exist
    • Implement caching to avoid redundant builds across multiple cross-build tests
    • Add MIN_EXPECTED_DIFF threshold to detect same-backend comparisons

Diagram Walkthrough

flowchart LR
  A["Turbulence Models<br/>mixing_length, EARSM, GEP, SST"] -->|Extract common logic| B["Unified Kernels<br/>with omp declare target"]
  B -->|Compile for CPU| C["CPU Execution"]
  B -->|Compile for GPU| D["GPU Execution"]
  C -->|Generate reference| E["CPU Reference Output"]
  D -->|Compare against| E
  E -->|Pass/Fail| F["Test Result"]
  G["TurbulenceDeviceView"] -->|Return host pointers| H["CPU Build"]
  H -->|Enable unified kernels| C
Loading

File Walkthrough

Relevant files
Enhancement
6 files
solver.cpp
Return valid host pointers in CPU TurbulenceDeviceView     
+38/-1   
turbulence_baseline.cpp
Replace inlined GPU code with unified mixing_length kernel
+9/-23   
turbulence_earsm.cpp
Extract EARSM output computation into unified kernel         
+120/-250
turbulence_gep.cpp
Create unified GEP cell kernel for CPU/GPU                             
+129/-106
turbulence_transport.cpp
Unify SST k-ω transport into single cell kernel                   
+281/-215
ci.sh
Rename from run_ci_local.sh and improve cross-build testing
+190/-102
Tests
2 files
test_cpu_gpu_bitwise.cpp
Add GPU device verification and MIN_EXPECTED_DIFF threshold
+41/-8   
test_poisson_cpu_gpu_3d.cpp
Add GPU device verification and MIN_EXPECTED_DIFF threshold
+32/-2   
Documentation
3 files
rules.md
Update script references from run_ci_local.sh to ci.sh     
+11/-11 
gpu_ci_correctness.sbatch.template
Update script references from run_ci_local.sh to ci.sh     
+2/-2     
ci.yml
Update script references from run_ci_local.sh to ci.sh     
+2/-2     

Summary by CodeRabbit

  • Tests

    • Strengthened GPU detection and execution verification in comparisons; added a tiny-difference threshold to flag suspiciously small CPU/GPU deltas.
  • New Features

    • Added GPU-aware cross-build test flow: CPU reference generation, GPU comparisons, and environment-controlled enforcement of GPU offload.
  • Refactor

    • Unified CPU/GPU execution paths and cell-level kernels for turbulence and transport; made device-view data consistent in CPU builds.
  • Chores

    • Updated CI/run guidance to reference the consolidated CI script.

✏️ Tip: You can customize this high-level summary in your review settings.

sbryngelson and others added 8 commits December 30, 2025 10:39
The cross-build tests were silently skipping when only one build existed,
and would compare CPU-to-CPU when GPU was unavailable. Now:

- Builds both CPU and GPU versions automatically for cross-build tests
- Fails (not skips) if GPU is unavailable when required
- Always generates fresh CPU reference before GPU comparison
- Renamed run_ci_local.sh -> ci.sh (used by both local and CI pipelines)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Following the SolverDeviceView pattern, make get_device_view() return
valid host pointers in CPU builds instead of an invalid empty view.
This is the tap root change that enables future unification of
CPU/GPU turbulence model code paths.

The change is backwards-compatible:
- GPU builds: unchanged behavior (returns GPU-mapped pointers)
- CPU builds: now returns valid host pointers
- NN models: unaffected (their GPU checks are inside #ifdef USE_GPU_OFFLOAD)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace inlined GPU computation with call to mixing_length_cell_kernel()
which was already defined with #pragma omp declare target but wasn't
being used. This eliminates ~15 lines of duplicate code and ensures
GPU and CPU paths use identical kernel logic.

The mixing_length_cell_kernel() function (lines 34-64) computes:
- Strain rate magnitude from velocity gradients
- y+ and van Driest damping
- Mixing length with delta/2 cap
- Final eddy viscosity

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Following the unified kernel pattern, extract the GEP eddy viscosity
computation into a single gep_cell_kernel() function with
#pragma omp declare target. Both GPU and CPU paths now call the
same kernel function, eliminating duplicated logic.

Key changes:
- Add gep_cell_kernel() with unified GEP variant logic
- GPU path: replace inlined code with kernel call
- CPU path: replace inlined code with kernel call (using local
  wall_distance buffer for flat array access)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract the common EARSM output computation (tensor basis, Reynolds
stresses, eddy viscosity) into a unified earsm_compute_output() kernel
with #pragma omp declare target. All three EARSM variants (WJ, GS, Pope)
now call this shared kernel after computing model-specific β coefficients.

Key changes:
- Add earsm_compute_output() unified kernel (~60 lines)
- WJ kernel: replace 60 lines of inlined output with kernel call
- GS kernel: replace 50 lines of inlined output with kernel call
- Pope kernel: replace 120 lines (including debug logging) with kernel call
- Remove unused debug logging infrastructure and includes
- Net reduction: ~130 lines

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add omp_is_initial_device() check to verify target regions execute on GPU
- Tighten warning threshold from exact 0.0 to 1e-14 (below machine epsilon)
- Print GPU device count and verification status

This catches cases where GPU offload compiles but silently produces
identical results to CPU, which would indicate something is wrong.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract sst_transport_cell_kernel() with #pragma omp declare target
that computes all SST physics (gradients, F1 blending, production,
advection, diffusion, cross-diffusion, time integration) for a single
cell. Both GPU and CPU paths now call this unified kernel.

Benefits:
- Single source of truth for SST transport physics
- Eliminates multi-pass CPU approach (separate gradient, blending,
  production, advection, diffusion loops)
- CPU and GPU use identical computation logic
- Easier to maintain and verify correctness

The unified kernel uses simple central-difference diffusion (same as
previous GPU path) rather than face-averaged coefficients. This is
acceptable per user guidance prioritizing unified logic over micro-
optimizations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Dec 30, 2025

📝 Walkthrough

Walkthrough

Replaced local CI runner references with scripts/ci.sh and added a GPU-aware cross-build flow; CPU builds now return fully populated host-backed device views; per-cell turbulence and transport computations were centralized into unified cell kernels; tests add GPU availability/execution verification and thresholded diff checks.

Changes

Cohort / File(s) Summary
Docs / Guidance
\.claude/rules.md
Replaced ./scripts/run_ci_local.sh./scripts/ci.sh in guidance and quick-reference text.
CI Workflows & Templates
.github/workflows/ci.yml, .github/scripts/gpu_ci_correctness.sbatch.template
Updated workflow and SBATCH template to invoke ./scripts/ci.sh (including --cpu usage).
Local/CI Runner
scripts/ci.sh
Reworked runner into a GPU-aware cross-build script: added check_gpu_available(), ensure_build(<build_dir>, <use_gpu_offload>), run_cross_build_test(), build-caching flags (CPU_BUILD_ENSURED/GPU_BUILD_ENSURED), extended run_test() to accept optional env_prefix, and CPU-reference → GPU-compare orchestration.
Solver device views
src/solver.cpp
CPU builds: get_device_view() and get_solver_view() now return fully populated host-backed views (host pointers for all fields), aligning CPU/GPU data-access semantics.
Turbulence kernels (centralized)
src/turbulence_baseline.cpp, src/turbulence_gep.cpp, src/turbulence_earsm.cpp, src/turbulence_transport.cpp
Centralized per-cell logic into unified inline/device-agnostic kernels (mixing_length_cell_kernel, gep_cell_kernel, earsm_compute_output, sst_transport_cell_kernel); both GPU and CPU paths call the same kernels; added small host-side wall-distance buffers and minor includes.
Tests: CPU↔GPU comparisons
tests/test_cpu_gpu_bitwise.cpp, tests/test_poisson_cpu_gpu_3d.cpp
Added GPU availability and on-device execution verification, introduced MIN_EXPECTED_DIFF = 1e-14 and replaced exact-zero diff checks with threshold-based warnings; added OpenMP target execution guards.

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer / CI Trigger
    participant CI as scripts/ci.sh
    participant BuilderCPU as build_cpu
    participant BuilderGPU as build_gpu
    participant GPU as GPU Device
    participant Comparator as Comparator / Tests

    rect rgba(0,128,96,0.06)
    Note over CI: CI starts (local or workflow)
    end

    Dev->>CI: invoke ./scripts/ci.sh [--cpu / tests...]
    CI->>CI: check_gpu_available()
    alt GPU available
      CI->>BuilderCPU: ensure_build(build_cpu, offload=OFF)
      CI->>BuilderGPU: ensure_build(build_gpu, offload=ON)
      CI->>BuilderCPU: run tests → generate cpu_reference
      BuilderCPU-->>CI: cpu_reference artifacts
      CI->>BuilderGPU: run tests on GPU (use cpu_reference)
      BuilderGPU->>GPU: offloaded kernels execute
      BuilderGPU-->>CI: gpu outputs
      CI->>Comparator: compare gpu outputs vs cpu_reference
      Comparator-->>CI: pass/fail + diffs
    else No GPU
      CI-->>Dev: skip/fail GPU cross-build tests (based on flags)
    end
    CI-->>Dev: final status, logs, artifacts
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I hopped through kernels, neat and spry,

One kernel now for CPU and sky,
CI bakes refs and sends them to test,
GPU and host must both do their best,
A carrot-sprung patch — hop, code, rejoice! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main change categories in the PR: CPU/GPU consistency test improvements and the CI script rename from run_ci_local.sh to ci.sh.
✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4190803 and b30576e.

📒 Files selected for processing (1)
  • src/turbulence_baseline.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/turbulence_baseline.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-and-test (ubuntu-latest, Debug)
  • GitHub Check: build-and-test (ubuntu-latest, Release)
  • GitHub Check: gpu-tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link

qodo-code-review bot commented Dec 30, 2025

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unsafe pointer exposure: RANSSolver::get_device_view() returns writable raw pointers via const_cast which could
allow unintended mutation if the view is used outside trusted/internal code paths.

Referred Code
// CPU build: return host pointers (following get_solver_view() pattern)
TurbulenceDeviceView view;

// Velocity field (staggered, use same pattern as get_solver_view)
view.u_face = const_cast<double*>(velocity_.u_data().data());
view.v_face = const_cast<double*>(velocity_.v_data().data());
view.u_stride = velocity_.u_stride();
view.v_stride = velocity_.v_stride();

// Turbulence fields (cell-centered)
view.k = const_cast<double*>(k_.data().data());
view.omega = const_cast<double*>(omega_.data().data());
view.nu_t = const_cast<double*>(nu_t_.data().data());
view.cell_stride = mesh_->total_Nx();

// Reynolds stress tensor
view.tau_xx = const_cast<double*>(tau_ij_.xx_data().data());
view.tau_xy = const_cast<double*>(tau_ij_.xy_data().data());
view.tau_yy = const_cast<double*>(tau_ij_.yy_data().data());

// Gradient scratch buffers


 ... (clipped 17 lines)

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

qodo-code-review bot commented Dec 30, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard duplicate device_view definitions

Wrap the two definitions of RANSSolver::get_device_view() with #ifdef
USE_GPU_OFFLOAD and #ifndef USE_GPU_OFFLOAD to prevent a duplicate symbol linker
error.

src/solver.cpp [4438-4477]

+#ifndef USE_GPU_OFFLOAD
 TurbulenceDeviceView RANSSolver::get_device_view() const {
     // CPU build: return host pointers (following get_solver_view() pattern)
     TurbulenceDeviceView view;
     // ...
     return view;
 }
+#endif

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 10

__

Why: The suggestion correctly identifies a critical build error (duplicate symbol definition) that would prevent the code from compiling and provides the correct fix using preprocessor guards.

High
Enforce correct test mode usage

To prevent generating incorrect reference files, add a check in the main
function to ensure that GPU-enabled builds of the test are only run in compare
mode, not dump mode.

tests/test_cpu_gpu_bitwise.cpp [291-463]

+int main(int argc, char* argv[]) {
+    #ifdef USE_GPU_OFFLOAD
+    // In a GPU build, verify we are in compare mode. Dump mode is for CPU-only builds.
+    bool is_compare_mode = false;
+    for (int i = 1; i < argc; ++i) {
+        if (std::strcmp(argv[i], "--compare-prefix") == 0) {
+            is_compare_mode = true;
+            break;
+        }
+    }
+    if (!is_compare_mode) {
+        std::cerr << "ERROR: GPU build (USE_GPU_OFFLOAD=ON) must be run with --compare-prefix.\n";
+        std::cerr << "       To generate reference data, use a CPU-only build (USE_GPU_OFFLOAD=OFF).\n";
+        return 1;
+    }
+    #endif
+
+    // ... existing main function logic ...
+}
+
+// ...
+
 int run_compare_mode([[maybe_unused]] const std::string& prefix) {
 #ifndef USE_GPU_OFFLOAD
-    std::cerr << "ERROR: --compare-prefix requires GPU build\n";
-    std::cerr << "       This binary was built with USE_GPU_OFFLOAD=OFF\n";
-    std::cerr << "       Rebuild with -DUSE_GPU_OFFLOAD=ON\n";
-    return 1;
+    // ... (this part is now theoretically unreachable but good for defense)
 #else
     std::cout << "=== GPU Comparison Mode ===\n";
     std::cout << "Reference prefix: " << prefix << "\n\n";
 
     // Verify GPU is actually accessible (not just compiled with offload)
     const int num_devices = omp_get_num_devices();
     std::cout << "GPU devices available: " << num_devices << "\n";
     if (num_devices == 0) {
         std::cerr << "ERROR: No GPU devices found. Cannot run GPU comparison.\n";
         return 1;
     }
 
     // Verify target regions actually execute on GPU (not host fallback)
     int on_device = 0;
     #pragma omp target map(tofrom: on_device)
     {
         on_device = !omp_is_initial_device();
     }
     if (!on_device) {
         std::cerr << "ERROR: Target region executed on host, not GPU.\n";
         std::cerr << "       Check GPU drivers and OMP_TARGET_OFFLOAD settings.\n";
         return 1;
     }
     std::cout << "GPU execution verified: YES\n\n";
 
     // Verify reference files exist
     if (!file_exists(prefix + "_u.dat")) {
         std::cerr << "ERROR: Reference file not found: " << prefix << "_u.dat\n";
         std::cerr << "       Run CPU build with --dump-prefix first\n";
         return 1;
     }
     // ...
 #endif
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a potential misuse of the test binary that could lead to invalid reference files and confusing test results, and the proposed fix is robust and improves the test harness's reliability.

Medium
General
Pass CMAKE_BUILD_TYPE to cmake
Suggestion Impact:The cmake command in ensure_build was modified to include -DCMAKE_BUILD_TYPE (set explicitly to Release), addressing the missing build-type propagation noted in the suggestion, though it does not use the suggested BUILD_TYPE variable/default handling.

code diff:

     if [ ! -f "CMakeCache.txt" ]; then
         log_info "  Configuring $build_name build..."
-        if ! cmake "$PROJECT_DIR" -DUSE_GPU_OFFLOAD=${gpu_offload} -DBUILD_TESTS=ON > cmake_output.log 2>&1; then
+        if ! cmake "$PROJECT_DIR" -DCMAKE_BUILD_TYPE=Release -DUSE_GPU_OFFLOAD=${gpu_offload} -DBUILD_TESTS=ON > cmake_output.log 2>&1; then

Add the -DCMAKE_BUILD_TYPE flag to the cmake command within the ensure_build
function to ensure that both Debug and Release configurations are tested as
intended.

scripts/ci.sh [104-143]

 ensure_build() {
     ...
     if [ ! -f "CMakeCache.txt" ]; then
-        if ! cmake "$PROJECT_DIR" -DUSE_GPU_OFFLOAD=${gpu_offload} -DBUILD_TESTS=ON > cmake_output.log 2>&1; then
+        if ! cmake "$PROJECT_DIR" \
+                  -DCMAKE_BUILD_TYPE="${BUILD_TYPE:-Debug}" \
+                  -DUSE_GPU_OFFLOAD=${gpu_offload} \
+                  -DBUILD_TESTS=ON > cmake_output.log 2>&1; then
             ...
         fi
     fi
     ...
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies that the ensure_build function is missing the CMAKE_BUILD_TYPE flag, which would cause all cross-build tests to run only in one configuration (likely Debug), failing to test the Release build as intended by the CI logic.

High
Possible issue
Avoid ghost-cell wall distance reads

Correct the loop that populates wall_dist_buf to iterate only over interior
cells, preventing potential out-of-bounds access when calling
mesh.wall_distance.

src/turbulence_transport.cpp [818-827]

 // Create wall_distance buffer for unified kernel
 const size_t total_cells = (size_t)mesh.total_Nx() * mesh.total_Ny();
-std::vector<double> wall_dist_buf(total_cells, 0.0);
-for (int j = 0; j < mesh.total_Ny(); ++j) {
-    for (int i = 0; i < mesh.total_Nx(); ++i) {
+std::vector<double> wall_dist_buf(total_cells, 1e-10);  // safe default for ghosts
+for (int j = mesh.j_begin(); j < mesh.j_end(); ++j) {
+    for (int i = mesh.i_begin(); i < mesh.i_end(); ++i) {
         const int idx = j * cell_stride + i;
         wall_dist_buf[idx] = mesh.wall_distance(i, j);
     }
 }
 const double* wall_dist_ptr = wall_dist_buf.data();
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the loop populating wall_dist_buf iterates over ghost cells, which could lead to out-of-bounds access or incorrect behavior if mesh.wall_distance is not designed for it. The fix correctly restricts the loop to interior cells, preventing a potential crash or incorrect calculations.

Medium
Make job count portable

Improve script portability by adding a fallback for nproc to determine the
number of CPU cores, preventing failures on systems like macOS.

scripts/ci.sh [133-138]

-if ! make -j"$(nproc)" > build_output.log 2>&1; then
+local jobs
+jobs="$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)"
+if ! make -j"$jobs" > build_output.log 2>&1; then
     log_failure "$build_name build failed"
     cat build_output.log | tail -30 | sed 's/^/    /'
     cd "$orig_dir"
     return 1
 fi
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that using nproc without a fallback will cause the script to fail on macOS. The proposed change improves portability for local development, which is a key goal of this script.

Low
Use device-friendly math functions

Replace std:: math functions like std::sqrt and std::tanh with their C-style
counterparts (sqrt, tanh) inside the OpenMP target region to improve portability
and prevent potential GPU compilation issues.

src/turbulence_transport.cpp [115-122]

-double sqrt_k = std::sqrt(k_c);
+double sqrt_k = sqrt(k_c);
 double arg1_1 = sqrt_k / (beta_star * omega_c * y_safe);
 double arg1_2 = 500.0 * nu / (y_safe * y_safe * omega_c);
 double arg1_3 = 4.0 * sigma_omega2 * k_c / (CD_omega * y_safe * y_safe);
 double arg1 = arg1_1;
 arg1 = (arg1 > arg1_2) ? arg1 : arg1_2;
 arg1 = (arg1 < arg1_3) ? arg1 : arg1_3;
-double F1 = std::tanh(arg1 * arg1 * arg1 * arg1);
+double F1 = tanh(arg1 * arg1 * arg1 * arg1);
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly points out a common portability issue with using std:: math functions in OpenMP target regions. Using the C-style global namespace functions (sqrt, tanh) is more portable across different compilers and toolchains for GPU offloading, improving the code's robustness.

Low
Safely pass environment variables

Refactor the run_test function to handle environment variables more robustly by
splitting env_prefix into an array, preventing issues with word-splitting.

scripts/ci.sh [203-207]

 if [ -n "$env_prefix" ]; then
-    env $env_prefix timeout "$timeout_secs" "$test_binary" > "$output_file" 2>&1 || exit_code=$?
+    # Split env_prefix into words intentionally, then pass as separate args to env
+    # shellcheck disable=SC2206
+    local env_kv=($env_prefix)
+    env "${env_kv[@]}" timeout "$timeout_secs" "$test_binary" > "$output_file" 2>&1 || exit_code=$?
 else
     timeout "$timeout_secs" "$test_binary" > "$output_file" 2>&1 || exit_code=$?
 fi
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly points out a potential word-splitting issue with env $env_prefix. While the current usage in the PR is safe, the proposed change makes the run_test function more robust and prevents potential bugs if it's used with more complex environment variables in the future.

Low
  • Update

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/turbulence_baseline.cpp (1)

332-354: CPU path does not use the unified kernel, risking code drift.

The CPU path still has inline computation instead of calling mixing_length_cell_kernel. While the logic appears equivalent, this defeats the purpose of the "unified kernel" approach. If either path is modified, the other won't automatically get the fix.

Consider refactoring the CPU path to also call the unified kernel for true consistency:

🔎 Suggested refactor
+    const int stride = mesh.total_Nx();
+    double* nu_t_ptr = nu_t.data().data();
+    const double* dudx_ptr = dudx_.data().data();
+    const double* dudy_ptr = dudy_.data().data();
+    const double* dvdx_ptr = dvdx_.data().data();
+    const double* dvdy_ptr = dvdy_.data().data();
+
     // CPU path
     for (int j = mesh.j_begin(); j < mesh.j_end(); ++j) {
         for (int i = mesh.i_begin(); i < mesh.i_end(); ++i) {
-            double y_wall = mesh.wall_distance(i, j);
-            double y_plus = y_wall * u_tau / nu_;
-            
-            // Use same formulation as GPU (single expression)
-            double damping = 1.0 - std::exp(-y_plus / A_plus_);
-            
-            double l_mix = kappa_ * y_wall * damping;
-            // Use same min operation as GPU
-            if (l_mix > 0.5 * delta_) {
-                l_mix = 0.5 * delta_;
-            }
-            
-            double Sxx = dudx_(i, j);
-            double Syy = dvdy_(i, j);
-            double Sxy = 0.5 * (dudy_(i, j) + dvdx_(i, j));
-            double S_mag = std::sqrt(2.0 * (Sxx*Sxx + Syy*Syy + 2.0*Sxy*Sxy));
-            
-            nu_t(i, j) = l_mix * l_mix * S_mag;
+            const int cell_idx = j * stride + i;
+            mixing_length_cell_kernel(
+                cell_idx, u_tau, nu_,
+                kappa_, A_plus_, delta_,
+                mesh.wall_distance(i, j),
+                dudx_ptr, dudy_ptr, dvdx_ptr, dvdy_ptr,
+                nu_t_ptr);
         }
     }
src/turbulence_earsm.cpp (1)

204-340: CPU path does not use the unified output kernel.

The CPU implementation in EARSMClosure::compute_nu_t still computes tensor basis and Reynolds stresses inline rather than calling earsm_compute_output. While the GPU-specific variants (WJ, GS, Pope) use the unified kernel, the base CPU path does not. This creates potential for drift between CPU and GPU implementations.

Consider refactoring to use earsm_compute_output in the CPU path as well, or at minimum document why the paths differ.

scripts/ci.sh (1)

104-143: Robust build helper with proper error handling.

The ensure_build function handles configuration and build with appropriate error reporting. However, the static analysis tool flagged line 116:

🔎 Address SC2155: Declare and assign separately
-    local orig_dir=$(pwd)
+    local orig_dir
+    orig_dir=$(pwd)

This prevents masking the return value of pwd if it were to fail (though unlikely).

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4aa91fb and 3d4d718.

📒 Files selected for processing (11)
  • .claude/rules.md
  • .github/scripts/gpu_ci_correctness.sbatch.template
  • .github/workflows/ci.yml
  • scripts/ci.sh
  • src/solver.cpp
  • src/turbulence_baseline.cpp
  • src/turbulence_earsm.cpp
  • src/turbulence_gep.cpp
  • src/turbulence_transport.cpp
  • tests/test_cpu_gpu_bitwise.cpp
  • tests/test_poisson_cpu_gpu_3d.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx}

📄 CodeRabbit inference engine (.cursorrules)

**/*.{cpp,cc,cxx}: Do not add platform-specific tolerances with #ifdef __APPLE__ or similar platform checks in tests
Include code comments explaining WHY, not WHAT; document numerical algorithms; note non-obvious optimizations

Files:

  • tests/test_cpu_gpu_bitwise.cpp
  • src/turbulence_gep.cpp
  • src/turbulence_earsm.cpp
  • src/solver.cpp
  • tests/test_poisson_cpu_gpu_3d.cpp
  • src/turbulence_transport.cpp
  • src/turbulence_baseline.cpp
**/*.{cpp,cc,cxx,h,hpp}

📄 CodeRabbit inference engine (.cursorrules)

**/*.{cpp,cc,cxx,h,hpp}: Avoid overly strict floating-point comparisons (e.g., == for doubles); use appropriate tolerances based on algorithm
Fix all compiler warnings before pushing; use [[maybe_unused]] for intentionally unused variables in assertions
Use const for read-only references and mark methods const if they don't modify state
Use RAII for resource management; use smart pointers or RAII wrappers instead of manual new/delete or malloc/free
Use OmpDeviceBuffer wrapper for GPU buffer management instead of manual allocation
Check return values and use exceptions for error conditions with informative error messages

Files:

  • tests/test_cpu_gpu_bitwise.cpp
  • src/turbulence_gep.cpp
  • src/turbulence_earsm.cpp
  • src/solver.cpp
  • tests/test_poisson_cpu_gpu_3d.cpp
  • src/turbulence_transport.cpp
  • src/turbulence_baseline.cpp
**/*.{sh,slurm,batch}

📄 CodeRabbit inference engine (.cursor/rules/lean-changes.mdc)

Always submit slurm jobs using the embers QOS

Files:

  • scripts/ci.sh
🧠 Learnings (14)
📓 Common learnings
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: For GPU-related changes, also run `./test_before_ci_gpu.sh` to run the complete GPU CI test suite locally
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Run `./test_before_ci.sh` before every push to repository to test Debug build, Release build, and all unit tests
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Applies to **/*.{cpp,cc,cxx,h,hpp} : Use `OmpDeviceBuffer` wrapper for GPU buffer management instead of manual allocation
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: For GPU-related changes, also run `./test_before_ci_gpu.sh` to run the complete GPU CI test suite locally

Applied to files:

  • .github/workflows/ci.yml
  • .claude/rules.md
  • tests/test_cpu_gpu_bitwise.cpp
  • scripts/ci.sh
  • .github/scripts/gpu_ci_correctness.sbatch.template
  • tests/test_poisson_cpu_gpu_3d.cpp
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Run `./test_before_ci.sh` before every push to repository to test Debug build, Release build, and all unit tests

Applied to files:

  • .github/workflows/ci.yml
  • .claude/rules.md
  • scripts/ci.sh
  • .github/scripts/gpu_ci_correctness.sbatch.template
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Create feature branches from latest main; run `./test_before_ci.sh` to verify starting state before development

Applied to files:

  • .claude/rules.md
  • scripts/ci.sh
📚 Learning: 2025-12-25T20:38:56.549Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursor/rules/lean-debug.mdc:0-0
Timestamp: 2025-12-25T20:38:56.549Z
Learning: For GitHub CI debugging, run `./tools/ci_packet.sh <runid>` via terminal tool.

Applied to files:

  • .claude/rules.md
  • scripts/ci.sh
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Add tests for new features; update tests when changing behavior; don't commit broken tests

Applied to files:

  • .claude/rules.md
📚 Learning: 2025-12-25T20:38:56.549Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursor/rules/lean-debug.mdc:0-0
Timestamp: 2025-12-25T20:38:56.549Z
Learning: For local debugging, run `./tools/local_packet.sh <command...>` via terminal tool.

Applied to files:

  • .claude/rules.md
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Applies to **/*.{cpp,cc,cxx} : Do not add platform-specific tolerances with `#ifdef __APPLE__` or similar platform checks in tests

Applied to files:

  • .claude/rules.md
  • tests/test_cpu_gpu_bitwise.cpp
  • tests/test_poisson_cpu_gpu_3d.cpp
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Applies to **/*.{cpp,cc,cxx,h,hpp} : Fix all compiler warnings before pushing; use `[[maybe_unused]]` for intentionally unused variables in assertions

Applied to files:

  • .claude/rules.md
  • tests/test_cpu_gpu_bitwise.cpp
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Applies to **/*.{cpp,cc,cxx,h,hpp} : Use `OmpDeviceBuffer` wrapper for GPU buffer management instead of manual allocation

Applied to files:

  • tests/test_cpu_gpu_bitwise.cpp
  • src/turbulence_earsm.cpp
  • src/solver.cpp
  • tests/test_poisson_cpu_gpu_3d.cpp
  • src/turbulence_transport.cpp
  • src/turbulence_baseline.cpp
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Applies to **/*.{cpp,cc,cxx,h,hpp} : Avoid overly strict floating-point comparisons (e.g., `==` for doubles); use appropriate tolerances based on algorithm

Applied to files:

  • tests/test_cpu_gpu_bitwise.cpp
  • tests/test_poisson_cpu_gpu_3d.cpp
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Explicitly upload data to GPU; check `USE_GPU_OFFLOAD` is defined; verify `omp_get_num_devices() > 0` at runtime for GPU offload code

Applied to files:

  • tests/test_cpu_gpu_bitwise.cpp
  • src/turbulence_gep.cpp
  • src/turbulence_earsm.cpp
  • tests/test_poisson_cpu_gpu_3d.cpp
  • src/turbulence_transport.cpp
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Applies to tests/test_physics_validation.cpp : Navier-Stokes solver validation tests must verify: viscous terms, pressure gradient, parabolic velocity profiles, divergence-free constraint, momentum balance, channel symmetry, cross-model consistency, and numerical stability

Applied to files:

  • src/turbulence_gep.cpp
  • tests/test_poisson_cpu_gpu_3d.cpp
  • src/turbulence_transport.cpp
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Applies to tests/test_tg_validation.cpp : Taylor-Green Vortex test must validate energy decay with <5% error over 100 timesteps using viscous terms, time integration, and periodic BCs

Applied to files:

  • tests/test_poisson_cpu_gpu_3d.cpp
🧬 Code graph analysis (3)
tests/test_cpu_gpu_bitwise.cpp (1)
tests/test_poisson_cpu_gpu_3d.cpp (1)
  • cout (160-169)
src/turbulence_gep.cpp (3)
include/gpu_utils.hpp (5)
  • USE_GPU_OFFLOAD (88-95)
  • USE_GPU_OFFLOAD (98-104)
  • USE_GPU_OFFLOAD (107-113)
  • USE_GPU_OFFLOAD (116-123)
  • USE_GPU_OFFLOAD (188-203)
include/turbulence_model.hpp (8)
  • nu (189-189)
  • nu (189-189)
  • mesh (132-140)
  • mesh (161-177)
  • mesh (161-169)
  • mesh (184-184)
  • mesh (184-184)
  • nu_ (190-190)
src/features.cpp (2)
  • compute_gradients_from_mac (26-59)
  • compute_gradients_from_mac (26-30)
src/solver.cpp (2)
include/solver.hpp (5)
  • velocity_ (120-120)
  • velocity_ (127-127)
  • k_ (123-123)
  • omega_ (124-124)
  • nu_t_ (122-122)
include/fields.hpp (2)
  • mesh_ (47-47)
  • mesh_ (127-127)
🪛 Shellcheck (0.11.0)
scripts/ci.sh

[warning] 116-116: Declare and assign separately to avoid masking return values.

(SC2155)

🔇 Additional comments (23)
src/solver.cpp (1)

4438-4477: LGTM - CPU build now provides valid device view for consistency tests.

The implementation correctly mirrors the GPU path by returning a fully populated TurbulenceDeviceView with host pointers. All required fields are set consistently with the GPU version at lines 4268-4308, and the const_cast pattern matches the existing get_solver_view() implementation. This enables CPU/GPU consistency testing without conditional code paths in turbulence models.

src/turbulence_baseline.cpp (1)

265-282: GPU path correctly uses the unified kernel.

The GPU path now calls mixing_length_cell_kernel which consolidates the per-cell computation. This is a good refactor for maintainability and CPU/GPU consistency.

tests/test_poisson_cpu_gpu_3d.cpp (3)

23-26: Good addition of GPU verification includes and threshold constant.

The conditional #include <omp.h> and MIN_EXPECTED_DIFF constant are well-placed. The threshold of 1e-14 is appropriately above machine epsilon, providing a reasonable heuristic to detect if CPU and GPU are running identical code paths.

Also applies to: 32-34


274-293: Excellent GPU execution verification.

This proactive verification catches two critical failure modes:

  1. No GPU devices available (driver issues, misconfiguration)
  2. GPU present but offload silently falling back to host

The use of omp_is_initial_device() inside a target region is the correct way to detect host fallback. Based on learnings, verifying omp_get_num_devices() > 0 at runtime is the recommended practice for GPU offload code.


361-365: Threshold-based warning is appropriate.

Replacing the exact-zero comparison with a threshold check is correct. This aligns with coding guidelines to avoid overly strict floating-point comparisons. The warning-but-pass behavior is suitable since suspiciously small diffs don't necessarily indicate a bug.

tests/test_cpu_gpu_bitwise.cpp (3)

25-27: Consistent GPU include guards and threshold constant.

Matches the pattern in test_poisson_cpu_gpu_3d.cpp, maintaining consistency across the test suite.

Also applies to: 35-37


301-320: GPU verification matches the 3D Poisson test pattern.

Consistent implementation of the GPU availability and execution verification across test files is good for maintainability.


371-376: Threshold-based warnings applied consistently across all field comparisons.

Each velocity component (u, v, w) and pressure uses the same MIN_EXPECTED_DIFF threshold for detecting suspiciously small differences. This uniform approach ensures consistent behavior across all field comparisons.

Also applies to: 396-398, 421-423, 446-448

src/turbulence_gep.cpp (2)

35-100: Well-designed unified GEP kernel.

The kernel is properly annotated with #pragma omp declare target guards and handles all the per-cell GEP computation including variant selection, van Driest damping, and eddy viscosity calculation. The inline function approach allows the compiler to optimize while maintaining code consistency.


283-318: CPU path correctly uses the unified kernel.

Unlike turbulence_baseline.cpp, this file properly calls gep_cell_kernel from both GPU and CPU paths. The wall distance buffer creation (lines 299-308) correctly adapts the kernel's pointer-based interface for CPU use. This is the right approach for maintaining CPU/GPU consistency.

src/turbulence_transport.cpp (3)

55-185: Comprehensive SST transport cell kernel.

The kernel correctly implements the full SST k-ω physics: velocity gradients, F1 blending, production limiting, upwind advection, central diffusion, cross-diffusion, and time integration. While the function signature is lengthy (24+ parameters), this is inherent to the SST formulation and clearly documented with doxygen comments.


726-776: GPU path uses the unified kernel correctly.

The GPU kernel loop properly calculates all necessary indices and calls sst_transport_cell_kernel. The output values are written back to the device arrays immediately after the kernel call.


799-890: CPU path mirrors GPU path structure.

The CPU implementation correctly uses the same unified kernel, ensuring bit-for-bit consistency with the GPU path. The wall distance buffer is populated for the full domain (including ghost cells via the loop bounds on lines 821-826), matching the kernel's expectations.

src/turbulence_earsm.cpp (2)

707-793: Good unified EARSM output kernel.

The earsm_compute_output function consolidates the common output stage across all EARSM variants (WJ, GS, Pope). Key features:

  • Re_t-based blending to fade nonlinear terms
  • Tensor basis construction (S*, [S*,Ω*], S*²)
  • Reynolds stress computation
  • Equivalent eddy viscosity with fallback for near-zero Sxy
  • Defensive NaN check (line 785)

This reduces code duplication in the GPU kernels.


880-885: All GPU EARSM kernels now use the unified output function.

The Wallin-Johansson, Gatski-Speziale, and Pope GPU kernels all call earsm_compute_output for the final stage, ensuring consistent output computation across variants.

Also applies to: 971-976, 1041-1045

scripts/ci.sh (3)

91-100: Good GPU availability check.

The check_gpu_available function correctly uses nvidia-smi to detect GPU presence. The double-check pattern (command exists AND succeeds) handles edge cases where nvidia-smi is installed but no GPU is present.


238-351: Comprehensive cross-build test implementation.

The run_cross_build_test function properly:

  1. Checks GPU availability and fails (not skips) if unavailable
  2. Uses build caching to avoid redundant builds
  3. Generates CPU reference first, then runs GPU comparison
  4. Uses OMP_TARGET_OFFLOAD=MANDATORY to prevent silent CPU fallback
  5. Reports detailed output on failure

The fail-not-skip behavior on missing GPU is intentional and correct for ensuring actual GPU testing.


179-233: Extended run_test with environment prefix support.

The optional env_prefix parameter (line 183) enables per-test environment variable injection, used for OMP_TARGET_OFFLOAD=MANDATORY on GPU utilization tests. This is a clean extension of the existing function.

.github/workflows/ci.yml (1)

43-45: CI workflow updated to use renamed script.

The workflow correctly references the new ci.sh script with the --cpu flag for CPU-only testing. This aligns with the script rename from run_ci_local.sh to ci.sh.

.claude/rules.md (3)

241-241: LGTM! All script references consistently updated.

All references to the CI script have been systematically updated from run_ci_local.sh to ci.sh across workflow guidelines, quick reference, and emergency procedures sections. The changes maintain consistency throughout the documentation.

Also applies to: 254-254, 315-315, 349-349, 359-359


474-477: LGTM! Documentation reflects enhanced CI script functionality.

The updated documentation correctly captures the new GPU-aware capabilities of ci.sh:

  • Auto-detection of GPU availability
  • Explicit CPU-only mode (--cpu)
  • GPU-specific test mode (gpu)

This aligns with the PR objective to implement "GPU-aware cross-build testing" and represents a meaningful enhancement over the previous script.


100-101: Documentation is correct; the referenced scripts exist in the repository.

The scripts ./scripts/ci.sh and ./scripts/run_gpu_ci_test.sh are both present and correctly documented in .claude/rules.md. The inconsistency noted was with outdated learnings that reference test_before_ci.sh, which no longer exists. The documentation accurately reflects the current repository structure.

.github/scripts/gpu_ci_correctness.sbatch.template (1)

58-62: Script rename verified and complete.

The updates correctly reference the renamed CI script ci.sh. The invocation pattern (./scripts/ci.sh full) is valid, the script exists and is executable, and no stale references to the old run_ci_local.sh remain in the codebase.

…kernel

Address reviewer feedback: CPU path in MixingLengthModel::update() now
calls the same mixing_length_cell_kernel() used by GPU path, eliminating
potential for drift between implementations.

Changes:
- Add <vector> include for wall_distance buffer
- Replace inline computation loop with unified kernel call
- Use same cell indexing pattern as GPU for consistency

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/turbulence_baseline.cpp (2)

345-354: Cache the wall distance buffer to avoid repeated allocations.

The CPU path allocates and populates wall_dist_buf on every update() call, while the GPU path pre-computes and caches y_wall_flat_ as a member variable (lines 120, 122-128). This creates unnecessary allocations in CPU-only builds.

Consider caching the wall distance buffer as a member variable (similar to dudx_, dudy_, etc.) to match the GPU optimization and improve CPU performance.

💡 Proposed refactor to cache wall distance buffer

Add a member variable to cache the wall distance buffer in the header file:

// In turbulence_baseline.hpp, add to MixingLengthModel class:
std::vector<double> wall_dist_cpu_;  // Cached wall distance buffer for CPU path

Then update the CPU path to lazy-initialize once:

-    // Create wall_distance buffer for unified kernel
-    const size_t total_cells = (size_t)mesh.total_Nx() * mesh.total_Ny();
-    std::vector<double> wall_dist_buf(total_cells, 0.0);
-    for (int j = 0; j < mesh.total_Ny(); ++j) {
-        for (int i = 0; i < mesh.total_Nx(); ++i) {
-            const int idx = j * stride + i;
-            wall_dist_buf[idx] = mesh.wall_distance(i, j);
+    // Create wall_distance buffer for unified kernel (lazy initialization)
+    const int total_cells = mesh.total_Nx() * mesh.total_Ny();
+    if (wall_dist_cpu_.empty() || static_cast<int>(wall_dist_cpu_.size()) != total_cells) {
+        wall_dist_cpu_.resize(total_cells, 0.0);
+        for (int j = 0; j < mesh.total_Ny(); ++j) {
+            for (int i = 0; i < mesh.total_Nx(); ++i) {
+                const int idx = j * stride + i;
+                wall_dist_cpu_[idx] = mesh.wall_distance(i, j);
+            }
         }
     }
-    const double* wall_dist_ptr = wall_dist_buf.data();
+    const double* wall_dist_ptr = wall_dist_cpu_.data();

346-346: Minor: Inconsistent integer type for total_cells.

This line uses size_t while the rest of the codebase consistently uses int for cell counts (e.g., lines 82, 104, 115, 150, 229, 250, 334-335).

🔎 Proposed fix
-    const size_t total_cells = (size_t)mesh.total_Nx() * mesh.total_Ny();
+    const int total_cells = mesh.total_Nx() * mesh.total_Ny();
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d4d718 and fca792c.

📒 Files selected for processing (1)
  • src/turbulence_baseline.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,cc,cxx}

📄 CodeRabbit inference engine (.cursorrules)

**/*.{cpp,cc,cxx}: Do not add platform-specific tolerances with #ifdef __APPLE__ or similar platform checks in tests
Include code comments explaining WHY, not WHAT; document numerical algorithms; note non-obvious optimizations

Files:

  • src/turbulence_baseline.cpp
**/*.{cpp,cc,cxx,h,hpp}

📄 CodeRabbit inference engine (.cursorrules)

**/*.{cpp,cc,cxx,h,hpp}: Avoid overly strict floating-point comparisons (e.g., == for doubles); use appropriate tolerances based on algorithm
Fix all compiler warnings before pushing; use [[maybe_unused]] for intentionally unused variables in assertions
Use const for read-only references and mark methods const if they don't modify state
Use RAII for resource management; use smart pointers or RAII wrappers instead of manual new/delete or malloc/free
Use OmpDeviceBuffer wrapper for GPU buffer management instead of manual allocation
Check return values and use exceptions for error conditions with informative error messages

Files:

  • src/turbulence_baseline.cpp
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: For GPU-related changes, also run `./test_before_ci_gpu.sh` to run the complete GPU CI test suite locally
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Run `./test_before_ci.sh` before every push to repository to test Debug build, Release build, and all unit tests
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-and-test (ubuntu-latest, Release)
  • GitHub Check: build-and-test (ubuntu-latest, Debug)
  • GitHub Check: gpu-tests
🔇 Additional comments (3)
src/turbulence_baseline.cpp (3)

23-23: LGTM: Header addition supports CPU buffer allocation.

The <vector> include is required for the wall_dist_buf allocation in the CPU path (line 347).


266-284: LGTM: GPU path successfully uses unified kernel.

The GPU computation path now invokes the shared mixing_length_cell_kernel, eliminating code duplication while maintaining the same computational logic.


369-375: LGTM: CPU path correctly invokes unified kernel.

The CPU execution path successfully uses the same mixing_length_cell_kernel as the GPU path, achieving the PR objective of eliminating code duplication while maintaining computational consistency.

Extract unified cell kernels for all EARSM variants (WJ, GS, Pope)
with #pragma omp declare target. Both GPU and CPU paths now call
the same kernel functions, eliminating potential for drift.

Added kernels:
- earsm_wj_cell_kernel(): Wallin-Johansson cubic solve + beta
- earsm_gs_cell_kernel(): Gatski-Speziale regularization + beta
- earsm_pope_cell_kernel(): Pope regularized model + beta

All kernels call earsm_compute_output() for tensor basis → output.
Host path added to #else blocks so CPU builds use same logic as GPU.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/turbulence_earsm.cpp (3)

735-736: Clarify that beta2 and beta3 modification is local.

The blending factors beta2 and beta3 are modified in-place. Since they are passed by value, this doesn't affect the caller, but adding a brief comment would improve readability.

🔎 Optional clarification comment
     // Apply Re_t blending: keep linear term (beta1), fade nonlinear terms (beta2, beta3)
+    // Note: beta2/beta3 are pass-by-value, so modification is local to this function
     beta2 *= alpha;
     beta3 *= alpha;

785-785: Consider using std::isfinite() for NaN check.

The NaN check (nu_t_loc == nu_t_loc) works but is less readable than std::isfinite(nu_t_loc). If this pattern is intentional for GPU compatibility, consider adding a comment explaining why.

🔎 More readable alternative
-    nu_t_loc = (nu_t_loc == nu_t_loc) ? nu_t_loc : 0.0;  // NaN check
+    nu_t_loc = std::isfinite(nu_t_loc) ? nu_t_loc : 0.0;  // NaN/Inf check

799-862: LGTM! Consider extracting common Re_t blending logic.

The WJ cell kernel correctly implements the Wallin-Johansson model with proper regularization and clamping. The Re_t computation and alpha blending (lines 820-823) are duplicated across all three EARSM variants, which is acceptable for performance but could be extracted into a helper if maintainability becomes a concern.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fca792c and b6ea0c3.

📒 Files selected for processing (1)
  • src/turbulence_earsm.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,cc,cxx}

📄 CodeRabbit inference engine (.cursorrules)

**/*.{cpp,cc,cxx}: Do not add platform-specific tolerances with #ifdef __APPLE__ or similar platform checks in tests
Include code comments explaining WHY, not WHAT; document numerical algorithms; note non-obvious optimizations

Files:

  • src/turbulence_earsm.cpp
**/*.{cpp,cc,cxx,h,hpp}

📄 CodeRabbit inference engine (.cursorrules)

**/*.{cpp,cc,cxx,h,hpp}: Avoid overly strict floating-point comparisons (e.g., == for doubles); use appropriate tolerances based on algorithm
Fix all compiler warnings before pushing; use [[maybe_unused]] for intentionally unused variables in assertions
Use const for read-only references and mark methods const if they don't modify state
Use RAII for resource management; use smart pointers or RAII wrappers instead of manual new/delete or malloc/free
Use OmpDeviceBuffer wrapper for GPU buffer management instead of manual allocation
Check return values and use exceptions for error conditions with informative error messages

Files:

  • src/turbulence_earsm.cpp
🧠 Learnings (3)
📓 Common learnings
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: For GPU-related changes, also run `./test_before_ci_gpu.sh` to run the complete GPU CI test suite locally
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Applies to tests/test_physics_validation.cpp : Navier-Stokes solver validation tests must verify: viscous terms, pressure gradient, parabolic velocity profiles, divergence-free constraint, momentum balance, channel symmetry, cross-model consistency, and numerical stability
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Run `./test_before_ci.sh` before every push to repository to test Debug build, Release build, and all unit tests
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Explicitly upload data to GPU; check `USE_GPU_OFFLOAD` is defined; verify `omp_get_num_devices() > 0` at runtime for GPU offload code

Applied to files:

  • src/turbulence_earsm.cpp
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Applies to **/*.{cpp,cc,cxx,h,hpp} : Use `OmpDeviceBuffer` wrapper for GPU buffer management instead of manual allocation

Applied to files:

  • src/turbulence_earsm.cpp
🧬 Code graph analysis (1)
src/turbulence_earsm.cpp (3)
include/features.hpp (2)
  • nu (129-129)
  • b_xx (116-120)
include/turbulence_transport.hpp (5)
  • nu (120-120)
  • nu (120-120)
  • k (187-187)
  • C_mu (150-150)
  • C_mu (150-150)
include/turbulence_baseline.hpp (2)
  • C_mu (86-86)
  • C_mu (86-86)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: gpu-tests
  • GitHub Check: build-and-test (ubuntu-latest, Debug)
  • GitHub Check: build-and-test (ubuntu-latest, Release)
🔇 Additional comments (5)
src/turbulence_earsm.cpp (5)

865-925: LGTM! GS model correctly implemented.

The Gatski-Speziale cell kernel is correctly implemented with appropriate regularization and rotation correction. The strain/rotation computation and Re_t blending logic are duplicated from the WJ kernel, which maintains consistency across models.


928-970: LGTM! Pope model correctly implemented.

The Pope quadratic cell kernel is correctly implemented with appropriate simplifications compared to the WJ/GS models. The computation is well-structured and consistent with the unified output kernel interface.


1001-1028: Excellent CPU/GPU unification for WJ model.

The modification successfully unifies the CPU and GPU code paths by calling the same earsm_wj_cell_kernel from both contexts. This eliminates code duplication and ensures consistent behavior across backends. The proper handling of total_size (marked unused in CPU builds) is good defensive programming.


1047-1074: Excellent CPU/GPU unification for GS model.

The GS model now uses the unified earsm_gs_cell_kernel for both CPU and GPU execution, ensuring consistency and reducing duplication. The implementation correctly follows the same pattern as the WJ model.


1089-1116: Excellent CPU/GPU unification for Pope model.

The Pope model successfully adopts the unified kernel approach. The CPU fallback is properly implemented with the same logic as the GPU path, ensuring bit-for-bit reproducibility across backends.

ci.sh:
- Fix SC2155: Split local declaration and assignment for orig_dir
- Add -DCMAKE_BUILD_TYPE=Release to ensure consistent build config

test_cpu_gpu_bitwise.cpp:
- Prevent GPU builds from running in --dump-prefix mode
- This avoids accidentally generating reference files with GPU build
- Clear error message directs user to rebuild with CPU

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
scripts/ci.sh (4)

304-318: Verify that reference files were actually created.

After CPU reference generation succeeds (exit code 0), the script doesn't verify that the expected reference files actually exist. If the test binary succeeds but doesn't write files (e.g., due to a bug or permission issue), the GPU comparison in step 2 will fail with a confusing error.

🔎 Add reference file verification
     if [ $cpu_exit_code -ne 0 ]; then
         log_failure "$test_name (CPU reference generation failed, exit code: $cpu_exit_code)"
         echo "  Output (last 30 lines):"
         tail -30 "$output_file" | sed 's/^/    /'
         FAILED=$((FAILED + 1))
         FAILED_TESTS="${FAILED_TESTS}\n  - $test_name (CPU ref generation)"
         rm -f "$output_file"
         return 0
     fi
+
+    # Verify at least one reference file was created
+    local ref_files_count
+    ref_files_count=$(find "$ref_dir" -name "${ref_prefix}*" -type f 2>/dev/null | wc -l)
+    if [ "$ref_files_count" -eq 0 ]; then
+        log_failure "$test_name (No reference files created with prefix '${ref_prefix}')"
+        FAILED=$((FAILED + 1))
+        FAILED_TESTS="${FAILED_TESTS}\n  - $test_name (no ref files)"
+        rm -f "$output_file"
+        return 0
+    fi

325-329: Document why OMP_TARGET_OFFLOAD=MANDATORY is needed.

Line 329 uses OMP_TARGET_OFFLOAD=MANDATORY but the comment only explains part of the purpose. This environment variable is critical for detecting GPU offload failures - without it, OpenMP silently falls back to CPU execution, making GPU tests meaningless.

🔎 Enhance comment
     # Run GPU comparison against CPU reference
-    # MANDATORY ensures we fail if GPU offload doesn't work (no silent CPU fallback)
+    # OMP_TARGET_OFFLOAD=MANDATORY ensures we fail if GPU offload doesn't work.
+    # Without this, OpenMP silently falls back to CPU, making GPU testing meaningless.
+    # This catches GPU unavailability, driver issues, or misconfigured offload.
     log_info "  Step 2: Running GPU and comparing against CPU reference..."

298-308: Consider cleaning reference directory before generating new files.

The script creates the reference directory but doesn't clean existing files. If a previous test run wrote files with different prefixes or the test's output format changed, stale files could cause confusion during debugging.

🔎 Add reference cleanup
     # Create reference directory
     local ref_dir="${PROJECT_DIR}/build_gpu/cpu_reference"
     mkdir -p "$ref_dir"
+    
+    # Clean any existing reference files for this test to avoid stale data
+    rm -f "${ref_dir}/${ref_prefix}"*

     local output_file="/tmp/test_output_$$.txt"

354-362: Inconsistent cmake configuration with ensure_build.

Line 359 doesn't specify -DCMAKE_BUILD_TYPE, while ensure_build (line 123) hardcodes it to Release. This inconsistency means:

  • Fallback builds will use CMake's default (often Debug)
  • Cross-build tests will always use Release

For consistency, either both paths should specify the build type or both should omit it.

🔎 Align build configurations
 if [ ! -d "$BUILD_DIR" ]; then
     log_info "Build directory not found. Creating and building..."
     mkdir -p "$BUILD_DIR"
     cd "$BUILD_DIR"
-    cmake .. -DUSE_GPU_OFFLOAD=${USE_GPU} -DBUILD_TESTS=ON
+    cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_GPU_OFFLOAD=${USE_GPU} -DBUILD_TESTS=ON
     make -j"$(nproc)"
     cd "$PROJECT_DIR"
 fi
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6ea0c3 and 4190803.

📒 Files selected for processing (2)
  • scripts/ci.sh
  • tests/test_cpu_gpu_bitwise.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_cpu_gpu_bitwise.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{sh,slurm,batch}

📄 CodeRabbit inference engine (.cursor/rules/lean-changes.mdc)

Always submit slurm jobs using the embers QOS

Files:

  • scripts/ci.sh
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: For GPU-related changes, also run `./test_before_ci_gpu.sh` to run the complete GPU CI test suite locally
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Applies to tests/test_physics_validation.cpp : Navier-Stokes solver validation tests must verify: viscous terms, pressure gradient, parabolic velocity profiles, divergence-free constraint, momentum balance, channel symmetry, cross-model consistency, and numerical stability
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Run `./test_before_ci.sh` before every push to repository to test Debug build, Release build, and all unit tests
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: For GPU-related changes, also run `./test_before_ci_gpu.sh` to run the complete GPU CI test suite locally

Applied to files:

  • scripts/ci.sh
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Run `./test_before_ci.sh` before every push to repository to test Debug build, Release build, and all unit tests

Applied to files:

  • scripts/ci.sh
📚 Learning: 2025-12-25T20:38:56.549Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursor/rules/lean-debug.mdc:0-0
Timestamp: 2025-12-25T20:38:56.549Z
Learning: For GitHub CI debugging, run `./tools/ci_packet.sh <runid>` via terminal tool.

Applied to files:

  • scripts/ci.sh
📚 Learning: 2025-12-25T20:38:43.379Z
Learnt from: CR
Repo: comp-physics/cfd-nn PR: 0
File: .cursorrules:0-0
Timestamp: 2025-12-25T20:38:43.379Z
Learning: Create feature branches from latest main; run `./test_before_ci.sh` to verify starting state before development

Applied to files:

  • scripts/ci.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: gpu-tests
  • GitHub Check: build-and-test (ubuntu-latest, Release)
  • GitHub Check: build-and-test (ubuntu-latest, Debug)
🔇 Additional comments (5)
scripts/ci.sh (5)

1-25: LGTM! Clear documentation of cross-build behavior.

The updated documentation clearly explains the GPU requirement for cross-build tests and the two-step process (CPU reference generation + GPU comparison).


91-100: LGTM! GPU detection logic is sound.

The function correctly checks for both nvidia-smi availability and successful execution.


180-234: LGTM! env_prefix parameter correctly implemented.

The optional environment variable prefix is properly handled with a default empty string and correctly passed to the env command when present.


425-434: LGTM! Cross-build tests properly gated on GPU availability.

The logic correctly skips cross-build tests when running in CPU-only mode (--cpu flag), with clear messaging explaining why GPU is required for CPU vs GPU comparisons.


441-444: LGTM! GPU utilization test uses OMP_TARGET_OFFLOAD=MANDATORY.

The GPU utilization test correctly uses OMP_TARGET_OFFLOAD=MANDATORY via the env_prefix parameter to ensure GPU execution is enforced, consistent with the cross-build test approach.

Reuse y_wall_flat_ member variable instead of creating a local
wall_dist_buf on every update() call. Also fix size_t to int
for consistency with rest of codebase.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@sbryngelson sbryngelson merged commit 627a39b into master Jan 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants