Newton-Schulz via cuSOLVERMp by vcherepanov-nv · Pull Request #2706 · NVIDIA/TransformerEngine

vcherepanov-nv · 2026-02-25T23:37:21Z

Description

Adds an API to call Newton-Schulz method on a distributed tensor.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Integrate cuSOLVERMp as a new dependency
Add corresponding API to TE/common
Add PyTorch binding and tests

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Add a new distributed Newton-Schulz inverse square root API to Transformer Engine's common C library. This wraps the cusolverMpNewtonSchulz library function, following the same pattern as the existing cuBLASMp integration for comm_gemm. New files: - newton_schulz.h: Public C API header with context management and computation functions - newton_schulz/newton_schulz.cpp: Implementation with RAII wrappers for cuSolverMp handles Build integration: - New NVTE_WITH_CUSOLVERMP CMake option and CUSOLVERMP_HOME env var - NVTE_CHECK_CUSOLVERMP error checking macro in logging.h - Conditional compilation guarded by NVTE_WITH_CUSOLVERMP Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Add PyTorch-level bindings for the cuSolverMp Newton-Schulz inverse square root API introduced in the previous commit. New files: - pytorch/csrc/extensions/newton_schulz.cpp: C++ extension wrapping the C API with PyTorch tensor support - pytorch/newton_schulz.py: Python wrapper that extracts NCCL communicator from torch.distributed ProcessGroup - tests/pytorch/distributed/test_newton_schulz.py: pytest launcher - tests/pytorch/distributed/run_newton_schulz.py: distributed test worker with reference implementation for numerical validation Modified files: - pytorch/csrc/extensions.h: Function declarations - pytorch/csrc/extensions/pybind.cpp: pybind11 registrations - pytorch/__init__.py: Public API export Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Fix API mismatches discovered during compilation: - cusolverMpCreate takes (handle*, deviceId, stream), not (handle*, stream) - cusolverMpCreateDeviceGrid takes handle as first arg with different parameter order - Use cusolverMpGridMapping_t (not cusolverMpGridLayout_t) and CUSOLVERMP_GRID_MAPPING_COL_MAJOR - cusolverMpCreateMatrixDesc has different parameter order: (desc*, grid, dtype, M, N, MB, NB, RSRC, CSRC, LLD) - cusolverMpNewtonSchulzDescriptorCreate takes only (nsDesc*) with no iteration/coefficient args - No cusolverMpStreamSet exists; create handle per-call with user stream - cusolverMpNewtonSchulz requires computeType and info parameters - Switch from generic template RAII to explicit deleter structs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

…build Add NVTE_WITH_CUSOLVERMP compiler define and cusolverMp include/library paths to the PyTorch C++ extension build, following the same pattern as NVTE_UB_WITH_MPI and NVTE_ENABLE_NVSHMEM. Without this, the #ifdef NVTE_WITH_CUSOLVERMP guards in the PyTorch extension code would never be active since the define was only set as PRIVATE in the CMake build for the common library. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Two fixes: - Use ProcessGroupNCCL._comm_ptr() to extract the raw NCCL communicator pointer instead of the non-existent get_nccl_comm() method - Pass global matrix dimensions (m, n) from Python to C++ instead of using local tensor dimensions, which would produce incorrect ScaLAPACK block sizes in the distributed computation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

cuSolverMp handle and grid creation are expensive operations. Move them from per-call creation in nvte_newton_schulz into the NVTECusolverMpCtx, which is their natural home — the context exists to encapsulate the grid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

cuSolverMp cannot work with the default CUDA stream. Create a dedicated stream inside nvte_cusolvermp_ctx_create and remove the stream parameter from both C API functions since the context now owns its stream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

The internal dedicated stream was reading the input tensor before the caller's stream had finished producing it, resulting in all-zero output. Add event-based synchronisation: the internal stream waits for the caller's input to be ready, and the caller's stream waits for the output to be written. Replaces the blocking cudaStreamSynchronize. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

cuSolverMp is asynchronous and uses the host workspace during multi-GPU execution. The event-based output sync did not block the host, so the local workspace_host vector was destroyed while the GPU was still reading from it. Restore cudaStreamSynchronize to ensure the host workspace remains valid for the full duration of the operation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Avoid creating and destroying a cudaEvent_t on every nvte_newton_schulz call by making it a persistent member of NVTECusolverMpCtx, matching the existing pattern for the stream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Replace single event with in_ready and out_ready events. After the cuSolverMp call, record out_ready on the internal stream and make the caller's stream wait on it, ensuring the output tensor is ready before the caller uses it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Replace reference-comparison test with a direct arithmetic check: if X is the inverse square root of A, then X @ A @ X must equal the identity matrix. This is more robust and removes the need for a separate reference implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-25T23:42:44Z

Greptile Summary

This PR adds distributed Newton-Schulz matrix orthogonalization to Transformer Engine by wrapping the cuSOLVERMp library. It introduces a new optional build dependency (NVTE_WITH_CUSOLVERMP), a C++ binding layer, a Python API (cusolvermp_ctx_create / newton_schulz), and distributed tests via torchrun.

Critical issues requiring fixes:

Memory safety: CusolverMpCtx.destroy() lacks a guard against double-free when called explicitly within a with block (line 34)
Resource leak: Raw CUDA handles created in nvte_cusolvermp_ctx_create (lines 98-104) can leak if exceptions occur before RAII wrappers initialize (line 109)
Correctness: Row-major vs. column-major layout mismatch in newton_schulz.cpp line 150 — PyTorch tensors are row-major but lld = local_rows assumes column-major, silently producing wrong results for non-symmetric matrices
Silent failures: nullptr passed as devInfo to cusolverMpNewtonSchulz suppresses convergence diagnostics, hiding numerical failures
Missing validation: No check that matrix rows are evenly distributed across ranks; uneven distribution causes incorrect global size m, leading to wrong results
CI failures: Test runs unconditionally without guarding for cuSolverMp build flag, causing guaranteed failures on standard builds

Important concerns:

Performance: Context is created/destroyed on every newton_schulz() call, defeating workspace caching and incurring repeated CUDA stream/event creation overhead
Encapsulation: cuSOLVERMp linked PUBLIC; newton_schulz symbols exported unconditionally; NCCL include unconditional in public header
Consistency: CMake option() declared 78 lines after first use, violating file conventions
Documentation: Fallback coefficients [1.5, -0.5, 0.0] for non-5 iterations unexplained

Related test issues:

No skip guard for cuSolverMp availability in test_newton_schulz.py; will fail on non-cuSolverMp builds
Subprocess has no timeout, risking indefinite hangs on NCCL deadlock

Confidence Score: 1/5

Not safe to merge — multiple unresolved correctness issues (row-major/col-major mismatch, resource leaks, double-free), missing runtime validation, and guaranteed CI failures on non-cuSolverMp builds.
Critical memory safety issues (double-free in destroy, resource leak in ctx_create), correctness issues (column-major layout mismatch silently breaking non-symmetric matrices), missing validation (uneven row distribution), and CI breakage make this unsafe. PR is still marked Draft, and several fundamental issues remain unresolved.
transformer_engine/common/newton_schulz/newton_schulz.cpp (resource leak, layout mismatch, devInfo), transformer_engine/pytorch/newton_schulz.py (double-free, missing validation), tests/pytorch/distributed/test_newton_schulz.py and qa/L1_pytorch_distributed_unittest/test.sh (CI guard), transformer_engine/common/include/transformer_engine/newton_schulz.h (unconditional NCCL include)

Comments Outside Diff (15)

tests/pytorch/distributed/test_newton_schulz.py, line 14-15 (link)

The test only guards against insufficient GPUs but doesn't check whether the build includes cuSolverMp support. On systems where TE was built without NVTE_WITH_CUSOLVERMP=1, the subprocess will exit with a RuntimeError and the test will fail with an unhelpful AssertionError — instead of a clean pytest.skip.

Add a module-level skip that mirrors the GPU count guard:

Without this, the test fails unconditionally on builds without cuSolverMp support.
qa/L1_pytorch_distributed_unittest/test.sh, line 35 (link)

The Newton-Schulz test (line 35) runs unconditionally, but it requires NVTE_WITH_CUSOLVERMP=1 at build time. On CI runners where TE is built without cuSolverMp, this test will fail, breaking the whole test suite.

Guard the test with a build flag check:

Alternatively, rely solely on the module-level pytest.skip in the test file (once that guard is added), but the script-level check makes the conditional behavior explicit.
transformer_engine/pytorch/newton_schulz.py, line 34-36 (link)

destroy() calls tex.cusolvermp_ctx_destroy(self._ptr) but never nulls out self._ptr afterward. This makes it unsafe to call destroy() more than once — the second call will pass a dangling pointer to the C++ destructor.

A user could hit this with a pattern like:
```
with cusolvermp_ctx_create(group) as ctx:
    newton_schulz(x, ctx)
    ctx.destroy()          # explicit early release
# __exit__ calls ctx.destroy() again here → double free
```
Guard against this by checking and clearing the pointer:
transformer_engine/pytorch/newton_schulz.py, line 118-121 (link)

When num_iterations != 5, the default falls back to [1.5, -0.5, 0.0] * num_iterations (line 120). These values are unexplained — a reader has no way to know what iteration scheme they correspond to, where they come from, or whether they provide convergence guarantees comparable to the quintic coefficients.

Additionally the third coefficient is 0.0, which means every triplet degenerates from a cubic polynomial to a linear one (a*X + b*X^3 + 0.0*X^5). This is likely intentional for a cubic Newton-Schulz step, but it should be documented.

Add a comment explaining the fallback:
tests/pytorch/distributed/test_newton_schulz.py, line 37 (link)

The subprocess call has no timeout. If the distributed test deadlocks or hangs (e.g., due to NCCL communication issues), this will block CI indefinitely.

Add a timeout parameter to prevent hanging:

A 300-second timeout is a reasonable default for distributed tests.
transformer_engine/pytorch/__init__.py, line 62-66 (link)

newton_schulz, cusolvermp_ctx_create, and CusolverMpCtx are unconditionally imported and exported (lines 62-66), even when TE is built without NVTE_WITH_CUSOLVERMP. While the functions raise a RuntimeError at runtime, this exposes these symbols to all users, making them appear as supported features in auto-complete and documentation.

Guard the import behind a check similar to other optional features:

This prevents the symbols from appearing in the public API when the feature is not available.
transformer_engine/pytorch/newton_schulz.py, line 79 (link)

The newton_schulz function computes global matrix dimensions as m = x.size(0) * ctx.nranks (line 141) without validating that all ranks have the same number of local rows. If matrix rows are unevenly distributed, m will be incorrect, and cuSOLVERMp will mis-interpret the data layout, producing silent wrong results.

For example, with a 10×10 matrix and 3 ranks distributing rows as [4, 3, 3]:
- Rank 0 receives 4 rows, computes m = 4*3 = 12 (incorrect; should be 10)
- cuSOLVERMp allocates blocks for a 12-row matrix and mis-reads rank 0's data
Add validation to ensure even distribution:

Or at minimum, document this requirement clearly in the docstring.
transformer_engine/common/newton_schulz/newton_schulz.cpp, line 150-155 (link)

The nvte_newton_schulz function computes lld = local_rows (line 155), which assumes column-major (Fortran/LAPACK) memory layout. However, PyTorch tensors are row-major (C-contiguous) by default.

When cuSOLVERMp interprets the data with lld = local_rows (column-major convention), it will mis-index elements:
- Expected offset (row-major): i * n + j
- What cuSOLVERMp sees (column-major): j * local_rows + i
The test matrix happens to be symmetric, so A^T = A and the polar factor is also symmetric, which masks this bug. For any non-symmetric matrix, results would be wrong.

Verify the memory layout convention with the cuSOLVERMp documentation and either:
1. Transpose the input tensor before calling (if cuSOLVERMp expects column-major), or
2. Adjust lld to reflect row-major layout (if cuSOLVERMp supports row-major)
transformer_engine/common/newton_schulz/newton_schulz.cpp, line 185-188 (link)

The last argument to cusolverMpNewtonSchulz is the device info array (devInfo). Passing nullptr (line 188) means the library will not write convergence or per-iteration status back to the caller. If Newton-Schulz fails to converge or encounters a numerical issue, the NVTE_CHECK_CUSOLVERMP macro will only catch a non-CUSOLVER_STATUS_SUCCESS return code — convergence warnings or soft failures that still return SUCCESS will be silently swallowed.

Allocate and check the device info to make convergence failures visible:
transformer_engine/common/newton_schulz/newton_schulz.cpp, line 93-119 (link)

The raw CUDA handles (stream, in_ready, out_ready) are created with plain C API calls (lines 98-104) before being moved into the NVTECusolverMpCtx struct on line 109. If MakeCusolverMpHandle or MakeCusolverMpGrid throw an exception (via NVTE_CHECK_CUSOLVERMP → NVTE_ERROR), the destructor for NVTECusolverMpCtx is never called, and these three CUDA resources leak.

Since the exception unwinds the stack before the return new NVTECusolverMpCtx{...} line is reached, there is no way for the nvte_cusolvermp_ctx_destroy path to clean them up.

Wrap each handle in its own RAII type to ensure safe cleanup:
transformer_engine/common/CMakeLists.txt, line 230 (link)

The option(NVTE_WITH_CUSOLVERMP ...) declaration (line 308) comes after its first conditional use (line 230: if (NVTE_WITH_CUSOLVERMP)). This is inconsistent with the established pattern in this file — NVTE_UB_WITH_MPI, NVTE_ENABLE_NVSHMEM, and NVTE_WITH_CUBLASMP all declare their option() immediately before the if() that uses it.

While this works when -DNVTE_WITH_CUSOLVERMP=ON is passed on the CMake command line (the cache variable is already set), it violates the file's conventions and could surprise developers adding follow-on logic.

Move the option() declaration to just before line 230:

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
transformer_engine/common/CMakeLists.txt, line 317 (link)

The cuSOLVERMp library is linked with PUBLIC linkage (line 317: target_link_libraries(transformer_engine PUBLIC ${CUSOLVERMP_LIB})). This forces all downstream consumers of the transformer_engine library to have cuSOLVERMp in their link path, even if they never call Newton-Schulz.

Since newton_schulz.h does not expose cuSOLVERMp types in the public API (only NCCL and standard CUDA types), PRIVATE linkage would provide better encapsulation. Change line 317 to:
transformer_engine/common/include/transformer_engine/newton_schulz.h, line 17 (link)

The unconditional #include <nccl.h> (line 17) means that newton_schulz.h — which is installed as a public header — forces all downstream projects that include it to have NCCL in their include path, even if they have no interest in Newton-Schulz.

ncclComm_t is only used in the signatures of nvte_cusolvermp_ctx_create and nvte_newton_schulz, which are only meaningful when NVTE_WITH_CUSOLVERMP is defined. Guard the include and the declarations together:

This prevents NCCL dependency leakage to downstream consumers who don't use Newton-Schulz.
transformer_engine/common/newton_schulz/newton_schulz.cpp, line 172-178 (link)

cudaFree followed by cudaMalloc inside nvte_newton_schulz (lines 175-177) will synchronize with the device each time the workspace needs to grow. Since the context is recreated on every call from newton_schulz.py (lines 77-78 create the context, and the user's code destroys it), the workspace will never be reused across calls — the grow-only caching here is ineffective and introduces unnecessary synchronization overhead.

Consider either:
1. Caching the context (e.g., in a module-level dict keyed by (nccl_comm_ptr, nranks, rank)) to amortize context creation and reuse workspaces, or
2. Using cudaMallocAsync/cudaFreeAsync on ctx->stream to avoid synchronous stalls while caching per-context
transformer_engine/pytorch/newton_schulz.py, line 82 (link)

A new NVTECusolverMpCtx is created (line 77) and destroyed on every invocation of newton_schulz (line 66, via context manager). Context creation involves cudaStreamCreate, two cudaEventCreate calls, cusolverMpCreate, and cusolverMpCreateDeviceGrid — all heavyweight operations. Since the context is destroyed afterward, the grow-only workspace caching in the C++ layer is never actually reused.

To improve performance when calling newton_schulz repeatedly (e.g., in a training loop), consider:
1. Caching the context at the module level: _ctx_cache[(nccl_comm_ptr, nranks, rank)] = ctx, or
2. Documenting that users should create the context once and reuse it across multiple calls

_{Last reviewed commit: 1a35a36}

greptile-apps

_{15 files reviewed, 15 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-25T23:42:48Z

tests/pytorch/distributed/run_newton_schulz.py

+    # Check: if X = A^{-1/2}, then X @ A @ X should be the identity matrix
+    if rank == 0:
+        XXT = X @ X.t()
+        I = torch.eye(N, device=XXT.device, dtype=XXT.dtype)
+        max_diff = (XXT - I).abs().max().item()
+        print(f"Max |X @ X.t() - I|: {max_diff:.6e}", flush=True)


verification doesn't match the comment - if X = A^{-1/2}, the check should be X @ A @ X ≈ I, not X @ X.t() ≈ I. The current check verifies X is orthogonal, not that X is the inverse square root of A. Note that A_orig is created on line 76 but never used.

Suggested change

# Check: if X = A^{-1/2}, then X @ A @ X should be the identity matrix

if rank == 0:

XXT = X @ X.t()

I = torch.eye(N, device=XXT.device, dtype=XXT.dtype)

max_diff = (XXT - I).abs().max().item()

print(f"Max |X @ X.t() - I|: {max_diff:.6e}", flush=True)

# Check: if X = A^{-1/2}, then X @ A @ X should be the identity matrix

XAX = X @ A_orig @ X

I = torch.eye(N, device=XAX.device, dtype=XAX.dtype)

max_diff = (XAX - I).abs().max().item()

print(f"Max |X @ A @ X - I|: {max_diff:.6e}", flush=True)

if torch.allclose(XAX, I, atol=args.atol, rtol=args.rtol):

greptile-apps · 2026-02-25T23:42:49Z

transformer_engine/pytorch/newton_schulz.py

+    nccl_backend = group._get_backend(torch.device("cuda"))
+    return nccl_backend._comm_ptr()


uses private PyTorch APIs (_get_backend, _comm_ptr) that may change in future versions

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps · 2026-02-25T23:56:23Z

tests/pytorch/distributed/run_newton_schulz.py

+    quintic_coefficients = [
+        4.0848,
+        -6.8946,
+        2.9270,
+        3.9505,
+        -6.3029,
+        2.6377,
+        3.7418,
+        -5.5913,
+        2.3037,
+        2.8769,
+        -3.1427,
+        1.2046,
+        2.8366,
+        -3.0525,
+        1.2012,
+    ]
+    coefficients = (
+        quintic_coefficients if args.num_iterations == 5 else [1.5, -0.5, 0.0] * args.num_iterations
+    )


coefficients mismatch with API defaults - test uses 15 coefficients for 5 iterations, but newton_schulz.py defaults to 5 coefficients. This inconsistency means default API behavior isn't tested.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-26T00:52:53Z

transformer_engine/common/include/transformer_engine/newton_schulz.h

+ *  \brief Functions for distributed Newton-Schulz inverse square root.
+ *
+ *  This API is a TE-native binding to the cuSolverMp library.
+ *  It computes an iterative Newton-Schulz inverse square root
+ *  approximation on a distributed matrix.


Documentation claims this computes "inverse square root" but the test validates orthogonality (X @ X.t() ≈ I), and commit dd1dd0b states "it approximates orthogonal matrix, not inverse square root". If this computes the polar decomposition (orthogonal factor), the documentation should be updated to reflect that. Inverse square root would satisfy X @ A @ X ≈ I, which is different from orthogonality.

greptile-apps · 2026-02-26T00:52:55Z

transformer_engine/pytorch/newton_schulz.py

+    m = x.size(0) * nranks  # rows are distributed across ranks
+    n = x.size(1)


Assumes rows are evenly distributed (m = x.size(0) * nranks) but doesn't validate this. If matrix size isn't divisible by nranks, the computed global size m will be incorrect, leading to wrong results from cuSOLVERMp. Consider adding validation:

Suggested change

m = x.size(0) * nranks # rows are distributed across ranks

n = x.size(1)

# Global matrix dimensions

# Rows must be evenly distributed across ranks

local_rows = x.size(0)

m = local_rows * nranks

n = x.size(1)

Then add a validation check that all ranks have the same local_rows via dist.all_reduce.

greptile-apps · 2026-02-26T00:52:56Z

transformer_engine/pytorch/newton_schulz.py

+    num_iterations: int = 5,
+    coefficients: Optional[List[float]] = None,
+) -> None:
+    """Compute Newton-Schulz inverse square root in-place on a distributed matrix.


Docstring says "inverse square root" but test checks orthogonality. Update to match actual behavior (see comment on header file).

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps · 2026-02-26T01:09:12Z

transformer_engine/common/newton_schulz/newton_schulz.cpp

+void nvte_newton_schulz(NVTECusolverMpCtx* ctx, int64_t m, int64_t n, NVTETensor x,
+                        int64_t num_iterations, const float* coefficients, int64_t num_coefficients,
+                        cudaStream_t caller_stream) {


num_coefficients parameter is unused

The num_coefficients parameter is accepted but never referenced in the function body. Neither cusolverMpNewtonSchulz_bufferSize (line 166) nor cusolverMpNewtonSchulz (line 183) receive this value. If cuSolverMp infers the count from num_iterations internally, then num_coefficients is dead code that should be removed from the API. If cuSolverMp actually needs it, then it should be passed to the cuSolverMp calls — otherwise the library may read out of bounds on the coefficients array.

greptile-apps · 2026-02-26T01:09:13Z

transformer_engine/pytorch/__init__.py

 from transformer_engine.pytorch import optimizers
 from transformer_engine.pytorch.export import onnx_export
 from transformer_engine.pytorch.cross_entropy import parallel_cross_entropy
+from transformer_engine.pytorch.newton_schulz import newton_schulz


Unconditional import of optional feature

newton_schulz is unconditionally imported and exported as part of the public API, even when TE is built without NVTE_WITH_CUSOLVERMP. While the function itself raises a runtime error when called, this exposes the symbol to all users and makes it appear as a supported feature in auto-complete and docs. Consider guarding this import behind a check (similar to how other optional features are handled), or at minimum adding a note in the docstring that the function requires NVTE_WITH_CUSOLVERMP=1 at build time.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-02-26T01:09:14Z

transformer_engine/common/newton_schulz/newton_schulz.cpp

+  // Allocate/grow device workspace
+  if (ctx->workspace_size < wrksp_size_device) {
+    if (ctx->workspace) {
+      NVTE_CHECK_CUDA(cudaFree(ctx->workspace));
+    }
+    NVTE_CHECK_CUDA(cudaMalloc(&ctx->workspace, wrksp_size_device));
+    ctx->workspace_size = wrksp_size_device;
+  }


Synchronous cudaFree/cudaMalloc on hot path

cudaFree followed by cudaMalloc inside nvte_newton_schulz will synchronize with the device each time the workspace needs to grow. Since the context is recreated on every call from newton_schulz.py (line 82-86 creates + destroys ctx each invocation), the workspace will never be reused across calls — the grow-only caching here is ineffective. Consider either:

Allowing callers to keep the context alive across calls, or

Using cudaMallocAsync/cudaFreeAsync on ctx->stream to avoid synchronous stalls.

greptile-apps · 2026-02-26T01:09:16Z

transformer_engine/pytorch/newton_schulz.py

+    ctx_ptr = tex.cusolvermp_ctx_create(nccl_comm_ptr, nranks, rank)
+    try:
+        tex.newton_schulz(ctx_ptr, m, n, x, num_iterations, coefficients)
+    finally:
+        tex.cusolvermp_ctx_destroy(ctx_ptr)


Context created/destroyed per call wastes resources

A new NVTECusolverMpCtx is created and destroyed on every invocation of newton_schulz. Context creation involves cudaStreamCreate, two cudaEventCreate calls, cusolverMpCreate, and cusolverMpCreateDeviceGrid — all of which are heavyweight operations. And since the context is destroyed afterward, the grow-only workspace caching in the C++ layer (lines 170-177 of newton_schulz.cpp) is never actually reused.

Consider caching the context (e.g., in a module-level dict keyed by (nccl_comm_ptr, nranks, rank)) and reusing it across calls, or exposing the context lifecycle to callers so they can amortize the cost when calling newton_schulz repeatedly in a training loop.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-26T01:33:17Z

transformer_engine/pytorch/newton_schulz.py

+    assert (
+        len(coefficients) == num_iterations * 3
+    ), f"Unexpected number of coefficients: {len(coefficients)} for {num_iterations} iterations"


use ValueError instead of assert for validation - assert can be disabled with Python's -O flag

Suggested change

assert (

len(coefficients) == num_iterations * 3

), f"Unexpected number of coefficients: {len(coefficients)} for {num_iterations} iterations"

if len(coefficients) != num_iterations * 3:

raise ValueError(

f"Unexpected number of coefficients: {len(coefficients)} for {num_iterations} iterations"

)

greptile-apps · 2026-02-26T01:33:18Z

transformer_engine/pytorch/newton_schulz.py

+    if x.dim() != 2:
+        raise ValueError(f"Expected 2D tensor, got {x.dim()}D")
+    if not x.is_cuda:
+        raise ValueError("Input tensor must be on CUDA device")


missing contiguity check - C++ code uses data_ptr() which requires contiguous memory. Non-contiguous tensors will cause incorrect results.

Suggested change

if x.dim() != 2:

raise ValueError(f"Expected 2D tensor, got {x.dim()}D")

if not x.is_cuda:

raise ValueError("Input tensor must be on CUDA device")

if x.dim() != 2:

raise ValueError(f"Expected 2D tensor, got {x.dim()}D")

if not x.is_cuda:

raise ValueError("Input tensor must be on CUDA device")

if not x.is_contiguous():

raise ValueError("Input tensor must be contiguous")

Instead of requiring NVTE_WITH_CUSOLVERMP env var to be set for both the common library and PyTorch extension builds, inspect the already-built libtransformer_engine.so for exported symbols. This is more robust for incremental builds and CI environments where the env var may not be propagated to the extension build step. The PyTorch extension only calls nvte_* C API functions, so it does not need cusolverMp headers or libraries — only the compile definition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

When NVTE_WITH_CUSOLVERMP is not defined, omit the Newton-Schulz functions entirely from the pybind module instead of registering stubs that throw runtime errors. The Python wrapper checks for the attribute at call time and raises a clear error message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-27T03:15:45Z

transformer_engine/pytorch/newton_schulz.py

+    if x.dim() != 2:
+        raise ValueError(f"Expected 2D tensor, got {x.dim()}D")
+    if not x.is_cuda:
+        raise ValueError("Input tensor must be on CUDA device")


Missing dtype validation - docstring on line 36 states tensor must be float32 or bfloat16, but this isn't enforced. Passing unsupported dtypes leads to confusing errors from cuSolverMp.

Suggested change

if x.dim() != 2:

raise ValueError(f"Expected 2D tensor, got {x.dim()}D")

if not x.is_cuda:

raise ValueError("Input tensor must be on CUDA device")

if x.dim() != 2:

raise ValueError(f"Expected 2D tensor, got {x.dim()}D")

if not x.is_cuda:

raise ValueError("Input tensor must be on CUDA device")

if x.dtype not in (torch.float32, torch.bfloat16):

raise ValueError(f"Input tensor must be float32 or bfloat16, got {x.dtype}")

Raise FileNotFoundError when no libtransformer_engine.so is found in any candidate location, and raise RuntimeError when nm is unavailable or exits non-zero, rather than silently returning False in both cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps · 2026-02-27T20:55:17Z

tests/pytorch/distributed/test_newton_schulz.py

+    if dtype == "bfloat16":
+        test_cmd += ["--atol=5e-2", "--rtol=5e-2"]
+
+    result = subprocess.run(test_cmd, env=os.environ, capture_output=True, check=False)


no timeout on subprocess - if the distributed test deadlocks or hangs (e.g., due to NCCL communication issues), this will block CI indefinitely. Add timeout=300 or similar.

greptile-apps · 2026-02-27T20:55:18Z

transformer_engine/common/CMakeLists.txt

+                 PATHS ${CUSOLVERMP_DIR}
+                 PATH_SUFFIXES lib
+                 REQUIRED)
+    target_link_libraries(transformer_engine PUBLIC ${CUSOLVERMP_LIB})


PUBLIC linkage exposes cuSOLVERMp to all downstream consumers of transformer_engine library. Since newton_schulz.h doesn't expose cuSOLVERMp types in the public API, PRIVATE linkage would provide better encapsulation (consumers don't need cuSOLVERMp at link time).

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

In common_lib_has_symbol, prepend a candidate derived by importing transformer_engine via importlib.util.find_spec and using the package directory as the root. This correctly resolves the SO path for source and PyPI installs (where it lives inside transformer_engine/), before falling back to the repo-root and CMake build dir candidates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps · 2026-03-04T03:29:54Z

transformer_engine/common/newton_schulz/newton_schulz.cpp

+  const int64_t mb = (m + ctx->nranks - 1) / ctx->nranks;
+  const int64_t nb = n;
+
+  // Compute local leading dimension
+  const int64_t local_rows = cusolverMpNUMROC(m, mb, ctx->rank, 0, ctx->nranks);
+  const int64_t lld = std::max(local_rows, static_cast<int64_t>(1));
+
+  const cudaDataType_t cuda_dtype = get_cuda_dtype(t->dtype());
+
+  // Create matrix descriptor
+  auto mat_desc = MakeCusolverMpMatrixDesc(ctx->grid.get(), cuda_dtype, m, n, mb, nb, 0, 0, lld);


Row-major vs. column-major layout mismatch

lld is set to local_rows, which is the column-major (Fortran/LAPACK) leading-dimension convention for a local_rows × n matrix. However, PyTorch tensors are row-major (C-contiguous) by default, where the correct leading dimension is n (number of columns).

When cuSolverMp reads the data pointer assuming lld = local_rows (column-major) but the data is actually laid out row-major, it will silently mis-interpret every element [i,j]:

Expected offset (row-major): i * n + j

What cuSolverMp sees (column-major, lld = local_rows): j * local_rows + i

The test matrix happens to be symmetric (A = Q Λ Qᵀ), so A^T = A and the polar factor is also symmetric, which can mask this bug. For any non-symmetric rectangular matrix the result would be wrong.

If cuSolverMp requires column-major input, the caller should transpose the tensor before calling (or the API should accept a row-major flag). If cuSolverMp supports row-major, lld should be n:

// For row-major PyTorch tensors (C-contiguous): const int64_t lld = n;

Please verify the expected memory layout against the cuSolverMp documentation and update accordingly, and add a non-symmetric test case to catch this class of bug.

greptile-apps · 2026-03-04T03:29:55Z

transformer_engine/common/newton_schulz/newton_schulz.cpp

+  NVTE_CHECK_CUSOLVERMP(cusolverMpNewtonSchulz(
+      ctx->handle.get(), ns_desc.get(), m, n, t->data.dptr, 1, 1, mat_desc.get(), num_iterations,
+      coefficients, CUDA_R_32F, ctx->workspace, ctx->workspace_size, workspace_host.data(),
+      workspace_host.size(), nullptr));


nullptr devInfo suppresses convergence diagnostics

The last argument to cusolverMpNewtonSchulz is the device info array (devInfo). Passing nullptr means the library will not write convergence or per-iteration status back to the caller. If Newton-Schulz fails to converge or encounters a numerical issue, the NVTE_CHECK_CUSOLVERMP macro will only catch a non-CUSOLVER_STATUS_SUCCESS return code — convergence warnings or soft failures that still return SUCCESS will be silently swallowed.

Consider allocating a small device integer and checking it after the call:

int* devInfo = nullptr; NVTE_CHECK_CUDA(cudaMalloc(&devInfo, sizeof(int))); NVTE_CHECK_CUSOLVERMP(cusolverMpNewtonSchulz( ..., devInfo)); int h_info = 0; NVTE_CHECK_CUDA(cudaMemcpy(&h_info, devInfo, sizeof(int), cudaMemcpyDeviceToHost)); NVTE_CHECK(h_info == 0, "cusolverMpNewtonSchulz devInfo = ", h_info); cudaFree(devInfo);

This would make convergence failures clearly visible to the user.

greptile-apps · 2026-03-04T03:29:56Z

transformer_engine/common/newton_schulz/newton_schulz.cpp

+NVTECusolverMpCtx* nvte_cusolvermp_ctx_create(ncclComm_t comm, int nranks, int rank) {
+  NVTE_API_CALL(nvte_cusolvermp_ctx_create);
+  int device_id{};
+  NVTE_CHECK_CUDA(cudaGetDevice(&device_id));
+
+  cudaStream_t stream{};
+  NVTE_CHECK_CUDA(cudaStreamCreate(&stream));
+
+  cudaEvent_t in_ready{};
+  NVTE_CHECK_CUDA(cudaEventCreate(&in_ready));
+  cudaEvent_t out_ready{};
+  NVTE_CHECK_CUDA(cudaEventCreate(&out_ready));
+
+  auto handle = MakeCusolverMpHandle(device_id, stream);
+  auto grid = MakeCusolverMpGrid(handle.get(), comm, nranks, 1, CUSOLVERMP_GRID_MAPPING_COL_MAJOR);
+
+  return new NVTECusolverMpCtx{
+      .nranks = nranks,
+      .rank = rank,
+      .stream = stream,
+      .in_ready = in_ready,
+      .out_ready = out_ready,
+      .handle = std::move(handle),
+      .grid = std::move(grid),
+      .workspace = nullptr,
+      .workspace_size = 0,
+  };


Resource leak on exception in nvte_cusolvermp_ctx_create

The raw CUDA handles stream, in_ready, and out_ready are created with plain C API calls before being moved into the NVTECusolverMpCtx struct. If MakeCusolverMpHandle or MakeCusolverMpGrid throw (via NVTE_CHECK_CUSOLVERMP → NVTE_ERROR), the destructor for NVTECusolverMpCtx is never called and these three CUDA resources leak.

Since the exception unwinds the stack before reaching the return new NVTECusolverMpCtx{...} line, there is no way for the nvte_cusolvermp_ctx_destroy path to clean them up.

Wrapping each handle in its own RAII type (similar to the existing CusolverMpHandle) would ensure safe cleanup on any early-exit path:

struct CudaStreamDeleter { void operator()(cudaStream_t s) const { cudaStreamDestroy(s); } }; using CudaStream = std::unique_ptr<std::remove_pointer_t<cudaStream_t>, CudaStreamDeleter>; // ... similar for cudaEvent_t

greptile-apps · 2026-03-04T03:29:57Z

build_tools/utils.py

+            f"'nm' failed on {lib_path} (exit code {e.returncode}):\n{e.stderr}"
+        ) from e
+
+    return symbol in result.stdout


Substring match can produce false positives

symbol in result.stdout does a plain substring search over the entire nm output. If the library ever contains a symbol that has the target symbol as a prefix (e.g. nvte_cusolvermp_ctx_create_with_options), this check will incorrectly return True and enable cuSolverMp support in the PyTorch extension even though the real symbol is absent.

The nm -D output format is <address> <type> <name>\n per line. A safer check is to match against word boundaries:

import re return bool(re.search(r'\b' + re.escape(symbol) + r'\b', result.stdout))

or equivalently test for whitespace/line boundaries:

return any(line.split()[-1] == symbol for line in result.stdout.splitlines() if line.strip())

greptile-apps · 2026-03-04T03:30:03Z

Additional Comments (1)

transformer_engine/common/CMakeLists.txt
option() declared after first use

The option(NVTE_WITH_CUSOLVERMP ...) declaration appears ~12 lines later in the file (at the target_link_libraries block), but this if (NVTE_WITH_CUSOLVERMP) guard relies on it. While this works when -DNVTE_WITH_CUSOLVERMP=ON is passed on the command line (the cache entry is set before CMake processes any option()), a developer who adds a second reference to NVTE_WITH_CUSOLVERMP above the option() call — or who processes the file without the cache — will get confusing behaviour.

CMake convention is to declare option() before first use. Moving the option() declaration to just above the first if (NVTE_WITH_CUSOLVERMP) block (line 281) would improve robustness and readability.

greptile-apps · 2026-03-04T04:08:33Z

build_tools/utils.py

+    candidates = []
+    try:
+        te_spec = importlib.util.find_spec("transformer_engine")
+        print(f"TE_SPEC: {te_spec}")


Debug print statement left in production code.

print(f"TE_SPEC: {te_spec}") is a debug artifact that will pollute build output for every user who builds the PyTorch extension. It should be removed before merging.

Suggested change

print(f"TE_SPEC: {te_spec}")

if te_spec is not None and te_spec.origin is not None:

greptile-apps · 2026-03-04T04:08:34Z

transformer_engine/common/CMakeLists.txt

+if (NVTE_WITH_CUSOLVERMP)
+list(APPEND transformer_engine_SOURCES
+     newton_schulz/newton_schulz.cpp)
+endif()


option() declared after its first use.

Every other optional feature in this file (NVTE_UB_WITH_MPI, NVTE_ENABLE_NVSHMEM, NVTE_WITH_CUBLASMP) follows the pattern of declaring the option() immediately before the if() that uses it. Here, option(NVTE_WITH_CUSOLVERMP ...) is declared ~78 lines later at line 308, after the source-file list is already conditionally extended on line 230.

While this works when -DNVTE_WITH_CUSOLVERMP=ON is passed on the CMake command line (the cache variable is already set before this file processes), it is inconsistent with the established pattern and could surprise developers who add follow-on logic. Moving the option() declaration to just before line 230 would make the file consistent.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-03-04T04:08:35Z

transformer_engine/common/include/transformer_engine/newton_schulz.h

+#ifndef TRANSFORMER_ENGINE_COMMON_NEWTON_SCHULZ_H_
+#define TRANSFORMER_ENGINE_COMMON_NEWTON_SCHULZ_H_
+
+#include <nccl.h>


Unconditional #include <nccl.h> in a public header.

newton_schulz.h is installed as a public header (under include/transformer_engine/). The unconditional #include <nccl.h> means that any downstream project that includes this header — even one with no interest in Newton-Schulz — now requires NCCL in its include path.

ncclComm_t is only used in the function signatures of nvte_cusolvermp_ctx_create and nvte_newton_schulz, which are themselves only meaningful when NVTE_WITH_CUSOLVERMP is defined. Guarding the include and the declarations together would prevent the leakage:

Suggested change

#include <nccl.h>

#ifdef NVTE_WITH_CUSOLVERMP

#include <nccl.h>

// ... struct and function declarations ...

#endif // NVTE_WITH_CUSOLVERMP

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-03-04T05:20:59Z

Additional Comments (3)

build_tools/utils.py, line 95
FileNotFoundError propagates to the build system

When libtransformer_engine.so cannot be located, common_lib_has_symbol raises FileNotFoundError. The call-site in pytorch.py is a bare if common_lib_has_symbol(...) with no try/except, so any missing-library condition aborts the entire PyTorch extension build rather than gracefully treating it as "symbol absent → no cuSOLVERMp".

This scenario is realistic: the common library might be built in a non-standard path, or a developer may be doing an incremental build of only the PyTorch extension. The function should return False (with an optional warning) instead of raising:

    if lib_path is None:
        import warnings
        warnings.warn(
            "Could not find libtransformer_engine.so; cuSOLVERMp auto-detection disabled.",
            stacklevel=2,
        )
        return False

The subprocess.CalledProcessError path (nm failure) has the same problem and should similarly return False or at most warn.

tests/pytorch/distributed/test_newton_schulz.py, line 252
Missing skip guard for builds without cuSOLVERMp

The file correctly skips when fewer than 2 GPUs are present, but there is no skip when TE was built without NVTE_WITH_CUSOLVERMP=1. In that case, newton_schulz raises RuntimeError("newton_schulz requires Transformer Engine to be built with NVTE_WITH_CUSOLVERMP=1"), which will appear in CI as a test failure rather than a skip.

A simple guard at module level (or inside the test) would mirror the GPU-count check:

import transformer_engine_torch as tex
if not hasattr(tex, "newton_schulz"):
    pytest.skip(
        "Newton-Schulz tests require TE built with NVTE_WITH_CUSOLVERMP=1.",
        allow_module_level=True,
    )

transformer_engine/pytorch/newton_schulz.py, line 813
Undocumented fallback coefficients for num_iterations != 5

When coefficients is None and num_iterations != 5, the function silently falls back to [1.5, -0.5, 0.0] * num_iterations. These are generic cubic Newton-Schulz coefficients and are likely far less accurate than the carefully tuned QUINTIC_COEFFICIENTS. Nothing in the public docstring tells callers about this behaviour.

If 15-iteration runs use the test's custom coefficients (which are different from both defaults), the fallback path is also never exercised by the current tests with the default API — making the generic path untested in practice.

Consider either:

Documenting that "optimised coefficients are only bundled for num_iterations=5; for other values you must supply coefficients explicitly", and turning the fallback into a ValueError, or
Bundling a second set of optimised coefficients for the 15-iteration case that matches what the test actually uses.

This reverts commit 8f50bd5.

greptile-apps · 2026-03-04T19:43:59Z

Additional Comments (12)

qa/L1_pytorch_distributed_unittest/test.sh, line 35
Newton-Schulz test runs unconditionally in QA

The test is added without any guard for NVTE_WITH_CUSOLVERMP. If the CI runner is not built with cuSolverMp, this subprocess will fail with a RuntimeError, causing the entire L1 QA job to fail. Other optional features are guarded in their test paths.

Add a conditional:

if [ "${NVTE_WITH_CUSOLVERMP:-0}" == "1" ]; then
    python3 -m pytest -v -s --junitxml=$XML_LOG_DIR/pytest_test_newton_schulz.xml $TE_PATH/tests/pytorch/distributed/test_newton_schulz.py || test_fail "test_newton_schulz.py"
fi

tests/pytorch/distributed/test_newton_schulz.py, line 15
Missing skip guard for cuSolverMp availability

The file skips when fewer than 2 GPUs are available, but not when Transformer Engine is built without NVTE_WITH_CUSOLVERMP=1. On such builds, every test fails inside the subprocess with RuntimeError, producing confusing stderr messages.

Add a module-level skip check:

import transformer_engine_torch as tex
if not hasattr(tex, "newton_schulz"):
    pytest.skip(
        "TE not built with NVTE_WITH_CUSOLVERMP=1; skipping Newton-Schulz tests.",
        allow_module_level=True,
    )

build_tools/pytorch.py, line 95
Silent default for CUSOLVERMP_HOME inconsistent with NVSHMEM pattern

The cuSolverMp block silently defaults to "/usr" when CUSOLVERMP_HOME is unset. The NVSHMEM block asserts that NVSHMEM_HOME is explicitly set, providing a clear error message. If the library is not installed under /usr/include and /usr/lib, the build fails with a generic linker error rather than a clear message about the missing environment variable.

Align with the NVSHMEM pattern:

if bool(int(os.getenv("NVTE_WITH_CUSOLVERMP", "0"))):
    assert (
        os.getenv("CUSOLVERMP_HOME") is not None
    ), "CUSOLVERMP_HOME must be set when compiling with NVTE_WITH_CUSOLVERMP=1"
    cusolvermp_home = Path(os.getenv("CUSOLVERMP_HOME"))
    include_dirs.append(cusolvermp_home / "include")
    library_dirs.append(cusolvermp_home / "lib")
    libraries.append("cusolverMp")
    cxx_flags.append("-DNVTE_WITH_CUSOLVERMP")

transformer_engine/pytorch/newton_schulz.py, line 21
Uses private PyTorch APIs that may change

Lines 20-21 use _get_backend() and _comm_ptr(), which are private PyTorch APIs (underscore prefix indicates internal/unstable). These can change in future PyTorch versions, breaking this code.

Consider using public APIs or documenting this dependency clearly in comments, noting that this code may need updates with new PyTorch releases.

transformer_engine/pytorch/newton_schulz.py, line 67
Use ValueError instead of assert for validation

The assertion on line 65 validates user input. Assertions can be disabled with Python's -O flag, silently allowing invalid inputs. Use ValueError for user-facing validation:

if len(coefficients) != num_iterations * 3:
    raise ValueError(
        f"Unexpected number of coefficients: {len(coefficients)} for {num_iterations} iterations"
    )

transformer_engine/pytorch/newton_schulz.py, line 72
Missing tensor memory layout validation

The C++ code calls data_ptr() which requires contiguous memory. Non-contiguous tensors will cause silent incorrect results. Add a contiguity check before validation:

if not x.is_contiguous():
    raise ValueError("Input tensor must be contiguous (C-order)")

Also add dtype validation since the docstring specifies float32 or bfloat16:

if x.dtype not in (torch.float32, torch.bfloat16):
    raise ValueError(f"Input tensor must be float32 or bfloat16, got {x.dtype}")

tests/pytorch/distributed/test_newton_schulz.py, line 37
Missing subprocess timeout

The distributed test subprocess has no timeout. If it deadlocks (e.g., NCCL communication issue), the test will block indefinitely, hanging the CI job. Add a timeout:

result = subprocess.run(test_cmd, env=os.environ, capture_output=True, check=False, timeout=300)

transformer_engine/common/CMakeLists.txt, line 317
PUBLIC linkage exposes cuSOLVERMp to all downstream consumers

cuSOLVERMp is linked with PUBLIC visibility, meaning all projects depending on Transformer Engine must have it in their link path, even those that don't use Newton-Schulz. Since the public API doesn't expose cuSOLVERMp types, PRIVATE linkage would provide better encapsulation:

target_link_libraries(transformer_engine PRIVATE ${CUSOLVERMP_LIB})

transformer_engine/common/CMakeLists.txt, line 230
CMake option declared after first use

The option(NVTE_WITH_CUSOLVERMP) is declared at line 308, but used at line 230. Every other optional feature declares the option immediately before using it. While this works when -DNVTE_WITH_CUSOLVERMP=ON is passed on the command line, it breaks the established pattern and could surprise developers adding follow-on logic.

Move the option() declaration to just before line 230.

transformer_engine/common/include/transformer_engine/newton_schulz.h, line 17
Unconditional #include <nccl.h> in public header

newton_schulz.h is installed as a public header. The unconditional #include <nccl.h> forces NCCL into every downstream project's include path, even those with no interest in Newton-Schulz. Since ncclComm_t is only used in functions meaningful when NVTE_WITH_CUSOLVERMP is defined, guarding both the include and declarations would prevent this leakage:

#ifdef NVTE_WITH_CUSOLVERMP
#include <nccl.h>
...
NVTECusolverMpCtx* nvte_cusolvermp_ctx_create(ncclComm_t comm, int nranks, int rank);
void nvte_newton_schulz(...);
#endif

transformer_engine/pytorch/__init__.py, line 62
Unconditional import of optional feature

newton_schulz is unconditionally imported and exported as public API, even when TE is built without NVTE_WITH_CUSOLVERMP. While the function raises a runtime error when called, this exposes the symbol to all users and makes it appear as a supported feature in auto-complete and documentation.

Guard this import similarly to other optional features, or add a note in the docstring that NVTE_WITH_CUSOLVERMP=1 at build time is required.

transformer_engine/common/newton_schulz/newton_schulz.cpp, line 119
Resource leak on exception during context creation

Raw CUDA handles (stream, in_ready, out_ready) are created at lines 99, 102, 104 but not moved into the struct until lines 112-114. If MakeCusolverMpHandle (line 106) or MakeCusolverMpGrid (line 107) throw (both contain NVTE_CHECK_CUSOLVERMP which can throw), these three handles are never cleaned up.

Wrap each handle in an RAII type (similar to CusolverMpHandle and CusolverMpGrid) to ensure safe cleanup on any early-exit path.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps · 2026-03-04T19:57:42Z

Additional Comments (4)

build_tools/pytorch.py, line 95
Missing assertion for CUSOLVERMP_HOME

When NVTE_WITH_CUSOLVERMP=1 is set but CUSOLVERMP_HOME is not, this silently falls back to /usr and will produce a confusing linker error at build time rather than a clear configuration message.

Compare to the NVSHMEM_HOME handling directly above (lines 81-83), which explicitly asserts the variable must be set. Add the same guard here:

if bool(int(os.getenv("NVTE_WITH_CUSOLVERMP", "0"))):
    assert (
        os.getenv("CUSOLVERMP_HOME") is not None
    ), "CUSOLVERMP_HOME must be set when compiling with NVTE_WITH_CUSOLVERMP=1"
    cusolvermp_home = Path(os.getenv("CUSOLVERMP_HOME"))
    ...

qa/L1_pytorch_distributed_unittest/test.sh, line 35
Test runs unconditionally regardless of build config

This test is added unconditionally to the QA script, so it will always execute even when TE is built without NVTE_WITH_CUSOLVERMP=1. The subprocess will fail with a runtime error about missing cuSolverMp support, breaking the CI job.

Add a build-flag guard matching the build configuration:

if [ "${NVTE_WITH_CUSOLVERMP:-0}" = "1" ]; then
    python3 -m pytest -v -s --junitxml=$XML_LOG_DIR/pytest_test_newton_schulz.xml $TE_PATH/tests/pytorch/distributed/test_newton_schulz.py || test_fail "test_newton_schulz.py"
fi

tests/pytorch/distributed/test_newton_schulz.py, line 15
No guard for missing cuSolverMp build support

The test only skips when fewer than 2 GPUs are available, but does not check whether TE was built with NVTE_WITH_CUSOLVERMP=1. On a system with 2+ GPUs but a TE build without cuSolverMp, the torchrun subprocess will fail with a runtime error, causing the test to report AssertionError with confusing output rather than a clean skip.

Add an early skip guard:

import transformer_engine_torch as tex

if not hasattr(tex, "newton_schulz"):
    pytest.skip("Newton-Schulz tests require TE built with NVTE_WITH_CUSOLVERMP=1.", allow_module_level=True)

transformer_engine/pytorch/newton_schulz.py, line 64
Fallback coefficients for num_iterations != 5 are undocumented and degrade polynomial degree

When num_iterations != 5 and no custom coefficients are supplied, the fallback is [1.5, -0.5, 0.0] * num_iterations. The trailing 0.0 silently degenerates the quintic polynomial to a cubic one (a·X + b·X³ + 0·X⁵). This means users calling with, e.g., num_iterations=10 will unknowingly use different convergence behavior than the optimized 5-iteration case.

Consider either:

Raising a ValueError when num_iterations != 5 and no coefficients are provided, forcing users to supply their own, or
Documenting clearly in the docstring that only 5-iteration defaults are optimised and all other counts fall back to generic cubic steps

Replace misleading 'inverse square root' descriptions with accurate 'matrix orthogonalization' in the module docstring, function docstring, and pybind11 binding docstring. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Context creation is expensive and should not happen on every newton_schulz call. Introduce CusolverMpCtx and cusolvermp_ctx_create() so callers can create a context once from a ProcessGroup and reuse it. CusolverMpCtx supports explicit destroy() and use as a context manager. newton_schulz() now takes CusolverMpCtx instead of ProcessGroup. Export CusolverMpCtx and cusolvermp_ctx_create from the pytorch package. Update the distributed test worker to use explicit context lifecycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace assert with ValueError for the coefficients length check. Add dtype (float32/bfloat16) and contiguity checks for the input tensor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

for more information, see https://pre-commit.ci

vcherepanov-nv and others added 19 commits February 8, 2026 22:38

Correct coefficients

e8c51f8

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

No stream synchronize

412445c

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Change test - it approximates orthogonal matrix, not inverse square root

dd1dd0b

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Generalize number of iterations in tests

85d33fb

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Remove extra info diag - everything should be in logs

a011231

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Add Newton-Schulz tests to the QA script

7c8a656

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

59e8aff

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 25, 2026

View reviewed changes

vcherepanov-nv added 2 commits February 25, 2026 15:47

Fix outdated comments

e433f06

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Remove unused variable

276b841

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps bot reviewed Feb 25, 2026

View reviewed changes

vcherepanov-nv and others added 2 commits February 26, 2026 00:45

Move magic numbers from tests to impl

7fad894

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1e726ce

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

Fix outdated comments

fac55db

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

Check num_coefficients

0732fc2

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8eb6028

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

vcherepanov-nv and others added 3 commits February 27, 2026 03:01

[pre-commit.ci] auto fixes from pre-commit.com hooks

d3740fb

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 27, 2026

View reviewed changes

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

Add site packages to search paths for TE common

2cf55f6

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

vcherepanov-nv force-pushed the newton-schulz branch from f11c55f to 2cf55f6 Compare March 4, 2026 05:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

80f4b61

for more information, see https://pre-commit.ci

Revert "Auto-detect cuSolverMp support from common library binary"

f03391e

This reverts commit 8f50bd5.

Remove unused import

8d81505

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

vcherepanov-nv and others added 4 commits March 5, 2026 20:35

[PyTorch] Strengthen input validation in newton_schulz

3238faf

Replace assert with ValueError for the coefficients length check. Add dtype (float32/bfloat16) and contiguity checks for the input tensor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1a35a36

for more information, see https://pre-commit.ci

vcherepanov-nv changed the title ~~[Draft] Newton-Schulz via cuSOLVERMp~~ Newton-Schulz via cuSOLVERMp Mar 9, 2026

		nccl_backend = group._get_backend(torch.device("cuda"))
		return nccl_backend._comm_ptr()

		m = x.size(0) * nranks # rows are distributed across ranks
		n = x.size(1)

-    m = x.size(0) * nranks  # rows are distributed across ranks
-    n = x.size(1)
+    # Global matrix dimensions
+    # Rows must be evenly distributed across ranks
+    local_rows = x.size(0)
+    m = local_rows * nranks
+    n = x.size(1)

	print(f"TE_SPEC: {te_spec}")
	if te_spec is not None and te_spec.origin is not None:

Conversation

vcherepanov-nv commented Feb 25, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 1/5

Comments Outside Diff (15)

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

greptile-apps bot commented Feb 25, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading