Skip to content

[All] Added better error messages#2705

Open
ptrendx wants to merge 7 commits intoNVIDIA:mainfrom
ptrendx:pr_better_errors
Open

[All] Added better error messages#2705
ptrendx wants to merge 7 commits intoNVIDIA:mainfrom
ptrendx:pr_better_errors

Conversation

@ptrendx
Copy link
Member

@ptrendx ptrendx commented Feb 25, 2026

Description

Added better error messages throughout the codebase in order to provide more information when the error gets triggered.

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 25, 2026

Greptile Summary

This PR systematically replaces bare assert statements with explicit, typed exceptions (ValueError, RuntimeError, TypeError) across 31 files, and enriches the resulting error messages with actual runtime values (shapes, dtypes, layout names, counts, etc.) to make debugging significantly easier.

Key changes:

  • C++ macros (TRANSFORMER_ENGINE_TYPE_SWITCH_ALL and friends): "Invalid type." → descriptive errors listing the actual dtype and the full set of valid types; the previously reported gap where Int16 was missing from TRANSFORMER_ENGINE_TYPE_SWITCH_ALL's expected list is now corrected.
  • fused_attn.cpp: New to_string overloads added for NVTE_QKV_Layout and NVTE_QKV_Format; layout/format switch defaults now report the actual enum value and enclosing function name.
  • Python (JAX / PyTorch): Hundreds of bare asserts converted to raise ValueError/RuntimeError/TypeError; identity comparisons (is False, is not False, is not True) replaced with idiomatic truthiness checks, addressing patterns flagged in previous review threads.
  • Bonus fixes: global_amax_col = None and global_amax_row = None initialisations quietly resolve latent NameError paths noted in a TODO comment; the typo "Bias is implemented for FP4 GEMM." (missing "not") is corrected to "Bias is not supported in NVFP4QuantizerRef.qgemm". One dtype enum value in an error message (quantizer.cpp) is shown as a raw integer instead of a readable type name and should be updated.

Confidence Score: 4/5

  • Safe to merge with minor improvements for error message clarity and undefined variable handling.
  • This PR is safe to merge as it contains only error-message improvements and exception-type changes with no algorithmic modifications. All conversions from assert to raise occur in error-handling paths only. Two issues identified: (1) one dtype value shown as raw enum integer instead of human-readable string (style/UX), and (2) a latent NameError risk in a pow_2_scales code path (logic) where global_amax_row should be initialized alongside global_amax_col. Both are low-severity but worth addressing.
  • transformer_engine/pytorch/csrc/quantizer.cpp (dtype enum value) and transformer_engine/pytorch/custom_recipes/quantization_nvfp4.py (uninitialized global_amax_row)

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User Calls API] --> B{Pre-condition check}
    B -->|Before PR: bare assert| C[AssertionError\nno context]
    B -->|After PR: explicit raise| D{Which error type?}
    D -->|Wrong value / range| E[ValueError\nshows actual value + expected]
    D -->|Wrong type| F[TypeError\nshows actual type + expected]
    D -->|Runtime / state violation| G[RuntimeError\ndescribes state + hint]
    C --> H[User must read source\nto understand failure]
    E --> I[User sees what value\nfailed and why]
    F --> I
    G --> I
Loading

Last reviewed commit: 78227d8

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

26 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

ptrendx and others added 3 commits February 25, 2026 11:52
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

26 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ptrendx
Copy link
Member Author

ptrendx commented Feb 25, 2026

/te-ci

Copy link
Collaborator

@jberchtold-nvidia jberchtold-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from the TE/JAX side! Thanks! I have not reviewed the core or PyTorch files

assert FusedAttnFwdPrimitive.inner_primitive is not None
assert (
FusedAttnFwdPrimitive.inner_primitive is not None
), "FusedAttnFwdPrimitive.inner_primitive has not been registered"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit SR: If you'd like, appending "This usually occurs when TransformerEngine was not installed properly and the shared library cannot be loaded. Please see the documentation troubleshooting for assistance" but as is it's already an improvement over no message

ksivaman
ksivaman previously approved these changes Mar 4, 2026
@ptrendx ptrendx dismissed stale reviews from ksivaman and jberchtold-nvidia via 2bca1c8 March 9, 2026 22:53
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx ptrendx force-pushed the pr_better_errors branch from 821e9e6 to 8dee07b Compare March 9, 2026 22:55
@ptrendx
Copy link
Member Author

ptrendx commented Mar 9, 2026

/te-ci

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
- sx_t: scale tensor for qx_t (if columnwise_usage), None otherwise
- global_amax_row, global_amax_col: global amax tensors
"""
global_amax_col = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global_amax_col = None initialisation was added to fix a latent bug, but global_amax_row should also be initialised. In the pow_2_scales branch (lines 604–612), neither variable is assigned, so any code path that references global_amax_row afterwards raises a NameError. The original TODO comment (line 609–610) mentions both variables. Since both are returned at line 682, both must be defined:

Suggested change
global_amax_col = None
global_amax_col = None
global_amax_row = None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants