Skip to content

feat: YAML-driven torch op codegen with canonical naming and exposed semantic params#595

Open
voltjia wants to merge 28 commits intomasterfrom
feat/torch-codegen
Open

feat: YAML-driven torch op codegen with canonical naming and exposed semantic params#595
voltjia wants to merge 28 commits intomasterfrom
feat/torch-codegen

Conversation

@voltjia
Copy link
Copy Markdown
Collaborator

@voltjia voltjia commented May 9, 2026

Summary

  • Add scripts/generate_torch_ops.py (~920 lines) — a YAML-driven codegen that consumes PyTorch's aten/native_functions.yaml and emits an InfiniOps base class plus a slot-8 PyTorch backend per op listed in scripts/torch_ops.yaml (~459 ops, generating 507 overloads across 437 canonical classes).
  • Wire the codegen into CMake (src/CMakeLists.txt) under WITH_TORCH=ON: invoke at configure time, glob generated/torch/*.cc, add generated/ to public include paths, install the per-op metadata JSON alongside the bindings.
  • Update the wrapper generator (scripts/generate_wrappers.py) to scan generated/base/ and generated/torch/, fix pybind11 overload ordering (specific → permissive), preserve std::vector<int64_t> parameters that libclang misreports as int, and route active_implementation_indices through a graceful unknown-device path.
  • Add safe device-type lookup primitives (detail::ListContains in src/operator.h, TryDeviceTypeFromString in src/pybind11_utils.h) so generated bindings handle devices an op does not implement without aborting.
  • Add a single data-driven tests/test_torch_ops.py that reads generated/torch_ops_metadata.json and exercises every generated op across three shapes and three dtypes; widen tests/conftest.py to handle non-floating outputs and equal_nan.
  • Move the Sigmoid helper in src/native/cuda/ops/swiglu/kernel.cuh into detail:: so it does not collide with the auto-generated infini::ops::Sigmoid operator class.
  • Add pyyaml to [build-system].requires so CMake can run the codegen during pip install.

Codegen design choices driven by review feedback collected across all 513 base PRs against feat/torch-codegen:

  • Canonical names only. ATen overload-name suffixes (_grad_input, _outtensor, _n_scalar, _values, _x, _l, _q, _u, _output) no longer leak into InfiniOps class names. Multiple ATen overloads of the same base op share a single class, with overloaded operator() methods.
  • Visible scalars are members. Every visible non-tensor parameter (scalars, strings, vectors) is stored as a base-class member initialized from the constructor argument, alongside the existing tensor-metadata members.
  • Default-valued non-optional params are exposed. bool upper, bool transpose, bool unitriangular (triangular_solve), int diagonal (triu), str ord (linalg_matrix_norm), int n on the chebyshev/hermite polynomial families, etc. are no longer hidden because they have an ATen default — they are now visible in the generated operator() and forwarded to ATen.

Motivation

Replaces 500+ hand-written src/base/<op>.h headers with a single declarative pipeline driven from PyTorch's schema. Each commit is single-purpose and individually passes ruff check, ruff format --check, and clang-format (version 21).

The previous iteration (feat/torch-codegen-legacy, preserved on the remote) generated suffixed names that reviewers consistently flagged as bad public API (per inline comments on PRs #280, #283-#290, #509, #563-#589). It also hid semantically critical parameters and did not store scalars as members, requiring hand-written corrections in every base PR. This refactor moves those corrections into the codegen itself, so future regenerations produce the reviewer-preferred shape directly. The 77 PRs that were for non-canonical overload names have been closed; the remaining 333 keep + 103 promote PRs have their content regenerated to match the new codegen output.

Closes #

Type of Change

  • feat — new feature / new operator / new platform
  • fix — bug fix
  • perf — performance improvement (no behavioral change)
  • refactor — code restructuring without behavior change
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system or CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Test Results on Supported Platforms

Platform Built pytest Result Notes / Hardware
NVIDIA
Iluvatar
MetaX
Cambricon
Moore
Ascend
Full `pytest` output (optional)
TBD — to be filled in once cross-platform validation completes.

Benchmark / Performance Impact

N/A. This PR adds a codegen pipeline, not a runtime hot-path change. Generated PyTorch backends call at::<op>_out(...) directly, so per-op performance matches a hand-written ATen-backed op.

Notes for Reviewers

  • The branch was force-pushed over the previous feat/torch-codegen integration branch. The previous content is preserved at feat/torch-codegen-legacy for reference.
  • The 513 open base PRs against feat/torch-codegen have been processed: 77 redundant overload PRs closed, 333 keep + 103 promote PRs scheduled to be force-pushed with regenerated content matching the new canonical naming and parameter shape.
  • Slot 8 is reserved for PyTorch backends; native and vendor implementations claim slots 0–7. The slot must be > 0 to avoid a partial-specialization-after-instantiation conflict with Operator<Op> at index 0.
  • Hand-written src/base/<op>.h continues to shadow generated/base/<op>.h (existence-based; no signature compatibility check). The four pre-existing hand-written bases that do not match the ATen-derived signature (add, linear, matmul, mul) are excluded from scripts/torch_ops.yaml and left to their existing hand-written infrastructure.
  • Optional ATen types (Tensor?, Scalar?, int?, float?) remain hidden for now — exposing them properly requires threading std::optional through to ATen, which is a separable refactor.

Checklist

Every contributor must verify every item below before requesting review. Tick each box only after the check has actually been performed — do not tick speculatively. If an item truly does not apply, replace the checkbox with N/A and briefly explain why in an inline comment.

Title, Branch, and Commits

  • PR title follows Conventional Commits.
  • Branch name follows <type>/xxx-yyyy-zzzzfeat/torch-codegen.
  • Each commit message follows Conventional Commits.
  • Large PR with meaningful, well-formed, independently reviewable commits (11 commits, each one logical change).
  • No stray merge commits from master — branch is rebased cleanly on top of current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal for a codegen-introduction PR; non-essential workflow tooling (merge_base_branches.py from the legacy branch) was dropped.
  • No dead code, commented-out blocks, debug prints, or TODO without an owner.
  • No unrelated formatting churn that would obscure the diff.
  • Public API changes are intentional: the slot-8 dispatch path and the new infini::ops::<Pascal> classes are documented via the codegen's docstring.

General Code Hygiene (applies to all languages)

  • The code is self-explanatory; comments were added only where the why is non-obvious.
  • Every modified or added file ends with a single trailing newline.
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • Identifiers in comments and error messages are wrapped in backticks.
  • All comments and error messages are in English.
  • Comments and error messages are complete sentences with terminal punctuation.

C++ Specific (if C++ files changed)

  • Code follows the Google C++ Style Guide strictly.
  • clang-format (version 21) is clean for every modified .h, .cc, .cuh file. Verified via git rebase master --exec 'clang-format --dry-run --Werror $(git diff HEAD~1 --name-only -- "*.h" "*.cc" "*.cuh")'.
  • clang-tidy concerns reviewed — to be verified during cross-platform CI.
  • Operator parameter order: inputs first, attributes between, outputs last (matches ATen _out form).
  • No exceptions — error paths use assert. Generated code uses ATen which itself uses TORCH_CHECK; that is consistent with the existing torch backend pattern.
  • Error/warning wording follows LLVM Coding Standards.
  • N/A: no new kernel files added — codegen only emits ATen wrappers.
  • Constructor initializer list order matches member declaration order (verified by inspecting generated headers).
  • One blank line between classes/functions; one between class members; one before/after namespace contents.
  • New operators added via src/base/<op>.h (auto-generated under generated/base/) inheriting Operator<Op>; PyTorch backend specializes at slot 8.
  • No raw new/delete.

Python Specific (if Python files changed)

  • ruff check is clean for the entire repo.
  • ruff format --check is clean for the entire repo. Verified per-commit via git rebase master --exec 'ruff format --check . && ruff check .'.
  • Comments are complete English sentences with backticked code references.
  • pytest.skip messages are lowercase without terminal period (framework convention).
  • No blank line between function signature and body when there is no docstring or comment.
  • Blank lines around control-flow statements.
  • Blank line before return when not directly following a control-flow statement.
  • Docstrings follow PEP 257.
  • Type hints are present on new dataclasses (Param, Op) and on every public function.

Testing

  • pytest was run locally on every supported platform — pending cross-platform CI completion. NVIDIA in progress at PR-open time.
  • Reasons for any platform that could not be tested — pending.
  • New functionality has matching tests under tests/test_torch_ops.py.
  • Tests use pytest.mark.parametrize correctly: dependent parameters share one decorator (("dtype", "rtol", "atol")); independent parameters use separate decorators.
  • Default dtype / device parameterization is relied on; op_meta and shape are added with explicit parametrize.
  • N/A: no flaky tests under parallelism added.
  • N/A: this is a feature PR, not a bug fix — no regression test required.

Build, CI, and Tooling

  • Builds cleanly from a fresh directory with pip install .[dev] — pending cross-platform CI.
  • compile_commands.json regenerates (no change to pyproject.toml's CMAKE_EXPORT_COMPILE_COMMANDS=ON).
  • N/A: no new backends / devices added.
  • N/A: CUDA-like GPU mutual exclusion not changed.
  • Both CI workflows (clang-format.yml, ruff.yml) are expected to be green — verified locally per-commit.
  • New build dependency pyyaml added to pyproject.toml's [build-system].requires.

Documentation

  • N/A: README.md, CONTRIBUTING.md, and developer workflow are unchanged for end users; the codegen docstring documents internal behaviour for maintainers.
  • Codegen docstring + per-section comments explain the codegen pipeline; the PyTorch slot-8 convention is documented in the _PYTORCH_SLOT comment.
  • N/A: no user-visible breaking change.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • No third-party code added (the codegen reads aten/native_functions.yaml from PyTorch's GitHub but does not vendor it).
  • No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks introduced.

voltjia added 12 commits May 9, 2026 13:24
The torch op codegen script imports `yaml` to parse `scripts/torch_ops.yaml`
and PyTorch's `native_functions.yaml`. Since CMake invokes the script at
configure time, PyYAML must be available in the build environment.
Frees the `infini::ops::Sigmoid` name for the auto-generated PyTorch
operator class emitted by the upcoming `scripts/generate_torch_ops.py`.
Adds two pieces used by the upcoming pybind bindings for auto-generated
torch ops:

  - `detail::ListContains` and an early-out in
    `Operator::active_implementation_indices` so querying impls for a
    device the op does not support returns an empty vector instead of
    crashing in `DispatchFunc`.

  - `TryDeviceTypeFromString` returning `std::optional<Device::Type>`,
    so generated bindings can resolve a device name without aborting on
    unrecognized inputs.
For each entry in `scripts/torch_ops.yaml`, the script finds the
matching `.out` variant in PyTorch's `native_functions.yaml` (fetched
from GitHub on first invocation, cached under `generated/.cache/`),
parses its schema, and emits an InfiniOps base class plus a PyTorch
backend specialization at slot 8 that wraps `at::<op>_out`.

Key strategies:

  - Overload-aware lookup: prefers `<name>.out` then any
    `<name>.<overload>_out`, picking the variant with the most tensor
    inputs (so `pow.Tensor_Tensor_out` wins over `pow.Tensor_Scalar_out`).

  - Hidden-parameter pattern: optional types (`Scalar?`, `int[]?`,
    `ScalarType?`, `Generator?`, …), `bool` defaults, numeric
    `int`/`float` defaults, `int[N]=[]` defaults, and ATen enum
    symbols (`Mean`, `Sum`) are filtered from the user-facing API
    and substituted at the ATen call site.  Unlocks reductions, scans,
    comparisons, losses, and multi-scalar activations from a single
    mechanism.

  - Slot 8: reserved for PyTorch backends; native and vendor
    implementations use 0–7.  Also avoids a partial-specialization-after-
    instantiation conflict with `Operator<Op>` at index 0.

  - Hand-written-base coexistence: if `src/base/<op>.h` exists, the
    generator skips emitting `generated/base/<op>.h` so the
    hand-written one wins.  Ops whose pre-existing hand-written base
    has a different parameter shape (`add`, `linear`, `matmul`,
    `mul`) are kept out of the YAML; including them would cause the
    generated torch override to mismatch the hand-written base.

  - Per-op metadata (`generated/torch_ops_metadata.json`): records the
    full parameter list per op for the test harness, so adding a new op
    to the allowlist requires no code changes.
When `WITH_TORCH=ON`, run `scripts/generate_torch_ops.py` at configure
time and add the generated tree to the torch source glob and include
path.  Vendor compilers (`mxcc`/`mcc`) get the same include via the
system-`g++` torch recompile loop.  When Python bindings are enabled,
also install `generated/torch_ops_metadata.json` so the torch-op test
can discover the generated catalog at runtime.
Three changes that let `generate_wrappers.py` see the codegen output:

  - `_find_base_header` resolves an op's base in `src/base/` first,
    then `generated/base/` — mirroring the C++ include-path order so a
    hand-written base wins.  `_OperatorExtractor`,
    `_find_optional_tensor_params`, and `_find_vector_tensor_params`
    use it; clang's parser also picks up `-I generated` so the include
    in a generated torch source resolves through the parser too.

  - `_get_all_ops` now scans both base directories and both impl roots
    (`src/` and `generated/`), so generated PyTorch backends are
    bound alongside hand-written ones.  `_to_include_path` strips
    either `src/` or `generated/` when emitting legacy-C `#include`
    directives.

  - Active-impl device lookup goes through the new
    `TryDeviceTypeFromString<Self>(device)` helper, returning an empty
    vector for an unknown name instead of aborting.

Also wipes the bindings/src/include output trees at start so files for
ops removed from the active set do not linger and get globbed by the
next build, and pulls `_get_system_include_flags` out as a
module-level `lru_cache` (the `subprocess` probes were the slow
path).
Tensor parameters bind to `py::object`, which accepts any Python value
and only rejects inside `TensorFromPybind11Handle` at runtime.  When
a class has both scalar and Tensor overloads of `__call__` or its
constructor (e.g. `pow.Tensor_Tensor_out` vs `pow.Tensor_Scalar_out`),
pybind's overload resolver tries them in registration order, so the
`Tensor` signature swallows scalar calls if it sits first and the call
aborts inside the conversion.

`_overload_order_key` sorts by (object-like-arg count ascending, total
arg count descending), so the most-specific signature is registered
first and pybind walks toward more permissive ones only on a real
type-mismatch.  While here, rename the `__call__` lambda's first
parameter from `self` to `op` so it does not collide with ATen ops
that take a parameter literally named `self`.
A single parametrized `test_op` reads `generated/torch_ops_metadata.json`
(installed alongside the bindings, with a fallback to the source-tree
copy), synthesises inputs by parameter type, calls the InfiniOps
wrapper at slot 8, and compares each output tensor against `torch.<op>`
or its `torch.special` / `torch.nn.functional` counterpart.  Adding
an op to `scripts/torch_ops.yaml` extends coverage with no test
changes.

Skip-lists narrow the harness around known harness limitations: vendor
kernels that lack a given (op, dtype, device) combination, random ops
whose RNG state diverges from a fresh torch reference, low-precision
reductions where the functional and `_out` paths diverge, ops that
fire CUDA device-side asserts on random inputs, and ops whose inputs
or outputs use dtypes outside the InfiniOps `DataType` enum.

`tests/conftest.py` now compares non-floating outputs with
`torch.equal` (since `torch.allclose` rejects `bool`) and passes
`equal_nan=True` for floats so symmetric NaNs (common for special
functions fed out-of-domain inputs) do not fail the test.
Reviewers consistently flagged class names like `xlogy_outtensor`,
`triangular_solve_x`, `*_grad_input`, `*_forward_output`,
`*_n_scalar`, `*_dim_values`, `*_values_stable` etc. as bad
public-API naming — the suffix is just an ATen schema artifact and
carries no semantic info.

Use only the canonical `aten_name` for the InfiniOps class; multiple
ATen overloads of the same base op (e.g. `scatter.src`,
`scatter.value`, `scatter.reduce`) become overloaded `operator()`
methods on a single `Scatter` class, with tensor metadata members
shared across overloads.  Overloads that collapse to identical
visible C++ signatures after hidden defaults are still deduped by
`_dedupe_visible_overloads`.

The test harness's parametrize-id falls back to `overload_name` so
pytest does not collide ids between overloads.
Reviewers flagged on multiple PRs that scalar parameters such as `n`
on `special_chebyshev_polynomial_v` were declared in the
constructor but never stored on the class — leaving the backend with
no way to read them outside of `operator()`.  Add a `<type>
<name>_;` member for every visible non-tensor parameter, initialized
from the matching constructor argument.

Same-named scalars across overloads must agree on type; if a later
overload disagrees, that overload's value is left default-constructed
rather than emitting a conflicting member.  Tensor metadata members
(`<name>_shape_`, `_strides_`, `_type_`) keep their existing
union-across-overloads behaviour.
Reviewers consistently flagged on multiple PRs that semantically
critical default-valued parameters were being hidden by the codegen:

  - `bool upper`, `bool transpose`, `bool unitriangular` on
    `triangular_solve` (PR #580)
  - `int diagonal` on `triu` (PR #509)
  - `int n` on the `special_chebyshev_polynomial_*` family
  - `str ord` on `linalg_matrix_norm` (PR #280)
  - `int[N]` dims with `[]` defaults on reductions

These were hidden because they have a default in ATen's schema, but
defaults do not equal "optional to expose".  Stop hiding non-optional
default-valued params; they are now visible in the generated
`operator()` signatures and forwarded to ATen.

Optional ATen types (`Tensor?`, `Scalar?`, `int?`, …) remain hidden
for now — exposing them properly requires threading `std::optional`
through to ATen, which is a larger refactor and tracked separately.
…tion

libclang silently reports the type of `std::vector<int64_t>` parameters
as `int` on systems where the STL headers are not fully indexable
(observed under the NVIDIA build's libclang).  The fallback type then
leaks into the generated binding as `const int padding` instead of
`const std::vector<int64_t> padding`, and the binding's call to the
base operator fails to compile with a long instantiation trace at
`Operator::operator()` for any op with `int[N]` schema parameters
(im2col, col2im, reflection_pad*, replication_pad*, fft_*, upsample_*,
nuclear_norm, …).

Adopt the same regex-scan workaround already used for
`std::optional<Tensor>` and `std::vector<Tensor>` parameters: scan
the base header text for `std::vector<int64_t> <name>` declarations and
emit the binding parameter with that exact type, bypassing libclang's
inferred spelling.
@voltjia voltjia requested a review from a team May 9, 2026 07:58
voltjia added 16 commits May 9, 2026 16:13
The wrapper generator picked up `generated/base/<op>.h` headers
unconditionally whenever the directory existed.  When a CI container
inherits a `generated/` tree via rsync but configures with
`WITH_TORCH=OFF` (so the codegen never re-runs and the matching
torch sources never compile), the generated bindings reference base
headers that are not on the include path of any compiled target —
`ops.cc` then fails with "fatal error: base/<op>.h: No such file or
directory".

Skip the `generated/base/` scan unless `--with-torch` is in effect,
mirroring the existing gate on `generated/torch/`.
ATen names the first tensor parameter `self` to mirror the
method-style invocation `tensor.abs()`.  InfiniOps' hand-written
bases (`Add`, `Gemm`, …) use `input` for the primary tensor input,
matching `CONTRIBUTING.md` §C++'s preference for PyTorch user-facing
naming conventions over PyTorch internal C++ names.

Rename `self` → `input` at parse time so generated headers stay
consistent with hand-written ones.
The generated torch source instantiated all 10 `Operator<Op, kDev,
8>` device specializations unconditionally.  Each instantiation pulls
in a deep ATen template tree that costs roughly 0.5-1 GB of RSS
during compilation; when the build compiles 451 ops in parallel
(scikit-build's default ninja `-j$(nproc)`), peak memory exceeds
what some CI containers can spare, and `cc1plus` is killed by the
OOM killer.

Guard each explicit instantiation with `#ifdef WITH_<DEV>`.  Each
`WITH_<DEV>` macro is set by `target_compile_definitions` (or, for
`WITH_METAX` / `WITH_MOORE` / `WITH_CPU`, added to the vendor
recompile loop's command line, since those sources are compiled
outside the cmake target with the system C++ compiler).  A typical
NVIDIA-only build now instantiates only `kCpu` + `kNvidia`, cutting
template instantiation work to 2 / 10.
The hand-written bases that get added via review (`src/base/<op>.h`)
do not carry an `AUTO-GENERATED` header.  Generated and reviewed
files end up with the same content otherwise — the marker becomes
the only visible difference and produces churn during the
`generated/` ↔ `src/base/` migration.  Drop the marker so a
hand-written base is byte-for-byte the same as the generated one.
Some generated signatures (e.g. `Xlogy::operator()(const Tensor input,
const Tensor other, Tensor out)` at 89 columns) overflow the 80-column
limit enforced by `.clang-format` and CI's `clang-format-action@v4`
running `clang-format` v21.  The codegen previously emitted them as
single lines, so every base PR ran into the same line-length violation
once the workflow re-ran.

Pipe each emitted header / source through the local `clang-format`
(passing `--assume-filename=<path>` so the include-order rule treats
each `.cc`'s own header as the primary include).  Adds ~30s to a
full regeneration but eliminates the recurring CI failure across
433+ PR branches.
The previous fix landed on a slightly older `ruff` version that
preferred a multi-line `base_path.write_text(\n  ...\n)` form; CI
runs the latest `ruff format --check` which collapses the line.
Reformatted to match upstream.
Each generated `<op>.cc` instantiates `at::<op>_out(...)`, which
expands roughly 0.5-1 GB of ATen template metaprogramming.  With 451
ops compiled in parallel at Ninja's default `-j$(nproc)`, peak
memory can exceed 30 GB and the OOM killer drops `cc1plus` on
build hosts that allocate less RAM (observed on metax, moore, and
cambricon CI containers).

Add a Ninja job pool `torch_compile=4` and apply it to:
  - the vendor-system-g++ `add_custom_command` recompile loop
    (metax / moore), via `JOB_POOL`;
  - a new `infiniops_torch_objs` OBJECT library for the regular
    cmake build path (cambricon / nvidia / iluvatar), via
    `JOB_POOL_COMPILE`.

The rest of the build keeps full parallelism.
The codegen pipes generated headers/sources through `clang-format`
to satisfy CI's style check.  CI containers (metax, moore, cambricon)
do not ship a system `clang-format` binary, so cmake-time codegen
fails with `FileNotFoundError: clang-format`.

Pin it as a build dep so `pip install` provisions `clang-format`
into the build env before scikit-build invokes cmake.
CI containers running with `--no-build-isolation` (metax, moore,
cambricon) skip `[build-system].requires` and never install
`clang-format` from PyPI; system packages do not provide it
either, so cmake-time codegen fails with `FileNotFoundError`.

Probe `PATH` for `clang-format` at codegen entry; if missing,
`pip install clang-format` into the running interpreter and reuse
the installed binary.  Adds at most a couple of seconds to a
first-time configure on hosts without the binary.
Some CI containers (metax, cambricon) run offline and cannot reach
PyPI; `pip install clang-format` fails with name-resolution errors
and the codegen aborts before any output is written.

Generated files live under `generated/` (gitignored), so they do
not need to satisfy the repo-level `clang-format` check — they only
need to compile.  Fall through to writing unformatted output when no
`clang-format` binary is reachable.  When a binary is available
(local dev, online CI), formatting still happens and the output that
gets pushed to `src/base/<op>.h` for hand-written-base PRs stays
clang-format-clean.
`target_link_libraries(infiniops_torch_objs PUBLIC infiniops)` and
`target_sources(infiniops PRIVATE $<TARGET_OBJECTS:...>)` form a
cycle that cmake rejects on cambricon
("Cyclic dependencies are allowed only among static libraries").

Inherit `infiniops`'s include directories, compile definitions, and
compile options via `$<TARGET_PROPERTY>` generator expressions
instead of linking, so the object library compiles with the same
settings without a back-edge to `infiniops`.
`torch_mlu` is pinned to an older ATen release whose `<op>_out`
overloads do not match the codegen's `pytorch v2.4.0` schema.  For
example, `at::all_out` in `torch_mlu` only accepts `int64_t dim`
or `at::Dimname dim`, while the codegen emits
`c10::optional<at::IntArrayRef> dim` (the v2.4.0 `all.dims_out`
shape).  The build dies with no-known-conversion errors on the first
such op.

Skip auto-detecting PyTorch on Cambricon for now; the WITH_TORCH
backend can be opted in explicitly with `-DWITH_TORCH=ON` once the
`torch_mlu` fork catches up with the upstream schema.
Two classes of false failures observed in the cross-platform run:

  - Multiple ATen overloads sharing one `aten_name` (e.g. `std.dim`
    and `std.correction`) all map to a single InfiniOps class but
    have different ATen-side semantics for hidden defaults.  The
    harness builds the same reference call (`torch.<op>(...)`) for
    every overload, so the secondary overload's nullopt-default
    behaviour disagrees with the reference.  Keep only the first
    overload of each `aten_name`.

  - `binary_cross_entropy` / `binary_cross_entropy_backward` carry
    `weight: Tensor?` (hidden) between visible inputs and
    `reduction: int` (now visible).  The harness passes inputs
    positionally, so `reduction` lands on the reference's `weight`
    parameter and `F.binary_cross_entropy` crashes inside
    `weight.size()`.  Skip these ops; the wrapper itself is fine.
`torch.uint16` / `uint32` / `uint64` only exist in PyTorch ≥ 2.3.
Vendor forks pinned to older releases (cambricon's `torch_mlu`)
fail collection at module import with
`AttributeError: module 'torch' has no attribute 'uint16'`.

Look up each dtype attribute via `getattr` and drop the missing
ones from the supported set.
`mode` blocks indefinitely inside `at::mode_out` when `self` is a
MUSA tensor, which hangs the entire CI run for ~30 min before pytest
gives up.  Add a vendor-hang skip list and put `mode` in it; remove
when the `torch_musa` kernel is fixed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant