PULPOpen Training Hardness: tiling-ready backward and optimizer deployment by runwangdl · Pull Request #174 · pulp-platform/Deeploy

runwangdl · 2026-03-12T01:30:24Z

Adds end-to-end on-device training graph deployment support for PULPOpen/Siracusa targets. This includes a full code-generation pipeline for training networks (forward + backward + optimizer), tiling support for gradient operators, and the necessary runtime harness changes to run SGD-based on-device learning.

Added

generateTrainingNetwork.py — CLI script to generate tiled training C code; supports --tiling, --l1, --l2, --doublebuffer, --defaultMemLevel
deeployTrainingRunner_siracusa.py — end-to-end training test driver for Siracusa
InPlaceAccumulatorV2 operator: parser, type checker, template, bindings, and SBTiler-based tile constraint (gradient accumulation buffer)
SoftmaxCrossEntropyLoss dual-output variant (loss scalar + log_prob): separate parser, checker, template, bindings, and MultiOutputMixin-based tile constraint
ConvGradX / ConvGradW / ConvGradB operators split from ConvGrad via SplitConvGradPass: individual parsers, templates, and bindings for each
MultiOutputTileConstraint framework (MultiOutputMixin, ScalarOutputAppender, FullTensorOutputAppender) — generic mechanism for wrapping multi-output tile constraints without per-operator boilerplate
deeploytraintest.c — C harness for running training steps on device, with mb % TRAINING_DATA_SIZE data cycling and post-init grad buffer memset
testinputs.h with TRAINING_DATA_SIZE, TRAINING_GRAD_BUF_START_IDX, TRAINING_NUM_GRAD_INPUTS macros

Changed

inputs.npz / outputs.npz format — added meta_data_size (unique samples stored) and meta_n_batches (total training steps) keys; C harness cycles data via mb % TRAINING_DATA_SIZE instead of storing all batches
TilerExtension.py — _setupTensorDimensionProducts and _setupHeuristics now receive layerBinding as parameter; four hasattr(template, 'tileConstraint') guards added so non-tileable ops (e.g. ConvGradB) execute on their current memory level without blocking the tiler

Fixed

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

… have finite lifetime

…y are I/O buffers

Implements the backward pass of MaxPool2D (MaxPoolGrad) following the same architecture as AveragePoolGrad. The gradient is scattered only to the argmax position in each pooling window (re-computed from the original forward input), unlike AveragePoolGrad which distributes uniformly. New files: - TargetLibraries/PULPOpen/inc/kernel/MaxPool.h: declare PULP_MaxPoolGrad2d_fp32_fp32_HWC - TargetLibraries/PULPOpen/src/MaxPool.c: implement MaxPoolGrad kernel (zero-init + argmax scatter, channel-parallel across cores) - Deeploy/Targets/PULPOpen/TileConstraints/MaxPoolGradTileConstraint.py: MaxPoolGradCTileConstraint (channel-tiling for 3 tensors: grad_output, original_input, grad_input) - Deeploy/Targets/PULPOpen/Templates/FloatMaxPoolTemplate.py: add referenceGradTemplate calling the new kernel Modified files: - Generic/Parsers.py: MaxPoolGradParser (2 inputs, 1 output, same attrs as MaxPool) - Generic/Layers.py: MaxPoolGradLayer - Generic/TypeCheckers.py: MaxPoolGradChecker (2 float32 in, 1 float32 out) - PULPOpen/Bindings.py: PULPMaxPoolGrad2DBindings - PULPOpen/Tiler.py: PULPMaxPoolGrad2DTilingReadyBindings - PULPOpen/Platform.py: MaxPoolGrad2DMapper + 'MaxPoolGrad' in PULPMapping Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bump pulp-trainlib to 37f70e5 (CNNTiling): - DW ConvGradW/X: padded kernels for non-zero padding or stride != 1 - im2col: fix early-return bug blocking stride > 1 weight gradients - Conv2D bw param grads: pass actual padding to im2col (was zero)

Add _infer_n_accum() to read meta_n_accum from inputs.npz. When neither --n-steps nor --n-accum is given on the command line, n_accum is now read from the npz (written by the exporter) instead of defaulting to 1. Same fix applied to testMVPTraining.py.

…er+DSCNN train models - MSELoss/MSELossGrad: full op support (parser, checker, bindings, tile constraint, template, platform registration) - ConvGradB: new ConvGradBTileConstraint (tiles C, keeps N/H/W full); ConvGradBMapper now uses tiling-ready bindings - MaxPoolGrad: RewriteMaxPoolGradInputPass added to PULPOptimizer; MaxPoolGradParser handles ORT [dY, mask_indices] format by traversing back to forward input X - BatchNorm tile constraints: _in_solution guards for global params (scale/bias/running_mean/ running_var, gamma/saved_mean/saved_inv_std) matching Conv.bias_in_solution pattern - MchanDma: fix 17-bit assertion (compare size directly, not log2); add transfer() override to chunk 1D transfers exceeding 131072-byte limit into multiple commands - TilerModel: auto-register tensor dims in addTensorNumEltToModel for untiled ops; add debug output for infeasible constraints (last-OK / first-failing) - New test models: Autoencoder_Train (MSELoss forward+backward) and DS-CNN_Train

This reverts commit 7ba871c.

L3 training harness (deeploytraintest.c, GAP9 + Siracusa): - Add IS_L2() macro and l3_aware_copy() helper for L3-aware data movement using ram_read/ram_write (FC-side DMA) following deeploytest.c patterns - Handle all 6 data-movement sites: gradient zero-init (chunked DMA), initial weight copy, per-mini-batch data load, lazy_reset_grad write, loss readback, and optimizer buffer sync - Increase MAINSTACKSIZE to 12000 (Siracusa) and enable CONFIG_CL_MASTER_CORE_STACK_SIZE=14000 (GAP9 sdk.config) for large L3-tiled training functions with deep closure chains - Increase GAP9 SLAVESTACKSIZE to 6000 Simulation runner (execution.py): - Add hex file detection and --flash-property flags for L3 flash images - Siracusa: hyperflash:readfs:files via gvsoc - GAP9: gapy with readfs_flash layout (mirrors cmake/gap9/gap9_gvsoc.cmake) Code generation (codeGenerate.py): - Generate L3 hex dumps for training network inputs (data, weights, gradients, lazy_reset_grad) so InitTrainingNetwork can load them from flash via load_file_to_ram Revert training-introduced regressions to devel core logic: - MemoryConstraintFlows.py: remove ConstantBuffer kill-set skip that prevented constants from entering tensorMemoryConstraints, breaking L3 DMA generation for weights/biases - GEMMTileConstraint.py: remove conditional bias DMA skip that checked tensorMemoryConstraints membership — bias must always be DMA'd - ConvTileConstraint.py: same fix for Conv2D weight/bias DMA schedule - ConvTemplate.py: restore im2col buffer formula (8*2*ch_in*kernel_y) that was incorrectly changed, causing L1 buffer underallocation Tested: simplemlp, dscnn, lightweightcnn, autoencoder, tinytransformer pass on both Siracusa and GAP9 with --defaultMemLevel=L3.

runwangdl and others added 30 commits March 17, 2025 22:19

Add classifier training support

9ec13f9

Fix L3 DMA and Maxpool Bugs

f1a0491

WIP Static Memory Allocation of IOs

29baf2c

Temporary fix broken float softmax

25be229

Fix lifetime of aliased input buffers

da56cbe

Fix output buffer lifetime

721f747

Linting

78685e5

WIP fix output buffer lifetime

02b5435

Change RQHardswish dim due to compiler bug

a2d67a0

Fix typo

bdd92de

Fix duplicated IO in memory allocation visualization

20b1f8b

Fix the Constant Tensor offset to not take into account IO since they…

c708069

… have finite lifetime

Add new attribute to Variable and Transient buffer to annotate if the…

b6e2448

…y are I/O buffers

Adapt calculateLifetime to use buffer I/O annotation

7e96f18

Fix typo

b923520

Remove IO buffer name and refactor var name

f4cb9e0

Linting

435cc9d

Test the correctness of the memory map after memory allocation

731f39f

Allocate memory arena first

dd1370c

correct DMA lengh of copy assertion

8bfdb13

Align memory allocation test

f01eb7f

delete redundant shell scripts

031dc79

Merge branch 'devel' into PULPCCTL3_16_16_64

58e18da

Update node with multioutput to single output

ac2d879

add softmaxcrossentropygrad tiling

6a7198b

Add softmaxcrossentropylossgrad tiling

360aef7

Merge branch 'PULPCCTL3_16_16_64' into GEMM_training_tiled

bc48582

Fix CI issue

b6542ba

Fix CI bugs

fe208d0

update CI

4a21359

runwangdl and others added 3 commits March 11, 2026 17:40

Update grad kernels

528a8b1

Update Conv Bias for Train Platform

4d297bb

runwangdl requested review from Victor-Jung and Xeratec as code owners March 12, 2026 01:30

runwangdl marked this pull request as draft March 12, 2026 01:30

runwangdl self-assigned this Mar 12, 2026

runwangdl added 4 commits March 12, 2026 14:31

Add batchnormgrad, globalaveragepoolgrad

a0188d0

runwangdl force-pushed the TrainingPlatform branch from a6730b6 to 5ab0fe4 Compare March 12, 2026 22:41

Transferring TrainingPlatform to Gap9

0229dfa

runwangdl force-pushed the TrainingPlatform branch from 874fdc3 to 0229dfa Compare March 18, 2026 15:02

Clean redundant files

febd2af

runwangdl force-pushed the TrainingPlatform branch from b0405fd to febd2af Compare March 18, 2026 15:24

runwangdl added 10 commits March 18, 2026 16:53

Add training off cmake config

709a4fd

Clean redundant pytest files for training

ce267c7

Orgnaize all kernels onnx and npz

b0d468a

Remove groupnorm

ec668f3

Move gap9 binary to /bin

7ba871c

Add transpose node to cnngrad weight hwc layout change

f6b1a3b

Add back deeploytrainingrunner

8a2e8fd

Revert "Move gap9 binary to /bin"

983ecca

This reverts commit 7ba871c.

Siracusa cmake path fix

8fb22e2

statci tileID not initiated across trainingstep

ef32057

runwangdl force-pushed the TrainingPlatform branch from d36733f to ef32057 Compare March 19, 2026 09:28

runwangdl added the Feature Addition of new features label Mar 19, 2026

runwangdl force-pushed the TrainingPlatform branch from 4c49fb3 to 9856417 Compare March 19, 2026 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PULPOpen Training Hardness: tiling-ready backward and optimizer deployment#174

PULPOpen Training Hardness: tiling-ready backward and optimizer deployment#174
runwangdl wants to merge 183 commits intopulp-platform:develfrom
runwangdl:TrainingPlatform

runwangdl commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

runwangdl commented Mar 12, 2026

Added

Changed

Fixed

PR Merge Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants