Skip to content

PULPOpen Training Hardness: tiling-ready backward and optimizer deployment#174

Draft
runwangdl wants to merge 183 commits intopulp-platform:develfrom
runwangdl:TrainingPlatform
Draft

PULPOpen Training Hardness: tiling-ready backward and optimizer deployment#174
runwangdl wants to merge 183 commits intopulp-platform:develfrom
runwangdl:TrainingPlatform

Conversation

@runwangdl
Copy link
Contributor

Adds end-to-end on-device training graph deployment support for PULPOpen/Siracusa targets. This includes a full code-generation pipeline for training networks (forward + backward + optimizer), tiling support for gradient operators, and the necessary runtime harness changes to run SGD-based on-device learning.

Added

  • generateTrainingNetwork.py — CLI script to generate tiled training C code; supports --tiling, --l1, --l2, --doublebuffer, --defaultMemLevel
  • deeployTrainingRunner_siracusa.py — end-to-end training test driver for Siracusa
  • InPlaceAccumulatorV2 operator: parser, type checker, template, bindings, and SBTiler-based tile constraint (gradient accumulation buffer)
  • SoftmaxCrossEntropyLoss dual-output variant (loss scalar + log_prob): separate parser, checker, template, bindings, and MultiOutputMixin-based tile constraint
  • ConvGradX / ConvGradW / ConvGradB operators split from ConvGrad via SplitConvGradPass: individual parsers, templates, and bindings for each
  • MultiOutputTileConstraint framework (MultiOutputMixin, ScalarOutputAppender, FullTensorOutputAppender) — generic mechanism for wrapping multi-output tile constraints without per-operator boilerplate
  • deeploytraintest.c — C harness for running training steps on device, with mb % TRAINING_DATA_SIZE data cycling and post-init grad buffer memset
  • testinputs.h with TRAINING_DATA_SIZE, TRAINING_GRAD_BUF_START_IDX, TRAINING_NUM_GRAD_INPUTS macros

Changed

  • inputs.npz / outputs.npz format — added meta_data_size (unique samples stored) and meta_n_batches (total training steps) keys; C harness cycles data via mb % TRAINING_DATA_SIZE instead of storing all batches

  • TilerExtension.py — _setupTensorDimensionProducts and _setupHeuristics now receive layerBinding as parameter; four hasattr(template, 'tileConstraint') guards added so non-tileable ops (e.g. ConvGradB) execute on their current memory level without blocking the tiler

Fixed

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

runwangdl and others added 3 commits March 11, 2026 17:40
Implements the backward pass of MaxPool2D (MaxPoolGrad) following the
same architecture as AveragePoolGrad. The gradient is scattered only
to the argmax position in each pooling window (re-computed from the
original forward input), unlike AveragePoolGrad which distributes
uniformly.

New files:
- TargetLibraries/PULPOpen/inc/kernel/MaxPool.h: declare PULP_MaxPoolGrad2d_fp32_fp32_HWC
- TargetLibraries/PULPOpen/src/MaxPool.c: implement MaxPoolGrad kernel
  (zero-init + argmax scatter, channel-parallel across cores)
- Deeploy/Targets/PULPOpen/TileConstraints/MaxPoolGradTileConstraint.py:
  MaxPoolGradCTileConstraint (channel-tiling for 3 tensors:
  grad_output, original_input, grad_input)
- Deeploy/Targets/PULPOpen/Templates/FloatMaxPoolTemplate.py: add
  referenceGradTemplate calling the new kernel

Modified files:
- Generic/Parsers.py: MaxPoolGradParser (2 inputs, 1 output, same attrs as MaxPool)
- Generic/Layers.py: MaxPoolGradLayer
- Generic/TypeCheckers.py: MaxPoolGradChecker (2 float32 in, 1 float32 out)
- PULPOpen/Bindings.py: PULPMaxPoolGrad2DBindings
- PULPOpen/Tiler.py: PULPMaxPoolGrad2DTilingReadyBindings
- PULPOpen/Platform.py: MaxPoolGrad2DMapper + 'MaxPoolGrad' in PULPMapping

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@runwangdl runwangdl marked this pull request as draft March 12, 2026 01:30
@runwangdl runwangdl self-assigned this Mar 12, 2026
Bump pulp-trainlib to 37f70e5 (CNNTiling):
- DW ConvGradW/X: padded kernels for non-zero padding or stride != 1
- im2col: fix early-return bug blocking stride > 1 weight gradients
- Conv2D bw param grads: pass actual padding to im2col (was zero)
Add _infer_n_accum() to read meta_n_accum from inputs.npz.
When neither --n-steps nor --n-accum is given on the command
line, n_accum is now read from the npz (written by the exporter)
instead of defaulting to 1.  Same fix applied to testMVPTraining.py.
…er+DSCNN train models

- MSELoss/MSELossGrad: full op support (parser, checker, bindings, tile constraint, template,
  platform registration)
- ConvGradB: new ConvGradBTileConstraint (tiles C, keeps N/H/W full); ConvGradBMapper now uses
  tiling-ready bindings
- MaxPoolGrad: RewriteMaxPoolGradInputPass added to PULPOptimizer; MaxPoolGradParser handles
  ORT [dY, mask_indices] format by traversing back to forward input X
- BatchNorm tile constraints: _in_solution guards for global params (scale/bias/running_mean/
  running_var, gamma/saved_mean/saved_inv_std) matching Conv.bias_in_solution pattern
- MchanDma: fix 17-bit assertion (compare size directly, not log2); add transfer() override
  to chunk 1D transfers exceeding 131072-byte limit into multiple commands
- TilerModel: auto-register tensor dims in addTensorNumEltToModel for untiled ops; add
  debug output for infeasible constraints (last-OK / first-failing)
- New test models: Autoencoder_Train (MSELoss forward+backward) and DS-CNN_Train
@runwangdl runwangdl added the Feature Addition of new features label Mar 19, 2026
L3 training harness (deeploytraintest.c, GAP9 + Siracusa):
- Add IS_L2() macro and l3_aware_copy() helper for L3-aware data movement
  using ram_read/ram_write (FC-side DMA) following deeploytest.c patterns
- Handle all 6 data-movement sites: gradient zero-init (chunked DMA),
  initial weight copy, per-mini-batch data load, lazy_reset_grad write,
  loss readback, and optimizer buffer sync
- Increase MAINSTACKSIZE to 12000 (Siracusa) and enable
  CONFIG_CL_MASTER_CORE_STACK_SIZE=14000 (GAP9 sdk.config) for large
  L3-tiled training functions with deep closure chains
- Increase GAP9 SLAVESTACKSIZE to 6000

Simulation runner (execution.py):
- Add hex file detection and --flash-property flags for L3 flash images
- Siracusa: hyperflash:readfs:files via gvsoc
- GAP9: gapy with readfs_flash layout (mirrors cmake/gap9/gap9_gvsoc.cmake)

Code generation (codeGenerate.py):
- Generate L3 hex dumps for training network inputs (data, weights,
  gradients, lazy_reset_grad) so InitTrainingNetwork can load them
  from flash via load_file_to_ram

Revert training-introduced regressions to devel core logic:
- MemoryConstraintFlows.py: remove ConstantBuffer kill-set skip that
  prevented constants from entering tensorMemoryConstraints, breaking
  L3 DMA generation for weights/biases
- GEMMTileConstraint.py: remove conditional bias DMA skip that checked
  tensorMemoryConstraints membership — bias must always be DMA'd
- ConvTileConstraint.py: same fix for Conv2D weight/bias DMA schedule
- ConvTemplate.py: restore im2col buffer formula (8*2*ch_in*kernel_y)
  that was incorrectly changed, causing L1 buffer underallocation

Tested: simplemlp, dscnn, lightweightcnn, autoencoder, tinytransformer
pass on both Siracusa and GAP9 with --defaultMemLevel=L3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature Addition of new features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants