PULPOpen Training Hardness: tiling-ready backward and optimizer deployment#174
Draft
runwangdl wants to merge 183 commits intopulp-platform:develfrom
Draft
PULPOpen Training Hardness: tiling-ready backward and optimizer deployment#174runwangdl wants to merge 183 commits intopulp-platform:develfrom
runwangdl wants to merge 183 commits intopulp-platform:develfrom
Conversation
… have finite lifetime
…y are I/O buffers
Implements the backward pass of MaxPool2D (MaxPoolGrad) following the same architecture as AveragePoolGrad. The gradient is scattered only to the argmax position in each pooling window (re-computed from the original forward input), unlike AveragePoolGrad which distributes uniformly. New files: - TargetLibraries/PULPOpen/inc/kernel/MaxPool.h: declare PULP_MaxPoolGrad2d_fp32_fp32_HWC - TargetLibraries/PULPOpen/src/MaxPool.c: implement MaxPoolGrad kernel (zero-init + argmax scatter, channel-parallel across cores) - Deeploy/Targets/PULPOpen/TileConstraints/MaxPoolGradTileConstraint.py: MaxPoolGradCTileConstraint (channel-tiling for 3 tensors: grad_output, original_input, grad_input) - Deeploy/Targets/PULPOpen/Templates/FloatMaxPoolTemplate.py: add referenceGradTemplate calling the new kernel Modified files: - Generic/Parsers.py: MaxPoolGradParser (2 inputs, 1 output, same attrs as MaxPool) - Generic/Layers.py: MaxPoolGradLayer - Generic/TypeCheckers.py: MaxPoolGradChecker (2 float32 in, 1 float32 out) - PULPOpen/Bindings.py: PULPMaxPoolGrad2DBindings - PULPOpen/Tiler.py: PULPMaxPoolGrad2DTilingReadyBindings - PULPOpen/Platform.py: MaxPoolGrad2DMapper + 'MaxPoolGrad' in PULPMapping Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bump pulp-trainlib to 37f70e5 (CNNTiling): - DW ConvGradW/X: padded kernels for non-zero padding or stride != 1 - im2col: fix early-return bug blocking stride > 1 weight gradients - Conv2D bw param grads: pass actual padding to im2col (was zero)
Add _infer_n_accum() to read meta_n_accum from inputs.npz. When neither --n-steps nor --n-accum is given on the command line, n_accum is now read from the npz (written by the exporter) instead of defaulting to 1. Same fix applied to testMVPTraining.py.
…er+DSCNN train models - MSELoss/MSELossGrad: full op support (parser, checker, bindings, tile constraint, template, platform registration) - ConvGradB: new ConvGradBTileConstraint (tiles C, keeps N/H/W full); ConvGradBMapper now uses tiling-ready bindings - MaxPoolGrad: RewriteMaxPoolGradInputPass added to PULPOptimizer; MaxPoolGradParser handles ORT [dY, mask_indices] format by traversing back to forward input X - BatchNorm tile constraints: _in_solution guards for global params (scale/bias/running_mean/ running_var, gamma/saved_mean/saved_inv_std) matching Conv.bias_in_solution pattern - MchanDma: fix 17-bit assertion (compare size directly, not log2); add transfer() override to chunk 1D transfers exceeding 131072-byte limit into multiple commands - TilerModel: auto-register tensor dims in addTensorNumEltToModel for untiled ops; add debug output for infeasible constraints (last-OK / first-failing) - New test models: Autoencoder_Train (MSELoss forward+backward) and DS-CNN_Train
a6730b6 to
5ab0fe4
Compare
874fdc3 to
0229dfa
Compare
b0405fd to
febd2af
Compare
This reverts commit 7ba871c.
d36733f to
ef32057
Compare
L3 training harness (deeploytraintest.c, GAP9 + Siracusa): - Add IS_L2() macro and l3_aware_copy() helper for L3-aware data movement using ram_read/ram_write (FC-side DMA) following deeploytest.c patterns - Handle all 6 data-movement sites: gradient zero-init (chunked DMA), initial weight copy, per-mini-batch data load, lazy_reset_grad write, loss readback, and optimizer buffer sync - Increase MAINSTACKSIZE to 12000 (Siracusa) and enable CONFIG_CL_MASTER_CORE_STACK_SIZE=14000 (GAP9 sdk.config) for large L3-tiled training functions with deep closure chains - Increase GAP9 SLAVESTACKSIZE to 6000 Simulation runner (execution.py): - Add hex file detection and --flash-property flags for L3 flash images - Siracusa: hyperflash:readfs:files via gvsoc - GAP9: gapy with readfs_flash layout (mirrors cmake/gap9/gap9_gvsoc.cmake) Code generation (codeGenerate.py): - Generate L3 hex dumps for training network inputs (data, weights, gradients, lazy_reset_grad) so InitTrainingNetwork can load them from flash via load_file_to_ram Revert training-introduced regressions to devel core logic: - MemoryConstraintFlows.py: remove ConstantBuffer kill-set skip that prevented constants from entering tensorMemoryConstraints, breaking L3 DMA generation for weights/biases - GEMMTileConstraint.py: remove conditional bias DMA skip that checked tensorMemoryConstraints membership — bias must always be DMA'd - ConvTileConstraint.py: same fix for Conv2D weight/bias DMA schedule - ConvTemplate.py: restore im2col buffer formula (8*2*ch_in*kernel_y) that was incorrectly changed, causing L1 buffer underallocation Tested: simplemlp, dscnn, lightweightcnn, autoencoder, tinytransformer pass on both Siracusa and GAP9 with --defaultMemLevel=L3.
4c49fb3 to
9856417
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds end-to-end on-device training graph deployment support for PULPOpen/Siracusa targets. This includes a full code-generation pipeline for training networks (forward + backward + optimizer), tiling support for gradient operators, and the necessary runtime harness changes to run SGD-based on-device learning.
Added
Changed
inputs.npz / outputs.npz format — added meta_data_size (unique samples stored) and meta_n_batches (total training steps) keys; C harness cycles data via mb % TRAINING_DATA_SIZE instead of storing all batches
TilerExtension.py — _setupTensorDimensionProducts and _setupHeuristics now receive layerBinding as parameter; four hasattr(template, 'tileConstraint') guards added so non-tileable ops (e.g. ConvGradB) execute on their current memory level without blocking the tiler
Fixed
PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.