Skip to content

Add membw subcommand: DDR bandwidth probe (memset / read / memcpy)#165

Merged
widgetii merged 1 commit into
masterfrom
membw-issue-160
May 14, 2026
Merged

Add membw subcommand: DDR bandwidth probe (memset / read / memcpy)#165
widgetii merged 1 commit into
masterfrom
membw-issue-160

Conversation

@widgetii
Copy link
Copy Markdown
Member

Closes #160.

Summary

ipctool membw runs three synthetic memory-bandwidth ops against large anonymous DDR buffers (mmap of /dev/zero, NOT malloc) and reports MB/s:

Op Bytes counted Trustworthy across libcs?
write sz libc memset (depends on vectorization)
read sz yes — volatile uint32_t sum loop, libc-INdependent
copy 2 × sz libc memcpy (depends on vectorization)

The read op is the most trustworthy number when comparing across firmwares (musl vs uClibc vs glibc); write / copy are bounded by libc memset/memcpy vectorization.

CLI

Matches the existing clocks / cpubench shape:

ipctool membw [--size MB] [--iters N] [--ops set,...] [--json]
  --size MB     buffer size per pass (default: 16; must exceed L2)
  --iters N     passes per op        (default: 16)
  --ops a,b,c   comma list of write,read,copy (default: all)
  --json        JSON output instead of YAML

Sample output

membw:
  buffer_mb: 16
  iters: 16
  results:
    write:
      mb_per_sec: 2243
      duration_s: 0.120
    read:
      mb_per_sec: 421
      duration_s: 0.637
    copy:
      mb_per_sec: 1863
      duration_s: 0.288
  chip: hi3516ev300

Why this fits

The clocks (#162-#164) + cpubench (#162) + this membw PR together close the diagnostic loop motivated by #161: when two boards on the same SoC behave differently, you can now separate CPU-pipeline (cpubench), DDR-pipeline (this PR), and PLL-config (clocks) in three quick subcommands.

Caveats baked into design (per issue body)

  • Buffers come from mmap(/dev/zero, MAP_PRIVATE), so anonymous DDR pages rather than tmpfs / page cache.
  • Default 16 MB per buffer comfortably exceeds the V4-family L2 (256 KB - 1 MB). Smaller sizes measure L2/L1, not DDR.
  • Streamer / encoder DMA traffic loads DDR. To measure the DDR config baseline, stop majestic / vendor App first; to measure real workload bandwidth, leave them running.
  • Default 16 MB × 16 iters × 3 ops processes ~1 GB total and takes <2 s on a healthy V4 board.

Cross-board verification

All four lab boards, majestic / App paused for the DDR baseline:

Board Buffer write read copy
hi3516ev300 (V4, OpenIPC) 16 MB 2243 421 1863
gk7205v300 (V4, OpenIPC) 16 MB 2096 417 1633
gk7205v300 (V4, XM Sofia) 4 MB 1576 370 1302
hi3516av300 (V4A, OpenIPC) 16 MB 2320 427 2440

(MB/s; read is the libc-independent number — bold)

XM Sofia ran with --size 4 because the board has only 48 MB of userspace memory (the rest is mmz_anonymous for the encoder), so the default 32 MB total (2 × 16 MB buffers) doesn't fit. Confirms --size is a genuinely useful knob, not just a tunable. V4A copy is notably higher because it's dual-core SMP.

Test plan

  • cv100 toolchain build + UPX, no warnings under -Wextra
  • ipctool membw --help prints concise usage
  • ipctool membw --json produces valid JSON via jq round-trip
  • ipctool membw --ops read,copy skips the write op
  • ipctool membw --size 1 errors cleanly (must exceed L2 documented but not enforced — caller's choice)
  • OOM on too-large --size returns clean error (membw: mmap N MB: Cannot allocate memory) rather than crash
  • Verified on 4 boards above; numbers stable across runs to within ~1%

🤖 Generated with Claude Code

Closes #160.

`ipctool membw` runs three synthetic memory-bandwidth ops against
large anonymous DDR buffers (mmap of /dev/zero, NOT malloc) and
reports MB/s:

  write : memset over the buffer       (W-only, libc-dependent)
  read  : volatile uint32_t sum loop   (R-only, libc-INdependent
                                        -- most trustworthy for
                                        cross-firmware comparison)
  copy  : memcpy between two buffers   (R+W, counted as 2x bytes)

CLI matches the existing clocks/cpubench shape:
  --size MB      buffer size per pass (default: 16; must exceed L2)
  --iters N      passes per op        (default: 16)
  --ops a,b,c    comma list of write,read,copy (default: all)
  --json         JSON output instead of YAML

Output is YAML by default with a `chip:` tag for context:

  membw:
    buffer_mb: 16
    iters: 16
    results:
      write:
        mb_per_sec: 2243
        duration_s: 0.120
      read:
        mb_per_sec: 421
        duration_s: 0.637
      copy:
        mb_per_sec: 1863
        duration_s: 0.288
    chip: hi3516ev300

Use case (from #161 / #162 debugging): when two boards with the
same SoC behave differently, this separates "CPU pipeline is the
bottleneck" from "DDR pipeline is the bottleneck" in a few seconds.
With APLL decode and HPM bin now in `ipctool clocks` from #162-#164,
this PR closes the third leg of the same investigation flow.

Verified on four lab boards (all with majestic / vendor App stopped
to measure DDR config baseline rather than workload):

  hi3516ev300 (V4, OpenIPC):   write 2243   read 421   copy 1863  MB/s
  gk7205v300  (V4, OpenIPC):   write 2096   read 417   copy 1633  MB/s
  gk7205v300  (V4, XM Sofia):  write 1576   read 370   copy 1302  MB/s  [--size 4]
  hi3516av300 (V4A, OpenIPC):  write 2320   read 427   copy 2440  MB/s

XM Sofia ran with --size 4 because the board has only 48 MB
userspace memory (the rest is mmz_anonymous for the encoder), so
the default 32 MB total (2 x 16 MB buffers) doesn't fit -- confirms
--size is a genuine knob, not just a tunable.

Buffer-via-mmap caveat baked in per the issue: anonymous DDR pages
rather than tmpfs / page cache.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@widgetii widgetii merged commit 086cf8e into master May 14, 2026
3 checks passed
@widgetii widgetii deleted the membw-issue-160 branch May 14, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add DDR bandwidth test subcommand (memset/memcpy/read scan)

1 participant