Skip to content

Commit 41747c1

Browse files
committed
Add documentation for full system test scripts
1 parent b1911e8 commit 41747c1

File tree

1 file changed

+66
-0
lines changed

1 file changed

+66
-0
lines changed
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
## Full system test configuration and scripts
2+
3+
The full system test workflow scripts consist of 3 shell scripts:
4+
* `dpl-workflow.sh` : The main script that runs the dpl-workflow for the reconstruction.
5+
It can read the input either internally, or receive it externally by one of the others.
6+
* `raw-reader.sh` : Runs the `o2-raw-file-reader` to read the raw files as external input to `dpl-workflow.sh`.
7+
* `datadistribution.sh` : Run the `StfBuilder` to read time frame files as external input to `dpl-workflow.sh`.
8+
9+
One can either run the `dpl-workflow.sh` standalone (with `EXTINPUT=0`) or in parallel with one of the other scripts in separate shells (with `EXTINPUT=1`)
10+
11+
In addition, there is the shared `setenv.sh` script which sets default configuration options, and there is the additional benchmark script:
12+
* `start_tmux.sh` : This starts the full test in the configuration for the EPN with 2 NUMA domains, 512 GB RAM, 8 GPUs.
13+
It will run tmux with 3 sessions, running twice the `dpl-workflow.sh` and once one of the external input scripts (selected via `dd` and `rr` command line option).
14+
* Please note that `start_tmux.sh` overrides several of the environment options (see below) with the defaults for the EPN.
15+
The only relevant options for `start_tmux.sh` should be `TFDELAY` and `GPUMEMSIZE`.
16+
* Note also that while `dpl-workflow.sh` is a generic flexible script that can be used for actual operation, `start_tmux.sh` is a benchmark script to demonstrate how the full workflow is supposed to run on the EPN.
17+
It is meant for standalone tests and not to really start the actual processing on the EPN.
18+
19+
The `dpl-workflow.sh` can run both the synchronous and the asynchronous workflow, selected via the `SYNCMODE` option (see below), but note the following constraints.
20+
* By default, it will run the full chain (EPN + FLP parts) such that it can operate as a full standalone benchmark processing simulated raw data.
21+
* In order to run only the EPN part (skipping the steps that will run on the FLP), an `EPNONLY` option will be added later.
22+
23+
All settings are configured via environment variables.
24+
The default settings (if no env variable is exported) are defined in `setenv.sh` which is sourced by all other scripts.
25+
(Please note that `start_tmux.sh` overrides a couple of options with EPN defaults).
26+
The following options exist (some of the options are not used in all scripts, and might behave slightly differently as noted):
27+
* `NTIMEFRAMES`: Number of time frames to process.
28+
* `dpl-workflow.sh` without `EXTINPUT`: Will replay the timeframe `NTIMEFRAMES` time and then exit.
29+
* `raw-reader.sh` : Will replay the timeframe `NTIMEFRAMES` time and `raw-reader.sh` will exit, the dpl-workflows will keep running.
30+
* Ignored by `datadistribution.sh`, it will always run in an endless loop.
31+
* `TFDELAY`: Delay in seconds between publishing time frames (1 / rate).
32+
* `NGPUS`: Number of GPUs to use, data distributed round-robin.
33+
* `GPUTYPE`: GPU Tracking backend to use, can be CPU / CUDA / HIP / OCL / OCL2.
34+
* `SHMSIZE`: Size of the global shared memory segment.
35+
* `DDSHMSIZE`: Size of shared memory unmanaged region for DataDistribution Input.
36+
* `GPUMEMSIZE`: Size of allocated GPU memory (if GPUTYPE != CPU)
37+
* `HOSTMEMSIZE`: Size of allocated host memory for GPU reconstruction (0 = default).
38+
* For `GPUTYPE = CPU`: TPC Tracking scratch memory size. (Default 0 -> dynamic allocation.)
39+
* Otherwise : Size of page-locked host memory for GPU processing. (Defauls 0 -> 1 GB.)
40+
* `CREATECTFDICT`: Create CTF dictionary.
41+
* 0: Read `ctf_dictionary.root` as input.
42+
* 1: Create `ctf_dictionary.root`. Note that this was already done automatically if the raw data was simulated with `full_system_test.sh`.
43+
* `SYNCMODE`: Run only reconstruction steps of the synchronous reconstruction.
44+
* `NUMAGPUIDS`: NUMAID-aware GPU id selection. Needed for the full EPN configuration with 8 GPUs, 2 NUMA domains, 4 GPUs per domain.
45+
In this configuration, 2 instances of `dpl-workflow.sh` must run in parallel.
46+
To be used in combination with `NUMAID` to select the id per workflow.
47+
`start_tmux.sh` will set up these variables automatically.
48+
* `NUMAID`: SHM segment id to use for shipping data as well as set of GPUs to use (use `0` / `1` for 2 NUMA domains, 0 = GPUS `0` to `NGPUS - 1`, 1 = GPUS `NGPUS` to `2 * NGPUS - 1`)
49+
* 0: Runs all reconstruction steps, of sync and of async reconstruction, using raw data input.
50+
* 1: Runs only the steps of synchronous reconstruction, using raw data input.
51+
* `EXTINPUT`: Receive input from raw FMQ channel instead of running o2-raw-file-reader.
52+
* 0: `dpl-workflow.sh` can run as standalone benchmark, and will read the input itself.
53+
* 1: To be used in combination with either `datadistribution.sh` or `raw-reader.sh` or with another DataDistribution instance.
54+
* `NHBPERTF`: Time frame length (in HBF)
55+
* `GLOBALDPLOPT`: Global DPL workflow options appended to o2-dpl-run.
56+
* `EPNPIPELINES`: Set default EPN pipeline multiplicities.
57+
Normally the workflow will start 1 dpl device per processor.
58+
For some of the CPU parts, this is insufficient to keep step with the GPU processing rate, e.g. one ITS-TPC matcher on the CPU is slower than the TPC tracking on multiple GPUs.
59+
This option adds some multiplicies for CPU processes using DPL's pipeline feature.
60+
The settings were tuned for EPN processing with 8 GPUs.
61+
It is auto-selected by `start-tmux.sh`.
62+
* `SEVERITY`: Log verbosity (e.g. info or error)
63+
* `SHMTHROW`: Throw exception when running out of SHM memory.
64+
It is suggested to leave this enabled (default) on tests on the laptop to get an actual error when it runs out of memory.
65+
This is disabled in `start_tmux.sh`, to avoid breaking the processing while there is a chance that another process might free memory and we can continue.
66+
* `NORATELOG`: Disable FairMQ Rate Logging.

0 commit comments

Comments
 (0)