Conversation
|
Thanks a lot for the PR @DhanashreePetare. Before we go further, I’d like to better understand the motivation and concrete use cases behind this change. The PR adds a fairly substantial amount of new code, and as maintainers we try to be cautious about introducing functionality unless there’s a clear and compelling need, since it also increases the long-term maintenance burden. Could you elaborate a bit on the scenarios this is meant to support, and whether there are existing users or workflows that would benefit from it? That context would really help us evaluate how to move forward. |
|
Thanks for the thoughtful review and for asking about motivation. This PR is aimed at containerized and scheduler‑managed workloads where Concrete scenarios this helps:
The implementation is automatic and safe: it detects v1/v2/hybrid, gracefully degrades when cgroups are absent, and can be disabled with It would be great to hear on it, and I’m happy to adjust further implementation/changes if required based on maintainer's guidance. |
|
Hello @DhanashreePetare thank you for the explanations. @amete and I will look in more detail at the code and get back to you. |
|
Hello @DhanashreePetare Thank you again for submitting this PR and apologies that many other constraints on our time prevented a more prompt review. Serhan and I decided on a strategic approach to this that should inform the most significant changes to the PR:
Can you update the PR along these lines and then we will do a more detailed review of the code? Thank you! |
graeme-a-stewart
left a comment
There was a problem hiding this comment.
See comment above for restructuring changes.
|
Hello @DhanashreePetare Is this an issue you would still like to work on? Please let us know and we can try to assign to another person if you are not able to continue. Thanks. |
|
sorry @graeme-a-stewart, very sorry for my late reply. Yes I will change the workflow according to the above provided changes, I will get back very soon with the changes. |
graeme-a-stewart
left a comment
There was a problem hiding this comment.
Hello @DhanashreePetare - thanks a lot for the changes. There are still a few things to improve on with some further simplifications that will make the code even leaner.
| - Use WSL2 (Ubuntu) or Docker Desktop to provide a Linux runtime. | ||
| - Run all builds and tests inside WSL2 or a Linux container. | ||
|
|
||
| For formatting on Windows before opening a pull request: |
There was a problem hiding this comment.
Windows is not a supported environment for prmon, so while I appreciate your instructions, it's just adding noise. If one is using Windows or OS X then a container / WSL2 environment needs to be used and then you have your Linux for build, tests, formatting, etc.
|
|
||
| ```powershell | ||
| cmake -S . -B build | ||
| cmake --build build --target clang-format |
There was a problem hiding this comment.
This is already documented (L59).
| @@ -0,0 +1,291 @@ | |||
| // Copyright (C) 2018-2025 CERN | |||
There was a problem hiding this comment.
This is a new file, so it should be (C) for the dates the file existed, viz, 2025-2026 in this case.
|
|
||
| std::string line; | ||
| // For cgroup v2, look for line starting with "0::" | ||
| // For cgroup v2, look for line starting with "0::" |
| // Find the cgroup filesystem mount point | ||
| std::string cgroupmon::find_cgroup_mount_point() { | ||
| return "/sys/fs/cgroup"; | ||
| } |
There was a problem hiding this comment.
This is now OTT for returning the same string every time (and you're not "detecting" anything). Instead just define a const std::string.
| return; | ||
| } | ||
|
|
||
| hw_json["cgroup"]["version"] = "v2"; |
There was a problem hiding this comment.
This is not useful anymore as cgroups implies v2, and the mount point is fixed. Please remove.
| #include "registry.h" | ||
|
|
||
| // Cgroup version enumeration | ||
| enum class CgroupVersion { NONE, V2 }; |
There was a problem hiding this comment.
I think this is no longer needed - see comments above w.r.t. detecting cgroupsv2. An enum is overdesign.
| prmon::monitored_list cgroup_stats; | ||
|
|
||
| // Cgroup detection state | ||
| CgroupVersion cgroup_version; |
| CgroupVersion cgroup_version; | ||
| bool valid; | ||
| std::string cgroup_path; | ||
| std::string cgroup_mount_point; |
There was a problem hiding this comment.
This could now just be a const std::string.
| @@ -0,0 +1,239 @@ | |||
| # Cgroup v2 Support Implementation - Summary | |||
There was a problem hiding this comment.
We really don't need this file. The historical record of the addition of the cgroups monitoring is in the pull request.
- Implement cgroupmon monitor with 17 metrics across CPU, memory, I/O, and process counts - Support automatic detection of cgroup v1, v2, and hybrid environments - Parse cgroup controllers: cpu.stat, memory.stat, io.stat for both versions - Integrate with prmon build system and test infrastructure - Add Python and GTest test harnesses with precooked cgroup v2 fixtures - Update documentation: README, ADDING_MONITORS, CONTRIBUTING with usage and Windows dev notes - Enable container resource tracking with --disable cgroupmon flag - Provide more accurate I/O statistics for containerized workloads
…ead counting - Remove unused cgroup_stat_update variable in update_stats() - Fix read_single_value() to count lines for cgroup.procs and cgroup.threads instead of reading first value, since these are multi-line PID lists
… focused metrics Based on maintainer guidance: - Remove cgroup v1 and hybrid support (v2-only going forward) - Remove 3 less useful metrics: cgroup_mem_slab, cgroup_mem_pgfault, cgroup_mem_pgmajfault - Simplify logic for easier maintenance (14 metrics, no v1/v2 branching) - Fix get_hardware_info() to avoid referencing removed enum values - Update all documentation to reflect v2-only implementation Changes: - Enum: NONE, V1, V2, HYBRID → NONE, V2 - Metrics: 17 → 14 (removed: slab, pgfault, pgmajfault) - Removed functions: read_cgroup_v1_stats, parse_*_v1 (4 functions) - Simplified detection, path resolution, and hardware info logic - Updated tests, README, ADDING_MONITORS, CGROUP_IMPLEMENTATION docs
30a2dc1 to
816f3de
Compare
|
FYI, I just fixed the |
|
There is also a test failure, can you investigate it? 25/28 Testing: testUnits
25/28 Test: testUnits
Command: "/usr/bin/python3" "test_cpu.py" "--units" "--time" "3" "--slack" "0" "--interval" "1"
Directory: /tmp/build/package/tests
"testUnits" start time: Mar 24 17:33 UTC
Output:
----------------------------------------------------------
Will run for 3s using 1 process(es) and 1 thread(s)
[2026-03-24 17:33:21.529] [nvidiamon] [warning] Failed to load libnvidia-ml.so: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
[2026-03-24 17:33:21.529] [nvidiamon] [warning] Both NVML and nvidia-smi failed. NVIDIA monitoring disabled.
F
======================================================================
FAIL: test_run_test_with_params (__main__.setup_configurable_test.<locals>.ConfigurableProcessMonitor)
Actual test runner
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/build/package/tests/test_cpu.py", line 99, in test_run_test_with_params
self.assertEqual(
AssertionError: 3 != 0 : Wrong number of unit values for 'Avg' - missing parameters are {'cgroup_cpu_periods', 'cgroup_cpu_throttled', 'cgroup_mem_max'}
----------------------------------------------------------------------
Ran 1 test in 3.075sN.B. You can ignore the |
Cgroup v1/v2 Monitoring Support
Overview
Adds comprehensive cgroup (container) resource monitoring to prmon with automatic detection of cgroup v1, v2, and hybrid environments.
Changes
cgroupmonwith 17 metrics across CPU, memory, I/O, and process countscpu.stat,memory.stat,io.statfor both v1 and v2--disable cgroupmonif neededBenefits
Testing
Python linting passes. CI will verify build and tests in Linux environments.