Flush cache by fjwillemsen · Pull Request #246 · KernelTuner/kernel_tuner

fjwillemsen · 2024-03-02T21:03:14Z

This pull request adds the ability to flush the L2 cache between iterations on the GPU backends.

…backends

…ar one where the memory clock would always be seen as not-equal due to a rounding error

…goes to framework time instead of benchmark time

…ck on gpu-architecture compiler option, added gpu-architecture auto-adding to CuPy

…d tests for this function, removed setting --gpu-architecture for CuPy as it is already set internally

csbnw

Added a few (small) suggestions.

csbnw · 2024-03-04T12:59:55Z

CONTRIBUTING.rst

    * [Optional] both Mamba and Miniconda can be automatically activated via :bash:`~/.bashrc`. Do not forget to add these (usually provided at the end of the installation).
    * Exit the shell and re-enter to make sure Conda is available. :bash:`cd` to the kernel tuner directory.
-    * [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`.
+    * [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linu, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable. 


Suggested change

* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linu, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable.

* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linux, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable.

csbnw · 2024-03-04T13:03:28Z

kernel_tuner/backends/hip.py

+    def allocate_ndarray(self, array):
+        return hip.hipMalloc(array.nbytes)


Don't you need to store the allocated memory?

Suggested change

def allocate_ndarray(self, array):

return hip.hipMalloc(array.nbytes)

def allocate_ndarray(self, array):

alloc = hip.hipMalloc(array.nbytes)

self.allocations.append(alloc)

return alloc

csbnw · 2024-03-04T13:05:19Z

kernel_tuner/backends/nvcuda.py

+            # get the number of registers per thread used in this kernel
+            num_regs = cuda.cuFuncGetAttribute(cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS, self.func)
+            assert num_regs[0] == 0, f"Retrieving number of registers per thread unsuccesful: code {num_regs[0]}"


Would it make sense to move this code to a helper function?

csbnw · 2024-03-04T13:09:48Z

kernel_tuner/core.py

-    def benchmark_default(self, func, gpu_args, threads, grid, result):
-        """Benchmark one kernel execution at a time"""
+    def flush_cache(self):
+        """This special function can be called to flush the L2 cache."""


I would suggest changing the comment to:

Suggested change

"""This special function can be called to flush the L2 cache."""

"""Flush the L2 cache by overwriting it with zeros."""

I am surprised that this works at all, I thought that memset just touched the device memory.

csbnw · 2024-03-04T13:15:37Z

kernel_tuner/core.py


                # benchmark
                if func:
+                    # setting the NVML parameters here avoids this time from leaking into the benchmark time, ends up in framework time instead


Suggested change

# setting the NVML parameters here avoids this time from leaking into the benchmark time, ends up in framework time instead

# Setting the NVML parameters takes a non neglibible amount of time. By setting them

# here, this time is added to the framework time rather than to benchmark time.

csbnw · 2024-03-04T13:16:49Z

kernel_tuner/observers/register.py

@@ -0,0 +1,16 @@
+from kernel_tuner.observers.observer import BenchmarkObserver
+
+class RegisterObserver(BenchmarkObserver):


I like this new observer, but adding it seems outside the scope of this PR which is about flushing the L2 cache.

csbnw · 2024-03-04T13:17:33Z

kernel_tuner/util.py

+        highest_cc_index = max([i for i, cc in enumerate(subset_cc) if int(cc[1]) <= int(compute_capability[1])])
+        return subset_cc[highest_cc_index]
+    # if all else fails, return the default 52
+    return '52'


Suggested change

return '52'

return valid_cc[0]

…nted by CuPy, and attempt free of previous allocation after checking if flush is possible

… added interfacing for flushing L2 and recopying arguments

sonarqubecloud · 2024-03-07T10:30:15Z

Quality Gate passed

Issues
19 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

sonarqubecloud · 2024-04-12T09:35:57Z

Quality Gate passed

Issues
20 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

csbnw · 2024-04-12T10:09:29Z

kernel_tuner/core.py


        self.dev.synchronize()
-        for _ in range(self.iterations):
+        for i in range(self.iterations):


i doesn't seem to be used below. the for-loop on line 377 even defines its own i.

csbnw · 2024-04-12T10:09:41Z

kernel_tuner/core.py

-            self.flush_array = np.zeros((self.dev.cache_size_L2 // t(0).itemsize), order='F').astype(t)
+            self.flush_type = np.uint8
+            size = (self.dev.cache_size_L2 // self.flush_type(0).itemsize)
+            # self.flush_array = np.zeros((size), order='F', dtype=self.flush_type)


Suggested change

# self.flush_array = np.zeros((size), order='F', dtype=self.flush_type)

csbnw · 2024-04-12T10:10:17Z

kernel_tuner/backends/cupy.py

@@ -47,7 +47,7 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None
        self.devprops = dev.attributes
        self.cc = dev.compute_capability
        self.max_threads = self.devprops["MaxThreadsPerBlock"]


Also cast this to int for consistency?

fjwillemsen · 2025-03-28T13:34:35Z

This requires further investigation. The goal is to prevent caching effects and mitigate measurement noise due to this. The various proposed methods need to be investigated using experiments with a fixed clock frequency. The ideal method must be both effective in flushing the caches between iterations correctly without re-copying all input data, and durable in the sense that it is implemented without relying on opaque cache implementations.

fjwillemsen added 24 commits February 8, 2024 16:57

Added RegisterObserver with common interface among backends

81a68a4

Added test for RegisterObserver, added clause in case of mocktest

943b3c4

Added useful error message in case Register Observer is not supported

1681730

Added tests for Register Observer for OpenCL and HIP backends

f153945

Added instruction for pointing cache directory elsewhere

7bd7c2b

Non-argument streams are now correctly passed in the CuPy and NVCUDA …

9dea137

…backends

Fixed several issues pertaining to the setting of clocks, in particul…

df54145

…ar one where the memory clock would always be seen as not-equal due to a rounding error

Time spent setting NVML parameters (clock & memory frequency, power) …

4cc4a13

…goes to framework time instead of benchmark time

Time spent setting NVML parameters (clock & memory frequency, power) …

e309bc1

…goes to framework time instead of benchmark time

Removed redundant print statement

d6aac8b

Added L2 cache size property to CUDA backends

a020791

Added specification to CUPY compiler options

6e6e5fb

Added L2 cache size property to OpenCL, HIP and mocked PyCUDA backends

f15338f

Added function to check for compute capability validity, improved che…

00ac419

…ck on gpu-architecture compiler option, added gpu-architecture auto-adding to CuPy

Added a flush kernel to clear the L2 cache between runs

55ab074

Added a flush kernel to clear the L2 cache between runs

e106bae

Made function for scaling the compute capability to a valid one, adde…

0cb5e3a

…d tests for this function, removed setting --gpu-architecture for CuPy as it is already set internally

Applied suggestions from comments by @csbnw

b682506

Removed redundant comments / printing

da907b1

Added L2 cache size information to backends

2396bdf

Added L2 flush kernel

eced775

Switched to new attempt for flushing L2 using memset

143889f

Added implementation of allocate numpy array function

651eea7

Added new flush L2 cache method using memset

7d8d48f

fjwillemsen added the enhancement label Mar 2, 2024

csbnw reviewed Mar 4, 2024

View reviewed changes

fjwillemsen added 4 commits March 4, 2024 17:49

Added a standard method for freeing memory from the GPU

9911f4c

Circumvented an issue where list.remove(val) was not properly impleme…

47c2cca

…nted by CuPy, and attempt free of previous allocation after checking if flush is possible

Added the ability to recopy array arguments with every kernel launch,…

157ca41

… added interfacing for flushing L2 and recopying arguments

Renamed to for clarity, added check

98afa60

fjwillemsen added 2 commits April 12, 2024 11:31

Improved getting L2 cache size

cfecdc5

Small improvements to flushing arrays

108e14c

csbnw reviewed Apr 12, 2024

View reviewed changes

	* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linu, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable.
	* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linux, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable.

		def allocate_ndarray(self, array):
		return hip.hipMalloc(array.nbytes)

	"""This special function can be called to flush the L2 cache."""
	"""Flush the L2 cache by overwriting it with zeros."""

	# setting the NVML parameters here avoids this time from leaking into the benchmark time, ends up in framework time instead
	# Setting the NVML parameters takes a non neglibible amount of time. By setting them
	# here, this time is added to the framework time rather than to benchmark time.

		@@ -0,0 +1,16 @@
		from kernel_tuner.observers.observer import BenchmarkObserver

		class RegisterObserver(BenchmarkObserver):

Conversation

fjwillemsen commented Mar 2, 2024

Uh oh!

csbnw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Mar 7, 2024

Quality Gate passed

Uh oh!

sonarqubecloud bot commented Apr 12, 2024

Quality Gate passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjwillemsen commented Mar 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants