Conversation
…ar one where the memory clock would always be seen as not-equal due to a rounding error
…goes to framework time instead of benchmark time
…goes to framework time instead of benchmark time
…ck on gpu-architecture compiler option, added gpu-architecture auto-adding to CuPy
…d tests for this function, removed setting --gpu-architecture for CuPy as it is already set internally
csbnw
left a comment
There was a problem hiding this comment.
Added a few (small) suggestions.
| * [Optional] both Mamba and Miniconda can be automatically activated via :bash:`~/.bashrc`. Do not forget to add these (usually provided at the end of the installation). | ||
| * Exit the shell and re-enter to make sure Conda is available. :bash:`cd` to the kernel tuner directory. | ||
| * [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. | ||
| * [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linu, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable. |
There was a problem hiding this comment.
| * [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linu, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable. | |
| * [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linux, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable. |
| def allocate_ndarray(self, array): | ||
| return hip.hipMalloc(array.nbytes) |
There was a problem hiding this comment.
Don't you need to store the allocated memory?
| def allocate_ndarray(self, array): | |
| return hip.hipMalloc(array.nbytes) | |
| def allocate_ndarray(self, array): | |
| alloc = hip.hipMalloc(array.nbytes) | |
| self.allocations.append(alloc) | |
| return alloc |
| # get the number of registers per thread used in this kernel | ||
| num_regs = cuda.cuFuncGetAttribute(cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS, self.func) | ||
| assert num_regs[0] == 0, f"Retrieving number of registers per thread unsuccesful: code {num_regs[0]}" |
There was a problem hiding this comment.
Would it make sense to move this code to a helper function?
| def benchmark_default(self, func, gpu_args, threads, grid, result): | ||
| """Benchmark one kernel execution at a time""" | ||
| def flush_cache(self): | ||
| """This special function can be called to flush the L2 cache.""" |
There was a problem hiding this comment.
I would suggest changing the comment to:
| """This special function can be called to flush the L2 cache.""" | |
| """Flush the L2 cache by overwriting it with zeros.""" |
I am surprised that this works at all, I thought that memset just touched the device memory.
|
|
||
| # benchmark | ||
| if func: | ||
| # setting the NVML parameters here avoids this time from leaking into the benchmark time, ends up in framework time instead |
There was a problem hiding this comment.
| # setting the NVML parameters here avoids this time from leaking into the benchmark time, ends up in framework time instead | |
| # Setting the NVML parameters takes a non neglibible amount of time. By setting them | |
| # here, this time is added to the framework time rather than to benchmark time. |
| @@ -0,0 +1,16 @@ | |||
| from kernel_tuner.observers.observer import BenchmarkObserver | |||
|
|
|||
| class RegisterObserver(BenchmarkObserver): | |||
There was a problem hiding this comment.
I like this new observer, but adding it seems outside the scope of this PR which is about flushing the L2 cache.
| highest_cc_index = max([i for i, cc in enumerate(subset_cc) if int(cc[1]) <= int(compute_capability[1])]) | ||
| return subset_cc[highest_cc_index] | ||
| # if all else fails, return the default 52 | ||
| return '52' |
There was a problem hiding this comment.
| return '52' | |
| return valid_cc[0] |
…nted by CuPy, and attempt free of previous allocation after checking if flush is possible
… added interfacing for flushing L2 and recopying arguments
|
|
|
|
||
| self.dev.synchronize() | ||
| for _ in range(self.iterations): | ||
| for i in range(self.iterations): |
There was a problem hiding this comment.
i doesn't seem to be used below. the for-loop on line 377 even defines its own i.
| self.flush_array = np.zeros((self.dev.cache_size_L2 // t(0).itemsize), order='F').astype(t) | ||
| self.flush_type = np.uint8 | ||
| size = (self.dev.cache_size_L2 // self.flush_type(0).itemsize) | ||
| # self.flush_array = np.zeros((size), order='F', dtype=self.flush_type) |
There was a problem hiding this comment.
| # self.flush_array = np.zeros((size), order='F', dtype=self.flush_type) |
| @@ -47,7 +47,7 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None | |||
| self.devprops = dev.attributes | |||
| self.cc = dev.compute_capability | |||
| self.max_threads = self.devprops["MaxThreadsPerBlock"] | |||
There was a problem hiding this comment.
Also cast this to int for consistency?
|
This requires further investigation. The goal is to prevent caching effects and mitigate measurement noise due to this. The various proposed methods need to be investigated using experiments with a fixed clock frequency. The ideal method must be both effective in flushing the caches between iterations correctly without re-copying all input data, and durable in the sense that it is implemented without relying on opaque cache implementations. |




This pull request adds the ability to flush the L2 cache between iterations on the GPU backends.