Validity of using parallel in core._parallel_rolling_func

Hi

This is not a bug report per se rather asking question or just FYI.

This is a difficult issue to report because it is quite difficult to describe.

Note this is has only been tested/found with the `stumpy.stump` function,
however it is relevant in many other places too via `core.preprocess_non_normalized`

It appears that the use of `parallel=True` in `core._parallel_rolling_func` does
not have the desired effect in all circumstances.

Although there is no facility to invoke `cache=True` on the `njit` decorators in
stumpy, there are use cases where `cache=True` is an absolute requirement (I
shall update #699).  So if modifications are made to allow for numba caching it
reveals that on the first run when `stumpy.stump` is called in a process/thread
it will compile and run fine (FYI with len(T_A) == 1004), the first run with the
njit compilation taking around 13 seconds or so.  The numba cache files are
created and each subsequent call of the `stumpy.stump` function in the same
process/thread with the same `T_A` will run super fast with no issues in like
0.009974628977943212 seconds.  All good.

The problem arises when another process is invoked and imports stumpy and loads
the numba cached jit files.  At this point stumpy imports fine but when the
`stumpy.stump` function is called in jupyter the kernel dies and in a Python
terminal a `Segmentation fault (core dumped)` is encountered.

As we are all aware debugging with numba can be quite difficult and laborious :)

However, if `parallel=True` is simply commented out of the `core._parallel_rolling_func`
njit decorator, it works fine.  I went through them all one by one (STRIPPED
down version), disabling both `parallel` and `fastmath` all individually and
tested, which **eventually** revealed `core._parallel_rolling_func` to be the
culprit.

```
@njit(
    # parallel=True,
    fastmath={"nsz", "arcp", "contract", "afn", "reassoc"},
    cache=config.STUMPY_NUMBA_CACHE,
)
def _parallel_rolling_func(a, w, func):
    """
    Compute the (embarrassingly parallel) rolling metric by applying a user defined
    function on a 1-D array
```

I am not certain of why this is the case, perhaps when caching is in play the
`func` parameter is the problem as what func is numba compiling it, how does it
know what func will be passed to it?  However, my understanding of what can and
cannot be achieved with numba is not that verbose.

That said I did try to modify that to actually use the func name and that had
the same kernel died/Segmentation fault result so perhaps not.

For example:

```
def _parallel_rolling_func(a, w, func_name):
    ...
    ...

    l = a.shape[0] - w + 1
    out = np.empty(l)
    for i in prange(l):
        # out[i] = func(a[i : i + w])
        if func_name == 'np.ptp':
          out[i] = np.ptp(a[i : i + w])

# AND

def _rolling_isconstant(a, w):
    ...
    ...
    out = _parallel_rolling_func(a, w, 'np.ptp')
```

As I said a difficult one to describe.  However although it may run during first
compilation, that the use of parallel in `core._parallel_rolling_func` is
breaking when compiled and cached does raise the question, is it valid to use
in this function?

That is one of the advantages of `cache=True` I find it really does validate
njit functions.  Anything that is not correct will definitely always break when
the cached version is attempted to be loaded.  It is quite handy for development
testing in that way I find (for me at least) :)

I must further caveat this with the fact that I am tested on a STRIPPED down
version, which I stripped down to trace this bug down.  The version only has:

`__init__.py` - ONLY imports required in core.py, stump.py and aamp.py
`core.py` - ONLY functions required in stump.py and aamp.py
`aamp.py`
`stump.py`

Mainly because stump is all I currently need at the moment, but I **must** have
`cache=True` otherwise I would have to consider stepping back to
matrixprofile-foundation/matrixprofile but I would have maintain it myself seeing
as they have binned it now :)  But it did load and run in under 1 second and
having to wait between 13 and 17 seconds to run stumpy.stump is not an option,
but 0.009974628977943212 seconds is awesome!

So I am not sure what to do with this?  Other than advise you of the findings.

I have applied the change to a full v1.11.1 version but is modified to allow for
caching which allows for `cache=True` to be set on all core, stump and aamp
njit decorators and that it works too, as long as `parallel` is not passed.  Once
again only tested with `stumpy.stump`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validity of using parallel in core._parallel_rolling_func #777

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Validity of using parallel in core._parallel_rolling_func #777

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions