Conversation
rgommers
left a comment
There was a problem hiding this comment.
Thanks @kgryte. Looks pretty good to me. I agree with the design choices in the PR description.
Would we be okay with requiring that value equality must be used? Is there a scenario where we want to allow libraries some wiggle room, such as with NaN and signed zero comparison?
I am not sure wiggle room is needed here. This function has more to do with equal than with unique I think. I just checked NumPy, PyTorch, JAX and CuPy - all seem to be using value equality for nan.
Are we okay with leaving out
assume_unique?
Yes.
Are we okay with not mandating reshape behavior if
x2is multi-dimensional?
I think that that part of the np.isin docstring is confusing. Reshaping is meaningless, the only point of that is trying to express that the comparisons are element-wise. It'd be better to have a simple double for-loop with pseudo-code. There is no broadcasting either, any shapes should work and the output has the same shape as x1.
|
Run a basic hypothesis test for isin at data-apis/array-api-tests#407 Immediate observations:
I rather strongly suspect that nobody really uses |
| ---------- | ||
| x1: Union[array, int, float, complex, bool] | ||
| first input array. **May** have any data type. | ||
| x2: Union[array, int, float, complex, bool] |
There was a problem hiding this comment.
Should there maybe be a constraint on the rank of x2?
There was a problem hiding this comment.
We intentionally did not impose a constraint and it is not clear whether there is a conceptual reason to do so, as this API is a vectorized API for finding whether a needle (a value) is in a haystack (an array) regardless of the dimensionality of the haystack.
There was a problem hiding this comment.
FWIW (at least some) other array libraries happily accept arbitrarily shaped x1 and x2:
In [1]: import numpy as np
In [2]: a = np.arange(3*4*5).reshape(3, 4, 5)
In [3]: b = np.arange(11)
In [4]: np.isin(a, b).shape
Out[4]: (3, 4, 5)
In [5]: np.isin(b, a).shape
Out[5]: (11,)
Here's a hypothesis test data-apis/array-api-tests#407 which does not restrict the shapes, and which seems to pass on numpy,cupy, jax and torch locally.
ev-br
left a comment
There was a problem hiding this comment.
LGTM modulo an optional nit.
A preliminary test does not surface any problems, data-apis/array-api-tests#407. The test does not perform value tests though, so it won't catch if a library does something other than equality testing. As long as only integer dtypes are allowed, there's not that many options though.
|
just a note that we probably won't deprecate https://data-apis.org/array-api-extra/generated/array_api_extra.isin.html for now as it also exposes |
it would be good however to open an issue to track updating the documentation of |
This PR:
resolves RFC: add
isinfor elementwise set inclusion test #854 by addingisinto the specification.of the keyword arguments determined according to array comparison data, this PR chooses to support only the
invertkwarg. Theassume_uniquekwarg was not included for the following reasons:assume_uniquewhen usingisinand that was when searching lists of already known unique values.assume_uniqueis something of a performance optimization/implementation detail which we have generally attempted to avoid when standardizing APIs.does not place restrictions on the shape of
x2. While some libraries may choose to flatten a multi-dimensionalx2, that is something of an implementation detail and not strictly necessary. For example, an implementation could defer to an "includes" kernel which performs nested loop iteration without needing to perform explicit reshapes/copies.adds support for scalar arguments for either
x1orx2. This follows recent general practice in standardized APIs, with the restriction that at least one ofx1orx2must be an array.specifies that value equality
should be used, but notmust be used. This follows other set APIs (e.g.,unique*). As a consequence of value equality,NaNvalues can never test asTrueand there is no distinction between signed zeros.allows bothlimits portability to integer data types, as floating-point data types are not widely supported across all array libraries (e.g., PyTorch). However, ifx1andx2to be of any data typex1andx2have no promotable data type, behavior is left unspecified and thus implementation-defined.Questions
Update: answers provided based on feedback below and discussions during workgroup meetings.
NaNand signed zero comparison?must, notshould, due to predominant usage patterns.assume_unique?x2is multi-dimensional?