Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: take nans correctly into consideration in complex and tuple #41952

Merged
merged 11 commits into from
Jun 16, 2021

Conversation

realead
Copy link
Contributor

@realead realead commented Jun 11, 2021

NaNs as Float are already handled correctly (all nans are the same equavalency class), this PR adds similar functionality for complex and tuple.

It doesn't handle frozenset (see also discussion in #41836) because it would require to duplicate a lot of code/internal details from setobject.

@simonjayhawkins simonjayhawkins added Bug Internals Related to non-user accessible pandas implementation labels Jun 11, 2021
@realead
Copy link
Contributor Author

realead commented Jun 11, 2021

For the upcoming issues with Py3.10 and hash-functions (see this comment #41836 (comment) and #41953) there will be a separate PR soonish.

@realead
Copy link
Contributor Author

realead commented Jun 11, 2021

It looks as if this exampe fails:

def test_isin_nan_not_pypy():
        idx = MultiIndex.from_arrays([["foo", "bar"], [1.0, np.nan]])
>       tm.assert_numpy_array_equal(idx.isin([("bar", np.nan)]), np.array([False, False]))

pandas\tests\indexes\multi\test_isin.py:37: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

left = array([False,  True]), right = array([False, False]), err_msg = None

To me it seems, that the new behavior is correct. At least it would be result if np.nan would be replaced through something like 2.

@jreback
Copy link
Contributor

jreback commented Jun 13, 2021

does this change outward behavior for <= 3.9?

asking if we should put this in 1.3 (inclination is yes)

@realead
Copy link
Contributor Author

realead commented Jun 15, 2021

@jreback The only behavioral change I'm aware of is fixing the bug #41836 (which also lead to this change in the test suitehttps://github.com//pull/41952/commits/0b02e27eb14f7f059740081eb71639550832d05f).

The performance didn't really suffer much under this change, there is a typical run (somehow on my machine the variance is quite high between different runs):

+     2.42±0.02ms      2.58±0.06ms     1.07  algos.isin.IsInFloat64.time_isin('Float64', 'many_different_values')
-         531±3ms          503±3ms     0.95  algos.isin.IsinWithRandomFloat.time_isin(<class 'numpy.object_'>, 900000, 'inside')
-         283±2μs          261±4μs     0.92  algos.isin.IsinWithRandomFloat.time_isin(<class 'numpy.object_'>, 1300, 'outside')
-     1.28±0.02ms      1.16±0.01ms     0.91  algos.isin.IsinWithRandomFloat.time_isin(<class 'numpy.object_'>, 7000, 'outside')
-         423±2ms          382±7ms     0.90  algos.isin.IsinWithRandomFloat.time_isin(<class 'numpy.object_'>, 900000, 'outside')
-     1.34±0.04ms      1.19±0.01ms     0.89  algos.isin.IsinWithRandomFloat.time_isin(<class 'numpy.object_'>, 7000, 'inside')
-         369±2ms          321±2ms     0.87  algos.isin.IsinWithRandomFloat.time_isin(<class 'numpy.object_'>, 750000, 'inside')
-      16.0±0.7ms       12.7±0.3ms     0.80  algos.isin.IsinWithRandomFloat.time_isin(<class 'numpy.object_'>, 80000, 'outside')
-         374±4ms          253±7ms     0.68  algos.isin.IsinWithRandomFloat.time_isin(<class 'numpy.object_'>, 750000, 'outside')
-        710±10μs          471±8μs     0.66  algos.isin.IsInForObjects.time_isin('nans', 'nans')

@jreback jreback added this to the 1.3 milestone Jun 15, 2021
@jreback
Copy link
Contributor

jreback commented Jun 15, 2021

ok labeling this for 1.3

@realead
Copy link
Contributor Author

realead commented Jun 15, 2021

@jreback Do you think a what's new entry is needed?

@jreback
Copy link
Contributor

jreback commented Jun 15, 2021

@jreback Do you think a what's new entry is needed?

yeah why don't you add one just in case. merge master and ping on green.

@realead
Copy link
Contributor Author

realead commented Jun 16, 2021

@jreback it is green(ish) - the failures seem to be unrelated to this PR.

@simonjayhawkins
Copy link
Member

@jreback it is green(ish) - the failures seem to be unrelated to this PR.

agreed. restarted ci to be sure

@@ -1033,6 +1033,7 @@ Missing
- Bug in :meth:`DataFrame.fillna` not accepting a dictionary for the ``downcast`` keyword (:issue:`40809`)
- Bug in :func:`isna` not returning a copy of the mask for nullable types, causing any subsequent mask modification to change the original array (:issue:`40935`)
- Bug in :class:`DataFrame` construction with float data containing ``NaN`` and an integer ``dtype`` casting instead of retaining the ``NaN`` (:issue:`26919`)
- Bug in :func:`isin` didn't treat all nans as equivalent if they were in tuples (:issue:`41836`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ref is an internal one. does this for example affect Series.isin in a user visible way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the motivating case was MultiIndex.isin

@@ -163,18 +163,81 @@ KHASH_MAP_INIT_COMPLEX128(complex128, size_t)
#define kh_exist_complex128(h, k) (kh_exist(h, k))


int PANDAS_INLINE floatobject_cmp(PyFloatObject* a, PyFloatObject* b){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments before these of what is going on / why doing is & the issue reference

@jreback
Copy link
Contributor

jreback commented Jun 16, 2021

lgtm @realead just a few doc comments, ping when green

@realead
Copy link
Contributor Author

realead commented Jun 16, 2021

@jreback green.

@jreback jreback merged commit 87c9fc7 into pandas-dev:master Jun 16, 2021
@jreback
Copy link
Contributor

jreback commented Jun 16, 2021

thanks @realead

@jreback
Copy link
Contributor

jreback commented Jun 16, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jun 16, 2021

Something went wrong ... Please have a look at my logs.

jreback pushed a commit that referenced this pull request Jun 16, 2021
…mplex and tuple (#42058)

Co-authored-by: realead <egor.dranischnikow@googlemail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

_libs.hashtable.ismember incorrect with tuples containing nan
4 participants