Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rank() for int64 gives NAN for most negative value #32859

Closed
karthikeyann opened this issue Mar 20, 2020 · 4 comments · Fixed by #40659
Closed

rank() for int64 gives NAN for most negative value #32859

karthikeyann opened this issue Mar 20, 2020 · 4 comments · Fixed by #40659
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug
Milestone

Comments

@karthikeyann
Copy link

Code to recreate the issue

import pandas as pd
import numpy as np
s = pd.Series(np.array([np.inf, np.nan, -np.inf]).astype(np.int32))
print(s)
print(s.rank())
s = pd.Series(np.array([np.inf, np.nan, -np.inf]).astype(np.int64))
print(s)
print(s.rank())
s = pd.Series([-9223372036854775808, -9223372036854775808, -9223372036854775808], dtype=np.int64)
print(s.rank())
0   -2147483648
1   -2147483648
2   -2147483648
dtype: int32
0    2.0
1    2.0
2    2.0
dtype: float64
0   -9223372036854775808
1   -9223372036854775808
2   -9223372036854775808
dtype: int64
0   NaN
1   NaN
2   NaN
dtype: float64
0   NaN
1   NaN
2   NaN
dtype: float64

Problem description

rank for np.int64 most negative value gives NaN as rank.
np.int32 works fine for similar case.

Expected Output

0   -2147483648
1   -2147483648
2   -2147483648
dtype: int32
0    2.0
1    2.0
2    2.0
dtype: float64
0   -9223372036854775808
1   -9223372036854775808
2   -9223372036854775808
dtype: int64
0   2.0
1   2.0
2   2.0
dtype: float64
0   2.0
1   2.0
2   2.0
dtype: float64

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.2
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : 0.29.13
pytest : 5.2.2
hypothesis : None
sphinx : 1.8.2
blosc : None
feather : None
xlsxwriter : 1.2.2
lxml.etree : 4.3.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.2.1
fastparquet : 0.1.6
gcsfs : None
lxml.etree : 4.3.0
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 2.5.12
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
pytest : 5.2.2
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.2
numba : 0.42.0

@skasturi
Copy link
Contributor

skasturi commented Mar 22, 2020

I debugged this to the extent that the problem is in the following method:

def rank_1d(rank_t[:] in_arr, ties_method='average',

Trying to debug further using: http://docs.cython.org/en/latest/src/userguide/debugging.html as I haven't debugged pyx code before.

@jbrockmendel FYI as I see that you have added this code earlier.

@simonjayhawkins simonjayhawkins added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Apr 25, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Apr 25, 2020
@gabrielNT
Copy link
Contributor

So basically it looks like this is just following the na_option behavior. This is the way NA values are selected for int64:

elif rank_t is int64_t:
mask = values == NPY_NAT

If na_option is not keep, this is working:

In [3]: pd.Series(np.array([np.inf, np.nan, -np.inf]).astype(np.int64)).rank()                                                                                                                         
Out[3]: 
0   NaN
1   NaN
2   NaN
dtype: float64

In [4]: pd.Series(np.array([np.inf, np.nan, -np.inf]).astype(np.int64)).rank(na_option='bottom')                                                                                                       
Out[4]: 
0    2.0
1    2.0
2    2.0
dtype: float64

Not sure there's a good way to handle this without major restructuring. Maybe just be more explicit this can happen for int64?

@jbrockmendel
Copy link
Member

Some other cython methods have a datetimelike keyword that determines if NPY_NAT is considered an NA value. We probably need to add that keyword here.

@gabrielNT
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug
Projects
None yet
6 participants