rank() for int64 gives NAN for most negative value #32859

karthikeyann · 2020-03-20T10:29:01Z

Code to recreate the issue

import pandas as pd
import numpy as np
s = pd.Series(np.array([np.inf, np.nan, -np.inf]).astype(np.int32))
print(s)
print(s.rank())
s = pd.Series(np.array([np.inf, np.nan, -np.inf]).astype(np.int64))
print(s)
print(s.rank())
s = pd.Series([-9223372036854775808, -9223372036854775808, -9223372036854775808], dtype=np.int64)
print(s.rank())

0   -2147483648
1   -2147483648
2   -2147483648
dtype: int32
0    2.0
1    2.0
2    2.0
dtype: float64
0   -9223372036854775808
1   -9223372036854775808
2   -9223372036854775808
dtype: int64
0   NaN
1   NaN
2   NaN
dtype: float64
0   NaN
1   NaN
2   NaN
dtype: float64

Problem description

rank for np.int64 most negative value gives NaN as rank.
np.int32 works fine for similar case.

Expected Output

0   -2147483648
1   -2147483648
2   -2147483648
dtype: int32
0    2.0
1    2.0
2    2.0
dtype: float64
0   -9223372036854775808
1   -9223372036854775808
2   -9223372036854775808
dtype: int64
0   2.0
1   2.0
2   2.0
dtype: float64
0   2.0
1   2.0
2   2.0
dtype: float64

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.2
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : 0.29.13
pytest : 5.2.2
hypothesis : None
sphinx : 1.8.2
blosc : None
feather : None
xlsxwriter : 1.2.2
lxml.etree : 4.3.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.2.1
fastparquet : 0.1.6
gcsfs : None
lxml.etree : 4.3.0
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 2.5.12
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
pytest : 5.2.2
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.2
numba : 0.42.0

The text was updated successfully, but these errors were encountered:

skasturi · 2020-03-22T21:54:29Z

I debugged this to the extent that the problem is in the following method:

pandas/pandas/_libs/algos.pyx

Line 794 in 44de8dc

def rank_1d(rank_t[:] in_arr, ties_method='average',

Trying to debug further using: http://docs.cython.org/en/latest/src/userguide/debugging.html as I haven't debugged pyx code before.

@jbrockmendel FYI as I see that you have added this code earlier.

gabrielNT · 2020-07-18T00:32:22Z

So basically it looks like this is just following the na_option behavior. This is the way NA values are selected for int64:

pandas/pandas/_libs/algos.pyx

Lines 842 to 843 in bfac136

    
           elif rank_t is int64_t: 
        
               mask = values == NPY_NAT

If na_option is not keep, this is working:

In [3]: pd.Series(np.array([np.inf, np.nan, -np.inf]).astype(np.int64)).rank()                                                                                                                         
Out[3]: 
0   NaN
1   NaN
2   NaN
dtype: float64

In [4]: pd.Series(np.array([np.inf, np.nan, -np.inf]).astype(np.int64)).rank(na_option='bottom')                                                                                                       
Out[4]: 
0    2.0
1    2.0
2    2.0
dtype: float64

Not sure there's a good way to handle this without major restructuring. Maybe just be more explicit this can happen for int64?

jbrockmendel · 2020-08-03T15:32:53Z

Some other cython methods have a datetimelike keyword that determines if NPY_NAT is considered an NA value. We probably need to add that keyword here.

gabrielNT · 2020-08-03T23:57:49Z

take

karthikeyann mentioned this issue Mar 20, 2020

[REVIEW] Series rank and Dataframe rank rapidsai/cudf#4294

Merged

11 tasks

simonjayhawkins added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Apr 25, 2020

simonjayhawkins added this to the Contributions Welcome milestone Apr 25, 2020

github-actions bot assigned gabrielNT Aug 3, 2020

gabrielNT added a commit to gabrielNT/pandas that referenced this issue Aug 4, 2020

Check if NPY_NAT is NA for int64 in rank() (pandas-dev#32859)

f2a8cd0

gabrielNT mentioned this issue Aug 4, 2020

Check if NPY_NAT is NA for int64 in rank() (#32859) #35533

Closed

5 tasks

gabrielNT removed their assignment Aug 18, 2020

mzeitlin11 mentioned this issue Dec 28, 2020

REF/POC: Share groupby/series algos (rank) #38744

Merged

2 tasks

mzeitlin11 mentioned this issue Mar 27, 2021

BUG: rank treating min int as NaN #40659

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.3 Mar 29, 2021

jreback closed this as completed in #40659 Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rank() for int64 gives NAN for most negative value #32859

rank() for int64 gives NAN for most negative value #32859

karthikeyann commented Mar 20, 2020

INSTALLED VERSIONS

skasturi commented Mar 22, 2020 •

edited

Loading

gabrielNT commented Jul 18, 2020

jbrockmendel commented Aug 3, 2020

gabrielNT commented Aug 3, 2020

rank() for int64 gives NAN for most negative value #32859

rank() for int64 gives NAN for most negative value #32859

Comments

karthikeyann commented Mar 20, 2020

Code to recreate the issue

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

skasturi commented Mar 22, 2020 • edited Loading

gabrielNT commented Jul 18, 2020

jbrockmendel commented Aug 3, 2020

gabrielNT commented Aug 3, 2020

Output of `pd.show_versions()`

skasturi commented Mar 22, 2020 •

edited

Loading