Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Replacing NaNs with a Series object based on boolean indexing does not replace NaNs #39717

Closed
2 of 3 tasks
shivendra90 opened this issue Feb 10, 2021 · 3 comments
Closed
2 of 3 tasks
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Usage Question

Comments

@shivendra90
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

diamond_data.isna().sum()

carat        0
cut          0
color        0
clarity      0
depth      691
table        0
x            0
y            0
z            0
price        0
dtype: int64

# Generate random numbers for filling nans

ideal_data = pd.Series(np.random.randint(low=float(diamond_data[diamond_data.cut == "Ideal"]["depth"].min()),
                                         high=float(diamond_data[diamond_data.cut == "Ideal"]["depth"].max()),
                                         size=270)).astype("float")

prem_data = pd.Series(np.random.randint(low=float(diamond_data[diamond_data.cut == "Premium"]["depth"].min()),
                                        high=float(diamond_data[diamond_data.cut == "Premium"]["depth"].max()),
                                        size=192)).astype("float")

vgood_data = pd.Series(np.random.randint(low=float(diamond_data[diamond_data.cut == "Very Good"]["depth"].min()),
                                        high=float(diamond_data[diamond_data.cut == "Very Good"]["depth"].max()),
                                        size=152)).astype("float")

good_data = pd.Series(np.random.randint(low=float(diamond_data[diamond_data.cut == "Good"]["depth"].min()),
                                        high=float(diamond_data[diamond_data.cut == "Good"]["depth"].max()),
                                        size=59)).astype("float")

fair_data = pd.Series(np.random.randint(low=float(diamond_data[diamond_data.cut == "Fair"]["depth"].min()),
                                        high=float(diamond_data[diamond_data.cut == "Fair"]["depth"].max()),
                                        size=24)).astype("float")

# Fill it up
diamond_data.loc[(diamond_data.cut == "Ideal") & (diamond_data.depth.isna()), "depth"] = ideal_data
diamond_data.loc[(diamond_data.cut == "Premium") & (diamond_data.depth.isna()), "depth"] = prem_data
diamond_data.loc[(diamond_data.cut == "Very Good") & (diamond_data.depth.isna()), "depth"] = vgood_data
diamond_data.loc[(diamond_data.cut == "Good") & (diamond_data.depth.isna()), "depth"] = good_data
diamond_data.loc[(diamond_data.cut == "Fair") & (diamond_data.depth.isna()), "depth"] = fair_data

Problem description

When trying to replace some of the nan values that are spread out randomly on the dataset I'm working on, as shown above through .loc indexing, I should get all the values in thedepth column filled up. Instead, nothing is filled up and running isna().sum() shows presence of same missing values.

Expected Output

Expected output should be 0 na values for the column depth.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 9d598a5
python : 3.7.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-65-generic
Version : #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.2.1
numpy : 1.19.2
pytz : 2021.1
dateutil : 2.8.1
pip : 20.3.3
setuptools : 52.0.0.post20210125
Cython : None
pytest : 6.2.2
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 2.0.1
xlwt : None
numba : 0.51.2

@shivendra90 shivendra90 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 10, 2021
@attack68
Copy link
Contributor

You need to provide a minimalist example, preferably code-only, that demonstrates the problem.

As it stands there could be a number of problems, some candidates not even pandas related, so it is difficult to debug and for developers to fix..

@attack68
Copy link
Contributor

Here is an example:

df = pd.DataFrame([[1,2], [2,np.nan], [3, np.nan]], columns=['A', 'B'])
s = pd.Series(np.random.randint(3,7, size=2)).astype(float)
df.loc[df['B'].isna(), 'B'] = s
print(df)
   A    B
0  1  2.0
1  2  5.0
2  3  NaN

It has only filled in one value, preseumably because the index in s has been matched with the index in df.

You have two solutions:

  1. Either use the ndarray version of the series (so ignore the index)

df.loc[df['B'].isna(), 'B'] = s.values

  1. Generate ndarray directly (which excludes index) and apply to column
s = np.random.randint(3,7, size=2)
df.loc[df['B'].isna(), 'B'] = s

I'm fairly sure this is working as intended, i.e. giving the power of index matching where it is useful.

@attack68 attack68 added Usage Question Indexing Related to indexing on series/frames, not to indexes themselves and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 10, 2021
@shivendra90
Copy link
Author

@attack68 Thanks, using .values actually worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Usage Question
Projects
None yet
Development

No branches or pull requests

2 participants