Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: Remove special handling of nans in the float64-case of isin #22117

Merged
merged 5 commits into from
Aug 10, 2018

Conversation

realead
Copy link
Contributor

@realead realead commented Jul 29, 2018

It is no longer needed because the hash-table handles the nans correctly out of the box (see #21866)

Not having to scan the values via isna(...).any() will improve the performance.

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

There are no changes visible for the user (others than slightly better performance).

@codecov
Copy link

codecov bot commented Jul 30, 2018

Codecov Report

Merging #22117 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22117      +/-   ##
==========================================
- Coverage   92.08%   92.08%   -0.01%     
==========================================
  Files         169      169              
  Lines       50704    50703       -1     
==========================================
- Hits        46691    46690       -1     
  Misses       4013     4013
Flag Coverage Δ
#multiple 90.49% <100%> (-0.01%) ⬇️
#single 42.33% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.68% <100%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 475e391...567e7bb. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Jul 30, 2018

can you run the asv suite for value_counts, isin, factorize and report any changes.

@@ -623,6 +623,27 @@ def test_empty(self, empty):
result = algos.isin(vals, empty)
tm.assert_numpy_array_equal(expected, result)

def test_different_nans_as_float64(self):
# create different nans from bit-patterns,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment here

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Numeric Operations Arithmetic, Comparison, and Logical operations labels Jul 30, 2018
@realead
Copy link
Contributor Author

realead commented Jul 30, 2018

I had to add some performance tests for Series.isin and float64, because there weren't any.

For asv continuous -f 1.1 upstream/master HEAD -b ^series_methods -b ^algorithms the result was:

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@pep8speaks
Copy link

pep8speaks commented Aug 4, 2018

Hello @realead! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 09, 2018 at 21:45 Hours UTC

@realead
Copy link
Contributor Author

realead commented Aug 4, 2018

@jreback to bump it and to show, that there is really a performance benefit for some scenarios:

>>> asv continuous -f 1.00001 measure cleanup_nans_in_isin_float64  -b ^series_methods.IsIn
...
       before           after         ratio
     [8c973f48]       [a94d8c34]
-      45.0±0.2ms       37.8±0.8ms     0.84  series_methods.IsInFloat64.time_isin_few_different
-      45.1±0.3ms       37.4±0.2ms     0.83  series_methods.IsInFloat64.time_isin_nan_values

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

wiih measured = master+cherry-picked test-commits

The strange thing: when I put all tests into IsIn class and just add float64 to params no difference is reported. However, when I test it via %timeit or in a separate class (like now) I can see an improvement of around 15% for some scenarios. I don't know enough about asv to claim this is a bug in asv 0.3.dev1327+8bce115f, but at least a behavior I wouldn't expect.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2018

can you rebase

realead added 5 commits August 9, 2018 23:16
It is no longer needed because the hash-table handles the nans correctly out of the box (see GH21866)

Not having to scan the values via isna(...).any()  will improve the perfomance.
@realead realead force-pushed the cleanup_nans_in_isin_float64 branch from 9432f86 to 567e7bb Compare August 9, 2018 21:45
@realead
Copy link
Contributor Author

realead commented Aug 9, 2018

Performance test:

asv continuous -f 1.01 upstream/master HEAD -b ^series_methods -b ^algorithms -b ^categorial

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 66 total benchmarks (2 commits * 1 environments * 33 benchmarks)
[  0.00%] · For pandas commit hash 567e7bb8:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ··· Running benchmarks............................
[  0.00%] ··· Setting up algorithms.py:95
[  0.00%] ··· Running benchmarks.........
[  1.52%] ··· algorithms.Duplicated.time_duplicated_float                                                  16.4±0.7ms;...
[  3.03%] ··· algorithms.Duplicated.time_duplicated_int                                                    10.3±0.2ms;...
[  4.55%] ··· algorithms.Duplicated.time_duplicated_string                                                 10.5±0.2μs;...
[  6.06%] ··· algorithms.DuplicatedUniqueIndex.time_duplicated_unique_int                                     30.6±0.08μs
[  7.58%] ··· algorithms.Factorize.time_factorize_float                                                      62.6±2ms;...
[  9.09%] ··· algorithms.Factorize.time_factorize_int                                                        21.8±2ms;...
[ 10.61%] ··· algorithms.Factorize.time_factorize_string                                                      154±8ms;...
[ 12.12%] ··· algorithms.Match.time_match_string                                                                  373±5μs
[ 13.64%] ··· series_methods.Clip.time_clip                                                                     121±0.8μs
[ 15.15%] ··· series_methods.Dir.time_dir_strings                                                             1.98±0.02ms
[ 16.67%] ··· series_methods.Dropna.time_dropna                                                            2.87±0.1ms;...
[ 18.18%] ··· series_methods.IsIn.time_isin                                                               1.60±0.03ms;...
[ 19.70%] ··· series_methods.IsInFloat64.time_isin_few_different                                               37.9±0.4ms
[ 21.21%] ··· series_methods.IsInFloat64.time_isin_many_different                                              41.9±0.9ms
[ 22.73%] ··· series_methods.IsInFloat64.time_isin_nan_values                                                  38.6±0.5ms
[ 24.24%] ··· series_methods.IsInForObjects.time_isin_long_series_long_values                                 3.50±0.08ms
[ 25.76%] ··· series_methods.IsInForObjects.time_isin_long_series_long_values_floats                           6.08±0.1ms
[ 27.27%] ··· series_methods.IsInForObjects.time_isin_long_series_short_values                                2.20±0.01ms
[ 28.79%] ··· series_methods.IsInForObjects.time_isin_nans                                                        660±7μs
[ 30.30%] ··· series_methods.IsInForObjects.time_isin_short_series_long_values                                1.16±0.02ms
[ 31.82%] ··· series_methods.Map.time_map                                                                 1.02±0.03ms;...
[ 33.33%] ··· series_methods.NSort.time_nlargest                                                           2.93±0.1ms;...
[ 34.85%] ··· series_methods.NSort.time_nsmallest                                                          2.35±0.2ms;...
[ 36.36%] ··· series_methods.SeriesConstructor.time_constructor                                               164±2μs;...
[ 37.88%] ··· series_methods.SeriesGetattr.time_series_datetimeindex_repr                                     3.09±0.03μs
[ 39.39%] ··· series_methods.ValueCounts.time_value_counts                                                2.25±0.07ms;...
[ 40.91%] ··· algorithms.Hashing.time_frame                                                                    21.1±0.3ms
[ 42.42%] ··· algorithms.Hashing.time_series_categorical                                                      5.17±0.08ms
[ 43.94%] ··· algorithms.Hashing.time_series_dates                                                            3.26±0.06ms
[ 45.45%] ··· algorithms.Hashing.time_series_float                                                            3.24±0.04ms
[ 46.97%] ··· algorithms.Hashing.time_series_int                                                              3.20±0.02ms
[ 48.48%] ··· algorithms.Hashing.time_series_string                                                            14.3±0.7ms
[ 50.00%] ··· algorithms.Hashing.time_series_timedeltas                                                        3.30±0.1ms
[ 50.00%] · For pandas commit hash 475e391e:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running benchmarks............................
[ 50.00%] ··· Setting up algorithms.py:95
[ 50.00%] ··· Running benchmarks.........
[ 51.52%] ··· algorithms.Duplicated.time_duplicated_float                                                  15.3±0.7ms;...
[ 53.03%] ··· algorithms.Duplicated.time_duplicated_int                                                    9.54±0.2ms;...
[ 54.55%] ··· algorithms.Duplicated.time_duplicated_string                                                 10.5±0.1μs;...
[ 56.06%] ··· algorithms.DuplicatedUniqueIndex.time_duplicated_unique_int                                      30.7±0.1μs
[ 57.58%] ··· algorithms.Factorize.time_factorize_float                                                      57.2±2ms;...
[ 59.09%] ··· algorithms.Factorize.time_factorize_int                                                        21.5±3ms;...
[ 60.61%] ··· algorithms.Factorize.time_factorize_string                                                      153±8ms;...
[ 62.12%] ··· algorithms.Match.time_match_string                                                                  374±4μs
[ 63.64%] ··· series_methods.Clip.time_clip                                                                       120±2μs
[ 65.15%] ··· series_methods.Dir.time_dir_strings                                                             2.05±0.05ms
[ 66.67%] ··· series_methods.Dropna.time_dropna                                                           2.73±0.05ms;...
[ 68.18%] ··· series_methods.IsIn.time_isin                                                               1.60±0.02ms;...
[ 69.70%] ··· series_methods.IsInFloat64.time_isin_few_different                                               45.6±0.6ms
[ 71.21%] ··· series_methods.IsInFloat64.time_isin_many_different                                                43.3±1ms
[ 72.73%] ··· series_methods.IsInFloat64.time_isin_nan_values                                                  45.2±0.7ms
[ 74.24%] ··· series_methods.IsInForObjects.time_isin_long_series_long_values                                  3.44±0.1ms
[ 75.76%] ··· series_methods.IsInForObjects.time_isin_long_series_long_values_floats                          5.82±0.05ms
[ 77.27%] ··· series_methods.IsInForObjects.time_isin_long_series_short_values                                2.09±0.03ms
[ 78.79%] ··· series_methods.IsInForObjects.time_isin_nans                                                        642±3μs
[ 80.30%] ··· series_methods.IsInForObjects.time_isin_short_series_long_values                                 1.18±0.1ms
[ 81.82%] ··· series_methods.Map.time_map                                                                  1.13±0.1ms;...
[ 83.33%] ··· series_methods.NSort.time_nlargest                                                           2.95±0.1ms;...
[ 84.85%] ··· series_methods.NSort.time_nsmallest                                                         2.33±0.08ms;...
[ 86.36%] ··· series_methods.SeriesConstructor.time_constructor                                             163±0.9μs;...
[ 87.88%] ··· series_methods.SeriesGetattr.time_series_datetimeindex_repr                                     3.21±0.02μs
[ 89.39%] ··· series_methods.ValueCounts.time_value_counts                                                2.24±0.02ms;...
[ 90.91%] ··· algorithms.Hashing.time_frame                                                                    23.2±0.7ms
[ 92.42%] ··· algorithms.Hashing.time_series_categorical                                                       5.36±0.2ms
[ 93.94%] ··· algorithms.Hashing.time_series_dates                                                             3.35±0.2ms
[ 95.45%] ··· algorithms.Hashing.time_series_float                                                            3.49±0.08ms
[ 96.97%] ··· algorithms.Hashing.time_series_int                                                              3.23±0.05ms
[ 98.48%] ··· algorithms.Hashing.time_series_string                                                            15.5±0.5ms
[100.00%] ··· algorithms.Hashing.time_series_timedeltas                                                        3.43±0.2ms
       before           after         ratio
     [475e391e]       [567e7bb8]
-      45.2±0.7ms       38.6±0.5ms     0.85  series_methods.IsInFloat64.time_isin_nan_values
-      45.6±0.6ms       37.9±0.4ms     0.83  series_methods.IsInFloat64.time_isin_few_different

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback jreback added this to the 0.24.0 milestone Aug 9, 2018
@jreback
Copy link
Contributor

jreback commented Aug 9, 2018

looks fine. ping on green.

@realead
Copy link
Contributor Author

realead commented Aug 10, 2018

@jreback Done

@jreback jreback merged commit 2c2b658 into pandas-dev:master Aug 10, 2018
@jreback
Copy link
Contributor

jreback commented Aug 10, 2018

thanks @realead nice patch, keep em coming!

@realead realead deleted the cleanup_nans_in_isin_float64 branch August 11, 2018 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants