Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: ASV Gil benchmark #18675

Merged
merged 3 commits into from
Dec 11, 2017
Merged

Conversation

mroeschke
Copy link
Member

  • Adding missing setup import to attrs_caching.py

  • Now lint benchmark files that start with g and h

  • Utilized params for gil.py benchmarks, flake8 and remove start imports

asv dev -b ^gil
· Discovering benchmarks
· Running 15 total benchmarks (1 commits * 1 environments * 15 benchmarks)
[  0.00%] ·· Building for existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[  0.00%] ·· Benchmarking existing-py_home_matt_anaconda_envs_pandas_dev_bin_python
[  6.67%] ··· Running gil.ParallelDatetimeFields.time_datetime_field_day                                                   211ms
[ 13.33%] ··· Running gil.ParallelDatetimeFields.time_datetime_field_daysinmonth                                           224ms
[ 20.00%] ··· Running gil.ParallelDatetimeFields.time_datetime_field_normalize                                             324ms
[ 26.67%] ··· Running gil.ParallelDatetimeFields.time_datetime_field_year                                                  201ms
[ 33.33%] ··· Running gil.ParallelDatetimeFields.time_datetime_to_period                                                   258ms
[ 40.00%] ··· Running gil.ParallelDatetimeFields.time_period_to_datetime                                                   325ms
[ 46.67%] ··· Running gil.ParallelFactorize.time_loop                                                                         ok
[ 46.67%] ···· 
               ========= ========
                threads          
               --------- --------
                   2      76.9ms 
                   4      156ms  
                   8      314ms  
               ========= ========

[ 53.33%] ··· Running gil.ParallelFactorize.time_parallel                                                                     ok
[ 53.33%] ···· 
               ========= ========
                threads          
               --------- --------
                   2      95.9ms 
                   4      209ms  
                   8      474ms  
               ========= ========

[ 60.00%] ··· Running gil.ParallelGroupbyMethods.time_loop                                                                    ok
[ 60.00%] ···· 
               ========= ======== ======== ======== ======== ======== ======== ======== =======
               --                                        method                                
               --------- ----------------------------------------------------------------------
                threads   count     last     max      mean     min      prod     sum      var  
               ========= ======== ======== ======== ======== ======== ======== ======== =======
                   2      92.6ms   85.2ms   79.0ms   82.7ms   82.3ms   82.0ms   82.2ms   106ms 
                   4      187ms    163ms    158ms    164ms    156ms    165ms    160ms    207ms 
                   8      452ms    406ms    396ms    410ms    398ms    405ms    401ms    493ms 
               ========= ======== ======== ======== ======== ======== ======== ======== =======

[ 66.67%] ··· Running gil.ParallelGroupbyMethods.time_parallel                                                                ok
[ 66.67%] ···· 
               ========= ======= ======= ======= ======== ======= ======= ======= =======
               --                                     method                             
               --------- ----------------------------------------------------------------
                threads   count    last    max     mean     min     prod    sum     var  
               ========= ======= ======= ======= ======== ======= ======= ======= =======
                   2      150ms   104ms   105ms   83.9ms   104ms   103ms   103ms   111ms 
                   4      305ms   235ms   227ms   227ms    227ms   231ms   219ms   271ms 
                   8      740ms   466ms   481ms   504ms    493ms   515ms   484ms   709ms 
               ========= ======= ======= ======= ======== ======= ======= ======= =======

[ 73.33%] ··· Running gil.ParallelGroups.time_get_groups                                                                      ok
[ 73.33%] ···· 
               ========= =======
                threads         
               --------- -------
                   2      1.41s 
                   4      2.94s 
                   8      5.76s 
               ========= =======

[ 80.00%] ··· Running gil.ParallelKth.time_kth_smallest                                                                    291ms
[ 86.67%] ··· Running gil.ParallelReadCSV.time_read_csv                                                                       ok
[ 86.67%] ···· 
               ========== ========
                 dtype            
               ---------- --------
                 float     554ms  
                 object    24.4ms 
                datetime   561ms  
               ========== ========

[ 93.33%] ··· Running gil.ParallelRolling.time_rolling                                                                        ok
[ 93.33%] ···· 
               ================ ========
                    method              
               ---------------- --------
                rolling_median   239ms  
                 rolling_mean    24.7ms 
                 rolling_min     27.3ms 
                 rolling_max     29.1ms 
                 rolling_var     23.7ms 
                 rolling_skew    31.3ms 
                 rolling_kurt    29.7ms 
                 rolling_std     30.0ms 
               ================ ========

[ 93.33%] ····· 
                
                For parameters: 'rolling_median'
                /home/matt/Projects/pandas-mroeschke/asv_bench/benchmarks/gil.py:194: FutureWarning: pd.rolling_median is deprecated for ndarrays and will be removed in a future version
                  rolling[method](arr, win)
                
                For parameters: 'rolling_mean'
                /home/matt/Projects/pandas-mroeschke/asv_bench/benchmarks/gil.py:194: FutureWarning: pd.rolling_mean is deprecated for ndarrays and will be removed in a future version
                  rolling[method](arr, win)
                
                For parameters: 'rolling_min'
                /home/matt/Projects/pandas-mroeschke/asv_bench/benchmarks/gil.py:194: FutureWarning: pd.rolling_min is deprecated for ndarrays and will be removed in a future version
                  rolling[method](arr, win)
                
                For parameters: 'rolling_max'
                /home/matt/Projects/pandas-mroeschke/asv_bench/benchmarks/gil.py:194: FutureWarning: pd.rolling_max is deprecated for ndarrays and will be removed in a future version
                  rolling[method](arr, win)
                
                For parameters: 'rolling_var'
                /home/matt/Projects/pandas-mroeschke/asv_bench/benchmarks/gil.py:194: FutureWarning: pd.rolling_var is deprecated for ndarrays and will be removed in a future version
                  rolling[method](arr, win)
                
                For parameters: 'rolling_skew'
                /home/matt/Projects/pandas-mroeschke/asv_bench/benchmarks/gil.py:194: FutureWarning: pd.rolling_skew is deprecated for ndarrays and will be removed in a future version
                  rolling[method](arr, win)
                
                For parameters: 'rolling_kurt'
                /home/matt/Projects/pandas-mroeschke/asv_bench/benchmarks/gil.py:194: FutureWarning: pd.rolling_kurt is deprecated for ndarrays and will be removed in a future version
                  rolling[method](arr, win)
                
                For parameters: 'rolling_std'
                /home/matt/Projects/pandas-mroeschke/asv_bench/benchmarks/gil.py:194: FutureWarning: pd.rolling_std is deprecated for ndarrays and will be removed in a future version
                  rolling[method](arr, win)

[100.00%] ··· Running gil.ParallelTake1D.time_take1d                                                                          ok
[100.00%] ···· 
               ========= ========
                 dtype           
               --------- --------
                 int64    24.3ms 
                float64   8.17ms 
               ========= ========

@jorisvandenbossche
Copy link
Member

Is it a bit strange that you see such a slowdown with multiple threads ?

@jreback jreback added the Benchmark Performance (ASV) benchmarks label Dec 7, 2017
@codecov
Copy link

codecov bot commented Dec 8, 2017

Codecov Report

Merging #18675 into master will decrease coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18675      +/-   ##
==========================================
- Coverage   91.59%   91.57%   -0.03%     
==========================================
  Files         153      153              
  Lines       51263    51212      -51     
==========================================
- Hits        46956    46899      -57     
- Misses       4307     4313       +6
Flag Coverage Δ
#multiple 89.43% <ø> (-0.01%) ⬇️
#single 40.67% <ø> (-0.13%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 65.25% <0%> (-1.27%) ⬇️
pandas/core/config_init.py 98.34% <0%> (-0.12%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️
pandas/core/window.py 96.31% <0%> (ø) ⬆️
pandas/tseries/frequencies.py 94.09% <0%> (+0.07%) ⬆️
pandas/plotting/_core.py 82.49% <0%> (+0.12%) ⬆️
pandas/util/_test_decorators.py 95.23% <0%> (+1.9%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba3a442...a83286d. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 8, 2017

Codecov Report

Merging #18675 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18675      +/-   ##
==========================================
- Coverage    91.6%    91.6%   -0.01%     
==========================================
  Files         153      153              
  Lines       51273    51306      +33     
==========================================
+ Hits        46970    46999      +29     
- Misses       4303     4307       +4
Flag Coverage Δ
#multiple 89.46% <ø> (+0.01%) ⬆️
#single 40.72% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/io/parquet.py 64.55% <0%> (-0.83%) ⬇️
pandas/core/reshape/merge.py 94.2% <0%> (-0.21%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️
pandas/core/indexes/datetimes.py 95.68% <0%> (ø) ⬆️
pandas/core/generic.py 95.9% <0%> (ø) ⬆️
pandas/io/pytables.py 92.84% <0%> (ø) ⬆️
pandas/plotting/_core.py 82.41% <0%> (+0.03%) ⬆️
pandas/core/indexes/numeric.py 97.33% <0%> (+0.07%) ⬆️
pandas/util/testing.py 82.34% <0%> (+0.32%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34a8d36...0c4f3e7. Read the comment docs.

@mroeschke
Copy link
Member Author

@jorisvandenbossche This benchmark makes heavy use of the test_parallel decorator which creates threads using the threading module, but I think all these operations are CPU bound so threading shouldn't help here.

Here's an example analysis of that point: https://medium.com/practo-engineering/threading-vs-multiprocessing-in-python-7b57f224eadb

@jorisvandenbossche
Copy link
Member

I think all these operations are CPU bound so threading shouldn't help here.

As far as I understand, threads don't help for that due to the GIL, but here we are explicitly benchmarking the GIL-freeing of those groupby et al methods.

Trying out the example interactively also does give a speedup, so I was mainly wondering why we don't see this in the asv benchmarks:

N = 1000000
ngroups = 1000
np.random.seed(1234)
df = DataFrame({'key' : np.random.randint(0,ngroups,size=N),
             'data' : np.random.randn(N) })
            
def f():
    df.groupby('key')['data'].sum()

def g4():
    for i in range(4):
        f()

from pandas.util.testing import test_parallel

@test_parallel(num_threads=4)
def pg4():
    f()
In [14]: %timeit g4()
74.5 ms ± 1e+03 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [15]: %timeit pg4()
39.4 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@mroeschke
Copy link
Member Author

Okay thanks for the explaination. The gil.py benchmarks on master are also showing that threads are slower so maybe it's an artifact of asv?

(time_sum_4_notp is faster than time_sum)

asv run -b ^gil.NoGilGroupby· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 15 total benchmarks (1 commits * 1 environments * 15 benchmarks)
[  0.00%] · For pandas commit hash 34a8d36e:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  6.67%] ··· Running gil.NoGilGroupby.time_count_2                                            161±3ms
[ 13.33%] ··· Running gil.NoGilGroupby.time_groups_2                                        1.32±0.02s
[ 20.00%] ··· Running gil.NoGilGroupby.time_groups_4                                             2.70s
[ 26.67%] ··· Running gil.NoGilGroupby.time_groups_8                                             5.28s
[ 33.33%] ··· Running gil.NoGilGroupby.time_last_2                                           144±0.4ms
[ 40.00%] ··· Running gil.NoGilGroupby.time_max_2                                              141±2ms
[ 46.67%] ··· Running gil.NoGilGroupby.time_mean_2                                           130±0.6ms
[ 53.33%] ··· Running gil.NoGilGroupby.time_min_2                                            141±0.6ms
[ 60.00%] ··· Running gil.NoGilGroupby.time_prod_2                                             132±1ms
[ 66.67%] ··· Running gil.NoGilGroupby.time_sum_2                                            141±0.7ms
[ 73.33%] ··· Running gil.NoGilGroupby.time_sum_4                                              296±3ms
[ 80.00%] ··· Running gil.NoGilGroupby.time_sum_4_notp                                        190±20ms
[ 86.67%] ··· Running gil.NoGilGroupby.time_sum_8                                            554±0.8ms
[ 93.33%] ··· Running gil.NoGilGroupby.time_sum_8_notp                                        296±40ms
[100.00%] ··· Running gil.NoGilGroupby.time_var_2                                              158±2ms

@jorisvandenbossche jorisvandenbossche added this to the 0.22.0 milestone Dec 10, 2017
@jorisvandenbossche
Copy link
Member

There is still one linting issue:

Linting asv_bench/benchmarks/

asv_bench/benchmarks/groupby.py:279:9: E731 do not assign a lambda expression, use a def

Linting asv_bench/benchmarks/*.py DONE

@jorisvandenbossche jorisvandenbossche merged commit 5d9151c into pandas-dev:master Dec 11, 2017
@jorisvandenbossche
Copy link
Member

Thanks!

@jorisvandenbossche
Copy link
Member

For a follow-up, can you 'fix' those usage of rolling_ functions? (eg have the dict be dependent on whether pd.DataFrame.rolling exists and then fall back to rolling_.. etc. (can be in #18725)
See #18723, then that PR does not need to worry about the benchmarks.

@topper-123
Copy link
Contributor

See #18723, then that PR does not need to worry about the benchmarks.

So I'll roll back my changes in the gil.py file, unless otherwise notified.

@jorisvandenbossche
Copy link
Member

So I'll roll back my changes in the gil.py file, unless otherwise notified.

Yes, that is maybe the easiest, then you don't need to worry about the merge conflicts

@mroeschke mroeschke deleted the asv_clean_gil branch December 12, 2017 03:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants