Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrowStringArray] PERF: isin using native pyarrow function if available #41281

Merged
merged 3 commits into from
May 5, 2021

Conversation

simonjayhawkins
Copy link
Member

Unlike MaskedArray, this returns a numpy bool array to be consistent with the EA interface and StringArray and also due to the fact that the returned boolean array has no null values to be consistent with the latest version of pyarrow.

[  0.00%] ·· Benchmarking existing-py_home_simon_miniconda3_envs_pandas-dev_bin_python
[  4.17%] ··· algos.isin.IsIn.time_isin                                                                                                                   ok
[  4.17%] ··· ================== ==========
                    dtype                  
              ------------------ ----------
                    int64         295±0μs  
                    uint64        348±0μs  
                    object        337±0μs  
                    Int64         785±0μs  
                   boolean        868±0μs  
                     bool         420±0μs  
                datetime64[ns]    4.67±0ms 
               category[object]   9.46±0ms 
                category[int]     7.30±0ms 
                     str          535±0μs  
                    string        556±0μs  
                 arrow_string     330±0μs  
              ================== ==========

[  8.33%] ··· algos.isin.IsIn.time_isin_categorical                                                                                                       ok
[  8.33%] ··· ================== ==========
                    dtype                  
              ------------------ ----------
                    int64         374±0μs  
                    uint64        507±0μs  
                    object        467±0μs  
                    Int64         633±0μs  
                   boolean        702±0μs  
                     bool         458±0μs  
                datetime64[ns]    3.10±0ms 
               category[object]   10.2±0ms 
                category[int]     9.12±0ms 
                     str          598±0μs  
                    string        628±0μs  
                 arrow_string     404±0μs  
              ================== ==========

[ 12.50%] ··· algos.isin.IsIn.time_isin_empty                                                                                                             ok
[ 12.50%] ··· ================== ==========
                    dtype                  
              ------------------ ----------
                    int64         275±0μs  
                    uint64        285±0μs  
                    object        390±0μs  
                    Int64         1.06±0ms 
                   boolean        1.14±0ms 
                     bool         280±0μs  
                datetime64[ns]    295±0μs  
               category[object]   4.17±0ms 
                category[int]     3.49±0ms 
                     str          424±0μs  
                    string        718±0μs  
                 arrow_string     140±0μs  
              ================== ==========

[ 16.67%] ··· algos.isin.IsIn.time_isin_mismatched_dtype                                                                                                  ok
[ 16.67%] ··· ================== ==========
                    dtype                  
              ------------------ ----------
                    int64         221±0μs  
                    uint64        216±0μs  
                    object        337±0μs  
                    Int64         337±0μs  
                   boolean        356±0μs  
                     bool         323±0μs  
                datetime64[ns]    348±0μs  
               category[object]   4.86±0ms 
                category[int]     3.47±0ms 
                     str          586±0μs  
                    string        525±0μs  
                 arrow_string     224±0μs  
              ================== ==========

@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Strings String extension data type and string data labels May 3, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone May 3, 2021
@simonjayhawkins simonjayhawkins changed the title [ArrowStringArray] isin using native pyarrow function if available [ArrowStringArray] PERF: isin using native pyarrow function if available May 4, 2021
return np.zeros(len(self), dtype=bool)

kwargs = {}
if LooseVersion(pa.__version__) < "3.0.0":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my suggestion elsewhere, let's create these accessors before doing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add some variables to pandas/compat/__init__.py, similarly as we have for Python and numpy versions.

@jreback jreback merged commit 691a2c4 into pandas-dev:master May 5, 2021
@jreback
Copy link
Contributor

jreback commented May 5, 2021

thanks @simonjayhawkins

@simonjayhawkins simonjayhawkins deleted the arrow-isin branch May 5, 2021 12:55
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants