Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: MultiIndex.get_locs #45931

Merged
merged 3 commits into from
Feb 17, 2022
Merged

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Feb 11, 2022

Perf improvements for MultiIndex.get_locs.

import numpy as np
import pandas as pd

n1 = 10 ** 7
n2 = 10

mi = pd.MultiIndex.from_product([np.arange(n1), np.arange(n2)])

%timeit mi.get_locs([n1 - 1])
756 ms ± 39.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- main
13 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- PR

Two changes:

  1. Use pandas.core.algorithms.searchsorted rather than numpy.searchsorted.

This comment in algos.searchsorted explains the rationale:

# if `arr` and `value` have different dtypes, `arr` would be
# recast by numpy, causing a slow search.
# Before searching below, we therefore try to give `value` the
# same dtype as `arr`, while guarding against integer overflows.

The scenario of different types is common here due to the nature of factorize_from_iterable which returns different int types depending on the size of the codes:

from pandas.core.arrays.categorical import factorize_from_iterable

codes, levels = factorize_from_iterable(range(10 ** 2))
print(codes.dtype)  # int8

codes, levels = factorize_from_iterable(range(10 ** 5))
print(codes.dtype)  # int32

The impact when using numpy.searchsorted directly with mismatched types can be significant:

arr32 = np.arange(10 ** 8, dtype=np.int32)
arr64 = np.arange(10 ** 8, dtype=np.int64)

idx = np.int64(len(arr32) - 1)

%timeit arr32.searchsorted(idx)
%timeit arr64.searchsorted(idx)
%timeit algos.searchsorted(arr32, idx)
%timeit algos.searchsorted(arr64, idx)

103 ms ± 6.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
458 ns ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
11.5 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
10.9 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
  1. Avoid allocating/intersecting a large array to start with in MultiIndex.get_locs. This change is more significant than the first change in terms of impact.

One asv added:

       before           after         ratio
     [c24a3c8b]       [bd742436]
     <main>           <multiindex-get-locs>
-        543±20μs         461±10μs     0.85  multiindex_object.GetLocs.time_small_get_locs
-         764±6μs         601±10μs     0.79  multiindex_object.GetLocs.time_med_get_locs
-      22.8±0.5ms      7.59±0.09ms     0.33  multiindex_object.GetLocs.time_large_get_locs

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance labels Feb 11, 2022
@jreback jreback added this to the 1.5 milestone Feb 11, 2022
@jreback
Copy link
Contributor

jreback commented Feb 11, 2022

this fully covers #38650 ?

@lukemanley
Copy link
Member Author

this fully covers #38650 ?

Using the example from #38650 - see results below. These changes make a decent impact. However, #38650 includes additional conversation about an alternative data structure which I suspect is still valid for further improvements. I can leave #38650 open if you want to keep that discussion going there?

from datetime import datetime
import pandas as pd
import numpy as np
import timeit


dates = pd.date_range('1997-01-01', '2020-12-31')
cats = list('abcdefghijklmnop')

multi = pd.MultiIndex.from_product([dates, cats, cats])

series = pd.Series(np.random.rand(len(multi)), index=multi)
date = datetime(year=2020, month=1, day=1)

# generate cache
series.loc[(date, 'a', 'a')]

repeats = 1000

print("Performance of indexing with full keys")
print(timeit.timeit(lambda: series.loc[(date, 'a', 'a')], number=repeats))

print("Performance of indexing with partial keys")
print(timeit.timeit(lambda: series.loc[date], number=repeats))
Performance of indexing with full keys
0.1040964999992866   <- main
0.10500609999871813  <- PR

Performance of indexing with partial keys
4.212487299999339    <- main
0.21816960000433028  <- PR

@jreback
Copy link
Contributor

jreback commented Feb 11, 2022

thanks @lukemanley

ok to leave that issue open though this seems to get a lot of the low hanging fruit so maybe not worth it but happy to be proven wrong

@lukemanley
Copy link
Member Author

ok to leave that issue open

sound good. Its now open

@lukemanley
Copy link
Member Author

@jreback - gentle ping. If this looks ok, I have a few perf-related follow-ups for multiindex indexing.

@jreback jreback merged commit 9d6d587 into pandas-dev:main Feb 17, 2022
@jreback
Copy link
Contributor

jreback commented Feb 17, 2022

thanks @lukemanley this is great!

@rhshadrach
Copy link
Member

This looks to have a great performance improvement on partial indexing, which I think should also get a line in the whatsnew.

@lukemanley
Copy link
Member Author

This looks to have a great performance improvement on partial indexing, which I think should also get a line in the whatsnew.

#46040 (still open) adds some additional perf beyond this and I'm working on one more as well. I can expand on the whatsnew in one of the pending PRs.

@lukemanley lukemanley deleted the multiindex-get-locs branch March 2, 2022 01:13
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: .loc slow with large DataFrame with MultiIndex while old pandas versions perform well
4 participants