PERF: MultiIndex.get_locs #45931

lukemanley · 2022-02-11T01:21:13Z

closes PERF: .loc slow with large DataFrame with MultiIndex while old pandas versions perform well #45681
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Perf improvements for MultiIndex.get_locs.

import numpy as np
import pandas as pd

n1 = 10 ** 7
n2 = 10

mi = pd.MultiIndex.from_product([np.arange(n1), np.arange(n2)])

%timeit mi.get_locs([n1 - 1])
756 ms ± 39.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- main
13 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- PR

Two changes:

Use pandas.core.algorithms.searchsorted rather than numpy.searchsorted.

This comment in algos.searchsorted explains the rationale:

pandas/pandas/core/algorithms.py

Lines 1518 to 1521 in 6f8d279

    
           # if `arr` and `value` have different dtypes, `arr` would be 
        
           # recast by numpy, causing a slow search. 
        
           # Before searching below, we therefore try to give `value` the 
        
           # same dtype as `arr`, while guarding against integer overflows.

The scenario of different types is common here due to the nature of factorize_from_iterable which returns different int types depending on the size of the codes:

from pandas.core.arrays.categorical import factorize_from_iterable

codes, levels = factorize_from_iterable(range(10 ** 2))
print(codes.dtype)  # int8

codes, levels = factorize_from_iterable(range(10 ** 5))
print(codes.dtype)  # int32

The impact when using numpy.searchsorted directly with mismatched types can be significant:

arr32 = np.arange(10 ** 8, dtype=np.int32)
arr64 = np.arange(10 ** 8, dtype=np.int64)

idx = np.int64(len(arr32) - 1)

%timeit arr32.searchsorted(idx)
%timeit arr64.searchsorted(idx)
%timeit algos.searchsorted(arr32, idx)
%timeit algos.searchsorted(arr64, idx)

103 ms ± 6.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
458 ns ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
11.5 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
10.9 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Avoid allocating/intersecting a large array to start with in MultiIndex.get_locs. This change is more significant than the first change in terms of impact.

One asv added:

       before           after         ratio
     [c24a3c8b]       [bd742436]
     <main>           <multiindex-get-locs>
-        543±20μs         461±10μs     0.85  multiindex_object.GetLocs.time_small_get_locs
-         764±6μs         601±10μs     0.79  multiindex_object.GetLocs.time_med_get_locs
-      22.8±0.5ms      7.59±0.09ms     0.33  multiindex_object.GetLocs.time_large_get_locs

jreback · 2022-02-11T02:05:10Z

this fully covers #38650 ?

lukemanley · 2022-02-11T02:57:10Z

this fully covers #38650 ?

Using the example from #38650 - see results below. These changes make a decent impact. However, #38650 includes additional conversation about an alternative data structure which I suspect is still valid for further improvements. I can leave #38650 open if you want to keep that discussion going there?

from datetime import datetime
import pandas as pd
import numpy as np
import timeit


dates = pd.date_range('1997-01-01', '2020-12-31')
cats = list('abcdefghijklmnop')

multi = pd.MultiIndex.from_product([dates, cats, cats])

series = pd.Series(np.random.rand(len(multi)), index=multi)
date = datetime(year=2020, month=1, day=1)

# generate cache
series.loc[(date, 'a', 'a')]

repeats = 1000

print("Performance of indexing with full keys")
print(timeit.timeit(lambda: series.loc[(date, 'a', 'a')], number=repeats))

print("Performance of indexing with partial keys")
print(timeit.timeit(lambda: series.loc[date], number=repeats))

Performance of indexing with full keys
0.1040964999992866   <- main
0.10500609999871813  <- PR

Performance of indexing with partial keys
4.212487299999339    <- main
0.21816960000433028  <- PR

jreback · 2022-02-11T03:10:08Z

thanks @lukemanley

ok to leave that issue open though this seems to get a lot of the low hanging fruit so maybe not worth it but happy to be proven wrong

lukemanley · 2022-02-11T03:12:55Z

ok to leave that issue open

sound good. Its now open

pandas/core/indexes/multi.py

lukemanley · 2022-02-16T22:10:05Z

@jreback - gentle ping. If this looks ok, I have a few perf-related follow-ups for multiindex indexing.

jreback · 2022-02-17T16:02:17Z

thanks @lukemanley this is great!

rhshadrach · 2022-02-19T19:31:40Z

This looks to have a great performance improvement on partial indexing, which I think should also get a line in the whatsnew.

lukemanley · 2022-02-19T21:37:38Z

This looks to have a great performance improvement on partial indexing, which I think should also get a line in the whatsnew.

#46040 (still open) adds some additional perf beyond this and I'm working on one more as well. I can expand on the whatsnew in one of the pending PRs.

lukemanley added 3 commits February 10, 2022 19:38

perf: multiindex.get_locs

8a6c3da

asv

df6eaff

whatsnew

bd74243

jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance labels Feb 11, 2022

jreback added this to the 1.5 milestone Feb 11, 2022

jbrockmendel reviewed Feb 11, 2022

View reviewed changes

pandas/core/indexes/multi.py Show resolved Hide resolved

jbrockmendel reviewed Feb 11, 2022

View reviewed changes

pandas/core/indexes/multi.py Show resolved Hide resolved

jreback merged commit 9d6d587 into pandas-dev:main Feb 17, 2022

This was referenced Feb 17, 2022

PERF: MultiIndex slicing #46040

Merged

BUG: Multiindex.equals not commutative for ea dtype #46047

Merged

lukemanley deleted the multiindex-get-locs branch March 2, 2022 01:13

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

PERF: MultiIndex.get_locs (pandas-dev#45931)

d70eb95

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: MultiIndex.get_locs #45931

PERF: MultiIndex.get_locs #45931

lukemanley commented Feb 11, 2022 •

edited

Loading

jreback commented Feb 11, 2022

lukemanley commented Feb 11, 2022

jreback commented Feb 11, 2022

lukemanley commented Feb 11, 2022

lukemanley commented Feb 16, 2022

jreback commented Feb 17, 2022

rhshadrach commented Feb 19, 2022

lukemanley commented Feb 19, 2022

	# if `arr` and `value` have different dtypes, `arr` would be
	# recast by numpy, causing a slow search.
	# Before searching below, we therefore try to give `value` the
	# same dtype as `arr`, while guarding against integer overflows.

PERF: MultiIndex.get_locs #45931

PERF: MultiIndex.get_locs #45931

Conversation

lukemanley commented Feb 11, 2022 • edited Loading

jreback commented Feb 11, 2022

lukemanley commented Feb 11, 2022

jreback commented Feb 11, 2022

lukemanley commented Feb 11, 2022

lukemanley commented Feb 16, 2022

jreback commented Feb 17, 2022

rhshadrach commented Feb 19, 2022

lukemanley commented Feb 19, 2022

lukemanley commented Feb 11, 2022 •

edited

Loading