Skip to content

ENH: Reimplement DataFrame.lookup #61185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
spacing
  • Loading branch information
stevenae committed Mar 26, 2025
commit e0b0b57ffcc5fe2031777d7b2054e9bb0c146246
10 changes: 6 additions & 4 deletions doc/source/user_guide/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1458,11 +1458,13 @@ default value.

The :meth:`~pandas.DataFrame.lookup` method
-------------------------------------------
Sometimes you want to extract a set of values given a sequence of row labels
and column labels, and the ``lookup`` method allows for this and returns a
NumPy array. For instance:

.. ipython:: python
Sometimes you want to extract a set of values given a sequence of row labels
and column labels, and the ``lookup`` method allows for this and returns a
NumPy array. For instance:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have other places in our API where we return a NumPy array? With the prevalance of the Arrow type system this doesn't seem desirable to be locked into returning a NumPy array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like values also does this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed I think this API should return an ExtensionArray or numpy array depending on the initial type or result type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values only returns a NumPy array for numpy types. For extension types or arrow-backed types you get something different:

>>> pd.Series([1, 2, 3], dtype="int64[pyarrow]").values
<ArrowExtensionArray>
[1, 2, 3]
Length: 3, dtype: int64[pyarrow]

I don't think we should force a NumPy array return here; particularly for string data, that could be non-performant and expensive

Copy link
Contributor Author

@stevenae stevenae Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought through and did a bit more of a heavy-handed rewrite.

Now using melt to achieve the outcome of values or to_numpy'

Performance does take a hit, however, we are still outperforming the naiive lookup of to_numpy for mixed-type lookups.

Old PR New PR
2 100
0.1964133749715984 0.5150299999950221
0.274302874924615 0.5055611249990761
3 100
0.15044220816344023 0.48040162499819417
0.2768622918520123 0.5237024579982972
4 100
0.15489325020462275 0.49075670799356885
0.26732829213142395 0.5079907500039553
5 100
0.1546538749244064 0.4678692500019679
0.2721201251260936 0.5082256250025239
2 100000
0.8096102089621127 2.114792499996838
1.9508202918805182 2.619460332993185
3 100000
0.8242515418678522 2.2221941250027157
1.9535491249989718 2.6292148750071647
4 100000
0.8302762501407415 2.3314981659932528
1.9240409170743078 2.711707041991758
5 100000
0.8654224998317659 2.201970291993348
2.0630989999044687 2.674396375005017

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have other places in our API where we return a NumPy array?

factorize

With the prevalance of the Arrow type system this doesn't seem desirable to be locked into returning a NumPy array

This function can be operating on multiple columns of different dtypes. I think the only option in such a case is to return a NumPy array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true on factorize but that isn't 100% an equivalent comparison. For sure the indexer is a numpy array, but the values in the two-tuple are an Index that should be type-preserving.

That's also a great point on the mixed column types, but that makes me wary of re-implementing this function. With all of the work going towards clarifying our nullability handling and implementing more than just NumPy types, it seems like this function is going to have a ton of edge cases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also wrap the result in a Series.


.. ipython:: python

dflookup = pd.DataFrame(np.random.rand(20, 4), columns = ['A', 'B', 'C', 'D'])
dflookup.lookup(list(range(0, 10, 2)), ['B', 'C', 'A', 'B', 'D'])

Expand Down
Loading