BUG: fix setitem with enlargment with pyarrow Scalar #52833

jorisvandenbossche · 2023-04-21T17:55:47Z

Potentially closes #52235

This updates lib.is_scalar to recognize pyarrow.Scalar objects.

Further, it updates a very specific code path for setitem with enlargement with a scalar to preserve the arrow extension dtype (#52235). There are many other places where specific handling would need to be added for pyarrow Scalars for generally supporting them.

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2023-04-21T18:23:07Z

pandas/core/dtypes/missing.py

+    elif lib.is_pyarrow_scalar(obj):
+        return (
+            obj.is_null()
+            and hasattr(dtype, "pyarrow_dtype")


is this a proxy for isinstance(dtype, ArrowDtype)?

xref #51378

is this a proxy for isinstance(dtype, ArrowDtype)?

Yes, I wanted to ask about that: currently this files does not import anything from pandas.core.arrays (only from libs and other core.dtypes files). I imagine that is on purpose?
It's a bit a strange situation with some of our own dtypes being defined in core.dtypes.dtypes (and so those can be imported), and some others in core.arrays

I imagine that is on purpose?

Yes. The idea was that core.dtypes can be mostly before/above the rest of core in the dependency structure (which is reflected in the isort order). If importing ArrowDtype here avoids a code smell, I won't object.

It's a bit a strange situation with some of our own dtypes being defined in core.dtypes.dtypes (and so those can be imported), and some others in core.arrays

Agreed. I've been thinking recently it would be nice to move ArrowDtype and SparseDtype to core.dtypes.dtypes.

jbrockmendel · 2023-04-21T18:25:00Z

pandas/_libs/lib.pyx

+
+cpdef is_pyarrow_scalar(obj):
+    if PYARROW_INSTALLED:
+        return isinstance(obj, pa.Scalar)


for my edification, if we did have pyarrow as a required dependency is there a more performant version of this check? IIRC you said something about lack of ABI stability so that might not be viable?

No, this is exactly the same as what we would do if pyarrow is a required dependency, except for the if PYARROW_INSTALLED: check (but that should be fast, since that's a boolean).

Only if we would add a compile-time dependency on pyarrow, we could cimport the class, and that might be faster then? (but that's not included in the current discussion, and this specific case for checking scalars is certainly in itself not worth a compile time dependency)

Only if we would add a compile-time dependency on pyarrow

Yah that's the one i was wondering about.

jbrockmendel · 2023-04-21T18:26:38Z

pandas/_libs/lib.pyx

@@ -238,6 +264,7 @@ def is_scalar(val: object) -> bool:

    # Note: PyNumber_Check check includes Decimal, Fraction, numbers.Number
    return (PyNumber_Check(val)
+            or is_pyarrow_scalar(val)


are there any sequence-like scalars that could accidentally get caught by the PySequence_Check check?

Ah, yes, that's a good point. For example, a ListScalar has __getitem__, and I think that alone is already enough to pass PySequence_Check.

I can move it before PySequence_Check then?

I am wondering if we shouldn't remove getitem/len from the ListScalar et al on the pyarrow side. Having scalars that also behave as a sequence is just very annoying, and it's only there for a bit of convenience (although we will also have to handle this for python list objects).
(for example in shapely 2.0 we have been making all the LineString and Polygon objects non-sequences, to avoid all issues of putting those in numpy arrays)

I'd be OK with moving it to before the PySequence_Check call. I'm guessing the perf difference is only a few nanoseconds?

jbrockmendel · 2023-04-21T19:45:28Z

pandas/core/indexing.py

@@ -2098,8 +2099,15 @@ def _setitem_with_indexer_missing(self, indexer, value):
                # We should not cast, if we have object dtype because we can
                # set timedeltas into object series
                curr_dtype = self.obj.dtype
-                curr_dtype = getattr(curr_dtype, "numpy_dtype", curr_dtype)


this is already pretty kludgy...

Yeah, I certainly don't disagree .. But this not really new to this PR, happy to hear suggestions.

This part of the code was added in #47342 to support retaining the dtype in setitem with expansion.
I am wondering though if we could simplify this by "just" trying to create an array from the scalar first with the dtype of the series, and then if that fails fall back on creating an array without specifying a dtype, and then letting inference and concat/common_dtype do its thing (so without trying to determine the new_dtype up front)

Something like

try: new_values = Series([value], dtype=self.obj.dtype)._values except ...: new_values = Series([value])._values

(although I assume there is a reason for all the checks .. Can see what test would fail with the above)

I expect the try/except approach here would run into the same set of problems that _from_scalars is intended to address. I'm hoping you and i can collaborate on implementing that soonish.

Maybe shorter-term something like:

dummy = ser.iloc[:1].copy() # assuming self.obj is a Series, needs an additional getitem otherwise dummy.iloc[0] = value

Then see if dummy raises/casts to determine if we can keep the dtype? This would be a little bit like #36226 but without a new method.

jbrockmendel · 2023-04-21T21:18:16Z

In #27462 you suggested a method on the EADtype/EA to check for a recognized scalar (might be similar to _recognized_scalars on our datetimelike EAs). I think this might be the right approach long-term. (which isn't mutually exclusive with this PR short term)

jbrockmendel · 2023-04-28T14:00:44Z

I went down a rabbit-hole looking at issues regarding is_scalar, have a half-written Deep Dive post about it. For the time being I think we should hold off on updating is_scalar since it is public, do a targeted fix in the indexing code.

jbrockmendel · 2023-05-05T00:00:16Z

pandas/core/indexing.py

+                if lib.is_pyarrow_scalar(value) and hasattr(
+                    curr_dtype, "pyarrow_dtype"
+                ):
+                    # TODO promote arrow scalar and type


this comment is referring to e.g. if int16[pyarrow] dtype and value being a too-big integer, right? it looks like for regular python ints the existing code path might actually work, but that looks fragile. im thinking the robust way to handle this would be to explicitly handle ArrowEADtype in maybe_promote

github-actions · 2023-06-04T00:05:53Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2023-08-01T17:25:42Z

Looks like this PR has gone stale. Closing to clear the queue but feel free to reopen when you have time

jorisvandenbossche added 2 commits April 21, 2023 18:34

BUG: fix setitem with enlargment with pyarrow Scalar

8a431f7

Merge remote-tracking branch 'upstream/main' into setitem-pyarrow-scalar

b0783a2

jorisvandenbossche mentioned this pull request Apr 21, 2023

COMPAT: Use actual isinstance check for pyarrow.Array instead of hasattr(.. 'type') duck typing #52830

Merged

jbrockmendel reviewed Apr 21, 2023

View reviewed changes

jorisvandenbossche mentioned this pull request Apr 28, 2023

PDEP-10: Add pyarrow as a required dependency #52711

Merged

1 task

mroeschke added the Arrow pyarrow functionality label Apr 28, 2023

jbrockmendel reviewed May 5, 2023

View reviewed changes

jbrockmendel mentioned this pull request May 5, 2023

ENH: EADtype._find_compatible_dtype #53106

Closed

7 tasks

github-actions bot added the Stale label Jun 4, 2023

mroeschke closed this Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix setitem with enlargment with pyarrow Scalar #52833

BUG: fix setitem with enlargment with pyarrow Scalar #52833

jorisvandenbossche commented Apr 21, 2023

jbrockmendel Apr 21, 2023

jbrockmendel Apr 21, 2023

jorisvandenbossche Apr 21, 2023

jbrockmendel Apr 21, 2023

jbrockmendel Apr 21, 2023

jorisvandenbossche Apr 21, 2023

jbrockmendel Apr 21, 2023

jbrockmendel Apr 21, 2023

jorisvandenbossche Apr 21, 2023 •

edited

Loading

jbrockmendel Apr 21, 2023

jbrockmendel Apr 21, 2023

jorisvandenbossche Apr 21, 2023

jorisvandenbossche Apr 21, 2023

jbrockmendel Apr 21, 2023

jbrockmendel commented Apr 21, 2023

jbrockmendel commented Apr 28, 2023

jbrockmendel May 5, 2023

github-actions bot commented Jun 4, 2023

mroeschke commented Aug 1, 2023

BUG: fix setitem with enlargment with pyarrow Scalar #52833

BUG: fix setitem with enlargment with pyarrow Scalar #52833

Conversation

jorisvandenbossche commented Apr 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Apr 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Apr 21, 2023

jbrockmendel commented Apr 28, 2023

Choose a reason for hiding this comment

github-actions bot commented Jun 4, 2023

mroeschke commented Aug 1, 2023

jorisvandenbossche Apr 21, 2023 •

edited

Loading