Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix setitem with enlargment with pyarrow Scalar #52833

Conversation

jorisvandenbossche
Copy link
Member

Potentially closes #52235

This updates lib.is_scalar to recognize pyarrow.Scalar objects.

Further, it updates a very specific code path for setitem with enlargement with a scalar to preserve the arrow extension dtype (#52235). There are many other places where specific handling would need to be added for pyarrow Scalars for generally supporting them.

elif lib.is_pyarrow_scalar(obj):
return (
obj.is_null()
and hasattr(dtype, "pyarrow_dtype")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a proxy for isinstance(dtype, ArrowDtype)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #51378

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a proxy for isinstance(dtype, ArrowDtype)?

Yes, I wanted to ask about that: currently this files does not import anything from pandas.core.arrays (only from libs and other core.dtypes files). I imagine that is on purpose?
It's a bit a strange situation with some of our own dtypes being defined in core.dtypes.dtypes (and so those can be imported), and some others in core.arrays

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that is on purpose?

Yes. The idea was that core.dtypes can be mostly before/above the rest of core in the dependency structure (which is reflected in the isort order). If importing ArrowDtype here avoids a code smell, I won't object.

It's a bit a strange situation with some of our own dtypes being defined in core.dtypes.dtypes (and so those can be imported), and some others in core.arrays

Agreed. I've been thinking recently it would be nice to move ArrowDtype and SparseDtype to core.dtypes.dtypes.


cpdef is_pyarrow_scalar(obj):
if PYARROW_INSTALLED:
return isinstance(obj, pa.Scalar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my edification, if we did have pyarrow as a required dependency is there a more performant version of this check? IIRC you said something about lack of ABI stability so that might not be viable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is exactly the same as what we would do if pyarrow is a required dependency, except for the if PYARROW_INSTALLED: check (but that should be fast, since that's a boolean).

Only if we would add a compile-time dependency on pyarrow, we could cimport the class, and that might be faster then? (but that's not included in the current discussion, and this specific case for checking scalars is certainly in itself not worth a compile time dependency)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if we would add a compile-time dependency on pyarrow

Yah that's the one i was wondering about.

@@ -238,6 +264,7 @@ def is_scalar(val: object) -> bool:

# Note: PyNumber_Check check includes Decimal, Fraction, numbers.Number
return (PyNumber_Check(val)
or is_pyarrow_scalar(val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any sequence-like scalars that could accidentally get caught by the PySequence_Check check?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that's a good point. For example, a ListScalar has __getitem__, and I think that alone is already enough to pass PySequence_Check.

I can move it before PySequence_Check then?

I am wondering if we shouldn't remove getitem/len from the ListScalar et al on the pyarrow side. Having scalars that also behave as a sequence is just very annoying, and it's only there for a bit of convenience (although we will also have to handle this for python list objects).
(for example in shapely 2.0 we have been making all the LineString and Polygon objects non-sequences, to avoid all issues of putting those in numpy arrays)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be OK with moving it to before the PySequence_Check call. I'm guessing the perf difference is only a few nanoseconds?

@@ -2098,8 +2099,15 @@ def _setitem_with_indexer_missing(self, indexer, value):
# We should not cast, if we have object dtype because we can
# set timedeltas into object series
curr_dtype = self.obj.dtype
curr_dtype = getattr(curr_dtype, "numpy_dtype", curr_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already pretty kludgy...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I certainly don't disagree .. But this not really new to this PR, happy to hear suggestions.

This part of the code was added in #47342 to support retaining the dtype in setitem with expansion.
I am wondering though if we could simplify this by "just" trying to create an array from the scalar first with the dtype of the series, and then if that fails fall back on creating an array without specifying a dtype, and then letting inference and concat/common_dtype do its thing (so without trying to determine the new_dtype up front)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

try:
    new_values = Series([value], dtype=self.obj.dtype)._values
except ...:
    new_values = Series([value])._values

(although I assume there is a reason for all the checks .. Can see what test would fail with the above)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect the try/except approach here would run into the same set of problems that _from_scalars is intended to address. I'm hoping you and i can collaborate on implementing that soonish.

Maybe shorter-term something like:

dummy = ser.iloc[:1].copy()  #  assuming self.obj is a Series, needs an additional getitem otherwise
dummy.iloc[0] = value

Then see if dummy raises/casts to determine if we can keep the dtype? This would be a little bit like #36226 but without a new method.

@jbrockmendel
Copy link
Member

In #27462 you suggested a method on the EADtype/EA to check for a recognized scalar (might be similar to _recognized_scalars on our datetimelike EAs). I think this might be the right approach long-term. (which isn't mutually exclusive with this PR short term)

@jbrockmendel
Copy link
Member

I went down a rabbit-hole looking at issues regarding is_scalar, have a half-written Deep Dive post about it. For the time being I think we should hold off on updating is_scalar since it is public, do a targeted fix in the indexing code.

@mroeschke mroeschke added the Arrow pyarrow functionality label Apr 28, 2023
if lib.is_pyarrow_scalar(value) and hasattr(
curr_dtype, "pyarrow_dtype"
):
# TODO promote arrow scalar and type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is referring to e.g. if int16[pyarrow] dtype and value being a too-big integer, right? it looks like for regular python ints the existing code path might actually work, but that looks fragile. im thinking the robust way to handle this would be to explicitly handle ArrowEADtype in maybe_promote

@github-actions
Copy link
Contributor

github-actions bot commented Jun 4, 2023

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jun 4, 2023
@mroeschke
Copy link
Member

Looks like this PR has gone stale. Closing to clear the queue but feel free to reopen when you have time

@mroeschke mroeschke closed this Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Setting a pyarrow scalar value with the same type that expands the series changes the dtype into object
3 participants