Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix setitem with enlargment with pyarrow Scalar #52833

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,32 @@ i8max = <int64_t>INT64_MAX
u8max = <uint64_t>UINT64_MAX


cdef bint PYARROW_INSTALLED = False

try:
import pyarrow as pa

PYARROW_INSTALLED = True
except ImportError:
pa = None


cpdef is_pyarrow_array(obj):
if PYARROW_INSTALLED:
return isinstance(obj, (pa.Array, pa.ChunkedArray))
return False


cpdef is_pyarrow_scalar(obj):
if PYARROW_INSTALLED:
return isinstance(obj, pa.Scalar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my edification, if we did have pyarrow as a required dependency is there a more performant version of this check? IIRC you said something about lack of ABI stability so that might not be viable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is exactly the same as what we would do if pyarrow is a required dependency, except for the if PYARROW_INSTALLED: check (but that should be fast, since that's a boolean).

Only if we would add a compile-time dependency on pyarrow, we could cimport the class, and that might be faster then? (but that's not included in the current discussion, and this specific case for checking scalars is certainly in itself not worth a compile time dependency)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if we would add a compile-time dependency on pyarrow

Yah that's the one i was wondering about.

return False


def is_pyarrow_installed():
return PYARROW_INSTALLED


@cython.wraparound(False)
@cython.boundscheck(False)
def memory_usage_of_objects(arr: object[:]) -> int64_t:
Expand Down Expand Up @@ -238,6 +264,7 @@ def is_scalar(val: object) -> bool:

# Note: PyNumber_Check check includes Decimal, Fraction, numbers.Number
return (PyNumber_Check(val)
or is_pyarrow_scalar(val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any sequence-like scalars that could accidentally get caught by the PySequence_Check check?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that's a good point. For example, a ListScalar has __getitem__, and I think that alone is already enough to pass PySequence_Check.

I can move it before PySequence_Check then?

I am wondering if we shouldn't remove getitem/len from the ListScalar et al on the pyarrow side. Having scalars that also behave as a sequence is just very annoying, and it's only there for a bit of convenience (although we will also have to handle this for python list objects).
(for example in shapely 2.0 we have been making all the LineString and Polygon objects non-sequences, to avoid all issues of putting those in numpy arrays)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be OK with moving it to before the PySequence_Check call. I'm guessing the perf difference is only a few nanoseconds?

or is_period_object(val)
or is_interval(val)
or is_offset_object(val))
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/dtypes/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -698,6 +698,12 @@ def is_valid_na_for_dtype(obj, dtype: DtypeObj) -> bool:
"""
if not lib.is_scalar(obj) or not isna(obj):
return False
elif lib.is_pyarrow_scalar(obj):
return (
obj.is_null()
and hasattr(dtype, "pyarrow_dtype")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a proxy for isinstance(dtype, ArrowDtype)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #51378

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a proxy for isinstance(dtype, ArrowDtype)?

Yes, I wanted to ask about that: currently this files does not import anything from pandas.core.arrays (only from libs and other core.dtypes files). I imagine that is on purpose?
It's a bit a strange situation with some of our own dtypes being defined in core.dtypes.dtypes (and so those can be imported), and some others in core.arrays

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that is on purpose?

Yes. The idea was that core.dtypes can be mostly before/above the rest of core in the dependency structure (which is reflected in the isort order). If importing ArrowDtype here avoids a code smell, I won't object.

It's a bit a strange situation with some of our own dtypes being defined in core.dtypes.dtypes (and so those can be imported), and some others in core.arrays

Agreed. I've been thinking recently it would be nice to move ArrowDtype and SparseDtype to core.dtypes.dtypes.

and dtype.pyarrow_dtype == obj.type
)
elif dtype.kind == "M":
if isinstance(dtype, np.dtype):
# i.e. not tzaware
Expand Down
12 changes: 10 additions & 2 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@

from pandas._config import using_copy_on_write

from pandas._libs import lib
from pandas._libs.indexing import NDFrameIndexerBase
from pandas._libs.lib import item_from_zerodim
from pandas.compat import PYPY
Expand Down Expand Up @@ -2098,8 +2099,15 @@ def _setitem_with_indexer_missing(self, indexer, value):
# We should not cast, if we have object dtype because we can
# set timedeltas into object series
curr_dtype = self.obj.dtype
curr_dtype = getattr(curr_dtype, "numpy_dtype", curr_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already pretty kludgy...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I certainly don't disagree .. But this not really new to this PR, happy to hear suggestions.

This part of the code was added in #47342 to support retaining the dtype in setitem with expansion.
I am wondering though if we could simplify this by "just" trying to create an array from the scalar first with the dtype of the series, and then if that fails fall back on creating an array without specifying a dtype, and then letting inference and concat/common_dtype do its thing (so without trying to determine the new_dtype up front)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

try:
    new_values = Series([value], dtype=self.obj.dtype)._values
except ...:
    new_values = Series([value])._values

(although I assume there is a reason for all the checks .. Can see what test would fail with the above)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect the try/except approach here would run into the same set of problems that _from_scalars is intended to address. I'm hoping you and i can collaborate on implementing that soonish.

Maybe shorter-term something like:

dummy = ser.iloc[:1].copy()  #  assuming self.obj is a Series, needs an additional getitem otherwise
dummy.iloc[0] = value

Then see if dummy raises/casts to determine if we can keep the dtype? This would be a little bit like #36226 but without a new method.

new_dtype = maybe_promote(curr_dtype, value)[0]
if lib.is_pyarrow_scalar(value) and hasattr(
curr_dtype, "pyarrow_dtype"
):
# TODO promote arrow scalar and type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is referring to e.g. if int16[pyarrow] dtype and value being a too-big integer, right? it looks like for regular python ints the existing code path might actually work, but that looks fragile. im thinking the robust way to handle this would be to explicitly handle ArrowEADtype in maybe_promote

new_dtype = curr_dtype
value = value.as_py()
else:
curr_dtype = getattr(curr_dtype, "numpy_dtype", curr_dtype)
new_dtype = maybe_promote(curr_dtype, value)[0]
else:
new_dtype = None

Expand Down
23 changes: 23 additions & 0 deletions pandas/tests/extension/test_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -2557,6 +2557,29 @@ def test_describe_numeric_data(pa_type):
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(
"value, target_value, dtype",
[
(pa.scalar(4, type="int32"), 4, "int32[pyarrow]"),
(pa.scalar(4, type="int64"), 4, "int32[pyarrow]"),
# (pa.scalar(4.5, type="float64"), 4, "int32[pyarrow]"),
(4, 4, "int32[pyarrow]"),
(pd.NA, None, "int32[pyarrow]"),
(None, None, "int32[pyarrow]"),
(pa.scalar(None, type="int32"), None, "int32[pyarrow]"),
(pa.scalar(None, type="int64"), None, "int32[pyarrow]"),
],
)
def test_series_setitem_with_enlargement(value, target_value, dtype):
# GH-52235
# similar to series/inedexing/test_setitem.py::test_setitem_keep_precision
# and test_setitem_enlarge_with_na, but for arrow dtypes
ser = pd.Series([1, 2, 3], dtype=dtype)
ser[3] = value
expected = pd.Series([1, 2, 3, target_value], dtype=dtype)
tm.assert_series_equal(ser, expected)


@pytest.mark.parametrize(
"pa_type", tm.DATETIME_PYARROW_DTYPES + tm.TIMEDELTA_PYARROW_DTYPES
)
Expand Down