-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix setitem with enlargment with pyarrow Scalar #52833
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -145,6 +145,32 @@ i8max = <int64_t>INT64_MAX | |
u8max = <uint64_t>UINT64_MAX | ||
|
||
|
||
cdef bint PYARROW_INSTALLED = False | ||
|
||
try: | ||
import pyarrow as pa | ||
|
||
PYARROW_INSTALLED = True | ||
except ImportError: | ||
pa = None | ||
|
||
|
||
cpdef is_pyarrow_array(obj): | ||
if PYARROW_INSTALLED: | ||
return isinstance(obj, (pa.Array, pa.ChunkedArray)) | ||
return False | ||
|
||
|
||
cpdef is_pyarrow_scalar(obj): | ||
if PYARROW_INSTALLED: | ||
return isinstance(obj, pa.Scalar) | ||
return False | ||
|
||
|
||
def is_pyarrow_installed(): | ||
return PYARROW_INSTALLED | ||
|
||
|
||
@cython.wraparound(False) | ||
@cython.boundscheck(False) | ||
def memory_usage_of_objects(arr: object[:]) -> int64_t: | ||
|
@@ -238,6 +264,7 @@ def is_scalar(val: object) -> bool: | |
|
||
# Note: PyNumber_Check check includes Decimal, Fraction, numbers.Number | ||
return (PyNumber_Check(val) | ||
or is_pyarrow_scalar(val) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are there any sequence-like scalars that could accidentally get caught by the PySequence_Check check? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, yes, that's a good point. For example, a ListScalar has I can move it before I am wondering if we shouldn't remove getitem/len from the ListScalar et al on the pyarrow side. Having scalars that also behave as a sequence is just very annoying, and it's only there for a bit of convenience (although we will also have to handle this for python list objects). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be OK with moving it to before the PySequence_Check call. I'm guessing the perf difference is only a few nanoseconds? |
||
or is_period_object(val) | ||
or is_interval(val) | ||
or is_offset_object(val)) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -698,6 +698,12 @@ def is_valid_na_for_dtype(obj, dtype: DtypeObj) -> bool: | |
""" | ||
if not lib.is_scalar(obj) or not isna(obj): | ||
return False | ||
elif lib.is_pyarrow_scalar(obj): | ||
return ( | ||
obj.is_null() | ||
and hasattr(dtype, "pyarrow_dtype") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this a proxy for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. xref #51378 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I wanted to ask about that: currently this files does not import anything from pandas.core.arrays (only from libs and other core.dtypes files). I imagine that is on purpose? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. The idea was that core.dtypes can be mostly before/above the rest of core in the dependency structure (which is reflected in the isort order). If importing ArrowDtype here avoids a code smell, I won't object.
Agreed. I've been thinking recently it would be nice to move ArrowDtype and SparseDtype to core.dtypes.dtypes. |
||
and dtype.pyarrow_dtype == obj.type | ||
) | ||
elif dtype.kind == "M": | ||
if isinstance(dtype, np.dtype): | ||
# i.e. not tzaware | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,7 @@ | |
|
||
from pandas._config import using_copy_on_write | ||
|
||
from pandas._libs import lib | ||
from pandas._libs.indexing import NDFrameIndexerBase | ||
from pandas._libs.lib import item_from_zerodim | ||
from pandas.compat import PYPY | ||
|
@@ -2098,8 +2099,15 @@ def _setitem_with_indexer_missing(self, indexer, value): | |
# We should not cast, if we have object dtype because we can | ||
# set timedeltas into object series | ||
curr_dtype = self.obj.dtype | ||
curr_dtype = getattr(curr_dtype, "numpy_dtype", curr_dtype) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is already pretty kludgy... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I certainly don't disagree .. But this not really new to this PR, happy to hear suggestions. This part of the code was added in #47342 to support retaining the dtype in setitem with expansion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something like
(although I assume there is a reason for all the checks .. Can see what test would fail with the above) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I expect the try/except approach here would run into the same set of problems that _from_scalars is intended to address. I'm hoping you and i can collaborate on implementing that soonish. Maybe shorter-term something like:
Then see if dummy raises/casts to determine if we can keep the dtype? This would be a little bit like #36226 but without a new method. |
||
new_dtype = maybe_promote(curr_dtype, value)[0] | ||
if lib.is_pyarrow_scalar(value) and hasattr( | ||
curr_dtype, "pyarrow_dtype" | ||
): | ||
# TODO promote arrow scalar and type | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this comment is referring to e.g. if |
||
new_dtype = curr_dtype | ||
value = value.as_py() | ||
else: | ||
curr_dtype = getattr(curr_dtype, "numpy_dtype", curr_dtype) | ||
new_dtype = maybe_promote(curr_dtype, value)[0] | ||
else: | ||
new_dtype = None | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my edification, if we did have pyarrow as a required dependency is there a more performant version of this check? IIRC you said something about lack of ABI stability so that might not be viable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is exactly the same as what we would do if pyarrow is a required dependency, except for the
if PYARROW_INSTALLED:
check (but that should be fast, since that's a boolean).Only if we would add a compile-time dependency on pyarrow, we could
cimport
the class, and that might be faster then? (but that's not included in the current discussion, and this specific case for checking scalars is certainly in itself not worth a compile time dependency)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah that's the one i was wondering about.