-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227
Conversation
8d31eb0
to
9876c64
Compare
) | ||
data_buff, data_dtype = buffers["data"] | ||
|
||
if (data_dtype[1] == 8) and ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity what is the 8
here supposed to represent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey - from https://data-apis.org/dataframe-protocol/latest/API.html, it's number of bits:
@property @abstractmethod def dtype(self) -> Dtype: """ Dtype description as a tuple ``(kind, bit-width, format string, endianness)``. Bit-width : the number of bits as an integer Format string : data type description format string in Apache Arrow C Data Interface format. Endianness : current only native endianness (``=``) is supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are string dtypes supposed to have an 8 bit association? That is kind of confusing for variable length types, granted I know very little of how this interchange is supposed to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is that strings are meant to be utf-8, and so each character can be represented with 8bits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm interesting. Well keep in mind that UTF-8 doesn't mean a character is 8 bits; it is still 1-4 bytes
In arrow-adbc I've seen this assigned the value of 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UFT-8 array, consisting of the actual strings and offsets array, so here the buffer representing all string data (so which is typically much longer as the logical length of the array) can be seen as simply a buffer of bytes ("bytearray"), so so in numpy / buffer interface terms you can view such an array as one bitwidth 8
In arrow-adbc I've seen this assigned the value of 0
That's something postgres specific, I think
# temporary workaround to keep backwards compatibility due to | ||
# https://github.com/pandas-dev/pandas/issues/54781 | ||
# Consider dtype being `uint` to get number of units passed since the 01.01.1970 | ||
data_dtype = ( | ||
DtypeKind.UINT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be an INT? Timestamps are backed by 64 bit signed integers in arrow
https://github.com/apache/arrow/blob/772a01c080ad57eb11e9323f5347472b769d45de/format/Schema.fbs#L264
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, AFAIK that should be signed INT
The tricky thing will be to test this ... Given that we don't yet have implementations to test with (pandas itself doesn't yet return the correct dtype), should we add some very specific unit tests explicitly targetting the helper function (where we can pass both types of dtypes). Although I assume that's still difficult, as those helpers take a Column object, and then we would have to mock a Column object just for those tests .. |
good point, marking as draft til I get back to that |
I've tried this out with:
and now df = pl.DataFrame({'a': ['foo', 'bar']})
print(pd.api.interchange.from_dataframe(df)) runs fine (whereas using pandas main and that branch from polars would've raised > assert protocol_data_dtype[2] in (
ArrowCTypes.STRING,
ArrowCTypes.LARGE_STRING,
) # format_str == utf-8
E AssertionError ) @stinodego fancy taking a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks to me like you're making things more complicated than they need to be! The fix should be really simple.
If my comments aren't clear, I can make a small PR to show what I was thinking.
assert protocol_data_dtype[2] in ( | ||
ArrowCTypes.STRING, | ||
ArrowCTypes.LARGE_STRING, | ||
) # format_str == utf-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion is valid, but it should be on col.dtype[2]
rather than protocol_data_dtype[2]
@@ -266,21 +266,29 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]: | |||
|
|||
assert buffers["offsets"], "String buffers must contain offsets" | |||
# Retrieve the data buffer containing the UTF-8 code units | |||
data_buff, protocol_data_dtype = buffers["data"] | |||
# We're going to reinterpret the buffer as uint8, so make sure we can do it safely | |||
assert protocol_data_dtype[1] == 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion can indeed be deleted, as we can assume bit width 8 if the column dtype is STRING or LARGE_STRING.
ArrowCTypes.UINT8, | ||
Endianness.NATIVE, | ||
) | ||
data_buff, data_dtype = buffers["data"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simply ignore the data dtype here, as we know what it needs to be (we set it later).
data_buff, data_dtype = buffers["data"] | |
data_buff, _ = buffers["data"] |
data_dtype = ( | ||
DtypeKind.UINT, | ||
8, | ||
ArrowCTypes.UINT8, | ||
Endianness.NATIVE, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not need to be in an if
-block. We can simply disregard the data buffer dtype - for string columns, this will ALWAYS be as listed here. This was already the case in the previous code! Only the assertions were wrong.
DtypeKind.UINT, | ||
dtype[1], | ||
getattr(ArrowCTypes, f"UINT{dtype[1]}"), | ||
Endianness.NATIVE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was fine, but needs to be DtypeKind.INT
as you have found, and use col.dtype[1]
rather than dtype[1]
.
thanks for your review 🙏 ! I've tried addressing your comments |
data_dtype, | ||
( | ||
DtypeKind.INT, | ||
col.dtype[1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We unpack col.dtype
on line 381, it'll be slightly more efficient to get the bit width from there!
Yeah I think this should do it! |
Co-authored-by: Stijn de Gooijer <stijn@degooijer.io>
Any chance this can get merged and be part of the next release? 🙏 This is blocking improvements to the protocol across all dataframe libraries. |
@jorisvandenbossche fancy taking a look please? I'd suggest backporting this, it has no user facing impact anyway |
Should this block the release? |
df = pd.DataFrame({"a": ["foo", "bar"]}).__dataframe__() | ||
interchange = df.__dataframe__() | ||
column = interchange.get_column_by_name("a") | ||
buffers = column.get_buffers() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker for this PR but I think these tests would be more impactful if we made the PandasBuffer implement the buffer protocol:
https://docs.cython.org/en/latest/src/userguide/buffer.html
That way we could inspect the bytes for tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started this in #55671
column = interchange.get_column_by_name("a") | ||
buffers = column.get_buffers() | ||
buffers_data = buffers["data"] | ||
buffer_dtype = buffers_data[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to ignore my possibly wrong commentary as I'm new to this, but I think the offset buffers don't have the proper bufsize here either
(Pdb) buffers["offsets"]
(PandasBuffer({'bufsize': 24, 'ptr': 94440192356160, 'device': 'CPU'}), (<DtypeKind.INT: 0>, 64, 'l', '='))
The standard StringType which inherits from BinaryType in arrow uses a 32 bit offset value, so I think that bufsize should only be 12, unless we are mapping to a LargeString intentionally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for looking into this - looks like it comes from
pandas/pandas/core/interchange/column.py
Line 364 in d2f05c2
offsets = np.zeros(shape=(len(values) + 1,), dtype=np.int64) |
where the dtype's being set to int64. OK to discuss/address this separately?
not at all! if it can't be backported to 1.1.2, no issue, definitely not a blocker |
merging then, thanks all! |
… the wrong dtype / from_dataframe incorrect
…er has the wrong dtype / from_dataframe incorrect ) (#55863) Backport PR #55227: BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com> Co-authored-by: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com>
from_dataframe
incorrect #54781 (Replace xxxx with the GitHub issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.I haven't included a whatsnew note as this isn't user-facing