Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.agg with UDF changing pyarrow dtypes #59601

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Aug 25, 2024

Continuation of #58129

Root cause:

  • agg_series always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be different

Fix:

  • replace all NA with nan
  • convert the `results' to respective pyarrow extension array, using pyarrow library methods
  • pyarrow library methods is used instead of maybe_convert_object, as maybe_convert_object does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),

Kei added 30 commits April 1, 2024 19:04
@rhshadrach rhshadrach marked this pull request as draft August 25, 2024 13:02
@rhshadrach rhshadrach added Groupby Arrow pyarrow functionality pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result Bug and removed Arrow pyarrow functionality labels Aug 25, 2024
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Sep 25, 2024
@rhshadrach rhshadrach changed the title Fix/group by agg pyarrow bool numpy same type BUG: groupby.agg with UDF changing pyarrow dtypes Oct 6, 2024
@rhshadrach rhshadrach marked this pull request as ready for review March 22, 2025 16:15
@rhshadrach rhshadrach removed the Stale label Mar 22, 2025
Comment on lines +1899 to +1905
result = gb.agg(lambda x: {"number": 1})

arr = pa.array([{"number": 1}, {"number": 1}, {"number": 1}])
expected = DataFrame(
{"B": ArrowExtensionArray(arr)},
index=Index(["c1", "c2", "c3"], name="A"),
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the column starts as a PyArrow dtype and returns dictionaries, it seems questionable to me whether we should return the corresponding PyArrow dtype. The other option is a NumPy array of object dtype. But both seem like reasonable results and I imagine the PyArrow is likely to be more convenient for the user who is using PyArrow dtypes.

@rhshadrach rhshadrach requested a review from mroeschke March 23, 2025 12:06
@rhshadrach rhshadrach added this to the 3.0 milestone Mar 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Groupby-aggregate on a boolean column returns a different datatype with pyarrow than with numpy
2 participants