Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Remove ArrowStringArray and StringDtype("pyarrow") #48469

Closed
1 of 3 tasks
gsheni opened this issue Sep 8, 2022 · 6 comments
Closed
1 of 3 tasks

ENH: Remove ArrowStringArray and StringDtype("pyarrow") #48469

gsheni opened this issue Sep 8, 2022 · 6 comments
Labels
Arrow pyarrow functionality Deprecate Functionality to remove in pandas Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@gsheni
Copy link
Contributor

gsheni commented Sep 8, 2022

Feature Type

  • Adding new functionality to pandas
  • Changing existing functionality in pandas
  • Removing existing functionality in pandas

Problem Description

I wish I could use pandas to create pyarrow backend Series for strings.
I wish there was a single data type and single extension array for strings (rather than 2).

Currently, we have 2 pyarrow data types & arrays for strings

  • StringDtype("pyarrow") backend by arrays.ArrowStringArray
  • ArrowDtype(pa.string()) backend by arrays.ArrowExtensionArray

I propose we use ArrowDtype(pa.string()) and ArrowExtensionArray.

Feature Description

import pyarrow as pa
import pandas as pd 

series_str_arry = pd.Series(['red', 'blue', None], dtype="string[pyarrow]")

string_ext_arry = pd.ArrowDtype(pa.string())
series_ext_arry = pd.Series(['red', 'blue', None], dtype=string_ext_arry)

assert series_str_arry.dtype == series_ext_arry.dtype
assert series_str_arry.dtype.construct_array_type() == series_ext_arry.dtype.construct_array_type()

Alternative Solutions

  • Keep both data types and arrays

Additional Context

@gsheni gsheni added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 8, 2022
@mroeschke
Copy link
Member

mroeschke commented Sep 8, 2022

Noting that there was discussion to not immediately deprecate StringDtype("pyarrow") when ArrowDtype was introduced.

Also ArrowExtensionArray would need to adopt string manipulation methods for them to be equivalent.

Nonetheless, agreed that ArrowDtype should be the primary way to use pyarrow string types in the future

@mroeschke mroeschke added Strings String extension data type and string data Deprecate Functionality to remove in pandas Arrow pyarrow functionality Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 8, 2022
@jorisvandenbossche
Copy link
Member

As I mentioned in #47818 (review), I am personally -1 on moving away from StringDtype(storage="arrow").
I think we should rather consider moving ArrowDtype towards the same model as StringDtype (and I am meaning the user-facing API, not necessarily implementation wise).

@jbrockmendel
Copy link
Member

As I mentioned in #47818 (review), I am personally -1 on moving away from StringDtype(storage="arrow").
I think we should rather consider moving ArrowDtype towards the same model as StringDtype (and I am meaning the user-facing API, not necessarily implementation wise).

Is the idea here to introduce storage parameter/option/... for other dtypes so that "int64" could alias to one of, say, "int64[pyarrow]", "int64[masked]", "int64[numpy]"?

@jbrockmendel
Copy link
Member

What if we used ArrowDtype(pa.string()) under the hood but patched StringDtype.__instancecheck__ to return True for ArrowDtype(pa.string())?

@thesword53
Copy link

The real issue is string[pyarrow] is pointing to StringDtype("pyarrow") instead of ArrowDtype(pa.string()). However str[pyarrow] is pointing to ArrowDtype(pa.string()), and these two types are behaving differently:

>>> s = pd.Series(["a,b,c", "c,d"], dtype="str[pyarrow]")
>>> s.str.split(",")
0    ['a' 'b' 'c']
1        ['c' 'd']
dtype: list<item: string>[pyarrow]
>>> s = pd.Series(["a,b,c", "c,d"], dtype="string[pyarrow]")
>>> s.str.split(",")
0    [a, b, c]
1       [c, d]
dtype: object

str[pyarrow] and string[pyarrow] should be the same type and the current string[pyarrow] (StringDtype("pyarrow")) should be renamed in something like string_pyarrow.
Every dtypes ending with [pyarrow] should use ArrowDtype.

@simonjayhawkins
Copy link
Member

I'm going to close this issue since we are not going to "Remove ArrowStringArray and StringDtype("pyarrow")" anytime soon. Conversely, it will effectively become the default, albeit a variant using nan semantics, in pandas 3.0.

There maybe a few points raised in this discussion that participants may feel have not been addressed, but I think they can be opened as separate issues if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Deprecate Functionality to remove in pandas Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

6 participants