-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame.to_parquet
doesn't round-trip pyarrow StringDtype
#42664
Comments
This is the tested behaviour. The storage type is an implementation detail. This allows reading a parquet file without pyarrow. If you change the global setting to pyarrow. you can read any string array in a pyarrow backed string array.
|
Thanks. I think it's worth adding a small note to the docs, both in
|
Hi there, long time user, first time contributor. Is it alright for me to pick up this issue and expand the Docs? |
Yep, that would be great @jeremyswerdlow! |
@TomAugspurger @mzeitlin11 |
take |
Although it is tested, I don't know if we intentionally don't support a roundtrip. It's true that we should be able to read any parquet file with fastparquet without having pyarrow for parquet files with string data. But when starting from a pandas dataframe, we store information about the dtype in the metadata of the arrow table / parquet file. And if we want, I think we could make use of this to support a faithful roundtrip out of the box. |
This would be really nice, as the memory difference can be huge. Got tripped up when trying to load a table stored by pyarrow that would take 16G when using |
Hey @TomAugspurger does this only require updating DOCS? Can I take it up? |
Need some help with the above PR. One of the CI checks is failing but I am able to build the doc in local. Can someone help me understand this: |
|
@TomAugspurger can you please review the PR? |
I'm not sure what the latest status of the StringDtype, but based on #48469 it sounds like there's still some flux. It'd be good to confirm that no changes there will affect the parquet storage type. And based on #42664 (comment), it sounds like my earlier comment wasn't necessarily correct. |
I'll probably wait out the discussion on #48469 |
Take |
the more relevant discussion is could now be #60639 @jorisvandenbossche has an open PR to address that and when that is merged we might be able to close this issue? In the OP, the claim that the data is not round-tripped is regarding the dtype equality and not the comparison of the values. However, #42664 (comment) suggests that the round-trip could be achieved using metadata and #42664 (comment) suggests that updating the docs would close this issue. I'll remove the good first issue tag for now. @preet545 what were you planning to do to resolve this issue? |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
read_parquet currently loads all string dtype as
string[python]
. We'd ideally match what was written.Expected Output
A DataFrame with
string[pyarrow]
rather thanstring[python]
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: