Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Deprecate literal json string input to read_json #52271

Closed
1 of 3 tasks
wence- opened this issue Mar 29, 2023 · 8 comments · Fixed by #53409
Closed
1 of 3 tasks

ENH: Deprecate literal json string input to read_json #52271

wence- opened this issue Mar 29, 2023 · 8 comments · Fixed by #53409
Assignees
Labels
API - Consistency Internal Consistency of API/Behavior Deprecate Functionality to remove in pandas IO JSON read_json, to_json, json_normalize

Comments

@wence-
Copy link
Contributor

wence- commented Mar 29, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

As seen in #29102 (the rejected #29104), and then #46718, determining user intent in user input from pd.read_json(some_string) is in general not possible. #46718 is a halfway house in that it explicitly marks some "file extensions" as "you probably wanted to read from a file", but is easily defeated by, for example pd.read_json("missing.jsonl", lines=True) (jsonl being a common extension for "lines"-formatted json files).

AFAICT, read_json is the only read_XXX function that accepts a literal representation of the data in its path_or_buf argument, so there doesn't seem to be a great deal of precedent here.

Feature Description

Deprecate literal json input to pd.read_json, if one wants to read from a string it should be wrapped in a StringIO.

e.g.

import pandas as pd
from io import StringIO
data = '{"a":{"0":1}}'
# old, proposed for deprecation
df = pd.read_json(data)
# new
df = pd.read_json(StringIO(data))

Alternative Solutions

None

Additional Context

No response

@wence- wence- added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 29, 2023
@mroeschke mroeschke added IO JSON read_json, to_json, json_normalize Deprecate Functionality to remove in pandas and removed Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 29, 2023
@mroeschke
Copy link
Member

Agreed that being stricter here by only accepting a "file like object" is reasonable

@jbrockmendel
Copy link
Member

xref #5924 can be closed as "no" if/when we restrict this consistently

@wence-
Copy link
Contributor Author

wence- commented Mar 30, 2023

Thanks for that cross-ref, to broaden the scope, here's an audit of the current state of play:

  • read_clipboard: doesn't take input argument
  • read_csv: pathlike, BinaryIO, StringIO
  • read_excel: pathlike, BinaryIO, engine-specific objects, literal bytes
  • read_feather: pathlike, BinaryIO
  • read_fwf: pathlike, BinaryIO, StringIO
  • read_gbq: SQL query (str)
  • read_hdf: pathlike, pytables.HDFStore
  • read_html: pathlike, BinaryIO, StringIO, literal bytes, literal string
  • read_json: pathlike, BinaryIO, StringIO, literal string
  • read_orc: pathlike, BinaryIO
  • read_parquet: pathlike, BinaryIO
  • read_pickle: pathlike, BinaryIO
  • read_sas: pathlike, BinaryIO
  • read_spss: pathlike (pyreadstat doesn't allow reading from a buffer)
  • read_sql_query: SQL query (str)
  • read_sql_table: name of database table (str)
  • read_stata: pathlike, BinaryIO
  • read_table: pathlike, BinaryIO, StringIO
  • read_xml: pathlike, BinaryIO, StringIO, literal bytes, literal string

So some, but not all, of the textual formats support reading literally, some support reading bytes as well as strings, and some of the binary formats support reading literal bytes.

If deprecating literal input to read_json I would suggest it makes sense to therefore also do so for:

  • read_excel
  • read_html
  • read_xml

I see in the past there's been some discussion about introducing utility reads_XXX functions to provide a slightly slicker experience for reading from literals than read_XXX(StringIO(foo)). I can introduce those as well with the deprecation if that is still considered a good idea.

@mroeschke
Copy link
Member

If deprecating literal input to read_json I would suggest it makes sense to therefore also do so for:

Yeah makes sense to have API consistency among the other read functions too

I can introduce those as well with the deprecation if that is still considered a good idea.

I would opt for not introducing those methods personally

@mroeschke mroeschke added the API - Consistency Internal Consistency of API/Behavior label Mar 30, 2023
@rmhowe425
Copy link
Contributor

take

@rmhowe425
Copy link
Contributor

@mroeschke Do you think it would make sense to change the name of the path_or_buf parameter for read_json() to something more appropriate such as obj_or_buf?

I feel like path_or_buf insinuates that you can pass a string-type argument that represents a file path.

@wence-
Copy link
Contributor Author

wence- commented May 22, 2023

I feel like path_or_buf insinuates that you can pass a string-type argument that represents a file path.

I think path_or_buf should remain. Passing a string to mean "read this file" is common across all of the readers. What should be deprecated is the interpretation of path_or_buf as literal json to be parsed directly.

@wence-
Copy link
Contributor Author

wence- commented Jun 21, 2023

Thanks @rmhowe425! I opened #53767 to track deprecation of literal input for the remaining IO routines that support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Deprecate Functionality to remove in pandas IO JSON read_json, to_json, json_normalize
Projects
None yet
4 participants