Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise FileNotFoundError in read_json if input looks like file path but file is missing #46718

Merged
merged 9 commits into from
Jun 7, 2022
Merged

Raise FileNotFoundError in read_json if input looks like file path but file is missing #46718

merged 9 commits into from
Jun 7, 2022

Conversation

janosh
Copy link
Contributor

@janosh janosh commented Apr 9, 2022

@janosh janosh changed the title Raise FileNotFoundError in read_json if input is string ending in .json{,.gz,.bz2} but file is missing Raise FileNotFoundError in read_json if input looks like file path but file is missing Apr 9, 2022
@mroeschke mroeschke added this to the 1.5 milestone Apr 9, 2022
@mroeschke mroeschke added Error Reporting Incorrect or improved errors from pandas IO JSON read_json, to_json, json_normalize labels Apr 9, 2022
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 12, 2022
@mroeschke mroeschke removed the Stale label Jun 3, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will need approval from @jreback to merge. (long string's overflowing the filesystem is handled by file_exists; there's a comment # gh-5874: if the filepath is too long will raise here)

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after merging master looks like there's a code path needs updating.

Also @twoertwein, for other file extensions with compression, it doesn't appear we manually raise a FileNotFoundError. Should this automatically be handled by the handler code?

@twoertwein
Copy link
Member

Also @twoertwein, for other file extensions with compression, it doesn't appear we manually raise a FileNotFoundError. Should this automatically be handled by the handler code?

Would maybe be better to raise here when the file doesn't exist:

if "r" not in mode and is_path:

@janosh
Copy link
Contributor Author

janosh commented Jun 6, 2022

Is this a known error in TestiLocBaseIndependent.test_iloc_setitem_multicolumn_to_datetime? Looks unrelated to my changes.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm merge on green

@jreback
Copy link
Contributor

jreback commented Jun 6, 2022

cc @mroeschke

)
and not file_exists(filepath_or_buffer)
):
raise FileNotFoundError(f"File {filepath_or_buffer} does not exist")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can investigate if this logic can be moved according to #46718 (comment) (such that not just json files raise this) that would be great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the safe assumptions to make here about how the file path ends? Is there sth like _extension_to_compression but for file formats?

# Only for write methods
if "r" not in mode and is_path:
    check_parent_directory(str(handle))

_supported_formats = ["json", "csv", "xls", "xlsx", ...]
_allowed_extensions = tuple(itertools.chain(
    (x,) + tuple(f"{x}{c}" for c in _extension_to_compression)
    for x in _supported_formats
))
if (
    is_path
    and handle.lower().endswith(_allowed_extensions)
    and not file_exists(handle)
):
    raise FileNotFoundError(f"File {handle} does not exist")

Copy link
Member

@twoertwein twoertwein Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After read_json knows that it is a file, I think you can just raise inside get_handle when the file doesn't exist:

if "r" not in mode and is_path and not file_exists(handle):
    raise ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean this, right?

if "r" in mode and is_path and not file_exists(handle):
    raise ...

Copy link
Contributor Author

@janosh janosh Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually read_json doesn't even call get_handle since it's behind a bunch of or cases none of which are true:

if (
    not isinstance(filepath_or_buffer, str)
    or is_url(filepath_or_buffer)
    or is_fsspec_url(filepath_or_buffer)
    or file_exists(filepath_or_buffer)
):
    self.handles = get_handle(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there were attempts at creating an efficient way of checking whether a str is a legit JSON string. I assume these attempts were unsuccessful - we check the reverse. If that's the case, then moving the check into get_handle, will not work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, then moving the check into get_handle, will not work.

So leave PR as is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So leave PR as is?

Probably.

@mroeschke mroeschke merged commit 6704590 into pandas-dev:main Jun 7, 2022
@mroeschke
Copy link
Member

Great, thanks @janosh

@janosh janosh mentioned this pull request Jun 7, 2022
5 tasks
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
…h but file is missing (pandas-dev#46718)

* raise FileNotFoundError in _get_data_from_filepath()

* update tests test_read_non_existent + test_read_expands_user_home_dir

* add changelog entry in doc/source/whatsnew/v1.5.0.rst

* use pandas.io.common._compression_to_extension instead of hard-coded extensions

* move changelog entry from IO to other API changes

* fix ImportError from _compression_to_extension -> _extension_to_compression rename

* add test read_json very long file path

* remove extra period in extension checking

Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misleading error messages when opening inexistent json file
4 participants