-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read csv category fix #18402
Read csv category fix #18402
Conversation
e3bd809
to
009311a
Compare
doc/source/whatsnew/v0.21.1.txt
Outdated
@@ -64,6 +64,7 @@ Bug Fixes | |||
- Bug in ``pd.concat`` when empty and non-empty DataFrames or Series are concatenated (:issue:`18178` :issue:`18187`) | |||
- Bug in :class:`IntervalIndex` constructor when a list of intervals is passed with non-default ``closed`` (:issue:`18334`) | |||
- Bug in :meth:`IntervalIndex.copy` when copying and ``IntervalIndex`` with non-default ``closed`` (:issue:`18339`) | |||
- Bug in ``pd.read_csv`` when reading numeric category fields with high cardinality (:issue `18186`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``pd.read_csv``
->:func:`read_csv`
:issue
is missing the closing:
, and no space between:issue:
and the number
pandas/_libs/parsers.pyx
Outdated
dtypes = set(a.dtype for a in arrs) | ||
if len(dtypes) > 1: | ||
common_type = np.find_common_type(dtypes, []) | ||
dtypes = set([a.dtype for a in arrs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rewrite this as a set comprehension, i.e. {a.dtype for a in arrs}
. Seems to be preferred based on #18383. I think this line is what caused the failure on Travis.
pandas/tests/io/parser/dtypes.py
Outdated
def test_categorical_dtype_high_cardinality_numeric(self): | ||
# GH 18186 | ||
data = sorted([str(i) for i in range(10**6)]) | ||
expected = pd.DataFrame({'a': Categorical(data, ordered=True)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame
has already been imported, so can remove the pd.
.
pandas/tests/io/parser/dtypes.py
Outdated
actual = self.read_csv(StringIO('a\n' + '\n'.join(data)), | ||
dtype='category') | ||
actual.a.cat.reorder_categories(sorted(actual.a.cat.categories), | ||
ordered=True, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this by assignment, i.e. actual['a'] = ...
, instead of inplace
. The convention is generally to avoid using inplace
in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make this change but I did the inplace hoping it would be faster and more memory efficient. Can you justify why it is preferred not to do inplace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sam-cohan inplace=True
is not more efficient in any way. Furthermore it is much harder to read, we do not use it in the codebase except in support of specific API where it is user supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorting should be using np.sort
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done on np.sort
.
I see your point about inplace=True
in an ideal world, but FYI, in my workflows I typically use this to avoid temporarily needing double the storage for sorting (since I am often playing close to limits of available physical memory). Would have been nice if assignment was smart enough to be memory efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand on inplace=True
a bit: Per my understanding, for most operations inplace=True
essentially does the same thing as assignment. It will still operate on a copy, but with the top-level reference being reassigned, so there usually isn't actually a gain in terms of memory efficiency, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the note... that is unfortunate. Perhaps that is worthy of discussion in another venue.
A few comments, but looks good. Thanks! |
Codecov Report
@@ Coverage Diff @@
## master #18402 +/- ##
==========================================
- Coverage 91.36% 91.34% -0.02%
==========================================
Files 164 164
Lines 49730 49730
==========================================
- Hits 45435 45426 -9
- Misses 4295 4304 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18402 +/- ##
==========================================
- Coverage 91.35% 91.33% -0.02%
==========================================
Files 163 163
Lines 49714 49714
==========================================
- Hits 45415 45406 -9
- Misses 4299 4308 +9
Continue to review full report at Codecov.
|
@jschendel PTAL. I made the changes you requested but I think there is still something off... |
@@ -114,6 +114,17 @@ def test_categorical_dtype(self): | |||
actual = self.read_csv(StringIO(data), dtype='category') | |||
tm.assert_frame_equal(actual, expected) | |||
|
|||
@pytest.mark.slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how slow is this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine it is about 4.5 seconds for the high memory parser, and 6.5 seconds for low memory and python parsers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimal range
necessary to reproduce the error is range(524289)
, at least locally for me. Might be beneficial to lower the range
in the test to offset some of the slowness? If so, not sure if we should push it down to the limit, or just go down to something like 600k to leave a little buffer room.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked and the limit is the same on both ubuntu and OSX so decided to do it as just the limit. This cut down the times to 2.25 seconds and 3 seconds so I removed the slow mark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added slow mark back. rebased. waiting for tests.
b831699
to
c17819b
Compare
@jreback PTAL. |
doc/source/whatsnew/v0.21.1.txt
Outdated
- Bug in ``pd.concat`` when empty and non-empty DataFrames or Series are concatenated (:issue:`18178` :issue:`18187`) | ||
- Bug in :class:`IntervalIndex` constructor when a list of intervals is passed with non-default ``closed`` (:issue:`18334`) | ||
- Bug in :meth:`IntervalIndex.copy` when copying and ``IntervalIndex`` with non-default ``closed`` (:issue:`18339`) | ||
- Bug in :func:``read_csv`` when reading numeric category fields with high cardinality (:issue:`18186`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like all entries above yours got duplicated after rebasing #18408; you can see repeat entries under different sections below. So, I think the following should be done:
- Delete all entries above yours (the repeat entries should still be there)
- Move your entry under either the I/O or Categorical section (not sure which is more appropriate, so should double check on this)
- When using references like
:func:
you only need single backticks, so:func:``read_csv``
->:func:`read_csv`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I put it under I/O because the direct call to CategoricalDType for the expected
was not broken in the original code.
c17819b
to
019116c
Compare
@jreback requested changes are completed. PTAL. |
pandas/tests/io/parser/dtypes.py
Outdated
@@ -114,6 +114,16 @@ def test_categorical_dtype(self): | |||
actual = self.read_csv(StringIO(data), dtype='category') | |||
tm.assert_frame_equal(actual, expected) | |||
|
|||
def test_categorical_dtype_high_cardinality_numeric(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sam-cohan can you restore the slow market. otherwise lgtm. ping on green.
019116c
to
b14746c
Compare
pls rebase |
b14746c
to
ff1945d
Compare
thanks! |
(cherry picked from commit d421a09)
(cherry picked from commit d421a09)
git diff upstream/master -u -- "*.py" | flake8 --diff
Summary: