Read csv category fix #18402

sam-cohan · 2017-11-21T03:15:53Z

closes BUG: read_csv(dtype='category') raises with many categories #18186
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whats new entry

Summary:

Fix the bug as suggested by @TomAugspurger in issue BUG: read_csv(dtype='category') raises with many categories #18186
Added a unit test which ensures fix is working.

jschendel · 2017-11-21T07:54:51Z

doc/source/whatsnew/v0.21.1.txt

@@ -64,6 +64,7 @@ Bug Fixes
 - Bug in ``pd.concat`` when empty and non-empty DataFrames or Series are concatenated (:issue:`18178` :issue:`18187`)
 - Bug in :class:`IntervalIndex` constructor when a list of intervals is passed with non-default ``closed`` (:issue:`18334`)
 - Bug in :meth:`IntervalIndex.copy` when copying and ``IntervalIndex`` with non-default ``closed`` (:issue:`18339`)
+- Bug in ``pd.read_csv`` when reading numeric category fields with high cardinality (:issue `18186`)


``pd.read_csv`` -> :func:`read_csv`

:issue is missing the closing :, and no space between :issue: and the number

jschendel · 2017-11-21T07:55:11Z

pandas/_libs/parsers.pyx

-        dtypes = set(a.dtype for a in arrs)
-        if len(dtypes) > 1:
-            common_type = np.find_common_type(dtypes, [])
+        dtypes = set([a.dtype for a in arrs])


Can you rewrite this as a set comprehension, i.e. {a.dtype for a in arrs}. Seems to be preferred based on #18383. I think this line is what caused the failure on Travis.

jschendel · 2017-11-21T07:55:43Z

pandas/tests/io/parser/dtypes.py

+    def test_categorical_dtype_high_cardinality_numeric(self):
+        # GH 18186
+        data = sorted([str(i) for i in range(10**6)])
+        expected = pd.DataFrame({'a': Categorical(data, ordered=True)})


DataFrame has already been imported, so can remove the pd..

jschendel · 2017-11-21T08:16:03Z

pandas/tests/io/parser/dtypes.py

+        actual = self.read_csv(StringIO('a\n' + '\n'.join(data)),
+                               dtype='category')
+        actual.a.cat.reorder_categories(sorted(actual.a.cat.categories),
+                                        ordered=True, inplace=True)


Can you do this by assignment, i.e. actual['a'] = ..., instead of inplace. The convention is generally to avoid using inplace in tests.

I will make this change but I did the inplace hoping it would be faster and more memory efficient. Can you justify why it is preferred not to do inplace?

@sam-cohan inplace=True is not more efficient in any way. Furthermore it is much harder to read, we do not use it in the codebase except in support of specific API where it is user supported.

sorting should be using np.sort here

Done on np.sort.
I see your point about inplace=True in an ideal world, but FYI, in my workflows I typically use this to avoid temporarily needing double the storage for sorting (since I am often playing close to limits of available physical memory). Would have been nice if assignment was smart enough to be memory efficient.

To expand on inplace=True a bit: Per my understanding, for most operations inplace=True essentially does the same thing as assignment. It will still operate on a copy, but with the top-level reference being reassigned, so there usually isn't actually a gain in terms of memory efficiency, etc.

Thanks for the note... that is unfortunate. Perhaps that is worthy of discussion in another venue.

jschendel · 2017-11-21T08:19:55Z

A few comments, but looks good. Thanks!

codecov · 2017-11-21T09:05:15Z

Codecov Report

Merging #18402 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #18402      +/-   ##
==========================================
- Coverage   91.36%   91.34%   -0.02%     
==========================================
  Files         164      164              
  Lines       49730    49730              
==========================================
- Hits        45435    45426       -9     
- Misses       4295     4304       +9

Flag	Coverage Δ
#multiple	`89.14% <ø> (ø)`	⬆️
#single	`39.62% <ø> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.8% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 509e03c...b831699. Read the comment docs.

codecov · 2017-11-21T09:05:22Z

Codecov Report

Merging #18402 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #18402      +/-   ##
==========================================
- Coverage   91.35%   91.33%   -0.02%     
==========================================
  Files         163      163              
  Lines       49714    49714              
==========================================
- Hits        45415    45406       -9     
- Misses       4299     4308       +9

Flag	Coverage Δ
#multiple	`89.13% <ø> (ø)`	⬆️
#single	`39.63% <ø> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.8% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 103ea6f...ff1945d. Read the comment docs.

sam-cohan · 2017-11-21T09:09:28Z

@jschendel PTAL. I made the changes you requested but I think there is still something off...

jreback · 2017-11-21T11:04:02Z

pandas/tests/io/parser/dtypes.py

@@ -114,6 +114,17 @@ def test_categorical_dtype(self):
        actual = self.read_csv(StringIO(data), dtype='category')
        tm.assert_frame_equal(actual, expected)

+    @pytest.mark.slow


how slow is this test?

On my machine it is about 4.5 seconds for the high memory parser, and 6.5 seconds for low memory and python parsers.

The minimal range necessary to reproduce the error is range(524289), at least locally for me. Might be beneficial to lower the range in the test to offset some of the slowness? If so, not sure if we should push it down to the limit, or just go down to something like 600k to leave a little buffer room.

Checked and the limit is the same on both ubuntu and OSX so decided to do it as just the limit. This cut down the times to 2.25 seconds and 3 seconds so I removed the slow mark.

Added slow mark back. rebased. waiting for tests.

sam-cohan · 2017-11-21T14:16:11Z

@jreback PTAL.

jschendel · 2017-11-21T19:25:38Z

doc/source/whatsnew/v0.21.1.txt

+- Bug in ``pd.concat`` when empty and non-empty DataFrames or Series are concatenated (:issue:`18178` :issue:`18187`)
+- Bug in :class:`IntervalIndex` constructor when a list of intervals is passed with non-default ``closed`` (:issue:`18334`)
+- Bug in :meth:`IntervalIndex.copy` when copying and ``IntervalIndex`` with non-default ``closed`` (:issue:`18339`)
+- Bug in :func:``read_csv`` when reading numeric category fields with high cardinality (:issue:`18186`)


Looks like all entries above yours got duplicated after rebasing #18408; you can see repeat entries under different sections below. So, I think the following should be done:

Delete all entries above yours (the repeat entries should still be there)

Move your entry under either the I/O or Categorical section (not sure which is more appropriate, so should double check on this)

When using references like :func: you only need single backticks, so :func:``read_csv`` -> :func:`read_csv`

Done. I put it under I/O because the direct call to CategoricalDType for the expected was not broken in the original code.

sam-cohan · 2017-11-21T22:48:31Z

@jreback requested changes are completed. PTAL.

jreback · 2017-11-22T00:03:28Z

pandas/tests/io/parser/dtypes.py

@@ -114,6 +114,16 @@ def test_categorical_dtype(self):
        actual = self.read_csv(StringIO(data), dtype='category')
        tm.assert_frame_equal(actual, expected)

+    def test_categorical_dtype_high_cardinality_numeric(self):


@sam-cohan can you restore the slow market. otherwise lgtm. ping on green.

jreback · 2017-11-22T01:59:00Z

pls rebase

…8186)

jreback · 2017-11-22T11:28:47Z

thanks!

(cherry picked from commit d421a09)

sam-cohan changed the base branch from 0.21.x to master November 21, 2017 03:17

sam-cohan force-pushed the read_csv-category-fix branch from e3bd809 to 009311a Compare November 21, 2017 03:41

jschendel reviewed Nov 21, 2017

View reviewed changes

jreback requested changes Nov 21, 2017

View reviewed changes

jreback added Bug IO CSV read_csv, to_csv labels Nov 21, 2017

sam-cohan force-pushed the read_csv-category-fix branch from b831699 to c17819b Compare November 21, 2017 14:10

jschendel reviewed Nov 21, 2017

View reviewed changes

sam-cohan force-pushed the read_csv-category-fix branch from c17819b to 019116c Compare November 21, 2017 20:00

jreback requested changes Nov 22, 2017

View reviewed changes

sam-cohan force-pushed the read_csv-category-fix branch from 019116c to b14746c Compare November 22, 2017 00:23

sam-cohan added 2 commits November 22, 2017 03:16

Fix bug in read_csv for high cardinality category types (pandas-dev#1…

bf61eaf

…8186)

Refactor read_csv bug fix with PR comments (pandas-dev#18186)

ff1945d

sam-cohan force-pushed the read_csv-category-fix branch from b14746c to ff1945d Compare November 22, 2017 03:16

jreback approved these changes Nov 22, 2017

View reviewed changes

jreback added this to the 0.21.1 milestone Nov 22, 2017

jreback added the Needs Backport label Nov 22, 2017

jreback merged commit d421a09 into pandas-dev:master Nov 22, 2017

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 8, 2017

Read csv category fix (pandas-dev#18402)

3113f40

(cherry picked from commit d421a09)

TomAugspurger pushed a commit that referenced this pull request Dec 11, 2017

Read csv category fix (#18402)

90c66c5

(cherry picked from commit d421a09)

TomAugspurger removed the Needs Backport label Dec 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read csv category fix #18402

Read csv category fix #18402

sam-cohan commented Nov 21, 2017

jschendel Nov 21, 2017

jschendel Nov 21, 2017 •

edited

Loading

jschendel Nov 21, 2017

jschendel Nov 21, 2017

sam-cohan Nov 21, 2017

jreback Nov 21, 2017

jreback Nov 21, 2017

sam-cohan Nov 21, 2017 •

edited

Loading

jschendel Nov 21, 2017

sam-cohan Nov 21, 2017

jschendel commented Nov 21, 2017

codecov bot commented Nov 21, 2017

codecov bot commented Nov 21, 2017 •

edited

Loading

sam-cohan commented Nov 21, 2017

jreback Nov 21, 2017

sam-cohan Nov 21, 2017

jschendel Nov 21, 2017

sam-cohan Nov 21, 2017

sam-cohan Nov 22, 2017

sam-cohan commented Nov 21, 2017

jschendel Nov 21, 2017

sam-cohan Nov 21, 2017

sam-cohan commented Nov 21, 2017

jreback Nov 22, 2017

jreback commented Nov 22, 2017

jreback commented Nov 22, 2017

Read csv category fix #18402

Read csv category fix #18402

Conversation

sam-cohan commented Nov 21, 2017

Choose a reason for hiding this comment

jschendel Nov 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sam-cohan Nov 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel commented Nov 21, 2017

codecov bot commented Nov 21, 2017

Codecov Report

codecov bot commented Nov 21, 2017 • edited Loading

Codecov Report

sam-cohan commented Nov 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sam-cohan commented Nov 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sam-cohan commented Nov 21, 2017

Choose a reason for hiding this comment

jreback commented Nov 22, 2017

jreback commented Nov 22, 2017

jschendel Nov 21, 2017 •

edited

Loading

sam-cohan Nov 21, 2017 •

edited

Loading

codecov bot commented Nov 21, 2017 •

edited

Loading