-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Thoroughly dedup column names in read_csv #17095
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,7 +25,7 @@ def test_basic(self): | |
mangle_dupe_cols=True) | ||
assert list(df.columns) == expected | ||
|
||
def test_thorough_mangle(self): | ||
def test_thorough_mangle_columns(self): | ||
# see gh-17060 | ||
data = "a,a,a.1\n1,2,3" | ||
df = self.read_csv(StringIO(data), sep=",", mangle_dupe_cols=True) | ||
|
@@ -40,3 +40,25 @@ def test_thorough_mangle(self): | |
df = self.read_csv(StringIO(data), sep=",", mangle_dupe_cols=True) | ||
assert list(df.columns) == ["a", "a.1", "a.3", "a.1.1", | ||
"a.2", "a.2.1", "a.3.1"] | ||
|
||
def test_thorough_mangle_names(self): | ||
# see gh-17095 | ||
data = "a,b,b\n1,2,3" | ||
names = ["a.1", "a.1", "a.1.1"] | ||
df = self.read_csv(StringIO(data), sep=",", names=names, | ||
mangle_dupe_cols=True) | ||
assert list(df.columns) == ["a.1", "a.1.1", "a.1.1.1"] | ||
|
||
data = "a,b,c,d,e,f\n1,2,3,4,5,6" | ||
names = ["a", "a", "a.1", "a.1.1", "a.1.1.1", "a.1.1.1.1"] | ||
df = self.read_csv(StringIO(data), sep=",", names=names, | ||
mangle_dupe_cols=True) | ||
assert list(df.columns) == ["a", "a.1", "a.1.1", "a.1.1.1", | ||
"a.1.1.1.1", "a.1.1.1.1.1"] | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe we should just raise for cases like these. this is clearly a user error in specification of the names. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The thought has crossed my mind. I don't see why not. However, it should probably go through a deprecation cycle (this behavior is dedup-ing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can do that as a follow-up if that works. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think is an ambiguous / unsupported case. I would just raise rather than try to get fancy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't really change to much to make this change. I still would want to deprecate this behavior (in general) instead of raising immediately on a special case because There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On further investigation, |
||
data = "a,b,c,d,e,f,g\n1,2,3,4,5,6,7" | ||
names = ["a", "a", "a.3", "a.1", "a.2", "a", "a"] | ||
df = self.read_csv(StringIO(data), sep=",", names=names, | ||
mangle_dupe_cols=True) | ||
assert list(df.columns) == ["a", "a.1", "a.3", "a.1.1", | ||
"a.2", "a.2.1", "a.3.1"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also another behaviour that might be sensible: a expected result of
"a.1", "a.1.1.1", "a.1.1"]
(so taking into account all names and not only the ones that already were processed). This ends up with only one column being different than specified, while in the test two columns did change.Not sure this is necessarily better, just pointing out the possibility. But less changed names sounds like a nice argument.
I would also add a test with a different order to be specific about this, for example for
names=['a', 'a.1', 'a']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyhow, I agree with @jreback that we should actually just raise an error here (possibly after a deprecation cycle), and then this is not really important.
So if we decide to do that, I would not bother discussing / changing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would aim for a deprecation cycle.
_maybe_dedup_names
has been well-ingrained for sometime. In addition, the function is actually used to dedup column names when the data portion of the CSV is empty. Thus, I would like to give some more thought on how best to deprecate dedupingnames
without issuing a warning (or raising) in the empty case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already have a test for that lower in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : Your thoughts? I think we're agreement that this behavior should be disallowed, but given some of the complications I've mentioned here, I would prefer to refrain from disallowing or deprecating until a subsequent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the issue is that if we have an
ambiguous
dedupe like above then I would just raise. This doesn't work now, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work correctly on
master
, yes. However, I'm following along with what I did in #17060, so in fact, what you're seeing is consistent with my decision in that PR.