Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor groupby helper from tempita to fused types #24954

Merged
merged 1 commit into from
Feb 9, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import cython
from cython import Py_ssize_t
from cython cimport floating

from libc.stdlib cimport malloc, free

Expand Down Expand Up @@ -382,5 +383,55 @@ def group_any_all(uint8_t[:] out,
out[lab] = flag_val


@cython.wraparound(False)
@cython.boundscheck(False)
def _group_add(floating[:, :] out,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR but I see there are some functions at the end of the template file which have been converted to fused types but still remain in the template file. Could make sense to move them here as well (separate PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at that (for next PR). thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd which functions did you find that are unnecessary? I couldn't find any.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just call this group_add

int64_t[:] counts,
floating[:, :] values,
const int64_t[:] labels,
Py_ssize_t min_count=0):
"""
Only aggregates on axis=0
"""
cdef:
Py_ssize_t i, j, N, K, lab, ncounts = len(counts)
floating val, count
ndarray[floating, ndim=2] sumx, nobs

if not len(values) == len(labels):
raise AssertionError("len(index) != len(labels)")

nobs = np.zeros_like(out)
sumx = np.zeros_like(out)

N, K = (<object>values).shape

with nogil:

for i in range(N):
lab = labels[i]
if lab < 0:
continue

counts[lab] += 1
for j in range(K):
val = values[i, j]

# not nan
if val == val:
nobs[lab, j] += 1
sumx[lab, j] += val

for i in range(ncounts):
for j in range(K):
if nobs[i, j] < min_count:
out[i, j] = NAN
else:
out[i, j] = sumx[i, j]


group_add_float32 = _group_add['float']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need these? Thought from a previous comment that they weren't required

Side note - I see you've been force pushing changes. If you merge master you can push without forcing, which typically does a better job of maintaining comment history

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are still needed (to avoid changing the code calling this function). atleast until some logic to handle int64_t differently is added to the _group_add function.

I force push to keep only 1 commit (avoid the bloat).
I can add commits and squash before merging, do you think that's better?

group_add_float64 = _group_add['double']

# generated from template
include "groupby_helper.pxi"
49 changes: 1 addition & 48 deletions pandas/_libs/groupby_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ cdef extern from "numpy/npy_math.h":
_int64_max = np.iinfo(np.int64).max

# ----------------------------------------------------------------------
# group_add, group_prod, group_var, group_mean, group_ohlc
# group_prod, group_var, group_mean, group_ohlc
# ----------------------------------------------------------------------

{{py:
Expand All @@ -27,53 +27,6 @@ def get_dispatch(dtypes):
{{for name, c_type in get_dispatch(dtypes)}}


@cython.wraparound(False)
@cython.boundscheck(False)
def group_add_{{name}}({{c_type}}[:, :] out,
int64_t[:] counts,
{{c_type}}[:, :] values,
const int64_t[:] labels,
Py_ssize_t min_count=0):
"""
Only aggregates on axis=0
"""
cdef:
Py_ssize_t i, j, N, K, lab, ncounts = len(counts)
{{c_type}} val, count
ndarray[{{c_type}}, ndim=2] sumx, nobs

if not len(values) == len(labels):
raise AssertionError("len(index) != len(labels)")

nobs = np.zeros_like(out)
sumx = np.zeros_like(out)

N, K = (<object>values).shape

with nogil:

for i in range(N):
lab = labels[i]
if lab < 0:
continue

counts[lab] += 1
for j in range(K):
val = values[i, j]

# not nan
if val == val:
nobs[lab, j] += 1
sumx[lab, j] += val

for i in range(ncounts):
for j in range(K):
if nobs[i, j] < min_count:
out[i, j] = NAN
else:
out[i, j] = sumx[i, j]


@cython.wraparound(False)
@cython.boundscheck(False)
def group_prod_{{name}}({{c_type}}[:, :] out,
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -380,7 +380,7 @@ def get_func(fname):
# otherwise find dtype-specific version, falling back to object
for dt in [dtype_str, 'object']:
f = getattr(libgroupby, "{fname}_{dtype_str}".format(
fname=fname, dtype_str=dtype_str), None)
fname=fname, dtype_str=dt), None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do we even need this object fallback loop?

Copy link
Contributor Author

@noamher noamher Jan 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for group_add and everything worked fine until now.
still it seemed like a bug.

if f is not None:
return f

Expand Down