-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: pandas.cut should optionally allow overlapping IntervalIndex bins #27654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @jschendel.
I don't think it's obvious that one or the other is "more" correct. |
The big issue with overlapping bins is the lack of uniqueness. Every other method for using My comment about overlapping bins not really making sense comes from the viewpoint (a possibly incorrect one on my part) of uniqueness being a defining characteristic of
Only a small portion of this feature was actually working at the time it was removed. The specific example you gave works fine, but very similar variants of it did not work, in large part due to some of the uniqueness issues discussed above. For example, indexed input fails to have it's index reapplied due to not conforming with the size of the input: In [3]: pd.cut([5, 6], bins=ii)
Out[3]:
[(0, 10], (2, 12], (4, 14], (0, 10], (2, 12], (4, 14]]
Categories (3, interval[int64]): [(0, 10] < (2, 12] < (4, 14]]
In [4]: pd.cut(pd.Series([5, 6]), bins=ii)
---------------------------------------------------------------------------
ValueError: Length of passed values is 6, index implies 2 Applying the proper index values here (i.e. ones that align with the input data) in a performant way is non-trivial, and I suspect it would add a non-negligible amount of complexity to Using In [5]: pd.cut([5, 6, 100], bins=ii)
---------------------------------------------------------------------------
KeyError: 100
In [6]: pd.cut([5, 6, 100], bins=pd.interval_range(4, 7))
Out[6]:
[(4, 5], (5, 6], NaN]
Categories (3, interval[int64]): [(4, 5] < (5, 6] < (6, 7]] There are a few issues at play above, one of which being
I'm not necessarily against this if it's the path we want to go down, but at the same time I'm not convinced that
In the meantime using a combination of In [3]: ii.take(ii.get_indexer_for([5, 6]))
Out[3]:
IntervalIndex([(0, 10], (2, 12], (4, 14], (0, 10], (2, 12], (4, 14]],
closed='right',
dtype='interval[int64]') Note that the above results in an In [4]: pd.Categorical.from_codes(ii.get_indexer_for([5, 6]), ii, ordered=True)
Out[4]:
[(0, 10], (2, 12], (4, 14], (0, 10], (2, 12], (4, 14]]
Categories (3, interval[int64]): [(0, 10] < (2, 12] < (4, 14]] |
I also think this is valuable. I'm not sure how complex an option to allow overlapping intervals would be. If it's not too bad, then I think it's reasonable to add it as an option, but for unique to be the default. |
Hello all, I am interested / in need of being able to groupby on overlapping intervals.
It is not thoroughly tested yet. (I have only checked with the example I am posting along) import random
import numpy as np
import pandas as pd
from numba import guvectorize
# Function to filter out overlapping intervals from an IntervalIndex.
@guvectorize('void(int64[:], int64[:], int64[:])',
'(m),(m)->(m)', nopython=True)
def non_overlapping_left_closed(start, end, non_overlapping):
"""
Return indices of non-overlapping intervals. It has to be applied
in several steps until there is no more overlapping intervals.
Parameters:
start (np.array[int]):
Array containing the starts of the intervals (included).
end (np.array[int]):
Array containing the ends of the intervals (excluded).
non_overlapping (np_array[int]):
Array of 0 and 1. 1 indicate a non-overlapping interval and 0 an
overlapping interval.
"""
start = start[1:]
previous_end = end[0]
non_overlapping[0] = 1
for idx, ts in np.ndenumerate(start):
# Switch from tuple to single value.
idx, = idx
if ts >= previous_end:
# Non-overlapping interval.
non_overlapping[idx+1] = 1
# Update to current end.
previous_end = end[idx+1]
else:
# Overlapping interval.
non_overlapping[idx+1] = 0
return
def groupby_on_overlapping_intervals(start, end, df, agg) \
-> pd.DataFrame:
"""
Conduct a groupby considering left closed intervals defined as
[start, end).
Parameters:
start (np.array[int]):
Array containing the starts of the intervals (included).
end (np.array[int]):
Array containing the ends of the intervals (excluded).
df (pd.DataFrame):
DataFrame onto which conducting the aggregation by groupby.
agg (dict):
Aggregation functions (named aggregations).
Returns:
aggregated_data (pd.DataFrame):
Aggregated data based on intervals.
"""
buffer = []
intervals = pd.IntervalIndex.from_arrays(start, end, closed='left')
while True:
if intervals.is_overlapping:
# Set apart overlapping intervals for next iteration.
non_overlapping = np.zeros(len(intervals), dtype='int64')
non_overlapping_left_closed(intervals.left.astype('long'),
intervals.right.astype('long'),
non_overlapping)
overlapping = intervals[(non_overlapping == 0).nonzero()[0]]
intervals = intervals[(non_overlapping == 1).nonzero()[0]]
else:
overlapping = pd.IntervalIndex([])
groups = pd.IntervalIndex(pd.cut(df['timestamp'], intervals)).right
buffer.append(df.groupby(groups).agg(**agg))
if overlapping.size == 0:
# No more intervals in overlap.
break
else:
# Manage intervals in overlap that have been set apart
intervals = overlapping
return buffer[0] if len(buffer) == 1 else pd.concat(buffer).sort_index()
# Test
# Dataset.
ts_raw = pd.DatetimeIndex([pd.Timestamp('2021/01/01 00:30'),
pd.Timestamp('2021/01/01 00:45'),
pd.Timestamp('2021/01/01 01:00'),
pd.Timestamp('2021/01/01 01:55'),
pd.Timestamp('2021/01/01 02:00'),
pd.Timestamp('2021/01/01 03:38'),
pd.Timestamp('2021/01/01 03:55'),
pd.Timestamp('2021/01/01 05:00'),
pd.Timestamp('2021/01/01 05:20'),
pd.Timestamp('2021/01/01 05:21'),
pd.Timestamp('2021/01/01 06:00')])
length = len(ts_raw)
random.seed(1)
value = random.sample(range(1, length+1), length)
df_raw = pd.DataFrame({'value': value, 'timestamp': ts_raw})
df_raw['value'] = df_raw['value'].astype('float64')
# Overlapping intervals.
start = df_raw['timestamp'].iloc[[0, 2, 5, 8]]
end = df_raw['timestamp'].iloc[[3, 6, 8, 10]]
# Aggregation function (named aggregation).
agg = {'high':('value', 'max')}
res = groupby_on_overlapping_intervals(start, end, df_raw, agg)
expected = pd.DataFrame({'high': [10.0, 5.0, 8.0, 11.0]},
index = end)
assert res.equals(expected) In case someone sees a shorter / better / cleaner / faster way to operate this aggregation on overlapping intervals, I will gladly accept any advices. Bests |
Thanks for the request, but it appears this feature request hasn't gain much traction in years so closing |
Due to #23980 the following code now raises a ValueError since 0.25:
Code Sample, a copy-pastable example if possible
Problem description
Before #23980 an IntervalIndex with overlapping columns could be used. It would return every Interval which is valid for the required data, which is obviously the correct solution.
In #23980 it was stated that this doesn't make sense in the context of cut. Unfortunately I missed the discussion over there (there really was None). I argue that by raising a value error we unnecessarily remove a valid feature: I use cut frequently as kind of a more versatile replacement to
pd.rolling
for overlapping non-equal sized custom windows.If there is a smarter way to do this I am happy to learn about it. Otherwise we should at least give the option to use overlapping indices in cut. Thus I would recommend to raise a warning instead of an error here:
pandas/pandas/core/reshape/tile.py
Lines 247 to 249 in 0fd888c
Expected Output
Raise a warning maybe (I am still not sure if this is necessary) and return:
The text was updated successfully, but these errors were encountered: