BUG: pandas.cut should optionally allow overlapping IntervalIndex bins #27654

c-thiel · 2019-07-30T07:19:40Z

Due to #23980 the following code now raises a ValueError since 0.25:

Code Sample, a copy-pastable example if possible

ii = pd.IntervalIndex.from_tuples([(0, 10), (2, 12), (4, 14)])
pd.cut([5, 6], bins=ii)

Problem description

Before #23980 an IntervalIndex with overlapping columns could be used. It would return every Interval which is valid for the required data, which is obviously the correct solution.

In #23980 it was stated that this doesn't make sense in the context of cut. Unfortunately I missed the discussion over there (there really was None). I argue that by raising a value error we unnecessarily remove a valid feature: I use cut frequently as kind of a more versatile replacement to pd.rolling for overlapping non-equal sized custom windows.

If there is a smarter way to do this I am happy to learn about it. Otherwise we should at least give the option to use overlapping indices in cut. Thus I would recommend to raise a warning instead of an error here:

pandas/pandas/core/reshape/tile.py

Lines 247 to 249 in 0fd888c

    
           elif isinstance(bins, IntervalIndex): 
        
               if bins.is_overlapping: 
        
                   raise ValueError("Overlapping IntervalIndex is not accepted.")

Expected Output

Raise a warning maybe (I am still not sure if this is necessary) and return:

[(0, 10], (2, 12], (4, 14], (0, 10], (2, 12], (4, 14]]
Categories (3, interval[int64]): [(0, 10] < (2, 12] < (4, 14]]

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-07-30T11:40:38Z

cc @jschendel.

which is obviously the correct solution.

I don't think it's obvious that one or the other is "more" correct.

jschendel · 2019-07-31T00:28:33Z

The big issue with overlapping bins is the lack of uniqueness. Every other method for using cut results in a unique 1-1 mapping from value to bin. Historically, cut was introduced before IntervalIndex, with IntervalIndex support being added on after the fact, so the uniqueness assumption has been baked into the implementation of cut.

My comment about overlapping bins not really making sense comes from the viewpoint (a possibly incorrect one on my part) of uniqueness being a defining characteristic of cut. The advantage of uniqueness being that there is always a straightforward way to realign with (or assign to) the input data. Once you lose uniqueness, you basically start to enter into join territory for alignment purposes.

I argue that by raising a value error we unnecessarily remove a valid feature

Only a small portion of this feature was actually working at the time it was removed. The specific example you gave works fine, but very similar variants of it did not work, in large part due to some of the uniqueness issues discussed above.

For example, indexed input fails to have it's index reapplied due to not conforming with the size of the input:

In [3]: pd.cut([5, 6], bins=ii)
Out[3]: 
[(0, 10], (2, 12], (4, 14], (0, 10], (2, 12], (4, 14]]
Categories (3, interval[int64]): [(0, 10] < (2, 12] < (4, 14]]

In [4]: pd.cut(pd.Series([5, 6]), bins=ii)
---------------------------------------------------------------------------
ValueError: Length of passed values is 6, index implies 2

Applying the proper index values here (i.e. ones that align with the input data) in a performant way is non-trivial, and I suspect it would add a non-negligible amount of complexity to cut.

Using cut with overlapping bins also failed for out of bounds data, but was fine for non-overlapping:

In [5]: pd.cut([5, 6, 100], bins=ii)
---------------------------------------------------------------------------
KeyError: 100

In [6]: pd.cut([5, 6, 100], bins=pd.interval_range(4, 7))
Out[6]: 
[(4, 5], (5, 6], NaN]
Categories (3, interval[int64]): [(4, 5] < (5, 6] < (6, 7]]

There are a few issues at play above, one of which being cut relies on indexing methods that assume unique return values (along with a now fixed IntervalIndex bug). I don't think it'd be terribly hard to get this working properly now.

we should at least give the option to use overlapping indices in cut

I'm not necessarily against this if it's the path we want to go down, but at the same time I'm not convinced that cut is the right place for this functionality. To be clear, I do think this functionality would be useful somewhere, it's moreso balancing if the increase in complexity of adding it to cut in a fully supported way is worth it.

If there is a smarter way to do this I am happy to learn about it.

In the meantime using a combination of take and get_indexer_for should generically work for both overlapping and non-overlapping IntervalIndex in 0.25.0:

In [3]: ii.take(ii.get_indexer_for([5, 6]))
Out[3]: 
IntervalIndex([(0, 10], (2, 12], (4, 14], (0, 10], (2, 12], (4, 14]],
              closed='right',
              dtype='interval[int64]')

Note that the above results in an IntervalIndex instead of a Categorial. If you want a Categorical it should be more memory efficient to construct via Categorical.from_codes:

In [4]: pd.Categorical.from_codes(ii.get_indexer_for([5, 6]), ii, ordered=True)
Out[4]: 
[(0, 10], (2, 12], (4, 14], (0, 10], (2, 12], (4, 14]]
Categories (3, interval[int64]): [(0, 10] < (2, 12] < (4, 14]]

TomAugspurger · 2019-08-02T18:54:29Z

The big issue with overlapping bins is the lack of uniqueness. Every other method for using cut results in a unique 1-1 mapping from value to bin.

I also think this is valuable.

I'm not sure how complex an option to allow overlapping intervals would be. If it's not too bad, then I think it's reasonable to add it as an option, but for unique to be the default.

yohplala · 2021-05-26T20:58:15Z

Hello all,

I am interested / in need of being able to groupby on overlapping intervals.
I hope this is something pandas lib will be able to handle in the future.
Meanwhile, to those encountering this limit in their work, I am posting here below the workaround I have coded.
Hope this can also help pandas team in some ways, although I am not sure anything of it can really be re-used.
There are some shortcuts made, specific to my use case:

intervals are left closed.
I am expecting in the index of the resulting DataFrame the right edge of the bin
...

It is not thoroughly tested yet. (I have only checked with the example I am posting along)

import random
import numpy as np
import pandas as pd
from numba import guvectorize

# Function to filter out overlapping intervals from an IntervalIndex.
@guvectorize('void(int64[:], int64[:], int64[:])',
             '(m),(m)->(m)', nopython=True)
def non_overlapping_left_closed(start, end, non_overlapping):
    """
    Return indices of non-overlapping intervals. It has to be applied
    in several steps until there is no more overlapping intervals.

    Parameters:
        start (np.array[int]):
            Array containing the starts of the intervals (included).
        end (np.array[int]):
            Array containing the ends of the intervals (excluded).
        non_overlapping (np_array[int]):
            Array of 0 and 1. 1 indicate a non-overlapping interval and 0 an
            overlapping interval.

    """

    start = start[1:]
    previous_end = end[0]
    non_overlapping[0] = 1
    for idx, ts in np.ndenumerate(start):
        # Switch from tuple to single value.
        idx, = idx
        if ts >= previous_end:
            # Non-overlapping interval.
            non_overlapping[idx+1] = 1
            # Update to current end.
            previous_end = end[idx+1]
        else:
            # Overlapping interval.
            non_overlapping[idx+1] = 0
    return

def groupby_on_overlapping_intervals(start, end, df, agg) \
                                     -> pd.DataFrame:
    """
    Conduct a groupby considering left closed intervals defined as
    [start, end).

    Parameters:
        start (np.array[int]):
            Array containing the starts of the intervals (included).
        end (np.array[int]):
            Array containing the ends of the intervals (excluded).
        df (pd.DataFrame):
            DataFrame onto which conducting the aggregation by groupby.
        agg (dict):
            Aggregation functions (named aggregations).

    Returns:
        aggregated_data (pd.DataFrame):
            Aggregated data based on intervals.

    """

    buffer = []
    intervals = pd.IntervalIndex.from_arrays(start, end, closed='left')
    while True:
        if intervals.is_overlapping:
            # Set apart overlapping intervals for next iteration.
            non_overlapping = np.zeros(len(intervals), dtype='int64')
            non_overlapping_left_closed(intervals.left.astype('long'),
                                        intervals.right.astype('long'),
                                        non_overlapping)
            overlapping = intervals[(non_overlapping == 0).nonzero()[0]]
            intervals = intervals[(non_overlapping == 1).nonzero()[0]]
        else:
            overlapping = pd.IntervalIndex([])            
        groups = pd.IntervalIndex(pd.cut(df['timestamp'], intervals)).right
        buffer.append(df.groupby(groups).agg(**agg))
        if overlapping.size == 0:
            # No more intervals in overlap.
            break
        else:
            # Manage intervals in overlap that have been set apart
            intervals = overlapping

    return buffer[0] if len(buffer) == 1 else pd.concat(buffer).sort_index()

# Test
# Dataset.
ts_raw = pd.DatetimeIndex([pd.Timestamp('2021/01/01 00:30'),
                           pd.Timestamp('2021/01/01 00:45'),
                           pd.Timestamp('2021/01/01 01:00'),
                           pd.Timestamp('2021/01/01 01:55'),
                           pd.Timestamp('2021/01/01 02:00'),
                           pd.Timestamp('2021/01/01 03:38'),
                           pd.Timestamp('2021/01/01 03:55'),
                           pd.Timestamp('2021/01/01 05:00'),
                           pd.Timestamp('2021/01/01 05:20'),
                           pd.Timestamp('2021/01/01 05:21'),
                           pd.Timestamp('2021/01/01 06:00')])
length = len(ts_raw)
random.seed(1)
value = random.sample(range(1, length+1), length)
df_raw = pd.DataFrame({'value': value, 'timestamp': ts_raw})
df_raw['value'] = df_raw['value'].astype('float64')
# Overlapping intervals.
start = df_raw['timestamp'].iloc[[0, 2, 5, 8]]
end = df_raw['timestamp'].iloc[[3, 6, 8, 10]]
# Aggregation function (named aggregation).
agg = {'high':('value', 'max')}

res = groupby_on_overlapping_intervals(start, end, df_raw, agg)
expected = pd.DataFrame({'high': [10.0, 5.0, 8.0, 11.0]},
                        index = end)
assert res.equals(expected)

In case someone sees a shorter / better / cleaner / faster way to operate this aggregation on overlapping intervals, I will gladly accept any advices.

Bests

mroeschke · 2024-08-28T17:43:41Z

Thanks for the request, but it appears this feature request hasn't gain much traction in years so closing

TomAugspurger added Interval Interval data type API Design labels Jul 30, 2019

ghost mentioned this issue Aug 2, 2019

API? how to do a "rolling groupby" or groupby with overlapping groups? #26959

Open

niowniow mentioned this issue Dec 10, 2019

Feature Request: Efficient rolling with strides pydata/xarray#3608

Open

mroeschke added the Enhancement label Apr 27, 2020

jbrockmendel added the cut cut, qcut label Sep 21, 2020

yohplala mentioned this issue May 26, 2021

ENH: IntervalIndex as groups in groupby #37949

Open

mroeschke added Needs Discussion Requires discussion from core team before further action and removed API Design labels Jul 10, 2021

mroeschke closed this as completed Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pandas.cut should optionally allow overlapping IntervalIndex bins #27654

BUG: pandas.cut should optionally allow overlapping IntervalIndex bins #27654

c-thiel commented Jul 30, 2019

TomAugspurger commented Jul 30, 2019

jschendel commented Jul 31, 2019

TomAugspurger commented Aug 2, 2019

yohplala commented May 26, 2021 •

edited

Loading

mroeschke commented Aug 28, 2024

BUG: pandas.cut should optionally allow overlapping IntervalIndex bins #27654

BUG: pandas.cut should optionally allow overlapping IntervalIndex bins #27654

Comments

c-thiel commented Jul 30, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

TomAugspurger commented Jul 30, 2019

jschendel commented Jul 31, 2019

TomAugspurger commented Aug 2, 2019

yohplala commented May 26, 2021 • edited Loading

mroeschke commented Aug 28, 2024

yohplala commented May 26, 2021 •

edited

Loading