Skip to content

BUG: pandas.cut should optionally allow overlapping IntervalIndex bins #27654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
c-thiel opened this issue Jul 30, 2019 · 5 comments
Closed

BUG: pandas.cut should optionally allow overlapping IntervalIndex bins #27654

c-thiel opened this issue Jul 30, 2019 · 5 comments
Labels
cut cut, qcut Enhancement Interval Interval data type Needs Discussion Requires discussion from core team before further action

Comments

@c-thiel
Copy link

c-thiel commented Jul 30, 2019

Due to #23980 the following code now raises a ValueError since 0.25:

Code Sample, a copy-pastable example if possible

ii = pd.IntervalIndex.from_tuples([(0, 10), (2, 12), (4, 14)])
pd.cut([5, 6], bins=ii)

Problem description

Before #23980 an IntervalIndex with overlapping columns could be used. It would return every Interval which is valid for the required data, which is obviously the correct solution.

In #23980 it was stated that this doesn't make sense in the context of cut. Unfortunately I missed the discussion over there (there really was None). I argue that by raising a value error we unnecessarily remove a valid feature: I use cut frequently as kind of a more versatile replacement to pd.rolling for overlapping non-equal sized custom windows.

If there is a smarter way to do this I am happy to learn about it. Otherwise we should at least give the option to use overlapping indices in cut. Thus I would recommend to raise a warning instead of an error here:

elif isinstance(bins, IntervalIndex):
if bins.is_overlapping:
raise ValueError("Overlapping IntervalIndex is not accepted.")

Expected Output

Raise a warning maybe (I am still not sure if this is necessary) and return:

[(0, 10], (2, 12], (4, 14], (0, 10], (2, 12], (4, 14]]
Categories (3, interval[int64]): [(0, 10] < (2, 12] < (4, 14]]
@TomAugspurger
Copy link
Contributor

cc @jschendel.

which is obviously the correct solution.

I don't think it's obvious that one or the other is "more" correct.

@TomAugspurger TomAugspurger added Interval Interval data type API Design labels Jul 30, 2019
@jschendel
Copy link
Member

The big issue with overlapping bins is the lack of uniqueness. Every other method for using cut results in a unique 1-1 mapping from value to bin. Historically, cut was introduced before IntervalIndex, with IntervalIndex support being added on after the fact, so the uniqueness assumption has been baked into the implementation of cut.

My comment about overlapping bins not really making sense comes from the viewpoint (a possibly incorrect one on my part) of uniqueness being a defining characteristic of cut. The advantage of uniqueness being that there is always a straightforward way to realign with (or assign to) the input data. Once you lose uniqueness, you basically start to enter into join territory for alignment purposes.

I argue that by raising a value error we unnecessarily remove a valid feature

Only a small portion of this feature was actually working at the time it was removed. The specific example you gave works fine, but very similar variants of it did not work, in large part due to some of the uniqueness issues discussed above.

For example, indexed input fails to have it's index reapplied due to not conforming with the size of the input:

In [3]: pd.cut([5, 6], bins=ii)
Out[3]: 
[(0, 10], (2, 12], (4, 14], (0, 10], (2, 12], (4, 14]]
Categories (3, interval[int64]): [(0, 10] < (2, 12] < (4, 14]]

In [4]: pd.cut(pd.Series([5, 6]), bins=ii)
---------------------------------------------------------------------------
ValueError: Length of passed values is 6, index implies 2

Applying the proper index values here (i.e. ones that align with the input data) in a performant way is non-trivial, and I suspect it would add a non-negligible amount of complexity to cut.

Using cut with overlapping bins also failed for out of bounds data, but was fine for non-overlapping:

In [5]: pd.cut([5, 6, 100], bins=ii)
---------------------------------------------------------------------------
KeyError: 100

In [6]: pd.cut([5, 6, 100], bins=pd.interval_range(4, 7))
Out[6]: 
[(4, 5], (5, 6], NaN]
Categories (3, interval[int64]): [(4, 5] < (5, 6] < (6, 7]]

There are a few issues at play above, one of which being cut relies on indexing methods that assume unique return values (along with a now fixed IntervalIndex bug). I don't think it'd be terribly hard to get this working properly now.

we should at least give the option to use overlapping indices in cut

I'm not necessarily against this if it's the path we want to go down, but at the same time I'm not convinced that cut is the right place for this functionality. To be clear, I do think this functionality would be useful somewhere, it's moreso balancing if the increase in complexity of adding it to cut in a fully supported way is worth it.

If there is a smarter way to do this I am happy to learn about it.

In the meantime using a combination of take and get_indexer_for should generically work for both overlapping and non-overlapping IntervalIndex in 0.25.0:

In [3]: ii.take(ii.get_indexer_for([5, 6]))
Out[3]: 
IntervalIndex([(0, 10], (2, 12], (4, 14], (0, 10], (2, 12], (4, 14]],
              closed='right',
              dtype='interval[int64]')

Note that the above results in an IntervalIndex instead of a Categorial. If you want a Categorical it should be more memory efficient to construct via Categorical.from_codes:

In [4]: pd.Categorical.from_codes(ii.get_indexer_for([5, 6]), ii, ordered=True)
Out[4]: 
[(0, 10], (2, 12], (4, 14], (0, 10], (2, 12], (4, 14]]
Categories (3, interval[int64]): [(0, 10] < (2, 12] < (4, 14]]

@TomAugspurger
Copy link
Contributor

The big issue with overlapping bins is the lack of uniqueness. Every other method for using cut results in a unique 1-1 mapping from value to bin.

I also think this is valuable.

I'm not sure how complex an option to allow overlapping intervals would be. If it's not too bad, then I think it's reasonable to add it as an option, but for unique to be the default.

@yohplala
Copy link

yohplala commented May 26, 2021

Hello all,

I am interested / in need of being able to groupby on overlapping intervals.
I hope this is something pandas lib will be able to handle in the future.
Meanwhile, to those encountering this limit in their work, I am posting here below the workaround I have coded.
Hope this can also help pandas team in some ways, although I am not sure anything of it can really be re-used.
There are some shortcuts made, specific to my use case:

  • intervals are left closed.
  • I am expecting in the index of the resulting DataFrame the right edge of the bin
  • ...

It is not thoroughly tested yet. (I have only checked with the example I am posting along)

import random
import numpy as np
import pandas as pd
from numba import guvectorize

# Function to filter out overlapping intervals from an IntervalIndex.
@guvectorize('void(int64[:], int64[:], int64[:])',
             '(m),(m)->(m)', nopython=True)
def non_overlapping_left_closed(start, end, non_overlapping):
    """
    Return indices of non-overlapping intervals. It has to be applied
    in several steps until there is no more overlapping intervals.

    Parameters:
        start (np.array[int]):
            Array containing the starts of the intervals (included).
        end (np.array[int]):
            Array containing the ends of the intervals (excluded).
        non_overlapping (np_array[int]):
            Array of 0 and 1. 1 indicate a non-overlapping interval and 0 an
            overlapping interval.

    """

    start = start[1:]
    previous_end = end[0]
    non_overlapping[0] = 1
    for idx, ts in np.ndenumerate(start):
        # Switch from tuple to single value.
        idx, = idx
        if ts >= previous_end:
            # Non-overlapping interval.
            non_overlapping[idx+1] = 1
            # Update to current end.
            previous_end = end[idx+1]
        else:
            # Overlapping interval.
            non_overlapping[idx+1] = 0
    return

def groupby_on_overlapping_intervals(start, end, df, agg) \
                                     -> pd.DataFrame:
    """
    Conduct a groupby considering left closed intervals defined as
    [start, end).

    Parameters:
        start (np.array[int]):
            Array containing the starts of the intervals (included).
        end (np.array[int]):
            Array containing the ends of the intervals (excluded).
        df (pd.DataFrame):
            DataFrame onto which conducting the aggregation by groupby.
        agg (dict):
            Aggregation functions (named aggregations).

    Returns:
        aggregated_data (pd.DataFrame):
            Aggregated data based on intervals.

    """

    buffer = []
    intervals = pd.IntervalIndex.from_arrays(start, end, closed='left')
    while True:
        if intervals.is_overlapping:
            # Set apart overlapping intervals for next iteration.
            non_overlapping = np.zeros(len(intervals), dtype='int64')
            non_overlapping_left_closed(intervals.left.astype('long'),
                                        intervals.right.astype('long'),
                                        non_overlapping)
            overlapping = intervals[(non_overlapping == 0).nonzero()[0]]
            intervals = intervals[(non_overlapping == 1).nonzero()[0]]
        else:
            overlapping = pd.IntervalIndex([])            
        groups = pd.IntervalIndex(pd.cut(df['timestamp'], intervals)).right
        buffer.append(df.groupby(groups).agg(**agg))
        if overlapping.size == 0:
            # No more intervals in overlap.
            break
        else:
            # Manage intervals in overlap that have been set apart
            intervals = overlapping

    return buffer[0] if len(buffer) == 1 else pd.concat(buffer).sort_index()

# Test
# Dataset.
ts_raw = pd.DatetimeIndex([pd.Timestamp('2021/01/01 00:30'),
                           pd.Timestamp('2021/01/01 00:45'),
                           pd.Timestamp('2021/01/01 01:00'),
                           pd.Timestamp('2021/01/01 01:55'),
                           pd.Timestamp('2021/01/01 02:00'),
                           pd.Timestamp('2021/01/01 03:38'),
                           pd.Timestamp('2021/01/01 03:55'),
                           pd.Timestamp('2021/01/01 05:00'),
                           pd.Timestamp('2021/01/01 05:20'),
                           pd.Timestamp('2021/01/01 05:21'),
                           pd.Timestamp('2021/01/01 06:00')])
length = len(ts_raw)
random.seed(1)
value = random.sample(range(1, length+1), length)
df_raw = pd.DataFrame({'value': value, 'timestamp': ts_raw})
df_raw['value'] = df_raw['value'].astype('float64')
# Overlapping intervals.
start = df_raw['timestamp'].iloc[[0, 2, 5, 8]]
end = df_raw['timestamp'].iloc[[3, 6, 8, 10]]
# Aggregation function (named aggregation).
agg = {'high':('value', 'max')}

res = groupby_on_overlapping_intervals(start, end, df_raw, agg)
expected = pd.DataFrame({'high': [10.0, 5.0, 8.0, 11.0]},
                        index = end)
assert res.equals(expected)

In case someone sees a shorter / better / cleaner / faster way to operate this aggregation on overlapping intervals, I will gladly accept any advices.

Bests

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed API Design labels Jul 10, 2021
@mroeschke
Copy link
Member

Thanks for the request, but it appears this feature request hasn't gain much traction in years so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cut cut, qcut Enhancement Interval Interval data type Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants