-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow aggregation performance on object type column with Decimal datatype values #25168
Comments
What kind of performance are you seeing if just using the NumPy array alone? |
Just to clarify, are you grouping on a Decimal column? I am seeing that a Decimal column does impose a performance penalty, but the case below has runtime of 1.5 seconds for 1,000,000 rows and 4 columns, which is much less than 70 seconds. Can you provide a runnable example that exhibits a 70-second groupby time? I suspect that most of the extra cost from Decimal values comes from allocation costs during while assembling a data frame for each group.
|
you can go do a lot better by using DecimalArray here |
Our DecimalArray isn't public FYI. Perhaps you mean using pyarrow's and
fletcher?
…On Fri, Feb 22, 2019 at 5:11 AM Jeff Reback ***@***.***> wrote:
you can go do a lot better by using DecimalArray here
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#25168 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIu-NP86QDha1Ehja94VfDMrMd-YSks5vP9BtgaJpZM4akGcL>
.
|
The performance is very slow if the number of groups is very higher.
|
I am just looking at Decimal Array. We get protobuf object which we convert to dict and then to pandas dataframe. Is there a small example to convert the decimal columns as decimal array when creating pandas dataframe from dictionary |
Problem description
I have a dataframe with a million row and 10 columns. I want to groupby on 3 columns and sum on 1 column. If the column to be aggregated is float it takes less than a second to get the result. But if the column is a object with Decimal datatype values it take 70 seconds to return the result. I know it is not a numpy datatype column and it would take longer but is 70 seconds for aggregating million rows seems reasonable. Is there any way to get results faster with Decimal
The text was updated successfully, but these errors were encountered: