-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: User Guide Page on user-defined functions #61195
base: main
Are you sure you want to change the base?
Conversation
Currently writing this, so I would appreciate any feedback on it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I'm not opposed to a dedicated page on UDFs, but I am opposed to duplicating documentation that exists elsewhere in the user guide, as I think much of this does. Instead of e.g. examples of apply
, I recommend linking to the appropriate section. This page can then focus on recommendations of when to use apply vs other methods.
Why Use User-Defined Functions? | ||
------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should lead with Why _not_ User-Defined Functions
. While performance is called out down below, I think the poor behavior of UDFs should be mentioned as well. Namely that pandas has no information on what a UDF is doing, and so has to infer (guess) at how to handle the result.
In particular, I think it should be mentioned that none of the examples on this page should be UDFs in practice.
Hi @rhshadrach thanks for the feedback! I agree with you and will push updates soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking a lot better. Can we also link to https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html#numba-jit-compilation at the very bottom in a section titled something like "Improving Performance with UDFs".
While UDFs provide flexibility, they come with significant drawbacks, primarily | ||
related to performance. Unlike vectorized pandas operations, UDFs are slower because pandas lacks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion the primary drawback is behavior and not performance, but others may disagree. I'd suggest not being opinionated here between the two, but rather saying primarily related to performance and behavior
.
In any case, can you include that pandas must perform inference on the result, and that inference can be incorrect.
insight into what they are computing, making it difficult to apply efficient handling or optimization | ||
techniques. As a result, pandas resorts to less efficient processing methods that significantly | ||
slow down computations. Additionally, relying on UDFs often sacrifices the benefits | ||
of pandas’ built-in, optimized methods, limiting compatibility and overall performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me these three sentences are all staying the same thing - and that one sentence here would do.
* **Custom Computations Are Needed**: Implementing complex logic or domain-specific calculations that pandas' | ||
built-in methods cannot handle. | ||
* **Extending pandas' Functionality**: Applying external libraries or specialized algorithms unavailable in pandas. | ||
* **Handling Complex Grouped Operations**: Performing operations on grouped data that standard methods do not support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the last line here, would love to see a real-world example of this that couldn't be broken down into supported operations. But I'm okay with this staying regardless.
ways to apply UDFs across different pandas data structures. | ||
|
||
.. note:: | ||
Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also make a mention of resample, rolling, expanding, and ewm. Perhaps link to each section in the User Guide.
+==================+======================================+===========================+====================+===========================+==========================================+ | ||
| :meth:`apply` | General-purpose function | Yes | Yes (when axis=1) | Slow | Custom row-wise or column-wise operations| | ||
+------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ | ||
| :meth:`agg` | Aggregation | Yes | No | Fast (if using built-ins) | Custom aggregation logic | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fast (if using built-ins)
We should decide if this page is about using UDFs, in which case I think e.g. .agg("sum")
is not within the scope, or if it's about using methods that take UDFs.
I'd suggest the former, and remove any mention of not using UDFs - and with that the performance column.
def is_long_name(column_name): | ||
return len(column_name) > 1 | ||
|
||
df_filtered = df[[col for col in df.columns if is_long_name(col)]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example doesn't actually use .filter
. Shouldn't it?
The pipe method is useful for chaining operations together into a clean and readable pipeline. | ||
It is a helpful tool for organizing complex data processing workflows. | ||
|
||
When to use: Use pipe when you need to create a pipeline of transformations and want to keep the code readable and maintainable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We'll probably want to stay away from transformations
here to avoid confusion with .transform
. I'd suggest operations
.
it is slower than vectorized operations and should be used only when you need operations | ||
that cannot be achieved with built-in pandas functions. | ||
|
||
When to use: :meth:`DataFrame.apply` is suitable when no alternative vectorized method is available, but consider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should recommend when no other UDF method is suitable as well.
Tests added and passed if fixing a bug or adding a new featureAll code checks passed.Added type annotations to new arguments/methods/functions.Added an entry in the latestdoc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.