Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: User Guide Page on user-defined functions #61195

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

arthurlw
Copy link
Contributor

@arthurlw
Copy link
Contributor Author

Currently writing this, so I would appreciate any feedback on it!

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I'm not opposed to a dedicated page on UDFs, but I am opposed to duplicating documentation that exists elsewhere in the user guide, as I think much of this does. Instead of e.g. examples of apply, I recommend linking to the appropriate section. This page can then focus on recommendations of when to use apply vs other methods.

Comment on lines 16 to 17
Why Use User-Defined Functions?
-------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should lead with Why _not_ User-Defined Functions. While performance is called out down below, I think the poor behavior of UDFs should be mentioned as well. Namely that pandas has no information on what a UDF is doing, and so has to infer (guess) at how to handle the result.

In particular, I think it should be mentioned that none of the examples on this page should be UDFs in practice.

@rhshadrach rhshadrach added Apply Apply, Aggregate, Transform, Map Docs labels Mar 29, 2025
@arthurlw
Copy link
Contributor Author

Hi @rhshadrach thanks for the feedback! I agree with you and will push updates soon

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking a lot better. Can we also link to https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html#numba-jit-compilation at the very bottom in a section titled something like "Improving Performance with UDFs".

Comment on lines +19 to +20
While UDFs provide flexibility, they come with significant drawbacks, primarily
related to performance. Unlike vectorized pandas operations, UDFs are slower because pandas lacks
Copy link
Member

@rhshadrach rhshadrach Apr 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion the primary drawback is behavior and not performance, but others may disagree. I'd suggest not being opinionated here between the two, but rather saying primarily related to performance and behavior.

In any case, can you include that pandas must perform inference on the result, and that inference can be incorrect.

Comment on lines +21 to +24
insight into what they are computing, making it difficult to apply efficient handling or optimization
techniques. As a result, pandas resorts to less efficient processing methods that significantly
slow down computations. Additionally, relying on UDFs often sacrifices the benefits
of pandas’ built-in, optimized methods, limiting compatibility and overall performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me these three sentences are all staying the same thing - and that one sentence here would do.

* **Custom Computations Are Needed**: Implementing complex logic or domain-specific calculations that pandas'
built-in methods cannot handle.
* **Extending pandas' Functionality**: Applying external libraries or specialized algorithms unavailable in pandas.
* **Handling Complex Grouped Operations**: Performing operations on grouped data that standard methods do not support.
Copy link
Member

@rhshadrach rhshadrach Apr 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the last line here, would love to see a real-world example of this that couldn't be broken down into supported operations. But I'm okay with this staying regardless.

ways to apply UDFs across different pandas data structures.

.. note::
Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also make a mention of resample, rolling, expanding, and ewm. Perhaps link to each section in the User Guide.

+==================+======================================+===========================+====================+===========================+==========================================+
| :meth:`apply` | General-purpose function | Yes | Yes (when axis=1) | Slow | Custom row-wise or column-wise operations|
+------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+
| :meth:`agg` | Aggregation | Yes | No | Fast (if using built-ins) | Custom aggregation logic |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast (if using built-ins)

We should decide if this page is about using UDFs, in which case I think e.g. .agg("sum") is not within the scope, or if it's about using methods that take UDFs.

I'd suggest the former, and remove any mention of not using UDFs - and with that the performance column.

def is_long_name(column_name):
return len(column_name) > 1

df_filtered = df[[col for col in df.columns if is_long_name(col)]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't actually use .filter. Shouldn't it?

The pipe method is useful for chaining operations together into a clean and readable pipeline.
It is a helpful tool for organizing complex data processing workflows.

When to use: Use pipe when you need to create a pipeline of transformations and want to keep the code readable and maintainable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We'll probably want to stay away from transformations here to avoid confusion with .transform. I'd suggest operations.

it is slower than vectorized operations and should be used only when you need operations
that cannot be achieved with built-in pandas functions.

When to use: :meth:`DataFrame.apply` is suitable when no alternative vectorized method is available, but consider
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should recommend when no other UDF method is suitable as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: Write user guide page on apply/map/transform methods
2 participants