DOC: User Guide Page on user-defined functions #61195

arthurlw · 2025-03-28T19:15:48Z

closes DOC: Write user guide page on apply/map/transform methods #61126
~~Tests added and passed if fixing a bug or adding a new feature~~
~~All code checks passed.~~
~~Added type annotations to new arguments/methods/functions.~~
~~Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.~~

arthurlw · 2025-03-28T19:34:49Z

Currently writing this, so I would appreciate any feedback on it!

rhshadrach

Thanks for the PR! I'm not opposed to a dedicated page on UDFs, but I am opposed to duplicating documentation that exists elsewhere in the user guide, as I think much of this does. Instead of e.g. examples of apply, I recommend linking to the appropriate section. This page can then focus on recommendations of when to use apply vs other methods.

rhshadrach · 2025-03-29T12:54:27Z

doc/source/user_guide/user_defined_functions.rst

+Why Use User-Defined Functions?
+-------------------------------


I think we should lead with Why _not_ User-Defined Functions. While performance is called out down below, I think the poor behavior of UDFs should be mentioned as well. Namely that pandas has no information on what a UDF is doing, and so has to infer (guess) at how to handle the result.

In particular, I think it should be mentioned that none of the examples on this page should be UDFs in practice.

arthurlw · 2025-03-29T17:10:53Z

Hi @rhshadrach thanks for the feedback! I agree with you and will push updates soon

rhshadrach

I think this is looking a lot better. Can we also link to https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html#numba-jit-compilation at the very bottom in a section titled something like "Improving Performance with UDFs".

rhshadrach · 2025-04-06T12:55:20Z

doc/source/user_guide/user_defined_functions.rst

+While UDFs provide flexibility, they come with significant drawbacks, primarily
+related to performance. Unlike vectorized pandas operations, UDFs are slower because pandas lacks


In my opinion the primary drawback is behavior and not performance, but others may disagree. I'd suggest not being opinionated here between the two, but rather saying primarily related to performance and behavior.

In any case, can you include that pandas must perform inference on the result, and that inference can be incorrect.

rhshadrach · 2025-04-06T12:57:01Z

doc/source/user_guide/user_defined_functions.rst

+insight into what they are computing, making it difficult to apply efficient handling or optimization
+techniques. As a result, pandas resorts to less efficient processing methods that significantly
+slow down computations. Additionally, relying on UDFs often sacrifices the benefits
+of pandas’ built-in, optimized methods, limiting compatibility and overall performance.


It seems to me these three sentences are all staying the same thing - and that one sentence here would do.

rhshadrach · 2025-04-06T12:58:35Z

doc/source/user_guide/user_defined_functions.rst

+* **Custom Computations Are Needed**: Implementing complex logic or domain-specific calculations that pandas'
+  built-in methods cannot handle.
+* **Extending pandas' Functionality**: Applying external libraries or specialized algorithms unavailable in pandas.
+* **Handling Complex Grouped Operations**: Performing operations on grouped data that standard methods do not support.


For the last line here, would love to see a real-world example of this that couldn't be broken down into supported operations. But I'm okay with this staying regardless.

rhshadrach · 2025-04-06T13:02:47Z

doc/source/user_guide/user_defined_functions.rst

+ways to apply UDFs across different pandas data structures.
+
+.. note::
+    Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`.


Can you also make a mention of resample, rolling, expanding, and ewm. Perhaps link to each section in the User Guide.

rhshadrach · 2025-04-06T13:07:56Z

doc/source/user_guide/user_defined_functions.rst

+==================+======================================+===========================+====================+===========================+==========================================+
+| :meth:`apply`    | General-purpose function             | Yes                       | Yes (when axis=1)  | Slow                      | Custom row-wise or column-wise operations|
+------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+
+| :meth:`agg`      | Aggregation                          | Yes                       | No                 | Fast (if using built-ins) | Custom aggregation logic                 |


Fast (if using built-ins)

We should decide if this page is about using UDFs, in which case I think e.g. .agg("sum") is not within the scope, or if it's about using methods that take UDFs.

I'd suggest the former, and remove any mention of not using UDFs - and with that the performance column.

rhshadrach · 2025-04-06T13:10:50Z

doc/source/user_guide/user_defined_functions.rst

+    def is_long_name(column_name):
+        return len(column_name) > 1
+
+    df_filtered = df[[col for col in df.columns if is_long_name(col)]]


This example doesn't actually use .filter. Shouldn't it?

rhshadrach · 2025-04-06T13:12:08Z

doc/source/user_guide/user_defined_functions.rst

+The pipe method is useful for chaining operations together into a clean and readable pipeline.
+It is a helpful tool for organizing complex data processing workflows.
+
+When to use: Use pipe when you need to create a pipeline of transformations and want to keep the code readable and maintainable.


Nit: We'll probably want to stay away from transformations here to avoid confusion with .transform. I'd suggest operations.

rhshadrach · 2025-04-06T13:16:11Z

doc/source/user_guide/user_defined_functions.rst

+it is slower than vectorized operations and should be used only when you need operations
+that cannot be achieved with built-in pandas functions.
+
+When to use: :meth:`DataFrame.apply` is suitable when no alternative vectorized method is available, but consider


I think we should recommend when no other UDF method is suitable as well.

arthurlw added 8 commits March 25, 2025 19:39

udf user guide introduction

3f94137

added apply method

bf984ca

added agg, transform and filter

fe67ec8

added map, pipe and vectorized operations

4ec5697

bugfix

11392d7

updated map method

f322d9e

precommit

b6b7b02

trim trailing whitespace

d20bcc7

toctree

72f7b62

rhshadrach requested changes Mar 29, 2025

View reviewed changes

rhshadrach added Apply Apply, Aggregate, Transform, Map Docs labels Mar 29, 2025

arthurlw added 5 commits March 29, 2025 13:28

restructured udf user guide

90a2d24

updated documentation links

0d02d64

precommit

214f0ac

fix links

fffaad0

change links

561a1f5

rhshadrach requested changes Apr 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: User Guide Page on user-defined functions #61195

DOC: User Guide Page on user-defined functions #61195

arthurlw commented Mar 28, 2025

arthurlw commented Mar 28, 2025

rhshadrach left a comment

rhshadrach Mar 29, 2025

arthurlw commented Mar 29, 2025

rhshadrach left a comment

rhshadrach Apr 6, 2025 •

edited

Loading

rhshadrach Apr 6, 2025

rhshadrach Apr 6, 2025 •

edited

Loading

rhshadrach Apr 6, 2025

rhshadrach Apr 6, 2025

rhshadrach Apr 6, 2025

rhshadrach Apr 6, 2025

rhshadrach Apr 6, 2025

		Why Use User-Defined Functions?
		-------------------------------

		While UDFs provide flexibility, they come with significant drawbacks, primarily
		related to performance. Unlike vectorized pandas operations, UDFs are slower because pandas lacks

DOC: User Guide Page on user-defined functions #61195

Are you sure you want to change the base?

DOC: User Guide Page on user-defined functions #61195

Conversation

arthurlw commented Mar 28, 2025

arthurlw commented Mar 28, 2025

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Mar 29, 2025

Choose a reason for hiding this comment

arthurlw commented Mar 29, 2025

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Apr 6, 2025 • edited Loading

Choose a reason for hiding this comment

rhshadrach Apr 6, 2025

Choose a reason for hiding this comment

rhshadrach Apr 6, 2025 • edited Loading

Choose a reason for hiding this comment

rhshadrach Apr 6, 2025

Choose a reason for hiding this comment

rhshadrach Apr 6, 2025

Choose a reason for hiding this comment

rhshadrach Apr 6, 2025

Choose a reason for hiding this comment

rhshadrach Apr 6, 2025

Choose a reason for hiding this comment

rhshadrach Apr 6, 2025

Choose a reason for hiding this comment

rhshadrach Apr 6, 2025 •

edited

Loading

rhshadrach Apr 6, 2025 •

edited

Loading