-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: User Guide Page on user-defined functions #61195
Open
arthurlw
wants to merge
15
commits into
pandas-dev:main
Choose a base branch
from
arthurlw:udf_user_guide
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+276
−0
Open
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
3f94137
udf user guide introduction
arthurlw bf984ca
added apply method
arthurlw fe67ec8
added agg, transform and filter
arthurlw 4ec5697
added map, pipe and vectorized operations
arthurlw 11392d7
bugfix
arthurlw f322d9e
updated map method
arthurlw b6b7b02
precommit
arthurlw d20bcc7
trim trailing whitespace
arthurlw 72f7b62
toctree
arthurlw 90a2d24
restructured udf user guide
arthurlw 0d02d64
updated documentation links
arthurlw 214f0ac
precommit
arthurlw fffaad0
fix links
arthurlw 561a1f5
change links
arthurlw c6891a0
updated user guide
arthurlw File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -88,3 +88,4 @@ Guides | |
sparse | ||
gotchas | ||
cookbook | ||
user_defined_functions |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,275 @@ | ||
.. _user_defined_functions: | ||
|
||
{{ header }} | ||
|
||
************************************** | ||
Introduction to User-Defined Functions | ||
************************************** | ||
|
||
In pandas, User-Defined Functions (UDFs) provide a way to extend the library’s | ||
functionality by allowing users to apply custom computations to their data. While | ||
pandas comes with a set of built-in functions for data manipulation, UDFs offer | ||
flexibility when built-in methods are not sufficient. These functions can be | ||
applied at different levels: element-wise, row-wise, column-wise, or group-wise, | ||
and change the data differently, depending on the method used. | ||
|
||
Why Not To Use User-Defined Functions | ||
----------------------------------------- | ||
|
||
While UDFs provide flexibility, they come with significant drawbacks, primarily | ||
related to performance and behavior. When using UDFs, pandas must perform inference | ||
on the result, and that inference could be incorrect. Furthermore, unlike vectorized operations, | ||
UDFs are slower because pandas can't optimize their computations, leading to | ||
inefficient processing. | ||
|
||
.. note:: | ||
In general, most tasks can and should be accomplished using pandas’ built-in methods or vectorized operations. | ||
|
||
Despite their drawbacks, UDFs can be helpful when: | ||
|
||
* **Custom Computations Are Needed**: Implementing complex logic or domain-specific calculations that pandas' | ||
built-in methods cannot handle. | ||
* **Extending pandas' Functionality**: Applying external libraries or specialized algorithms unavailable in pandas. | ||
* **Handling Complex Grouped Operations**: Performing operations on grouped data that standard methods do not support. | ||
|
||
For example: | ||
|
||
.. code-block:: python | ||
|
||
from sklearn.linear_model import LinearRegression | ||
|
||
# Sample data | ||
df = pd.DataFrame({ | ||
'group': ['A', 'A', 'A', 'B', 'B', 'B'], | ||
'x': [1, 2, 3, 1, 2, 3], | ||
'y': [2, 4, 6, 1, 2, 1.5] | ||
}) | ||
|
||
# Function to fit a model to each group | ||
def fit_model(group): | ||
model = LinearRegression() | ||
model.fit(group[['x']], group['y']) | ||
group['y_pred'] = model.predict(group[['x']]) | ||
return group | ||
|
||
result = df.groupby('group').apply(fit_model) | ||
|
||
|
||
Methods that support User-Defined Functions | ||
------------------------------------------- | ||
|
||
User-Defined Functions can be applied across various pandas methods: | ||
|
||
* :meth:`~DataFrame.apply` - A flexible method that allows applying a function to Series, | ||
DataFrames, or groups of data. | ||
* :meth:`~DataFrame.agg` (Aggregate) - Used for summarizing data, supporting multiple | ||
aggregation functions. | ||
* :meth:`~DataFrame.transform` - Applies a function to groups while preserving the shape of | ||
the original data. | ||
* :meth:`~DataFrame.filter` - Filters groups based on a list of Boolean conditions. | ||
* :meth:`~DataFrame.map` - Applies an element-wise function to a Series, useful for | ||
transforming individual values. | ||
* :meth:`~DataFrame.pipe` - Allows chaining custom functions to process entire DataFrames or | ||
Series in a clean, readable manner. | ||
|
||
All of these pandas methods can be used with both Series and DataFrame objects, providing versatile | ||
ways to apply UDFs across different pandas data structures. | ||
|
||
.. note:: | ||
Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you also make a mention of resample, rolling, expanding, and ewm. Perhaps link to each section in the User Guide. |
||
|
||
Additionally, operations such as :ref:`resample()<timeseries>`, :ref:`rolling()<window>`, | ||
:ref:`expanding()<window>`, and :ref:`ewm()<window>` also support UDFs for performing custom | ||
computations over temporal or statistical windows. | ||
|
||
|
||
Choosing the Right Method | ||
------------------------- | ||
When applying UDFs in pandas, it is essential to select the appropriate method based | ||
on your specific task. Each method has its strengths and is designed for different use | ||
cases. Understanding the purpose and behavior of each method will help you make informed | ||
decisions, ensuring more efficient and maintainable code. | ||
|
||
Below is a table overview of all methods that accept UDFs: | ||
|
||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||
| Method | Purpose | Supports UDFs | Keeps Shape | Recommended Use Case | | ||
+==================+======================================+===========================+====================+==========================================+ | ||
| :meth:`apply` | General-purpose function | Yes | Yes (when axis=1) | Custom row-wise or column-wise operations| | ||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||
| :meth:`agg` | Aggregation | Yes | No | Custom aggregation logic | | ||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||
| :meth:`transform`| Transform without reducing dimensions| Yes | Yes | Broadcast element-wise transformations | | ||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||
| :meth:`map` | Element-wise mapping | Yes | Yes | Simple element-wise transformations | | ||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||
| :meth:`pipe` | Functional chaining | Yes | Yes | Building clean operation pipelines | | ||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||
| :meth:`filter` | Row/Column selection | Not directly | Yes | Subsetting based on conditions | | ||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||
|
||
:meth:`DataFrame.apply` | ||
~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The :meth:`DataFrame.apply` allows you to apply UDFs along either rows or columns. While flexible, | ||
it is slower than vectorized operations and should be used only when you need operations | ||
that cannot be achieved with built-in pandas functions. | ||
|
||
When to use: :meth:`DataFrame.apply` is suitable when no alternative vectorized method or UDF method is available, | ||
but consider optimizing performance with vectorized operations wherever possible. | ||
|
||
Documentation can be found at :meth:`~DataFrame.apply`. | ||
|
||
:meth:`DataFrame.agg` | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
If you need to aggregate data, :meth:`DataFrame.agg` is a better choice than apply because it is | ||
specifically designed for aggregation operations. | ||
|
||
When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation | ||
functions across groups. | ||
|
||
Documentation can be found at :meth:`~DataFrame.agg`. | ||
|
||
:meth:`DataFrame.transform` | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The transform method is ideal for performing element-wise transformations while preserving the shape of the original DataFrame. | ||
It is generally faster than apply because it can take advantage of pandas' internal optimizations. | ||
|
||
When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame. | ||
|
||
Documentation can be found at :meth:`~DataFrame.transform`. | ||
|
||
Attempting to use common aggregation functions such as ``mean`` or ``sum`` will result in | ||
values being broadcasted to the original dimensions: | ||
|
||
.. ipython:: python | ||
|
||
# Sample DataFrame | ||
df = pd.DataFrame({ | ||
'Category': ['A', 'A', 'B', 'B', 'B'], | ||
'Values': [10, 20, 30, 40, 50] | ||
}) | ||
|
||
# Using transform with mean | ||
df['Mean_Transformed'] = df.groupby('Category')['Values'].transform('mean') | ||
|
||
# Using transform with sum | ||
df['Sum_Transformed'] = df.groupby('Category')['Values'].transform('sum') | ||
|
||
# Result broadcasted to DataFrame | ||
print(df) | ||
|
||
:meth:`DataFrame.filter` | ||
~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The :meth:`DataFrame.filter` method is used to select subsets of the DataFrame’s | ||
columns or row. It is useful when you want to extract specific columns or rows that | ||
match particular conditions. | ||
|
||
When to use: Use :meth:`DataFrame.filter` when you want to use a UDF to create a subset of a DataFrame or Series | ||
|
||
.. note:: | ||
:meth:`DataFrame.filter` does not accept UDFs, but can accept | ||
list comprehensions that have UDFs applied to them. | ||
|
||
.. ipython:: python | ||
|
||
# Sample DataFrame | ||
df = pd.DataFrame({ | ||
'AA': [1, 2, 3], | ||
'BB': [4, 5, 6], | ||
'C': [7, 8, 9], | ||
'D': [10, 11, 12] | ||
}) | ||
|
||
# Function that filters out columns where the name is longer than 1 character | ||
def is_long_name(column_name): | ||
return len(column_name) > 1 | ||
|
||
df_filtered = df.filter(items=[col for col in df.columns if is_long_name(col)]) | ||
print(df_filtered) | ||
|
||
Since filter does not directly accept a UDF, you have to apply the UDF indirectly, | ||
for example, by using list comprehensions. | ||
|
||
Documentation can be found at :meth:`~DataFrame.filter`. | ||
|
||
:meth:`DataFrame.map` | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
:meth:`DataFrame.map` is used specifically to apply element-wise UDFs and is better | ||
for this purpose compared to :meth:`DataFrame.apply` because of its better performance. | ||
|
||
When to use: Use map for applying element-wise UDFs to DataFrames or Series. | ||
|
||
Documentation can be found at :meth:`~DataFrame.map`. | ||
|
||
:meth:`DataFrame.pipe` | ||
~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The pipe method is useful for chaining operations together into a clean and readable pipeline. | ||
It is a helpful tool for organizing complex data processing workflows. | ||
|
||
When to use: Use pipe when you need to create a pipeline of operations and want to keep the code readable and maintainable. | ||
|
||
Documentation can be found at :meth:`~DataFrame.pipe`. | ||
|
||
|
||
Best Practices | ||
-------------- | ||
|
||
While UDFs provide flexibility, their use is currently discouraged as they can introduce | ||
performance issues, especially when written in pure Python. To improve efficiency, | ||
consider using built-in ``NumPy`` or ``pandas`` functions instead of UDFs | ||
for common operations. | ||
|
||
.. note:: | ||
If performance is critical, explore **vectorizated operations** before resorting | ||
to UDFs. | ||
|
||
Vectorized Operations | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Below is a comparison of using UDFs versus using Vectorized Operations: | ||
|
||
.. code-block:: python | ||
|
||
# User-defined function | ||
def calc_ratio(row): | ||
return 100 * (row["one"] / row["two"]) | ||
|
||
df["new_col"] = df.apply(calc_ratio, axis=1) | ||
|
||
# Vectorized Operation | ||
df["new_col2"] = 100 * (df["one"] / df["two"]) | ||
|
||
Measuring how long each operation takes: | ||
|
||
.. code-block:: text | ||
|
||
User-defined function: 5.6435 secs | ||
Vectorized: 0.0043 secs | ||
|
||
Vectorized operations in pandas are significantly faster than using :meth:`DataFrame.apply` | ||
with UDFs because they leverage highly optimized C functions | ||
via NumPy to process entire arrays at once. This approach avoids the overhead of looping | ||
through rows in Python and making separate function calls for each row, which is slow and | ||
inefficient. Additionally, NumPy arrays benefit from memory efficiency and CPU-level | ||
optimizations, making vectorized operations the preferred choice whenever possible. | ||
|
||
|
||
Improving Performance with UDFs | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
In scenarios where UDFs are necessary, there are still ways to mitigate their performance drawbacks. | ||
One approach is to use **Numba**, a Just-In-Time (JIT) compiler that can significantly speed up numerical | ||
Python code by compiling Python functions to optimized machine code at runtime. | ||
|
||
By annotating your UDFs with ``@numba.jit``, you can achieve performance closer to vectorized operations, | ||
especially for computationally heavy tasks. | ||
|
||
.. note:: | ||
You may also refer to the user guide on `Enhancing performance <https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html#numba-jit-compilation>`_ | ||
for a more detailed guide to using **Numba**. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the last line here, would love to see a real-world example of this that couldn't be broken down into supported operations. But I'm okay with this staying regardless.