From 3f94137161f7fa3d5d05d1a9284d04c0e971bb79 Mon Sep 17 00:00:00 2001 From: arthurlw Date: Tue, 25 Mar 2025 19:39:11 -0700 Subject: [PATCH 01/16] udf user guide introduction --- .../user_guide/user_defined_functions.rst | 51 +++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100644 doc/source/user_guide/user_defined_functions.rst diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst new file mode 100644 index 0000000000000..e874d11324e1c --- /dev/null +++ b/doc/source/user_guide/user_defined_functions.rst @@ -0,0 +1,51 @@ +.. _user_defined_functions: + +{{ header }} + +************************************** +Introduction to User Defined Functions +************************************** + +In pandas, User Defined Functions (UDFs) provide a way to extend the library’s +functionality by allowing users to apply custom computations to their data. While +pandas comes with a set of built-in functions for data manipulation, UDFs offer +flexibility when built-in methods are not sufficient. These functions can be +applied at different levels: element-wise, row-wise, column-wise, or group-wise, +depending on the method used. + +Note: User Defined Functions will be abbreviated to UDFs throughout this guide. + +Why Use UDFs? +------------- + +Pandas is designed for high-performance data processing, but sometimes your specific +needs go beyond standard aggregation, transformation, or filtering. UDFs allow you to: +* Customize Computations: Implement logic tailored to your dataset, such as complex + transformations, domain-specific calculations, or conditional modifications. +* Improve Code Readability: Encapsulate logic into functions rather than writing long, + complex expressions. +* Handle Complex Grouped Operations: Perform operations on grouped data that standard + methods do not support. +* Extend pandas' Functionality: Apply external libraries or advanced calculations that + are not natively available. + + +Where Can UDFs Be Used? +----------------------- + +UDFs can be applied across various pandas methods that work with both Series and DataFrames: + +* :meth:`DataFrame.apply` - A flexible method that allows applying a function to Series, + DataFrames, or groups of data. +* :meth:`DataFrame.agg` (Aggregate) - Used for summarizing data, supporting multiple + aggregation functions. +* :meth:`DataFrame.transform` - Applies a function to groups while preserving the shape of + the original data. +* :meth:`DataFrame.filter` - Filters groups based on a function returning a Boolean condition. +* :meth:`DataFrame.map` - Applies an element-wise function to a Series, useful for + transforming individual values. +* :meth:`DataFrame.pipe` - Allows chaining custom functions to process entire DataFrames or + Series in a clean, readable manner. + +Each of these methods can be used with both Series and DataFrame objects, providing versatile +ways to apply user-defined functions across different pandas data structures. \ No newline at end of file From bf984caf3c199e47531d073da120633473a4313a Mon Sep 17 00:00:00 2001 From: arthurlw Date: Tue, 25 Mar 2025 21:12:27 -0700 Subject: [PATCH 02/16] added apply method --- .../user_guide/user_defined_functions.rst | 135 ++++++++++++++++-- 1 file changed, 122 insertions(+), 13 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index e874d11324e1c..867c59fc2685f 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -3,37 +3,40 @@ {{ header }} ************************************** -Introduction to User Defined Functions +Introduction to User-Defined Functions ************************************** -In pandas, User Defined Functions (UDFs) provide a way to extend the library’s +In pandas, User-Defined Functions (UDFs) provide a way to extend the library’s functionality by allowing users to apply custom computations to their data. While pandas comes with a set of built-in functions for data manipulation, UDFs offer flexibility when built-in methods are not sufficient. These functions can be applied at different levels: element-wise, row-wise, column-wise, or group-wise, depending on the method used. -Note: User Defined Functions will be abbreviated to UDFs throughout this guide. +.. .. note:: + +.. User-Defined Functions will be abbreviated to UDFs throughout this guide. -Why Use UDFs? -------------- +Why Use User-Defined Functions? +------------------------------- Pandas is designed for high-performance data processing, but sometimes your specific needs go beyond standard aggregation, transformation, or filtering. UDFs allow you to: -* Customize Computations: Implement logic tailored to your dataset, such as complex + +* **Customize Computations**: Implement logic tailored to your dataset, such as complex transformations, domain-specific calculations, or conditional modifications. -* Improve Code Readability: Encapsulate logic into functions rather than writing long, +* **Improve Code Readability**: Encapsulate logic into functions rather than writing long, complex expressions. -* Handle Complex Grouped Operations: Perform operations on grouped data that standard +* **Handle Complex Grouped Operations**: Perform operations on grouped data that standard methods do not support. -* Extend pandas' Functionality: Apply external libraries or advanced calculations that +* **Extend pandas' Functionality**: Apply external libraries or advanced calculations that are not natively available. -Where Can UDFs Be Used? ------------------------ +What functions support User-Defined Functions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -UDFs can be applied across various pandas methods that work with both Series and DataFrames: +UDFs can be applied across various pandas methods that work with Series and DataFrames: * :meth:`DataFrame.apply` - A flexible method that allows applying a function to Series, DataFrames, or groups of data. @@ -48,4 +51,110 @@ UDFs can be applied across various pandas methods that work with both Series and Series in a clean, readable manner. Each of these methods can be used with both Series and DataFrame objects, providing versatile -ways to apply user-defined functions across different pandas data structures. \ No newline at end of file +ways to apply user-defined functions across different pandas data structures. + + +:meth:`DataFrame.apply` +----------------------- + +The :meth:`DataFrame.apply` allows applying a user-defined functions along either axis (rows or columns): + +.. ipython:: python + + import pandas as pd + + # Sample DataFrame + df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) + + # User-Defined Function + def add_one(x): + return x + 1 + + # Apply function + df_transformed = df.apply(add_one) + print(df_transformed) + + # This works with lambda functions too + df_lambda = df.apply(lambda x : x + 1) + print(df_lambda) + + +:meth:`DataFrame.apply` also accepts dictionaries of multiple user-defined functions: + +.. ipython:: python + + import pandas as pd + + # Sample DataFrame + df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3]}) + + # User-Defined Function + def add_one(x): + return x + 1 + + def add_two(x): + return x + 2 + + # Apply function + df_transformed = df.apply({"A": add_one, "B": add_two}) + print(df_transformed) + + # This works with lambda functions too + df_lambda = df.apply({"A": lambda x : x + 1, "B": lambda x : x + 2}) + print(df_lambda) + +:meth:`DataFrame.apply` works with Series objects as well: + +.. ipython:: python + + import pandas as pd + + # Sample Series + s = pd.Series([1, 2, 3]) + + # User-Defined Function + def add_one(x): + return x + 1 + + # Apply function + s_transformed = s.apply(add_one) + print(df_transformed) + + # This works with lambda functions too + s_lambda = s.apply(lambda x : x + 1) + print(s_lambda) + +:meth:`DataFrame.agg` +--------------------- + +When working with grouped data, user-defined functions can be used within :meth:`DataFrame.agg`: + +.. ipython:: python + + # Sample DataFrame + df = pd.DataFrame({ + 'Category': ['A', 'A', 'B', 'B'], + 'Values': [10, 20, 30, 40] + }) + + # Define a function for group operations + def group_mean(group): + return group.mean() + + # Apply UDF to each group + grouped_result = df.groupby('Category')['Values'].agg(group_mean) + print(grouped_result) + +Performance Considerations +-------------------------- + +While UDFs provide flexibility, their use is currently discouraged as they can introduce performance issues, especially when +written in pure Python. To improve efficiency: + +* Use **vectorized operations** (`NumPy` or `pandas` built-ins) when possible. +* Leverage **Cython or Numba** to speed up computations. +* Consider using **pandas' built-in methods** instead of UDFs for common operations. + +.. note:: + If performance is critical, explore **pandas' vectorized functions** before resorting + to UDFs. \ No newline at end of file From fe67ec87e92c722c89c1e61ca28974b0ae985ab6 Mon Sep 17 00:00:00 2001 From: arthurlw Date: Thu, 27 Mar 2025 20:30:58 -0700 Subject: [PATCH 03/16] added agg, transform and filter --- .../user_guide/user_defined_functions.rst | 100 ++++++++++++++---- 1 file changed, 80 insertions(+), 20 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index 867c59fc2685f..5489ce2329179 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -13,10 +13,6 @@ flexibility when built-in methods are not sufficient. These functions can be applied at different levels: element-wise, row-wise, column-wise, or group-wise, depending on the method used. -.. .. note:: - -.. User-Defined Functions will be abbreviated to UDFs throughout this guide. - Why Use User-Defined Functions? ------------------------------- @@ -36,7 +32,7 @@ needs go beyond standard aggregation, transformation, or filtering. UDFs allow y What functions support User-Defined Functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -UDFs can be applied across various pandas methods that work with Series and DataFrames: +User-Defined Functions can be applied across various pandas methods that work with Series and DataFrames: * :meth:`DataFrame.apply` - A flexible method that allows applying a function to Series, DataFrames, or groups of data. @@ -60,7 +56,6 @@ ways to apply user-defined functions across different pandas data structures. The :meth:`DataFrame.apply` allows applying a user-defined functions along either axis (rows or columns): .. ipython:: python - import pandas as pd # Sample DataFrame @@ -71,8 +66,8 @@ The :meth:`DataFrame.apply` allows applying a user-defined functions along eithe return x + 1 # Apply function - df_transformed = df.apply(add_one) - print(df_transformed) + df_applied = df.apply(add_one) + print(df_applied) # This works with lambda functions too df_lambda = df.apply(lambda x : x + 1) @@ -82,9 +77,6 @@ The :meth:`DataFrame.apply` allows applying a user-defined functions along eithe :meth:`DataFrame.apply` also accepts dictionaries of multiple user-defined functions: .. ipython:: python - - import pandas as pd - # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3]}) @@ -96,8 +88,8 @@ The :meth:`DataFrame.apply` allows applying a user-defined functions along eithe return x + 2 # Apply function - df_transformed = df.apply({"A": add_one, "B": add_two}) - print(df_transformed) + df_applied = df.apply({"A": add_one, "B": add_two}) + print(df_applied) # This works with lambda functions too df_lambda = df.apply({"A": lambda x : x + 1, "B": lambda x : x + 2}) @@ -106,9 +98,6 @@ The :meth:`DataFrame.apply` allows applying a user-defined functions along eithe :meth:`DataFrame.apply` works with Series objects as well: .. ipython:: python - - import pandas as pd - # Sample Series s = pd.Series([1, 2, 3]) @@ -117,8 +106,8 @@ The :meth:`DataFrame.apply` allows applying a user-defined functions along eithe return x + 1 # Apply function - s_transformed = s.apply(add_one) - print(df_transformed) + s_applied = s.apply(add_one) + print(s_applied) # This works with lambda functions too s_lambda = s.apply(lambda x : x + 1) @@ -127,10 +116,9 @@ The :meth:`DataFrame.apply` allows applying a user-defined functions along eithe :meth:`DataFrame.agg` --------------------- -When working with grouped data, user-defined functions can be used within :meth:`DataFrame.agg`: +The :meth:`DataFrame.agg` allows aggregation with a user-defined function along either axis (rows or columns): .. ipython:: python - # Sample DataFrame df = pd.DataFrame({ 'Category': ['A', 'A', 'B', 'B'], @@ -145,6 +133,78 @@ When working with grouped data, user-defined functions can be used within :meth: grouped_result = df.groupby('Category')['Values'].agg(group_mean) print(grouped_result) +In terms of the API, :meth:`DataFrame.agg` has similar usage to :meth:`DataFrame.apply`, +but it is primarily used for **aggregation**, applying functions that summarize or reduce data. +Typically, the result of :meth:`DataFrame.agg` reduces the dimensions of data as shown +in the above example. Conversely, :meth:`DataFrame.apply` is more general and allows for both +transformations and custom row-wise or element-wise operations. + +:meth:`DataFrame.transform` +--------------------------- + +The :meth:`DataFrame.transform` allows transforms a Dataframe, Series or Grouped object +while preserving the original shape of the object. + +.. ipython:: python + # Sample DataFrame + df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) + + # User-Defined Function + def double(x): + return x * 2 + + # Apply transform + df_transformed = df.transform(double) + print(df_transformed) + + # This works with lambda functions too + df_lambda = df.transform(lambda x: x * 2) + print(df_lambda) + +Attempting to use common aggregation functions such as `mean` or `sum` will result in +values being broadcasted to the original dimensions: + +.. ipython:: python + # Sample DataFrame + df = pd.DataFrame({ + 'Category': ['A', 'A', 'B', 'B', 'B'], + 'Values': [10, 20, 30, 40, 50] + }) + + # Using transform with mean + df['Mean_Transformed'] = df.groupby('Category')['Values'].transform('mean') + + # Using transform with sum + df['Sum_Transformed'] = df.groupby('Category')['Values'].transform('sum') + + # Result broadcasted to DataFrame + print(df) + +:meth:`DataFrame.filter` +------------------------ + +The :meth:`DataFrame.filter` method is used to select subsets of the DataFrame’s +columns or rows and accepts user-defined functions. Specifically, these functions +return boolean values to filter columns or rows. It is useful when you want to +extract specific columns or rows that match particular conditions. + +.. ipython:: python + # Sample DataFrame + df = pd.DataFrame({ + 'A': [1, 2, 3], + 'B': [4, 5, 6], + 'C': [7, 8, 9], + 'D': [10, 11, 12] + }) + + # Define a function that filters out columns where the name is longer than 1 character + df_filtered_func = df.filter(items=lambda x: len(x) > 1) + print(df_filtered_func) + +Unlike the methods discussed earlier, :meth:`DataFrame.filter` does not accept +functions that do not return boolean values, such as `mean` or `sum`. + + Performance Considerations -------------------------- From 4ec569712e5be4b5b5648d7a7ca432ca1ddcbf3b Mon Sep 17 00:00:00 2001 From: arthurlw Date: Fri, 28 Mar 2025 11:19:10 -0700 Subject: [PATCH 04/16] added map, pipe and vectorized operations --- .../user_guide/user_defined_functions.rst | 99 ++++++++++++++++--- 1 file changed, 85 insertions(+), 14 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index 5489ce2329179..7addd0206edcf 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -11,13 +11,13 @@ functionality by allowing users to apply custom computations to their data. Whil pandas comes with a set of built-in functions for data manipulation, UDFs offer flexibility when built-in methods are not sufficient. These functions can be applied at different levels: element-wise, row-wise, column-wise, or group-wise, -depending on the method used. +and change the data differently, depending on the method used. Why Use User-Defined Functions? ------------------------------- Pandas is designed for high-performance data processing, but sometimes your specific -needs go beyond standard aggregation, transformation, or filtering. UDFs allow you to: +needs go beyond standard aggregation, transformation, or filtering. User-defined functions allow you to: * **Customize Computations**: Implement logic tailored to your dataset, such as complex transformations, domain-specific calculations, or conditional modifications. @@ -32,7 +32,7 @@ needs go beyond standard aggregation, transformation, or filtering. UDFs allow y What functions support User-Defined Functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -User-Defined Functions can be applied across various pandas methods that work with Series and DataFrames: +User-Defined Functions can be applied across various pandas methods: * :meth:`DataFrame.apply` - A flexible method that allows applying a function to Series, DataFrames, or groups of data. @@ -46,7 +46,7 @@ User-Defined Functions can be applied across various pandas methods that work wi * :meth:`DataFrame.pipe` - Allows chaining custom functions to process entire DataFrames or Series in a clean, readable manner. -Each of these methods can be used with both Series and DataFrame objects, providing versatile +All of these pandas methods can be used with both Series and DataFrame objects, providing versatile ways to apply user-defined functions across different pandas data structures. @@ -184,10 +184,13 @@ values being broadcasted to the original dimensions: ------------------------ The :meth:`DataFrame.filter` method is used to select subsets of the DataFrame’s -columns or rows and accepts user-defined functions. Specifically, these functions -return boolean values to filter columns or rows. It is useful when you want to +columns or rows and accepts user-defined functions. It is useful when you want to extract specific columns or rows that match particular conditions. +.. note:: + :meth:`DataFrame.filter` expects a user-defined function that returns a boolean + value + .. ipython:: python # Sample DataFrame df = pd.DataFrame({ @@ -204,17 +207,85 @@ extract specific columns or rows that match particular conditions. Unlike the methods discussed earlier, :meth:`DataFrame.filter` does not accept functions that do not return boolean values, such as `mean` or `sum`. +:meth:`DataFrame.map` +--------------------- + +The :meth:`DataFrame.map` method is used to apply a function element-wise to a pandas Series +or Dataframe. It is particularly useful for substituting values or transforming data. + +.. ipython:: python + # Sample DataFrame + df = pd.DataFrame({ 'A': ['cat', 'dog', 'bird'], 'B': ['pig', 'cow', 'lamb'] }) + + # Using map with a user-defined function + def animal_to_length(animal): + return len(animal) + + df_mapped = df.map(animal_to_length) + print(df_mapped) + + # This works with lambda functions too + df_lambda = df.map(lambda x: x.upper()) + print(df_lambda) + +:meth:`DataFrame.pipe` +---------------------- + +The :meth:`DataFrame.pipe` method allows you to apply a function or a series of functions to a +DataFrame in a clean and readable way. This is especially useful for building data processing pipelines. + +.. ipython:: python + # Sample DataFrame + df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4, 5, 6] }) + + # User-defined functions for transformation + def add_one(df): + return df + 1 + + def square(df): + return df ** 2 + + # Applying functions using pipe + df_piped = df.pipe(add_one).pipe(square) + print(df_piped) + +The advantage of using :meth:`DataFrame.pipe` is that it allows you to chain together functions +without nested calls, promoting a cleaner and more readable code style. + Performance Considerations -------------------------- -While UDFs provide flexibility, their use is currently discouraged as they can introduce performance issues, especially when -written in pure Python. To improve efficiency: - -* Use **vectorized operations** (`NumPy` or `pandas` built-ins) when possible. -* Leverage **Cython or Numba** to speed up computations. -* Consider using **pandas' built-in methods** instead of UDFs for common operations. +While UDFs provide flexibility, their use is currently discouraged as they can introduce +performance issues, especially when written in pure Python. To improve efficiency, +consider using built-in `NumPy` or `pandas` functions instead of user-defined functions +for common operations. .. note:: - If performance is critical, explore **pandas' vectorized functions** before resorting - to UDFs. \ No newline at end of file + If performance is critical, explore **vectorizated operations** before resorting + to user-defined functions. + +Vectorized Operations +~~~~~~~~~~~~~~~~~~~~~ + +Below is an example of vectorized operations in pandas: + +.. ipython:: python + # Vectorized operation: + df["new_col"] = 100 * (df["one"] / df["two"]) + + # User-defined function + def calc_ratio(row): + return 100 * (row["one"] / row["two"]) + + df["new_col2"] = df.apply(calc_ratio, axis=1) + +Measuring how long each operation takes: + +.. ipython:: python + Vectorized: 0.0043 secs + User-defined function: 5.6435 secs + +This happens because user-defined functions loop through each row and apply its function, +while vectorized operations are applied to underlying `Numpy` arrays, skipping inefficient +Python code. \ No newline at end of file From 11392d7fdcae2c3dabd33ea974ed298673965ded Mon Sep 17 00:00:00 2001 From: arthurlw Date: Fri, 28 Mar 2025 12:07:42 -0700 Subject: [PATCH 05/16] bugfix --- .../user_guide/user_defined_functions.rst | 64 +++++++++++-------- 1 file changed, 39 insertions(+), 25 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index 7addd0206edcf..961769e7d3532 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -40,7 +40,7 @@ User-Defined Functions can be applied across various pandas methods: aggregation functions. * :meth:`DataFrame.transform` - Applies a function to groups while preserving the shape of the original data. -* :meth:`DataFrame.filter` - Filters groups based on a function returning a Boolean condition. +* :meth:`DataFrame.filter` - Filters groups based on a list of Boolean conditions. * :meth:`DataFrame.map` - Applies an element-wise function to a Series, useful for transforming individual values. * :meth:`DataFrame.pipe` - Allows chaining custom functions to process entire DataFrames or @@ -56,6 +56,7 @@ ways to apply user-defined functions across different pandas data structures. The :meth:`DataFrame.apply` allows applying a user-defined functions along either axis (rows or columns): .. ipython:: python + import pandas as pd # Sample DataFrame @@ -77,6 +78,7 @@ The :meth:`DataFrame.apply` allows applying a user-defined functions along eithe :meth:`DataFrame.apply` also accepts dictionaries of multiple user-defined functions: .. ipython:: python + # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3]}) @@ -98,6 +100,7 @@ The :meth:`DataFrame.apply` allows applying a user-defined functions along eithe :meth:`DataFrame.apply` works with Series objects as well: .. ipython:: python + # Sample Series s = pd.Series([1, 2, 3]) @@ -119,6 +122,7 @@ The :meth:`DataFrame.apply` allows applying a user-defined functions along eithe The :meth:`DataFrame.agg` allows aggregation with a user-defined function along either axis (rows or columns): .. ipython:: python + # Sample DataFrame df = pd.DataFrame({ 'Category': ['A', 'A', 'B', 'B'], @@ -146,6 +150,7 @@ The :meth:`DataFrame.transform` allows transforms a Dataframe, Series or Grouped while preserving the original shape of the object. .. ipython:: python + # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) @@ -165,6 +170,7 @@ Attempting to use common aggregation functions such as `mean` or `sum` will resu values being broadcasted to the original dimensions: .. ipython:: python + # Sample DataFrame df = pd.DataFrame({ 'Category': ['A', 'A', 'B', 'B', 'B'], @@ -184,28 +190,29 @@ values being broadcasted to the original dimensions: ------------------------ The :meth:`DataFrame.filter` method is used to select subsets of the DataFrame’s -columns or rows and accepts user-defined functions. It is useful when you want to -extract specific columns or rows that match particular conditions. +columns or row. It is useful when you want to extract specific columns or rows that +match particular conditions. .. note:: - :meth:`DataFrame.filter` expects a user-defined function that returns a boolean - value + :meth:`DataFrame.filter` does not accept user-defined functions, but can accept + list comprehensions that have user-defined functions applied to them. .. ipython:: python + # Sample DataFrame df = pd.DataFrame({ - 'A': [1, 2, 3], - 'B': [4, 5, 6], + 'AA': [1, 2, 3], + 'BB': [4, 5, 6], 'C': [7, 8, 9], 'D': [10, 11, 12] }) - # Define a function that filters out columns where the name is longer than 1 character - df_filtered_func = df.filter(items=lambda x: len(x) > 1) - print(df_filtered_func) + def is_long_name(column_name): + return len(column_name) > 1 -Unlike the methods discussed earlier, :meth:`DataFrame.filter` does not accept -functions that do not return boolean values, such as `mean` or `sum`. + # Define a function that filters out columns where the name is longer than 1 character + df_filtered = df[[col for col in df.columns if is_long_name(col)]] + print(df_filtered) :meth:`DataFrame.map` --------------------- @@ -214,19 +221,20 @@ The :meth:`DataFrame.map` method is used to apply a function element-wise to a p or Dataframe. It is particularly useful for substituting values or transforming data. .. ipython:: python + # Sample DataFrame - df = pd.DataFrame({ 'A': ['cat', 'dog', 'bird'], 'B': ['pig', 'cow', 'lamb'] }) + s = pd.Series(['cat', 'dog', 'bird']) # Using map with a user-defined function def animal_to_length(animal): return len(animal) - df_mapped = df.map(animal_to_length) - print(df_mapped) + s_mapped = s.map(animal_to_length) + print(s_mapped) # This works with lambda functions too - df_lambda = df.map(lambda x: x.upper()) - print(df_lambda) + s_lambda = s.map(lambda x: x.upper()) + print(s_lambda) :meth:`DataFrame.pipe` ---------------------- @@ -235,6 +243,7 @@ The :meth:`DataFrame.pipe` method allows you to apply a function or a series of DataFrame in a clean and readable way. This is especially useful for building data processing pipelines. .. ipython:: python + # Sample DataFrame df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4, 5, 6] }) @@ -256,7 +265,7 @@ without nested calls, promoting a cleaner and more readable code style. Performance Considerations -------------------------- -While UDFs provide flexibility, their use is currently discouraged as they can introduce +While user-defined functions provide flexibility, their use is currently discouraged as they can introduce performance issues, especially when written in pure Python. To improve efficiency, consider using built-in `NumPy` or `pandas` functions instead of user-defined functions for common operations. @@ -270,9 +279,7 @@ Vectorized Operations Below is an example of vectorized operations in pandas: -.. ipython:: python - # Vectorized operation: - df["new_col"] = 100 * (df["one"] / df["two"]) +.. code-block:: text # User-defined function def calc_ratio(row): @@ -280,12 +287,19 @@ Below is an example of vectorized operations in pandas: df["new_col2"] = df.apply(calc_ratio, axis=1) + # Vectorized Operation + df["new_col"] = 100 * (df["one"] / df["two"]) + Measuring how long each operation takes: -.. ipython:: python +.. code-block:: text + Vectorized: 0.0043 secs User-defined function: 5.6435 secs -This happens because user-defined functions loop through each row and apply its function, -while vectorized operations are applied to underlying `Numpy` arrays, skipping inefficient -Python code. \ No newline at end of file +Vectorized operations in pandas are significantly faster than using :meth:`DataFrame.apply` +with user-defined functions because they leverage highly optimized C functions +via NumPy to process entire arrays at once. This approach avoids the overhead of looping +through rows in Python and making separate function calls for each row, which is slow and +inefficient. Additionally, NumPy arrays benefit from memory efficiency and CPU-level +optimizations, making vectorized operations the preferred choice whenever possible. \ No newline at end of file From f322d9eeab02a4bd092d3adcff4934a866101cb7 Mon Sep 17 00:00:00 2001 From: arthurlw Date: Fri, 28 Mar 2025 12:13:40 -0700 Subject: [PATCH 06/16] updated map method --- doc/source/user_guide/user_defined_functions.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index 961769e7d3532..6007de1657645 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -223,18 +223,18 @@ or Dataframe. It is particularly useful for substituting values or transforming .. ipython:: python # Sample DataFrame - s = pd.Series(['cat', 'dog', 'bird']) + df = pd.DataFrame({ 'A': ['cat', 'dog', 'bird'], 'B': ['pig', 'cow', 'lamb'] }) # Using map with a user-defined function def animal_to_length(animal): return len(animal) - s_mapped = s.map(animal_to_length) - print(s_mapped) + df_mapped = df.map(animal_to_length) + print(df_mapped) # This works with lambda functions too - s_lambda = s.map(lambda x: x.upper()) - print(s_lambda) + df_lambda = df.map(lambda x: x.upper()) + print(df_lambda) :meth:`DataFrame.pipe` ---------------------- From b6b7b0245354ba1574a64ee19bbd4b9dcf5d9eff Mon Sep 17 00:00:00 2001 From: arthurlw Date: Fri, 28 Mar 2025 12:25:35 -0700 Subject: [PATCH 07/16] precommit --- .../user_guide/user_defined_functions.rst | 32 +++++++++---------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index 6007de1657645..bd8514dceaf90 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -151,25 +151,25 @@ while preserving the original shape of the object. .. ipython:: python - # Sample DataFrame - df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) + # Sample DataFrame + df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) - # User-Defined Function - def double(x): - return x * 2 + # User-Defined Function + def double(x): + return x * 2 - # Apply transform - df_transformed = df.transform(double) - print(df_transformed) + # Apply transform + df_transformed = df.transform(double) + print(df_transformed) - # This works with lambda functions too - df_lambda = df.transform(lambda x: x * 2) - print(df_lambda) + # This works with lambda functions too + df_lambda = df.transform(lambda x: x * 2) + print(df_lambda) -Attempting to use common aggregation functions such as `mean` or `sum` will result in +Attempting to use common aggregation functions such as ``mean`` or ``sum`` will result in values being broadcasted to the original dimensions: -.. ipython:: python +.. ipython:: python # Sample DataFrame df = pd.DataFrame({ @@ -197,7 +197,7 @@ match particular conditions. :meth:`DataFrame.filter` does not accept user-defined functions, but can accept list comprehensions that have user-defined functions applied to them. -.. ipython:: python +.. ipython:: python # Sample DataFrame df = pd.DataFrame({ @@ -267,7 +267,7 @@ Performance Considerations While user-defined functions provide flexibility, their use is currently discouraged as they can introduce performance issues, especially when written in pure Python. To improve efficiency, -consider using built-in `NumPy` or `pandas` functions instead of user-defined functions +consider using built-in ``NumPy`` or ``pandas`` functions instead of user-defined functions for common operations. .. note:: @@ -302,4 +302,4 @@ with user-defined functions because they leverage highly optimized C functions via NumPy to process entire arrays at once. This approach avoids the overhead of looping through rows in Python and making separate function calls for each row, which is slow and inefficient. Additionally, NumPy arrays benefit from memory efficiency and CPU-level -optimizations, making vectorized operations the preferred choice whenever possible. \ No newline at end of file +optimizations, making vectorized operations the preferred choice whenever possible. From d20bcc735bc39d07a21018527ea3f000e1d17eb0 Mon Sep 17 00:00:00 2001 From: arthurlw Date: Fri, 28 Mar 2025 12:31:43 -0700 Subject: [PATCH 08/16] trim trailing whitespace --- .../user_guide/user_defined_functions.rst | 26 +++++++++---------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index bd8514dceaf90..a976e62dac1c8 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -9,7 +9,7 @@ Introduction to User-Defined Functions In pandas, User-Defined Functions (UDFs) provide a way to extend the library’s functionality by allowing users to apply custom computations to their data. While pandas comes with a set of built-in functions for data manipulation, UDFs offer -flexibility when built-in methods are not sufficient. These functions can be +flexibility when built-in methods are not sufficient. These functions can be applied at different levels: element-wise, row-wise, column-wise, or group-wise, and change the data differently, depending on the method used. @@ -19,13 +19,13 @@ Why Use User-Defined Functions? Pandas is designed for high-performance data processing, but sometimes your specific needs go beyond standard aggregation, transformation, or filtering. User-defined functions allow you to: -* **Customize Computations**: Implement logic tailored to your dataset, such as complex +* **Customize Computations**: Implement logic tailored to your dataset, such as complex transformations, domain-specific calculations, or conditional modifications. * **Improve Code Readability**: Encapsulate logic into functions rather than writing long, complex expressions. * **Handle Complex Grouped Operations**: Perform operations on grouped data that standard methods do not support. -* **Extend pandas' Functionality**: Apply external libraries or advanced calculations that +* **Extend pandas' Functionality**: Apply external libraries or advanced calculations that are not natively available. @@ -58,14 +58,14 @@ The :meth:`DataFrame.apply` allows applying a user-defined functions along eithe .. ipython:: python import pandas as pd - + # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) - + # User-Defined Function def add_one(x): return x + 1 - + # Apply function df_applied = df.apply(add_one) print(df_applied) @@ -81,14 +81,14 @@ The :meth:`DataFrame.apply` allows applying a user-defined functions along eithe # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3]}) - + # User-Defined Function def add_one(x): return x + 1 def add_two(x): return x + 2 - + # Apply function df_applied = df.apply({"A": add_one, "B": add_two}) print(df_applied) @@ -103,11 +103,11 @@ The :meth:`DataFrame.apply` allows applying a user-defined functions along eithe # Sample Series s = pd.Series([1, 2, 3]) - + # User-Defined Function def add_one(x): return x + 1 - + # Apply function s_applied = s.apply(add_one) print(s_applied) @@ -128,11 +128,11 @@ The :meth:`DataFrame.agg` allows aggregation with a user-defined function along 'Category': ['A', 'A', 'B', 'B'], 'Values': [10, 20, 30, 40] }) - + # Define a function for group operations def group_mean(group): return group.mean() - + # Apply UDF to each group grouped_result = df.groupby('Category')['Values'].agg(group_mean) print(grouped_result) @@ -149,7 +149,7 @@ transformations and custom row-wise or element-wise operations. The :meth:`DataFrame.transform` allows transforms a Dataframe, Series or Grouped object while preserving the original shape of the object. -.. ipython:: python +.. ipython:: python # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) From 72f7b62b048e3d48f370d37b2681d704723d47d1 Mon Sep 17 00:00:00 2001 From: arthurlw Date: Fri, 28 Mar 2025 13:42:56 -0700 Subject: [PATCH 09/16] toctree --- doc/source/user_guide/index.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst index f0d6a76f0de5b..b321559976bda 100644 --- a/doc/source/user_guide/index.rst +++ b/doc/source/user_guide/index.rst @@ -88,3 +88,4 @@ Guides sparse gotchas cookbook + user_defined_functions From 90a2d240cdc2ebb52f24566c813f2b1972cd1614 Mon Sep 17 00:00:00 2001 From: arthurlw Date: Sat, 29 Mar 2025 13:28:53 -0700 Subject: [PATCH 10/16] restructured udf user guide --- .../user_guide/user_defined_functions.rst | 255 ++++++------------ 1 file changed, 88 insertions(+), 167 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index a976e62dac1c8..3fe95381d6d46 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -13,24 +13,28 @@ flexibility when built-in methods are not sufficient. These functions can be applied at different levels: element-wise, row-wise, column-wise, or group-wise, and change the data differently, depending on the method used. -Why Use User-Defined Functions? -------------------------------- +Why Not To Use User-Defined Functions +----------------------------------------- -Pandas is designed for high-performance data processing, but sometimes your specific -needs go beyond standard aggregation, transformation, or filtering. User-defined functions allow you to: +While UDFs provide flexibility, they come with significant drawbacks, primarily +related to performance. Unlike vectorized pandas operations, UDFs are slower because pandas lacks +insight into what they are computing, making it difficult to apply efficient handling or optimization +techniques. As a result, pandas resorts to less efficient processing methods that significantly +slow down computations. Additionally, relying on UDFs often sacrifices the benefits +of pandas’ built-in, optimized methods, limiting compatibility and overall performance. -* **Customize Computations**: Implement logic tailored to your dataset, such as complex - transformations, domain-specific calculations, or conditional modifications. -* **Improve Code Readability**: Encapsulate logic into functions rather than writing long, - complex expressions. -* **Handle Complex Grouped Operations**: Perform operations on grouped data that standard - methods do not support. -* **Extend pandas' Functionality**: Apply external libraries or advanced calculations that - are not natively available. +.. note:: + In general, most tasks can and should be accomplished using pandas’ built-in methods or vectorized operations. + +Despite their drawbacks, UDFs can be helpful when: +* **Custom Computations Are Needed**: Implementing complex logic or domain-specific calculations that pandas' + built-in methods cannot handle. +* **Extending pandas' Functionality**: Applying external libraries or specialized algorithms unavailable in pandas. +* **Handling Complex Grouped Operations**: Performing operations on grouped data that standard methods do not support. -What functions support User-Defined Functions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Methods that support User-Defined Functions +------------------------------------------- User-Defined Functions can be applied across various pandas methods: @@ -47,124 +51,66 @@ User-Defined Functions can be applied across various pandas methods: Series in a clean, readable manner. All of these pandas methods can be used with both Series and DataFrame objects, providing versatile -ways to apply user-defined functions across different pandas data structures. - +ways to apply UDFs across different pandas data structures. + + +Choosing the Right Method +------------------------- +When applying UDFs in pandas, it is essential to select the appropriate method based +on your specific task. Each method has its strengths and is designed for different use +cases. Understanding the purpose and behavior of each method will help you make informed +decisions, ensuring more efficient and maintainable code. + +Below is a table overview of all methods that accept UDFs: + ++------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ +| Method | Purpose | Supports UDFs | Keeps Shape | Performance | Recommended Use Case | ++==================+======================================+===========================+====================+===========================+==========================================+ +| :meth:`apply` | General-purpose function | Yes | Yes (when axis=1) | Slow | Custom row-wise or column-wise operations| ++------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ +| :meth:`agg` | Aggregation | Yes | No | Fast (if using built-ins) | Custom aggregation logic | ++------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ +| :meth:`transform`| Transform without reducing dimensions| Yes | Yes | Fast (if vectorized) | Broadcast Element-wise transformations | ++------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ +| :meth:`map` | Element-wise mapping | Yes | Yes | Moderate | Simple element-wise transformations | ++------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ +| :meth:`pipe` | Functional chaining | Yes | Yes | Depends on function | Building clean pipelines | ++------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ +| :meth:`filter` | Row/Column selection | Not directly | Yes | Fast | Subsetting based on conditions | ++------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ :meth:`DataFrame.apply` ------------------------ - -The :meth:`DataFrame.apply` allows applying a user-defined functions along either axis (rows or columns): - -.. ipython:: python - - import pandas as pd +~~~~~~~~~~~~~~~~~~~~~~~ - # Sample DataFrame - df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) - - # User-Defined Function - def add_one(x): - return x + 1 - - # Apply function - df_applied = df.apply(add_one) - print(df_applied) - - # This works with lambda functions too - df_lambda = df.apply(lambda x : x + 1) - print(df_lambda) +The :meth:`DataFrame.apply` allows you to apply UDFs along either rows or columns. While flexible, +it is slower than vectorized operations and should be used only when you need operations +that cannot be achieved with built-in pandas functions. +When to use: :meth:`DataFrame.apply` is suitable when no alternative vectorized method is available, but consider +optimizing performance with vectorized operations wherever possible. -:meth:`DataFrame.apply` also accepts dictionaries of multiple user-defined functions: - -.. ipython:: python - - # Sample DataFrame - df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3]}) - - # User-Defined Function - def add_one(x): - return x + 1 - - def add_two(x): - return x + 2 - - # Apply function - df_applied = df.apply({"A": add_one, "B": add_two}) - print(df_applied) - - # This works with lambda functions too - df_lambda = df.apply({"A": lambda x : x + 1, "B": lambda x : x + 2}) - print(df_lambda) - -:meth:`DataFrame.apply` works with Series objects as well: - -.. ipython:: python - - # Sample Series - s = pd.Series([1, 2, 3]) - - # User-Defined Function - def add_one(x): - return x + 1 - - # Apply function - s_applied = s.apply(add_one) - print(s_applied) - - # This works with lambda functions too - s_lambda = s.apply(lambda x : x + 1) - print(s_lambda) +Examples of usage can be found at :meth:`DataFrame.apply` :meth:`DataFrame.agg` ---------------------- - -The :meth:`DataFrame.agg` allows aggregation with a user-defined function along either axis (rows or columns): - -.. ipython:: python - - # Sample DataFrame - df = pd.DataFrame({ - 'Category': ['A', 'A', 'B', 'B'], - 'Values': [10, 20, 30, 40] - }) +~~~~~~~~~~~~~~~~~~~~~ - # Define a function for group operations - def group_mean(group): - return group.mean() +If you need to aggregate data, :meth:`DataFrame.agg` is a better choice than apply because it is +specifically designed for aggregation operations. - # Apply UDF to each group - grouped_result = df.groupby('Category')['Values'].agg(group_mean) - print(grouped_result) +When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation +functions across groups. -In terms of the API, :meth:`DataFrame.agg` has similar usage to :meth:`DataFrame.apply`, -but it is primarily used for **aggregation**, applying functions that summarize or reduce data. -Typically, the result of :meth:`DataFrame.agg` reduces the dimensions of data as shown -in the above example. Conversely, :meth:`DataFrame.apply` is more general and allows for both -transformations and custom row-wise or element-wise operations. +Examples of usage can be found at :meth:`DataFrame.agg ` :meth:`DataFrame.transform` ---------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The :meth:`DataFrame.transform` allows transforms a Dataframe, Series or Grouped object -while preserving the original shape of the object. +The transform method is ideal for performing element-wise transformations while preserving the shape of the original DataFrame. +It’s generally faster than apply because it can take advantage of pandas' internal optimizations. -.. ipython:: python - - # Sample DataFrame - df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) - - # User-Defined Function - def double(x): - return x * 2 +When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame. - # Apply transform - df_transformed = df.transform(double) - print(df_transformed) - - # This works with lambda functions too - df_lambda = df.transform(lambda x: x * 2) - print(df_lambda) +Documentation: DataFrame.transform Attempting to use common aggregation functions such as ``mean`` or ``sum`` will result in values being broadcasted to the original dimensions: @@ -187,15 +133,17 @@ values being broadcasted to the original dimensions: print(df) :meth:`DataFrame.filter` ------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~ The :meth:`DataFrame.filter` method is used to select subsets of the DataFrame’s columns or row. It is useful when you want to extract specific columns or rows that match particular conditions. +When to use: Use :meth:`DataFrame.filter` when you want to use a UDF to create a subset of a DataFrame or Series + .. note:: - :meth:`DataFrame.filter` does not accept user-defined functions, but can accept - list comprehensions that have user-defined functions applied to them. + :meth:`DataFrame.filter` does not accept UDFs, but can accept + list comprehensions that have UDFs applied to them. .. ipython:: python @@ -207,72 +155,45 @@ match particular conditions. 'D': [10, 11, 12] }) + # Define a function that filters out columns where the name is longer than 1 character def is_long_name(column_name): return len(column_name) > 1 - # Define a function that filters out columns where the name is longer than 1 character df_filtered = df[[col for col in df.columns if is_long_name(col)]] print(df_filtered) :meth:`DataFrame.map` ---------------------- - -The :meth:`DataFrame.map` method is used to apply a function element-wise to a pandas Series -or Dataframe. It is particularly useful for substituting values or transforming data. - -.. ipython:: python - - # Sample DataFrame - df = pd.DataFrame({ 'A': ['cat', 'dog', 'bird'], 'B': ['pig', 'cow', 'lamb'] }) +~~~~~~~~~~~~~~~~~~~~~ - # Using map with a user-defined function - def animal_to_length(animal): - return len(animal) +:meth:`DataFrame.map` is used specifically to apply element-wise UDFs and is better +for this purpose compared to :meth:`DataFrame.apply` because of its better performance. - df_mapped = df.map(animal_to_length) - print(df_mapped) +When to use: Use map for applying element-wise UDFs to DataFrames or Series. - # This works with lambda functions too - df_lambda = df.map(lambda x: x.upper()) - print(df_lambda) +Documentation: DataFrame.map :meth:`DataFrame.pipe` ----------------------- - -The :meth:`DataFrame.pipe` method allows you to apply a function or a series of functions to a -DataFrame in a clean and readable way. This is especially useful for building data processing pipelines. - -.. ipython:: python - - # Sample DataFrame - df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4, 5, 6] }) - - # User-defined functions for transformation - def add_one(df): - return df + 1 +~~~~~~~~~~~~~~~~~~~~~~ - def square(df): - return df ** 2 +The pipe method is useful for chaining operations together into a clean and readable pipeline. +It is a helpful tool for organizing complex data processing workflows. - # Applying functions using pipe - df_piped = df.pipe(add_one).pipe(square) - print(df_piped) +When to use: Use pipe when you need to create a pipeline of transformations and want to keep the code readable and maintainable. -The advantage of using :meth:`DataFrame.pipe` is that it allows you to chain together functions -without nested calls, promoting a cleaner and more readable code style. +Documentation: DataFrame.pipe -Performance Considerations --------------------------- +Best Practices +-------------- -While user-defined functions provide flexibility, their use is currently discouraged as they can introduce +While UDFs provide flexibility, their use is currently discouraged as they can introduce performance issues, especially when written in pure Python. To improve efficiency, -consider using built-in ``NumPy`` or ``pandas`` functions instead of user-defined functions +consider using built-in ``NumPy`` or ``pandas`` functions instead of UDFs for common operations. .. note:: If performance is critical, explore **vectorizated operations** before resorting - to user-defined functions. + to UDFs. Vectorized Operations ~~~~~~~~~~~~~~~~~~~~~ @@ -285,10 +206,10 @@ Below is an example of vectorized operations in pandas: def calc_ratio(row): return 100 * (row["one"] / row["two"]) - df["new_col2"] = df.apply(calc_ratio, axis=1) + df["new_col"] = df.apply(calc_ratio, axis=1) # Vectorized Operation - df["new_col"] = 100 * (df["one"] / df["two"]) + df["new_col2"] = 100 * (df["one"] / df["two"]) Measuring how long each operation takes: @@ -298,8 +219,8 @@ Measuring how long each operation takes: User-defined function: 5.6435 secs Vectorized operations in pandas are significantly faster than using :meth:`DataFrame.apply` -with user-defined functions because they leverage highly optimized C functions +with UDFs because they leverage highly optimized C functions via NumPy to process entire arrays at once. This approach avoids the overhead of looping through rows in Python and making separate function calls for each row, which is slow and inefficient. Additionally, NumPy arrays benefit from memory efficiency and CPU-level -optimizations, making vectorized operations the preferred choice whenever possible. +optimizations, making vectorized operations the preferred choice whenever possible. \ No newline at end of file From 0d02d6400cf13694dc7c0a3fdd272243119eaf14 Mon Sep 17 00:00:00 2001 From: arthurlw Date: Mon, 31 Mar 2025 19:34:04 -0700 Subject: [PATCH 11/16] updated documentation links --- .../user_guide/user_defined_functions.rst | 36 +++++++++++-------- 1 file changed, 21 insertions(+), 15 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index 3fe95381d6d46..5aec645c704ad 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -38,21 +38,24 @@ Methods that support User-Defined Functions User-Defined Functions can be applied across various pandas methods: -* :meth:`DataFrame.apply` - A flexible method that allows applying a function to Series, +* :meth:`~DataFrame.apply` - A flexible method that allows applying a function to Series, DataFrames, or groups of data. -* :meth:`DataFrame.agg` (Aggregate) - Used for summarizing data, supporting multiple +* :meth:`~DataFrame.agg` (Aggregate) - Used for summarizing data, supporting multiple aggregation functions. -* :meth:`DataFrame.transform` - Applies a function to groups while preserving the shape of +* :meth:`~DataFrame.transform` - Applies a function to groups while preserving the shape of the original data. -* :meth:`DataFrame.filter` - Filters groups based on a list of Boolean conditions. -* :meth:`DataFrame.map` - Applies an element-wise function to a Series, useful for +* :meth:`~DataFrame.filter` - Filters groups based on a list of Boolean conditions. +* :meth:`~DataFrame.map` - Applies an element-wise function to a Series, useful for transforming individual values. -* :meth:`DataFrame.pipe` - Allows chaining custom functions to process entire DataFrames or +* :meth:`~DataFrame.pipe` - Allows chaining custom functions to process entire DataFrames or Series in a clean, readable manner. All of these pandas methods can be used with both Series and DataFrame objects, providing versatile ways to apply UDFs across different pandas data structures. +.. note:: + Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`. + Choosing the Right Method ------------------------- @@ -70,7 +73,7 @@ Below is a table overview of all methods that accept UDFs: +------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ | :meth:`agg` | Aggregation | Yes | No | Fast (if using built-ins) | Custom aggregation logic | +------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ -| :meth:`transform`| Transform without reducing dimensions| Yes | Yes | Fast (if vectorized) | Broadcast Element-wise transformations | +| :meth:`transform`| Transform without reducing dimensions| Yes | Yes | Fast (if vectorized) | Broadcast element-wise transformations | +------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ | :meth:`map` | Element-wise mapping | Yes | Yes | Moderate | Simple element-wise transformations | +------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ @@ -89,7 +92,7 @@ that cannot be achieved with built-in pandas functions. When to use: :meth:`DataFrame.apply` is suitable when no alternative vectorized method is available, but consider optimizing performance with vectorized operations wherever possible. -Examples of usage can be found at :meth:`DataFrame.apply` +Examples of usage can be found :ref:`here`. :meth:`DataFrame.agg` ~~~~~~~~~~~~~~~~~~~~~ @@ -100,7 +103,7 @@ specifically designed for aggregation operations. When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation functions across groups. -Examples of usage can be found at :meth:`DataFrame.agg ` +Examples of usage can be found :ref:`here`. :meth:`DataFrame.transform` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -110,7 +113,7 @@ It’s generally faster than apply because it can take advantage of pandas' inte When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame. -Documentation: DataFrame.transform +Documentation can be found :ref:`here`. Attempting to use common aggregation functions such as ``mean`` or ``sum`` will result in values being broadcasted to the original dimensions: @@ -162,6 +165,9 @@ When to use: Use :meth:`DataFrame.filter` when you want to use a UDF to create a df_filtered = df[[col for col in df.columns if is_long_name(col)]] print(df_filtered) +Since filter does not direclty accept a UDF, you have to apply the UDF indirectly, +such as by using list comprehensions. + :meth:`DataFrame.map` ~~~~~~~~~~~~~~~~~~~~~ @@ -170,7 +176,7 @@ for this purpose compared to :meth:`DataFrame.apply` because of its better perfo When to use: Use map for applying element-wise UDFs to DataFrames or Series. -Documentation: DataFrame.map +Documentation can be found :ref:`here`. :meth:`DataFrame.pipe` ~~~~~~~~~~~~~~~~~~~~~~ @@ -180,7 +186,7 @@ It is a helpful tool for organizing complex data processing workflows. When to use: Use pipe when you need to create a pipeline of transformations and want to keep the code readable and maintainable. -Documentation: DataFrame.pipe +Documentation can be found :ref:`here`. Best Practices @@ -198,9 +204,9 @@ for common operations. Vectorized Operations ~~~~~~~~~~~~~~~~~~~~~ -Below is an example of vectorized operations in pandas: +Below is a comparison of using UDFs versus using Vectorized Operations: -.. code-block:: text +.. code-block:: python # User-defined function def calc_ratio(row): @@ -215,8 +221,8 @@ Measuring how long each operation takes: .. code-block:: text - Vectorized: 0.0043 secs User-defined function: 5.6435 secs + Vectorized: 0.0043 secs Vectorized operations in pandas are significantly faster than using :meth:`DataFrame.apply` with UDFs because they leverage highly optimized C functions From 214f0ac31de3aa89333405dd010cc37cb7426b64 Mon Sep 17 00:00:00 2001 From: arthurlw Date: Mon, 31 Mar 2025 19:39:40 -0700 Subject: [PATCH 12/16] precommit --- doc/source/user_guide/user_defined_functions.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index 5aec645c704ad..8dc294edc103a 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -165,7 +165,7 @@ When to use: Use :meth:`DataFrame.filter` when you want to use a UDF to create a df_filtered = df[[col for col in df.columns if is_long_name(col)]] print(df_filtered) -Since filter does not direclty accept a UDF, you have to apply the UDF indirectly, +Since filter does not directly accept a UDF, you have to apply the UDF indirectly, such as by using list comprehensions. :meth:`DataFrame.map` @@ -229,4 +229,4 @@ with UDFs because they leverage highly optimized C functions via NumPy to process entire arrays at once. This approach avoids the overhead of looping through rows in Python and making separate function calls for each row, which is slow and inefficient. Additionally, NumPy arrays benefit from memory efficiency and CPU-level -optimizations, making vectorized operations the preferred choice whenever possible. \ No newline at end of file +optimizations, making vectorized operations the preferred choice whenever possible. From fffaad0470d9d433875031a84b4306313abfb3bc Mon Sep 17 00:00:00 2001 From: arthurlw Date: Mon, 31 Mar 2025 21:11:02 -0700 Subject: [PATCH 13/16] fix links --- doc/source/user_guide/user_defined_functions.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index 8dc294edc103a..01f573d1b9faa 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -92,7 +92,7 @@ that cannot be achieved with built-in pandas functions. When to use: :meth:`DataFrame.apply` is suitable when no alternative vectorized method is available, but consider optimizing performance with vectorized operations wherever possible. -Examples of usage can be found :ref:`here`. +Examples of usage can be found :ref:`here `. :meth:`DataFrame.agg` ~~~~~~~~~~~~~~~~~~~~~ @@ -103,7 +103,7 @@ specifically designed for aggregation operations. When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation functions across groups. -Examples of usage can be found :ref:`here`. +Examples of usage can be found :ref:`here `. :meth:`DataFrame.transform` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -113,7 +113,7 @@ It’s generally faster than apply because it can take advantage of pandas' inte When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame. -Documentation can be found :ref:`here`. +Documentation can be found :ref:`here `. Attempting to use common aggregation functions such as ``mean`` or ``sum`` will result in values being broadcasted to the original dimensions: @@ -176,7 +176,7 @@ for this purpose compared to :meth:`DataFrame.apply` because of its better perfo When to use: Use map for applying element-wise UDFs to DataFrames or Series. -Documentation can be found :ref:`here`. +Documentation can be found :ref:`here `. :meth:`DataFrame.pipe` ~~~~~~~~~~~~~~~~~~~~~~ @@ -186,7 +186,7 @@ It is a helpful tool for organizing complex data processing workflows. When to use: Use pipe when you need to create a pipeline of transformations and want to keep the code readable and maintainable. -Documentation can be found :ref:`here`. +Documentation can be found :ref:`here `. Best Practices From 561a1f517ccda0797101ecbe857dedcfe4b2e284 Mon Sep 17 00:00:00 2001 From: arthurlw Date: Mon, 31 Mar 2025 21:31:49 -0700 Subject: [PATCH 14/16] change links --- doc/source/user_guide/user_defined_functions.rst | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index 01f573d1b9faa..449232611ce43 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -92,7 +92,7 @@ that cannot be achieved with built-in pandas functions. When to use: :meth:`DataFrame.apply` is suitable when no alternative vectorized method is available, but consider optimizing performance with vectorized operations wherever possible. -Examples of usage can be found :ref:`here `. +Examples of usage can be found :meth:`~DataFrame.apply`. :meth:`DataFrame.agg` ~~~~~~~~~~~~~~~~~~~~~ @@ -103,7 +103,7 @@ specifically designed for aggregation operations. When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation functions across groups. -Examples of usage can be found :ref:`here `. +Examples of usage can be found :meth:`~DataFrame.agg`. :meth:`DataFrame.transform` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -113,7 +113,7 @@ It’s generally faster than apply because it can take advantage of pandas' inte When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame. -Documentation can be found :ref:`here `. +Documentation can be found :meth:`~DataFrame.transform`. Attempting to use common aggregation functions such as ``mean`` or ``sum`` will result in values being broadcasted to the original dimensions: @@ -168,6 +168,8 @@ When to use: Use :meth:`DataFrame.filter` when you want to use a UDF to create a Since filter does not directly accept a UDF, you have to apply the UDF indirectly, such as by using list comprehensions. +Documentation can be found :meth:`~DataFrame.filter`. + :meth:`DataFrame.map` ~~~~~~~~~~~~~~~~~~~~~ @@ -176,7 +178,7 @@ for this purpose compared to :meth:`DataFrame.apply` because of its better perfo When to use: Use map for applying element-wise UDFs to DataFrames or Series. -Documentation can be found :ref:`here `. +Documentation can be found :meth:`~DataFrame.map`. :meth:`DataFrame.pipe` ~~~~~~~~~~~~~~~~~~~~~~ @@ -186,7 +188,7 @@ It is a helpful tool for organizing complex data processing workflows. When to use: Use pipe when you need to create a pipeline of transformations and want to keep the code readable and maintainable. -Documentation can be found :ref:`here `. +Documentation can be found :meth:`~DataFrame.pipe`. Best Practices From c6891a01707354a78ed30cf1b90c26abab3dfdb4 Mon Sep 17 00:00:00 2001 From: arthurlw Date: Mon, 7 Apr 2025 16:11:58 -0700 Subject: [PATCH 15/16] updated user guide --- .../user_guide/user_defined_functions.rst | 107 ++++++++++++------ 1 file changed, 74 insertions(+), 33 deletions(-) diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index 449232611ce43..7fd1615f697af 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -17,11 +17,10 @@ Why Not To Use User-Defined Functions ----------------------------------------- While UDFs provide flexibility, they come with significant drawbacks, primarily -related to performance. Unlike vectorized pandas operations, UDFs are slower because pandas lacks -insight into what they are computing, making it difficult to apply efficient handling or optimization -techniques. As a result, pandas resorts to less efficient processing methods that significantly -slow down computations. Additionally, relying on UDFs often sacrifices the benefits -of pandas’ built-in, optimized methods, limiting compatibility and overall performance. +related to performance and behavior. When using UDFs, pandas must perform inference +on the result, and that inference could be incorrect. Furthermore, unlike vectorized operations, +UDFs are slower because pandas can't optimize their computations, leading to +inefficient processing. .. note:: In general, most tasks can and should be accomplished using pandas’ built-in methods or vectorized operations. @@ -33,6 +32,29 @@ Despite their drawbacks, UDFs can be helpful when: * **Extending pandas' Functionality**: Applying external libraries or specialized algorithms unavailable in pandas. * **Handling Complex Grouped Operations**: Performing operations on grouped data that standard methods do not support. +For example: + +.. code-block:: python + + from sklearn.linear_model import LinearRegression + + # Sample data + df = pd.DataFrame({ + 'group': ['A', 'A', 'A', 'B', 'B', 'B'], + 'x': [1, 2, 3, 1, 2, 3], + 'y': [2, 4, 6, 1, 2, 1.5] + }) + + # Function to fit a model to each group + def fit_model(group): + model = LinearRegression() + model.fit(group[['x']], group['y']) + group['y_pred'] = model.predict(group[['x']]) + return group + + result = df.groupby('group').apply(fit_model) + + Methods that support User-Defined Functions ------------------------------------------- @@ -56,6 +78,10 @@ ways to apply UDFs across different pandas data structures. .. note:: Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`. +Additionally, operations such as :ref:`resample()`, :ref:`rolling()`, +:ref:`expanding()`, and :ref:`ewm()` also support UDFs for performing custom +computations over temporal or statistical windows. + Choosing the Right Method ------------------------- @@ -66,21 +92,21 @@ decisions, ensuring more efficient and maintainable code. Below is a table overview of all methods that accept UDFs: -+------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ -| Method | Purpose | Supports UDFs | Keeps Shape | Performance | Recommended Use Case | -+==================+======================================+===========================+====================+===========================+==========================================+ -| :meth:`apply` | General-purpose function | Yes | Yes (when axis=1) | Slow | Custom row-wise or column-wise operations| -+------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ -| :meth:`agg` | Aggregation | Yes | No | Fast (if using built-ins) | Custom aggregation logic | -+------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ -| :meth:`transform`| Transform without reducing dimensions| Yes | Yes | Fast (if vectorized) | Broadcast element-wise transformations | -+------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ -| :meth:`map` | Element-wise mapping | Yes | Yes | Moderate | Simple element-wise transformations | -+------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ -| :meth:`pipe` | Functional chaining | Yes | Yes | Depends on function | Building clean pipelines | -+------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ -| :meth:`filter` | Row/Column selection | Not directly | Yes | Fast | Subsetting based on conditions | -+------------------+--------------------------------------+---------------------------+--------------------+---------------------------+------------------------------------------+ ++------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ +| Method | Purpose | Supports UDFs | Keeps Shape | Recommended Use Case | ++==================+======================================+===========================+====================+==========================================+ +| :meth:`apply` | General-purpose function | Yes | Yes (when axis=1) | Custom row-wise or column-wise operations| ++------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ +| :meth:`agg` | Aggregation | Yes | No | Custom aggregation logic | ++------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ +| :meth:`transform`| Transform without reducing dimensions| Yes | Yes | Broadcast element-wise transformations | ++------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ +| :meth:`map` | Element-wise mapping | Yes | Yes | Simple element-wise transformations | ++------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ +| :meth:`pipe` | Functional chaining | Yes | Yes | Building clean operation pipelines | ++------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ +| :meth:`filter` | Row/Column selection | Not directly | Yes | Subsetting based on conditions | ++------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ :meth:`DataFrame.apply` ~~~~~~~~~~~~~~~~~~~~~~~ @@ -89,10 +115,10 @@ The :meth:`DataFrame.apply` allows you to apply UDFs along either rows or column it is slower than vectorized operations and should be used only when you need operations that cannot be achieved with built-in pandas functions. -When to use: :meth:`DataFrame.apply` is suitable when no alternative vectorized method is available, but consider -optimizing performance with vectorized operations wherever possible. +When to use: :meth:`DataFrame.apply` is suitable when no alternative vectorized method or UDF method is available, +but consider optimizing performance with vectorized operations wherever possible. -Examples of usage can be found :meth:`~DataFrame.apply`. +Documentation can be found at :meth:`~DataFrame.apply`. :meth:`DataFrame.agg` ~~~~~~~~~~~~~~~~~~~~~ @@ -103,17 +129,17 @@ specifically designed for aggregation operations. When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation functions across groups. -Examples of usage can be found :meth:`~DataFrame.agg`. +Documentation can be found at :meth:`~DataFrame.agg`. :meth:`DataFrame.transform` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The transform method is ideal for performing element-wise transformations while preserving the shape of the original DataFrame. -It’s generally faster than apply because it can take advantage of pandas' internal optimizations. +It is generally faster than apply because it can take advantage of pandas' internal optimizations. When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame. -Documentation can be found :meth:`~DataFrame.transform`. +Documentation can be found at :meth:`~DataFrame.transform`. Attempting to use common aggregation functions such as ``mean`` or ``sum`` will result in values being broadcasted to the original dimensions: @@ -158,17 +184,17 @@ When to use: Use :meth:`DataFrame.filter` when you want to use a UDF to create a 'D': [10, 11, 12] }) - # Define a function that filters out columns where the name is longer than 1 character + # Function that filters out columns where the name is longer than 1 character def is_long_name(column_name): return len(column_name) > 1 - df_filtered = df[[col for col in df.columns if is_long_name(col)]] + df_filtered = df.filter(items=[col for col in df.columns if is_long_name(col)]) print(df_filtered) Since filter does not directly accept a UDF, you have to apply the UDF indirectly, -such as by using list comprehensions. +for example, by using list comprehensions. -Documentation can be found :meth:`~DataFrame.filter`. +Documentation can be found at :meth:`~DataFrame.filter`. :meth:`DataFrame.map` ~~~~~~~~~~~~~~~~~~~~~ @@ -178,7 +204,7 @@ for this purpose compared to :meth:`DataFrame.apply` because of its better perfo When to use: Use map for applying element-wise UDFs to DataFrames or Series. -Documentation can be found :meth:`~DataFrame.map`. +Documentation can be found at :meth:`~DataFrame.map`. :meth:`DataFrame.pipe` ~~~~~~~~~~~~~~~~~~~~~~ @@ -186,9 +212,9 @@ Documentation can be found :meth:`~DataFrame.map`. The pipe method is useful for chaining operations together into a clean and readable pipeline. It is a helpful tool for organizing complex data processing workflows. -When to use: Use pipe when you need to create a pipeline of transformations and want to keep the code readable and maintainable. +When to use: Use pipe when you need to create a pipeline of operations and want to keep the code readable and maintainable. -Documentation can be found :meth:`~DataFrame.pipe`. +Documentation can be found at :meth:`~DataFrame.pipe`. Best Practices @@ -232,3 +258,18 @@ via NumPy to process entire arrays at once. This approach avoids the overhead of through rows in Python and making separate function calls for each row, which is slow and inefficient. Additionally, NumPy arrays benefit from memory efficiency and CPU-level optimizations, making vectorized operations the preferred choice whenever possible. + + +Improving Performance with UDFs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In scenarios where UDFs are necessary, there are still ways to mitigate their performance drawbacks. +One approach is to use **Numba**, a Just-In-Time (JIT) compiler that can significantly speed up numerical +Python code by compiling Python functions to optimized machine code at runtime. + +By annotating your UDFs with ``@numba.jit``, you can achieve performance closer to vectorized operations, +especially for computationally heavy tasks. + +.. note:: + You may also refer to the user guide on `Enhancing performance `_ + for a more detailed guide to using **Numba**. From f56ec28d6504ca0c04d7d29dd1f4669a63b0cbd3 Mon Sep 17 00:00:00 2001 From: arthurlw Date: Sun, 13 Apr 2025 12:51:12 -0700 Subject: [PATCH 16/16] updated udf user guide based on reviews --- doc/source/user_guide/index.rst | 2 +- .../user_guide/user_defined_functions.rst | 94 ++++++++++++------- 2 files changed, 61 insertions(+), 35 deletions(-) diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst index b321559976bda..230b2b86b2ffd 100644 --- a/doc/source/user_guide/index.rst +++ b/doc/source/user_guide/index.rst @@ -78,6 +78,7 @@ Guides boolean visualization style + user_defined_functions groupby window timeseries @@ -88,4 +89,3 @@ Guides sparse gotchas cookbook - user_defined_functions diff --git a/doc/source/user_guide/user_defined_functions.rst b/doc/source/user_guide/user_defined_functions.rst index 7fd1615f697af..0db38184585ba 100644 --- a/doc/source/user_guide/user_defined_functions.rst +++ b/doc/source/user_guide/user_defined_functions.rst @@ -2,19 +2,46 @@ {{ header }} -************************************** -Introduction to User-Defined Functions -************************************** +***************************** +User-Defined Functions (UDFs) +***************************** In pandas, User-Defined Functions (UDFs) provide a way to extend the library’s functionality by allowing users to apply custom computations to their data. While pandas comes with a set of built-in functions for data manipulation, UDFs offer flexibility when built-in methods are not sufficient. These functions can be applied at different levels: element-wise, row-wise, column-wise, or group-wise, -and change the data differently, depending on the method used. +and behave differently, depending on the method used. + +Here’s a simple example to illustrate a UDF applied to a Series: + +.. ipython:: python + + s = pd.Series([1, 2, 3]) + + # Simple UDF that adds 1 to a value + def add_one(x): + return x + 1 + + # Apply the function element-wise using .map + s.map(add_one) + +You can also apply UDFs to an entire DataFrame. For example: + +.. ipython:: python + + df = pd.DataFrame({"A": [1, 2, 3], "B": [10, 20, 30]}) + + # UDF that takes a row and returns the sum of columns A and B + def sum_row(row): + return row["A"] + row["B"] + + # Apply the function row-wise (axis=1 means apply across columns per row) + df.apply(sum_row, axis=1) + Why Not To Use User-Defined Functions ------------------------------------------ +------------------------------------- While UDFs provide flexibility, they come with significant drawbacks, primarily related to performance and behavior. When using UDFs, pandas must perform inference @@ -60,27 +87,25 @@ Methods that support User-Defined Functions User-Defined Functions can be applied across various pandas methods: -* :meth:`~DataFrame.apply` - A flexible method that allows applying a function to Series, - DataFrames, or groups of data. -* :meth:`~DataFrame.agg` (Aggregate) - Used for summarizing data, supporting multiple +* :meth:`~DataFrame.apply` - A flexible method that allows applying a function to Series and + DataFrames. +* :meth:`~DataFrame.agg` (Aggregate) - Used for summarizing data, supporting custom aggregation functions. -* :meth:`~DataFrame.transform` - Applies a function to groups while preserving the shape of +* :meth:`~DataFrame.transform` - Applies a function to Series and Dataframes while preserving the shape of the original data. -* :meth:`~DataFrame.filter` - Filters groups based on a list of Boolean conditions. -* :meth:`~DataFrame.map` - Applies an element-wise function to a Series, useful for +* :meth:`~DataFrame.filter` - Filters Series and Dataframes based on a list of Boolean conditions. +* :meth:`~DataFrame.map` - Applies an element-wise function to a Series or Dataframe, useful for transforming individual values. -* :meth:`~DataFrame.pipe` - Allows chaining custom functions to process entire DataFrames or - Series in a clean, readable manner. +* :meth:`~DataFrame.pipe` - Allows chaining custom functions to process Series or + Dataframes in a clean, readable manner. All of these pandas methods can be used with both Series and DataFrame objects, providing versatile ways to apply UDFs across different pandas data structures. .. note:: - Some of these methods are can also be applied to Groupby Objects. Refer to :ref:`groupby`. - -Additionally, operations such as :ref:`resample()`, :ref:`rolling()`, -:ref:`expanding()`, and :ref:`ewm()` also support UDFs for performing custom -computations over temporal or statistical windows. + Some of these methods are can also be applied to groupby, resample, and various window objects. + See :ref:`groupby`, :ref:`resample()`, :ref:`rolling()`, :ref:`expanding()`, + and :ref:`ewm()` for details. Choosing the Right Method @@ -126,8 +151,8 @@ Documentation can be found at :meth:`~DataFrame.apply`. If you need to aggregate data, :meth:`DataFrame.agg` is a better choice than apply because it is specifically designed for aggregation operations. -When to use: Use :meth:`DataFrame.agg` for performing aggregations like sum, mean, or custom aggregation -functions across groups. +When to use: Use :meth:`DataFrame.agg` for performing custom aggregations, where the operation returns +a scalar value on each input. Documentation can be found at :meth:`~DataFrame.agg`. @@ -141,25 +166,26 @@ When to use: When you need to perform element-wise transformations that retain t Documentation can be found at :meth:`~DataFrame.transform`. -Attempting to use common aggregation functions such as ``mean`` or ``sum`` will result in -values being broadcasted to the original dimensions: +.. code-block:: python -.. ipython:: python + from sklearn.linear_model import LinearRegression - # Sample DataFrame df = pd.DataFrame({ - 'Category': ['A', 'A', 'B', 'B', 'B'], - 'Values': [10, 20, 30, 40, 50] - }) - - # Using transform with mean - df['Mean_Transformed'] = df.groupby('Category')['Values'].transform('mean') + 'group': ['A', 'A', 'A', 'B', 'B', 'B'], + 'x': [1, 2, 3, 1, 2, 3], + 'y': [2, 4, 6, 1, 2, 1.5] + }).set_index("x") - # Using transform with sum - df['Sum_Transformed'] = df.groupby('Category')['Values'].transform('sum') + # Function to fit a model to each group + def fit_model(group): + x = group.index.to_frame() + y = group + model = LinearRegression() + model.fit(x, y) + pred = model.predict(x) + return pred - # Result broadcasted to DataFrame - print(df) + result = df.groupby('group').transform(fit_model) :meth:`DataFrame.filter` ~~~~~~~~~~~~~~~~~~~~~~~~