From 3b54a73e0bd34fbbef24c2d38b56c570fe659663 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Sat, 23 Mar 2024 19:40:16 -0500 Subject: [PATCH 1/2] Add pdep content --- web/pandas/pdeps/0015-ice-cream-agreement.md | 212 +++++++++++++++++++ 1 file changed, 212 insertions(+) create mode 100644 web/pandas/pdeps/0015-ice-cream-agreement.md diff --git a/web/pandas/pdeps/0015-ice-cream-agreement.md b/web/pandas/pdeps/0015-ice-cream-agreement.md new file mode 100644 index 0000000000000..ba2eb1e4507f7 --- /dev/null +++ b/web/pandas/pdeps/0015-ice-cream-agreement.md @@ -0,0 +1,212 @@ +# PDEP-15: Ice Cream Agreement + +- Created: March 2024 +- Status: Under discussion +- Discussion: [#32265](https://github.com/pandas-dev/pandas/issues/32265) +- Author: [Patrick Hoefler](https://github.com/phofl) + [Joris Van den Bossche](https://github.com/jorisvandenbossche) +- Revision: 1 + +## Abstract + +Short summary of the proposal: + +1. The pandas Extension Array interface will fully support 2D arrays. Currently, pandas publicly only + supports 1D Extension Arrays. Additionally, pandas will make all internal NumPy-based + Extension Arrays 2D. This specifically includes our nullable Extension Arrays and + __excludes__ Arrow-based extension arrays. Consequently, pandas will move to the nullable + extension dtypes by default to provide consistent missing value handling across all dtypes. + +2. The NumPy based Extension Arrays will exclusively use ``pd.NA`` as a missing value indicator. + ``np.nan`` will not allowed to be present, which removes the need to distinguish between + ``pd.NA`` and ``np.nan``. The ``FloatingArray`` will thus only use ``NA`` and not ``nan``. + +This addresses several issues that have been open for years: + +1) Clear and consistent missing value handling across all dtypes. +2) A resolution of the discussion how to treat ``NA`` and ``NaN`` in FloatingArrays. +3) Making the NA-scalar easier to use through no longer raising on ``bool(pd.NA)``. +4) The ExtensionArray interface will be a first class citizen, which simplifies 3rd-party + extensions. + +## Background + +pandas currently maintains three different sets of dtypes next to each other: + +- NumPy dtypes that use NumPy arrays to store the data +- Arrow dtypes that use PyArrow Arrays to store the data +- Nullable extension dtypes that use pandas Extension Arrays to store the data. These + arrays add a layer on top of NumPy to modify the behavior. + +The NumPy dtypes are currently default and the most widely used. They use NaN as the missing +value indicator, which is a float and can't be stored in an integer or boolean array. Consequently, +these dtypes are cast to float/object if a missing value is inserted into them. + +The nullable extension dtypes were originally designed to solve these problems and to provide +consistent missing value behavior between different dtypes. These arrays use a strict 1D layout +and store missing values through an accompanying mask. The integer and boolean dtypes are +supported well across the pandas API, but the float dtypes still have many inconsistencies +with respect to missing value handling and the behavior of ``pd.NA`` and ``np.nan``. The +nullable arrays generally are hindered in some scenarios because of the 1D layout (``axis=1`` +operations, transposing, etc.). + +The Arrow dtypes are the most recent addition to pandas. They are currently separate from the +other two sets of dtypes since they user a different data model under the hood and are strictly +1D. + +## Proposal + +This proposal aims to unify the missing value handling across all dtypes and to resolve +outstanding issues for the FloatingArray implementation. This proposal is not meant to +address implementation details, rather to provide a high level way forward. + +1. The ``FloatingArray`` implementation will exclusively use ``pd.NA`` was missing value + indicator. ``np.nan`` will not be allowed to be present in the array. The missing value + behavior will follow the semantics of the other nullable extension dtypes. + +2. The ExtensionArray interface will be extended to support 2D arrays. This will allow + us to make our internal nullable ExtensionArrays 2D and also make this option available + to 3rd party arrays. + +3. pandas will move to nullable extension arrays by default instead of using the NumPy + dtypes that are currently the default. Every constructor and IO method will infer + extension dtypes by default if not explicitly specified by the user. This is + similar to the current ``dtype_backend="numpy_nullable"`` keyword in IO methods, + but will be made the new default and extended to the constructors. + +We will obey the following dtype mapping: + +- int*/uint* -> Int*/UInt* +- float* -> Float* +- bool -> boolean +- object dtype will be mapped to string, but this is covered by PDEP10 +- object dtype will be used for values that aren't strings + +This will ensure that all dtypes have consistent missing value handling and there +is no need to upcast if a missing value is inserted into integers or booleans. Those +nullability semantics will be mostly consistent with how PyArrow treats nulls and thus +make switching between both set of dtypes easier. Additionally, it allows the usage of +other Arrow dtypes by default that user the same semantics (bytes, nested dtypes, ...). + +This proposal formalizes the results of the pandas core sprint in 2023. + +## Backward compatibility + +Making Extension Arrays 2D can be considered an implementation detail and shouldn't +impact users negatively. + +The ``FloatingArray`` implementation is still experimental and currently riddled with +bugs with respect to handling of ``pd.NA`` and ``np.nan``. It's experimental status allows +us to change this without worrying to much about backwards compatibility. Additionally, +because of the bugs related to NA handling makes it unlikely that it is used in serious +applications. + +Switching to nullable dtypes by default will be a huge change for pandas. It will deviate +from the current NumPy dtypes and change nullability semantics for users. This will require +care when implementing this change to make the change in behavior as small as possible and +to ensure that the new implementation is well tested and easy to opt in for users before +we make this switch. + +## Considerations + +### 2D Extension Arrays + +The current restriction of 1D Extension Arrays only has a number of limitations internally. +``axis=1`` operations and more generally operations that transpose the data in some way +tend to fall back to object. Additionally, the 1D limitation requires copies when converting +between NumPy and pandas in all cases for DataFrames. Our internal algorithms are more +performant for 2D arrays like groupby aggregations. There are currently 35 +TODOs across the code base that are related to 2D extension arrays. + +I am not aware of any drawbacks compared to the current default dtypes at the point +of writing. + +### FloatingArray + +The FloatingArray implementation is currently experimental and has a number of bugs. +The main source of issues stems from the fact that both ``np.nan`` and ``pd.NA`` +are allowed and not properly handled. + +**Status quo** + +When constructing a FloatingArray from a NumPy array, a Series with a +NumPy dtype or another list-like the constructor converts ``np.nan`` to ``pd.NA``. + +```python + +In [3]: pd.array(np.array([1.5, np.nan]), dtype="Float64") +Out[3]: + +[1.5, ] +Length: 2, dtype: Float64 +``` + +This is done because NumPy doesn't have a missing value sentinel and pandas +considers ``np.nan`` to be missing. + +Inserting ``np.nan`` into a FloatingArray will also coerce to ``pd.NA``. + +```python +In [4]: arr = pd.array([1.5, np.nan], dtype="Float64") +In [5]: arr[0] = np.nan + +In [6]: arr +Out[6]: + +[, ] +Length: 2, dtype: Float64 +``` + +You can introduce NaN values through ``0/0`` for example, but having NaN +causes other issues. None of our na-detection methods (fillna, isna, ...) +will match NaN values, they only match ``pd.NA``. A non exhaustive list of +issues this behavior causes can be found on the +[pandas issue tracker](https://github.com/pandas-dev/pandas/issues?q=is%3Aopen+is%3Aissue+label%3A%22Ice+Cream+Agreement%22). + +**Solution** + +The current state makes the FloatingArray unusable if you rely on missing values +in any way. We solve this problem through disallowing ``np.nan`` in the FloatingArray. +Only ``NA`` will be allowed to be present. + +- This solution makes the implementation of all methods that interact with NA + simpler and more consistent. This includes methods like ``fillna`` but also + sorting operations. +- Users are used to only having ``np.nan`` as a missing value indicator in pandas. + Staying with one missing value indicator in ``pd.NA`` will make the behavior + less confusing for users. +- In- and Output to and from NumPy is unambiguous. Every NaN is converted to NA + and back. + + +**Drawbacks** + +- There is no option to distinguish between missing and invalid values. This is currently not + possible either and generally would require increasing the API surface to handle both cases. + Methods interacting with missing values would need to be configurable. There was never much + demand for this feature, so the additional complexity does not seem justified. + +Distinguishing NA and NaN adds a lot of complexities: + +- Roundtripping through NumPy is not really possible. Currently, we are converting to NA and then + converting back. This is potentially very confusing if users end up with the same value after + going to NumPy. +- Increases the API surface and complexity for all methods that interact with NA values. Additionally, + it also concerns all methods that have the ``skipna`` argument. +- Having both values in pandas is confusing for users. Historically, pandas used only NaN. + Differentiating between NA and NaN would make behavior less intuitive for non-expert users. +- It adds maintenance burden. +- NA and NaN have different semantics in comparison operations, which adds further mental complexity. + +## Timeline + +Make Extension Arrays 2D and fix all inconsistencies in the FloatingArray. Ensure that +this is done by the time that pandas 4.0 is released and then prepare the migration +to nullable dtypes by default in the next major release. + +### PDEP History + +- March 2024: Initial draft + +Note: There is a very long discussion in [GH-32265](https://github.com/pandas-dev/pandas/issues/32265) +that concerns this topic. From d7dcbdb90d4d2e1658c34cff16dc99990aa6cdb0 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Wed, 12 Jun 2024 17:49:03 +0200 Subject: [PATCH 2/2] rename file, rephrase main proposal points, temporarily remove most other content --- web/pandas/pdeps/0015-ice-cream-agreement.md | 212 ------------------ .../pdeps/0016-consistent-missing-values.md | 107 +++++++++ 2 files changed, 107 insertions(+), 212 deletions(-) delete mode 100644 web/pandas/pdeps/0015-ice-cream-agreement.md create mode 100644 web/pandas/pdeps/0016-consistent-missing-values.md diff --git a/web/pandas/pdeps/0015-ice-cream-agreement.md b/web/pandas/pdeps/0015-ice-cream-agreement.md deleted file mode 100644 index ba2eb1e4507f7..0000000000000 --- a/web/pandas/pdeps/0015-ice-cream-agreement.md +++ /dev/null @@ -1,212 +0,0 @@ -# PDEP-15: Ice Cream Agreement - -- Created: March 2024 -- Status: Under discussion -- Discussion: [#32265](https://github.com/pandas-dev/pandas/issues/32265) -- Author: [Patrick Hoefler](https://github.com/phofl) - [Joris Van den Bossche](https://github.com/jorisvandenbossche) -- Revision: 1 - -## Abstract - -Short summary of the proposal: - -1. The pandas Extension Array interface will fully support 2D arrays. Currently, pandas publicly only - supports 1D Extension Arrays. Additionally, pandas will make all internal NumPy-based - Extension Arrays 2D. This specifically includes our nullable Extension Arrays and - __excludes__ Arrow-based extension arrays. Consequently, pandas will move to the nullable - extension dtypes by default to provide consistent missing value handling across all dtypes. - -2. The NumPy based Extension Arrays will exclusively use ``pd.NA`` as a missing value indicator. - ``np.nan`` will not allowed to be present, which removes the need to distinguish between - ``pd.NA`` and ``np.nan``. The ``FloatingArray`` will thus only use ``NA`` and not ``nan``. - -This addresses several issues that have been open for years: - -1) Clear and consistent missing value handling across all dtypes. -2) A resolution of the discussion how to treat ``NA`` and ``NaN`` in FloatingArrays. -3) Making the NA-scalar easier to use through no longer raising on ``bool(pd.NA)``. -4) The ExtensionArray interface will be a first class citizen, which simplifies 3rd-party - extensions. - -## Background - -pandas currently maintains three different sets of dtypes next to each other: - -- NumPy dtypes that use NumPy arrays to store the data -- Arrow dtypes that use PyArrow Arrays to store the data -- Nullable extension dtypes that use pandas Extension Arrays to store the data. These - arrays add a layer on top of NumPy to modify the behavior. - -The NumPy dtypes are currently default and the most widely used. They use NaN as the missing -value indicator, which is a float and can't be stored in an integer or boolean array. Consequently, -these dtypes are cast to float/object if a missing value is inserted into them. - -The nullable extension dtypes were originally designed to solve these problems and to provide -consistent missing value behavior between different dtypes. These arrays use a strict 1D layout -and store missing values through an accompanying mask. The integer and boolean dtypes are -supported well across the pandas API, but the float dtypes still have many inconsistencies -with respect to missing value handling and the behavior of ``pd.NA`` and ``np.nan``. The -nullable arrays generally are hindered in some scenarios because of the 1D layout (``axis=1`` -operations, transposing, etc.). - -The Arrow dtypes are the most recent addition to pandas. They are currently separate from the -other two sets of dtypes since they user a different data model under the hood and are strictly -1D. - -## Proposal - -This proposal aims to unify the missing value handling across all dtypes and to resolve -outstanding issues for the FloatingArray implementation. This proposal is not meant to -address implementation details, rather to provide a high level way forward. - -1. The ``FloatingArray`` implementation will exclusively use ``pd.NA`` was missing value - indicator. ``np.nan`` will not be allowed to be present in the array. The missing value - behavior will follow the semantics of the other nullable extension dtypes. - -2. The ExtensionArray interface will be extended to support 2D arrays. This will allow - us to make our internal nullable ExtensionArrays 2D and also make this option available - to 3rd party arrays. - -3. pandas will move to nullable extension arrays by default instead of using the NumPy - dtypes that are currently the default. Every constructor and IO method will infer - extension dtypes by default if not explicitly specified by the user. This is - similar to the current ``dtype_backend="numpy_nullable"`` keyword in IO methods, - but will be made the new default and extended to the constructors. - -We will obey the following dtype mapping: - -- int*/uint* -> Int*/UInt* -- float* -> Float* -- bool -> boolean -- object dtype will be mapped to string, but this is covered by PDEP10 -- object dtype will be used for values that aren't strings - -This will ensure that all dtypes have consistent missing value handling and there -is no need to upcast if a missing value is inserted into integers or booleans. Those -nullability semantics will be mostly consistent with how PyArrow treats nulls and thus -make switching between both set of dtypes easier. Additionally, it allows the usage of -other Arrow dtypes by default that user the same semantics (bytes, nested dtypes, ...). - -This proposal formalizes the results of the pandas core sprint in 2023. - -## Backward compatibility - -Making Extension Arrays 2D can be considered an implementation detail and shouldn't -impact users negatively. - -The ``FloatingArray`` implementation is still experimental and currently riddled with -bugs with respect to handling of ``pd.NA`` and ``np.nan``. It's experimental status allows -us to change this without worrying to much about backwards compatibility. Additionally, -because of the bugs related to NA handling makes it unlikely that it is used in serious -applications. - -Switching to nullable dtypes by default will be a huge change for pandas. It will deviate -from the current NumPy dtypes and change nullability semantics for users. This will require -care when implementing this change to make the change in behavior as small as possible and -to ensure that the new implementation is well tested and easy to opt in for users before -we make this switch. - -## Considerations - -### 2D Extension Arrays - -The current restriction of 1D Extension Arrays only has a number of limitations internally. -``axis=1`` operations and more generally operations that transpose the data in some way -tend to fall back to object. Additionally, the 1D limitation requires copies when converting -between NumPy and pandas in all cases for DataFrames. Our internal algorithms are more -performant for 2D arrays like groupby aggregations. There are currently 35 -TODOs across the code base that are related to 2D extension arrays. - -I am not aware of any drawbacks compared to the current default dtypes at the point -of writing. - -### FloatingArray - -The FloatingArray implementation is currently experimental and has a number of bugs. -The main source of issues stems from the fact that both ``np.nan`` and ``pd.NA`` -are allowed and not properly handled. - -**Status quo** - -When constructing a FloatingArray from a NumPy array, a Series with a -NumPy dtype or another list-like the constructor converts ``np.nan`` to ``pd.NA``. - -```python - -In [3]: pd.array(np.array([1.5, np.nan]), dtype="Float64") -Out[3]: - -[1.5, ] -Length: 2, dtype: Float64 -``` - -This is done because NumPy doesn't have a missing value sentinel and pandas -considers ``np.nan`` to be missing. - -Inserting ``np.nan`` into a FloatingArray will also coerce to ``pd.NA``. - -```python -In [4]: arr = pd.array([1.5, np.nan], dtype="Float64") -In [5]: arr[0] = np.nan - -In [6]: arr -Out[6]: - -[, ] -Length: 2, dtype: Float64 -``` - -You can introduce NaN values through ``0/0`` for example, but having NaN -causes other issues. None of our na-detection methods (fillna, isna, ...) -will match NaN values, they only match ``pd.NA``. A non exhaustive list of -issues this behavior causes can be found on the -[pandas issue tracker](https://github.com/pandas-dev/pandas/issues?q=is%3Aopen+is%3Aissue+label%3A%22Ice+Cream+Agreement%22). - -**Solution** - -The current state makes the FloatingArray unusable if you rely on missing values -in any way. We solve this problem through disallowing ``np.nan`` in the FloatingArray. -Only ``NA`` will be allowed to be present. - -- This solution makes the implementation of all methods that interact with NA - simpler and more consistent. This includes methods like ``fillna`` but also - sorting operations. -- Users are used to only having ``np.nan`` as a missing value indicator in pandas. - Staying with one missing value indicator in ``pd.NA`` will make the behavior - less confusing for users. -- In- and Output to and from NumPy is unambiguous. Every NaN is converted to NA - and back. - - -**Drawbacks** - -- There is no option to distinguish between missing and invalid values. This is currently not - possible either and generally would require increasing the API surface to handle both cases. - Methods interacting with missing values would need to be configurable. There was never much - demand for this feature, so the additional complexity does not seem justified. - -Distinguishing NA and NaN adds a lot of complexities: - -- Roundtripping through NumPy is not really possible. Currently, we are converting to NA and then - converting back. This is potentially very confusing if users end up with the same value after - going to NumPy. -- Increases the API surface and complexity for all methods that interact with NA values. Additionally, - it also concerns all methods that have the ``skipna`` argument. -- Having both values in pandas is confusing for users. Historically, pandas used only NaN. - Differentiating between NA and NaN would make behavior less intuitive for non-expert users. -- It adds maintenance burden. -- NA and NaN have different semantics in comparison operations, which adds further mental complexity. - -## Timeline - -Make Extension Arrays 2D and fix all inconsistencies in the FloatingArray. Ensure that -this is done by the time that pandas 4.0 is released and then prepare the migration -to nullable dtypes by default in the next major release. - -### PDEP History - -- March 2024: Initial draft - -Note: There is a very long discussion in [GH-32265](https://github.com/pandas-dev/pandas/issues/32265) -that concerns this topic. diff --git a/web/pandas/pdeps/0016-consistent-missing-values.md b/web/pandas/pdeps/0016-consistent-missing-values.md new file mode 100644 index 0000000000000..a74f7c42eef4f --- /dev/null +++ b/web/pandas/pdeps/0016-consistent-missing-values.md @@ -0,0 +1,107 @@ +# PDEP-16: Consistent missing value handling (with a single NA scalar) + +- Created: March 2024 +- Status: Under discussion +- Discussion: [#32265](https://github.com/pandas-dev/pandas/issues/32265) +- Author: [Patrick Hoefler](https://github.com/phofl) + [Joris Van den Bossche](https://github.com/jorisvandenbossche) +- Revision: 1 + +## Abstract + +... + +## Background + +Currently, pandas handles missing data differently for different data types. We +use different types to indicate that a value is missing: ``np.nan`` for +floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically +strings or booleans -- with missing values, and ``pd.NaT`` for datetimelike +data. Some other data types, such as integer and bool, cannot store missing data +or are cast to float or object dtype. In addition, pandas 1.0 introduced a new +missing value sentinel, ``pd.NA``, which is being used for the experimental +nullable integer, float, boolean, and string data types, and more recently also +for the pyarrow-backed data types. + +These different missing values also have different behaviors in user-facing +operations. Specifically, we introduced different semantics for the nullable +data types for certain operations (e.g. propagating in comparison operations +instead of comparing as False). + +The nullable extension dtypes and the `pd.NA` scalar were originally designed to +solve these problems and to provide consistent missing value behavior between +different dtypes. Historically those are used as 1D arrays, which hinders usage +of those dtypes in certain scenarios that rely on the 2D block structure of the +pandas internals for fast operations (``axis=1`` operations, transposing, etc.). + +Long term, we want to introduce consistent missing data handling for all data +types. This includes consistent behavior in all operations (indexing, arithmetic +operations, comparisons, etc.) and using a missing value scalar that behaves +consistently. + +## Proposal + +This proposal aims to unify the missing value handling across all dtypes. This +proposal is not meant to address implementation details, rather to provide a +high level way forward. + +1. All data types support missing values and use `pd.NA` exclusively as the + user-facing missing value indicator. + +2. All data types implement consistent missing value "semantics" corresponding + to the current nullable dtypes using `pd.NA` (i.e. regarding behaviour in + comparisons, see below for details). + +3. As a consequence, pandas will move to nullable extension arrays by default + for all data types, instead of using the NumPy dtypes that are currently the + default. To preserve the default 2D block structure of the DataFrame internals, + the ExtensionArray interface will be extended to support 2D arrays. + +4. For backwards compatibility, existing missing value indicators like `NaN` and + `NaT` will be interpreted as `pd.NA` when introduced in user input, IO or + through operations (to ensure it keeps being considered as missing). + Specifically for floating dtypes, in practice this means a float column can + for now only contain NA values. Potentially distinguishing NA and NaN is left + for a separate discussion. + +This will ensure that all dtypes have consistent missing value handling and there +is no need to upcast if a missing value is inserted into integers or booleans. Those +nullability semantics will be mostly consistent with how PyArrow treats nulls and thus +make switching between both set of dtypes easier. Additionally, it allows the usage of +other Arrow dtypes by default that use the same semantics (bytes, nested dtypes, ...). + +In practice, this means solidifying the existing integer, float, boolean and +string nullable data types that already exist, and implementing (variants of) +the categorical, datetimelike and interval data types using `pd.NA`. The +proposal leaves the exact implementation details (e.g. whether to use a mask or +a sentinel (where the best strategy might vary by data type depending on +existing code), or whether to use byte masks vs bitmaps, or whether to use +PyArrow under the hood like the string dtype, etc) out of scope. + +This PDEP also does not define the exact API for dtype constructors or +propose a new consistent interface; this is left for a separate discussion +(PDEP-13). + +### The `NA` scalar + +... + +### Missing value semantics + + +... + +## Backward compatibility + +... + +## Timeline + +... + +### PDEP History + +- March 2024: Initial draft + +Note: There is a very long discussion in [GH-32265](https://github.com/pandas-dev/pandas/issues/32265) +that concerns this topic.