From 3b54a73e0bd34fbbef24c2d38b56c570fe659663 Mon Sep 17 00:00:00 2001
From: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
Date: Sat, 23 Mar 2024 19:40:16 -0500
Subject: [PATCH 1/2] Add pdep content

---
 web/pandas/pdeps/0015-ice-cream-agreement.md | 212 +++++++++++++++++++
 1 file changed, 212 insertions(+)
 create mode 100644 web/pandas/pdeps/0015-ice-cream-agreement.md

diff --git a/web/pandas/pdeps/0015-ice-cream-agreement.md b/web/pandas/pdeps/0015-ice-cream-agreement.md
new file mode 100644
index 0000000000000..ba2eb1e4507f7
--- /dev/null
+++ b/web/pandas/pdeps/0015-ice-cream-agreement.md
@@ -0,0 +1,212 @@
+# PDEP-15: Ice Cream Agreement
+
+- Created: March 2024
+- Status: Under discussion
+- Discussion: [#32265](https://github.com/pandas-dev/pandas/issues/32265)
+- Author: [Patrick Hoefler](https://github.com/phofl)
+          [Joris Van den Bossche](https://github.com/jorisvandenbossche)
+- Revision: 1
+
+## Abstract
+
+Short summary of the proposal:
+
+1. The pandas Extension Array interface will fully support 2D arrays. Currently, pandas publicly only
+   supports 1D Extension Arrays. Additionally, pandas will make all internal NumPy-based
+   Extension Arrays 2D. This specifically includes our nullable Extension Arrays and
+   __excludes__ Arrow-based extension arrays. Consequently, pandas will move to the nullable
+   extension dtypes by default to provide consistent missing value handling across all dtypes.
+
+2. The NumPy based Extension Arrays will exclusively use ``pd.NA`` as a missing value indicator.
+   ``np.nan`` will not allowed to be present, which removes the need to distinguish between
+   ``pd.NA`` and ``np.nan``. The ``FloatingArray`` will thus only use ``NA`` and not ``nan``.
+
+This addresses several issues that have been open for years:
+
+1) Clear and consistent missing value handling across all dtypes.
+2) A resolution of the discussion how to treat ``NA`` and ``NaN`` in FloatingArrays.
+3) Making the NA-scalar easier to use through no longer raising on ``bool(pd.NA)``.
+4) The ExtensionArray interface will be a first class citizen, which simplifies 3rd-party
+   extensions.
+
+## Background
+
+pandas currently maintains three different sets of dtypes next to each other:
+
+- NumPy dtypes that use NumPy arrays to store the data
+- Arrow dtypes that use PyArrow Arrays to store the data
+- Nullable extension dtypes that use pandas Extension Arrays to store the data. These
+  arrays add a layer on top of NumPy to modify the behavior.
+
+The NumPy dtypes are currently default and the most widely used. They use NaN as the missing
+value indicator, which is a float and can't be stored in an integer or boolean array. Consequently,
+these dtypes are cast to float/object if a missing value is inserted into them.
+
+The nullable extension dtypes were originally designed to solve these problems and to provide
+consistent missing value behavior between different dtypes. These arrays use a strict 1D layout
+and store missing values through an accompanying mask. The integer and boolean dtypes are
+supported well across the pandas API, but the float dtypes still have many inconsistencies
+with respect to missing value handling and the behavior of ``pd.NA`` and ``np.nan``. The
+nullable arrays generally are hindered in some scenarios because of the 1D layout (``axis=1``
+operations, transposing, etc.).
+
+The Arrow dtypes are the most recent addition to pandas. They are currently separate from the
+other two sets of dtypes since they user a different data model under the hood and are strictly
+1D.
+
+## Proposal
+
+This proposal aims to unify the missing value handling across all dtypes and to resolve
+outstanding issues for the FloatingArray implementation. This proposal is not meant to
+address implementation details, rather to provide a high level way forward.
+
+1. The ``FloatingArray`` implementation will exclusively use ``pd.NA`` was missing value
+   indicator. ``np.nan`` will not be allowed to be present in the array. The missing value
+   behavior will follow the semantics of the other nullable extension dtypes.
+
+2. The ExtensionArray interface will be extended to support 2D arrays. This will allow
+   us to make our internal nullable ExtensionArrays 2D and also make this option available
+   to 3rd party arrays.
+
+3. pandas will move to nullable extension arrays by default instead of using the NumPy
+   dtypes that are currently the default. Every constructor and IO method will infer
+   extension dtypes by default if not explicitly specified by the user. This is
+   similar to the current ``dtype_backend="numpy_nullable"`` keyword in IO methods,
+   but will be made the new default and extended to the constructors.
+
+We will obey the following dtype mapping:
+
+- int*/uint* -> Int*/UInt*
+- float* -> Float*
+- bool -> boolean
+- object dtype will be mapped to string, but this is covered by PDEP10
+- object dtype will be used for values that aren't strings
+
+This will ensure that all dtypes have consistent missing value handling and there
+is no need to upcast if a missing value is inserted into integers or booleans. Those
+nullability semantics will be mostly consistent with how PyArrow treats nulls and thus
+make switching between both set of dtypes easier.  Additionally, it allows the usage of
+other Arrow dtypes by default that user the same semantics (bytes, nested dtypes, ...).
+
+This proposal formalizes the results of the pandas core sprint in 2023.
+
+## Backward compatibility
+
+Making Extension Arrays 2D can be considered an implementation detail and shouldn't
+impact users negatively.
+
+The ``FloatingArray`` implementation is still experimental and currently riddled with
+bugs with respect to handling of ``pd.NA`` and ``np.nan``. It's experimental status allows
+us to change this without worrying to much about backwards compatibility. Additionally,
+because of the bugs related to NA handling makes it unlikely that it is used in serious
+applications.
+
+Switching to nullable dtypes by default will be a huge change for pandas. It will deviate
+from the current NumPy dtypes and change nullability semantics for users. This will require
+care when implementing this change to make the change in behavior as small as possible and
+to ensure that the new implementation is well tested and easy to opt in for users before
+we make this switch.
+
+## Considerations
+
+### 2D Extension Arrays
+
+The current restriction of 1D Extension Arrays only has a number of limitations internally.
+``axis=1`` operations and more generally operations that transpose the data in some way
+tend to fall back to object. Additionally, the 1D limitation requires copies when converting
+between NumPy and pandas in all cases for DataFrames. Our internal algorithms are more
+performant for 2D arrays like groupby aggregations. There are currently 35
+TODOs across the code base that are related to 2D extension arrays.
+
+I am not aware of any drawbacks compared to the current default dtypes at the point
+of writing.
+
+### FloatingArray
+
+The FloatingArray implementation is currently experimental and has a number of bugs.
+The main source of issues stems from the fact that both ``np.nan`` and ``pd.NA``
+are allowed and not properly handled.
+
+**Status quo**
+
+When constructing a FloatingArray from a NumPy array, a Series with a
+NumPy dtype or another list-like the constructor converts ``np.nan`` to ``pd.NA``.
+
+```python
+
+In [3]: pd.array(np.array([1.5, np.nan]), dtype="Float64")
+Out[3]:
+<FloatingArray>
+[1.5, <NA>]
+Length: 2, dtype: Float64
+```
+
+This is done because NumPy doesn't have a missing value sentinel and pandas
+considers ``np.nan`` to be missing.
+
+Inserting ``np.nan`` into a FloatingArray will also coerce to ``pd.NA``.
+
+```python
+In [4]: arr = pd.array([1.5, np.nan], dtype="Float64")
+In [5]: arr[0] = np.nan
+
+In [6]: arr
+Out[6]:
+<FloatingArray>
+[<NA>, <NA>]
+Length: 2, dtype: Float64
+```
+
+You can introduce NaN values through ``0/0`` for example, but having NaN
+causes other issues. None of our na-detection methods (fillna, isna, ...)
+will match NaN values, they only match ``pd.NA``. A non exhaustive list of
+issues this behavior causes can be found on the
+[pandas issue tracker](https://github.com/pandas-dev/pandas/issues?q=is%3Aopen+is%3Aissue+label%3A%22Ice+Cream+Agreement%22).
+
+**Solution**
+
+The current state makes the FloatingArray unusable if you rely on missing values
+in any way. We solve this problem through disallowing ``np.nan`` in the FloatingArray.
+Only ``NA`` will be allowed to be present.
+
+- This solution makes the implementation of all methods that interact with NA
+  simpler and more consistent. This includes methods like ``fillna`` but also
+  sorting operations.
+- Users are used to only having ``np.nan`` as a missing value indicator in pandas.
+  Staying with one missing value indicator in ``pd.NA`` will make the behavior
+  less confusing for users.
+- In- and Output to and from NumPy is unambiguous. Every NaN is converted to NA
+  and back.
+
+
+**Drawbacks**
+
+- There is no option to distinguish between missing and invalid values. This is currently not
+  possible either and generally would require increasing the API surface to handle both cases.
+  Methods interacting with missing values would need to be configurable. There was never much
+  demand for this feature, so the additional complexity does not seem justified.
+
+Distinguishing NA and NaN adds a lot of complexities:
+
+- Roundtripping through NumPy is not really possible. Currently, we are converting to NA and then
+  converting back. This is potentially very confusing if users end up with the same value after
+  going to NumPy.
+- Increases the API surface and complexity for all methods that interact with NA values. Additionally,
+  it also concerns all methods that have the ``skipna`` argument.
+- Having both values in pandas is confusing for users. Historically, pandas used only NaN.
+  Differentiating between NA and NaN would make behavior less intuitive for non-expert users.
+- It adds maintenance burden.
+- NA and NaN have different semantics in comparison operations, which adds further mental complexity.
+
+## Timeline
+
+Make Extension Arrays 2D and fix all inconsistencies in the FloatingArray. Ensure that
+this is done by the time that pandas 4.0 is released and then prepare the migration
+to nullable dtypes by default in the next major release.
+
+### PDEP History
+
+- March 2024: Initial draft
+
+Note: There is a very long discussion in [GH-32265](https://github.com/pandas-dev/pandas/issues/32265)
+that concerns this topic.

From d7dcbdb90d4d2e1658c34cff16dc99990aa6cdb0 Mon Sep 17 00:00:00 2001
From: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date: Wed, 12 Jun 2024 17:49:03 +0200
Subject: [PATCH 2/2] rename file, rephrase main proposal points, temporarily
 remove most other content

---
 web/pandas/pdeps/0015-ice-cream-agreement.md  | 212 ------------------
 .../pdeps/0016-consistent-missing-values.md   | 107 +++++++++
 2 files changed, 107 insertions(+), 212 deletions(-)
 delete mode 100644 web/pandas/pdeps/0015-ice-cream-agreement.md
 create mode 100644 web/pandas/pdeps/0016-consistent-missing-values.md

diff --git a/web/pandas/pdeps/0015-ice-cream-agreement.md b/web/pandas/pdeps/0015-ice-cream-agreement.md
deleted file mode 100644
index ba2eb1e4507f7..0000000000000
--- a/web/pandas/pdeps/0015-ice-cream-agreement.md
+++ /dev/null
@@ -1,212 +0,0 @@
-# PDEP-15: Ice Cream Agreement
-
-- Created: March 2024
-- Status: Under discussion
-- Discussion: [#32265](https://github.com/pandas-dev/pandas/issues/32265)
-- Author: [Patrick Hoefler](https://github.com/phofl)
-          [Joris Van den Bossche](https://github.com/jorisvandenbossche)
-- Revision: 1
-
-## Abstract
-
-Short summary of the proposal:
-
-1. The pandas Extension Array interface will fully support 2D arrays. Currently, pandas publicly only
-   supports 1D Extension Arrays. Additionally, pandas will make all internal NumPy-based
-   Extension Arrays 2D. This specifically includes our nullable Extension Arrays and
-   __excludes__ Arrow-based extension arrays. Consequently, pandas will move to the nullable
-   extension dtypes by default to provide consistent missing value handling across all dtypes.
-
-2. The NumPy based Extension Arrays will exclusively use ``pd.NA`` as a missing value indicator.
-   ``np.nan`` will not allowed to be present, which removes the need to distinguish between
-   ``pd.NA`` and ``np.nan``. The ``FloatingArray`` will thus only use ``NA`` and not ``nan``.
-
-This addresses several issues that have been open for years:
-
-1) Clear and consistent missing value handling across all dtypes.
-2) A resolution of the discussion how to treat ``NA`` and ``NaN`` in FloatingArrays.
-3) Making the NA-scalar easier to use through no longer raising on ``bool(pd.NA)``.
-4) The ExtensionArray interface will be a first class citizen, which simplifies 3rd-party
-   extensions.
-
-## Background
-
-pandas currently maintains three different sets of dtypes next to each other:
-
-- NumPy dtypes that use NumPy arrays to store the data
-- Arrow dtypes that use PyArrow Arrays to store the data
-- Nullable extension dtypes that use pandas Extension Arrays to store the data. These
-  arrays add a layer on top of NumPy to modify the behavior.
-
-The NumPy dtypes are currently default and the most widely used. They use NaN as the missing
-value indicator, which is a float and can't be stored in an integer or boolean array. Consequently,
-these dtypes are cast to float/object if a missing value is inserted into them.
-
-The nullable extension dtypes were originally designed to solve these problems and to provide
-consistent missing value behavior between different dtypes. These arrays use a strict 1D layout
-and store missing values through an accompanying mask. The integer and boolean dtypes are
-supported well across the pandas API, but the float dtypes still have many inconsistencies
-with respect to missing value handling and the behavior of ``pd.NA`` and ``np.nan``. The
-nullable arrays generally are hindered in some scenarios because of the 1D layout (``axis=1``
-operations, transposing, etc.).
-
-The Arrow dtypes are the most recent addition to pandas. They are currently separate from the
-other two sets of dtypes since they user a different data model under the hood and are strictly
-1D.
-
-## Proposal
-
-This proposal aims to unify the missing value handling across all dtypes and to resolve
-outstanding issues for the FloatingArray implementation. This proposal is not meant to
-address implementation details, rather to provide a high level way forward.
-
-1. The ``FloatingArray`` implementation will exclusively use ``pd.NA`` was missing value
-   indicator. ``np.nan`` will not be allowed to be present in the array. The missing value
-   behavior will follow the semantics of the other nullable extension dtypes.
-
-2. The ExtensionArray interface will be extended to support 2D arrays. This will allow
-   us to make our internal nullable ExtensionArrays 2D and also make this option available
-   to 3rd party arrays.
-
-3. pandas will move to nullable extension arrays by default instead of using the NumPy
-   dtypes that are currently the default. Every constructor and IO method will infer
-   extension dtypes by default if not explicitly specified by the user. This is
-   similar to the current ``dtype_backend="numpy_nullable"`` keyword in IO methods,
-   but will be made the new default and extended to the constructors.
-
-We will obey the following dtype mapping:
-
-- int*/uint* -> Int*/UInt*
-- float* -> Float*
-- bool -> boolean
-- object dtype will be mapped to string, but this is covered by PDEP10
-- object dtype will be used for values that aren't strings
-
-This will ensure that all dtypes have consistent missing value handling and there
-is no need to upcast if a missing value is inserted into integers or booleans. Those
-nullability semantics will be mostly consistent with how PyArrow treats nulls and thus
-make switching between both set of dtypes easier.  Additionally, it allows the usage of
-other Arrow dtypes by default that user the same semantics (bytes, nested dtypes, ...).
-
-This proposal formalizes the results of the pandas core sprint in 2023.
-
-## Backward compatibility
-
-Making Extension Arrays 2D can be considered an implementation detail and shouldn't
-impact users negatively.
-
-The ``FloatingArray`` implementation is still experimental and currently riddled with
-bugs with respect to handling of ``pd.NA`` and ``np.nan``. It's experimental status allows
-us to change this without worrying to much about backwards compatibility. Additionally,
-because of the bugs related to NA handling makes it unlikely that it is used in serious
-applications.
-
-Switching to nullable dtypes by default will be a huge change for pandas. It will deviate
-from the current NumPy dtypes and change nullability semantics for users. This will require
-care when implementing this change to make the change in behavior as small as possible and
-to ensure that the new implementation is well tested and easy to opt in for users before
-we make this switch.
-
-## Considerations
-
-### 2D Extension Arrays
-
-The current restriction of 1D Extension Arrays only has a number of limitations internally.
-``axis=1`` operations and more generally operations that transpose the data in some way
-tend to fall back to object. Additionally, the 1D limitation requires copies when converting
-between NumPy and pandas in all cases for DataFrames. Our internal algorithms are more
-performant for 2D arrays like groupby aggregations. There are currently 35
-TODOs across the code base that are related to 2D extension arrays.
-
-I am not aware of any drawbacks compared to the current default dtypes at the point
-of writing.
-
-### FloatingArray
-
-The FloatingArray implementation is currently experimental and has a number of bugs.
-The main source of issues stems from the fact that both ``np.nan`` and ``pd.NA``
-are allowed and not properly handled.
-
-**Status quo**
-
-When constructing a FloatingArray from a NumPy array, a Series with a
-NumPy dtype or another list-like the constructor converts ``np.nan`` to ``pd.NA``.
-
-```python
-
-In [3]: pd.array(np.array([1.5, np.nan]), dtype="Float64")
-Out[3]:
-<FloatingArray>
-[1.5, <NA>]
-Length: 2, dtype: Float64
-```
-
-This is done because NumPy doesn't have a missing value sentinel and pandas
-considers ``np.nan`` to be missing.
-
-Inserting ``np.nan`` into a FloatingArray will also coerce to ``pd.NA``.
-
-```python
-In [4]: arr = pd.array([1.5, np.nan], dtype="Float64")
-In [5]: arr[0] = np.nan
-
-In [6]: arr
-Out[6]:
-<FloatingArray>
-[<NA>, <NA>]
-Length: 2, dtype: Float64
-```
-
-You can introduce NaN values through ``0/0`` for example, but having NaN
-causes other issues. None of our na-detection methods (fillna, isna, ...)
-will match NaN values, they only match ``pd.NA``. A non exhaustive list of
-issues this behavior causes can be found on the
-[pandas issue tracker](https://github.com/pandas-dev/pandas/issues?q=is%3Aopen+is%3Aissue+label%3A%22Ice+Cream+Agreement%22).
-
-**Solution**
-
-The current state makes the FloatingArray unusable if you rely on missing values
-in any way. We solve this problem through disallowing ``np.nan`` in the FloatingArray.
-Only ``NA`` will be allowed to be present.
-
-- This solution makes the implementation of all methods that interact with NA
-  simpler and more consistent. This includes methods like ``fillna`` but also
-  sorting operations.
-- Users are used to only having ``np.nan`` as a missing value indicator in pandas.
-  Staying with one missing value indicator in ``pd.NA`` will make the behavior
-  less confusing for users.
-- In- and Output to and from NumPy is unambiguous. Every NaN is converted to NA
-  and back.
-
-
-**Drawbacks**
-
-- There is no option to distinguish between missing and invalid values. This is currently not
-  possible either and generally would require increasing the API surface to handle both cases.
-  Methods interacting with missing values would need to be configurable. There was never much
-  demand for this feature, so the additional complexity does not seem justified.
-
-Distinguishing NA and NaN adds a lot of complexities:
-
-- Roundtripping through NumPy is not really possible. Currently, we are converting to NA and then
-  converting back. This is potentially very confusing if users end up with the same value after
-  going to NumPy.
-- Increases the API surface and complexity for all methods that interact with NA values. Additionally,
-  it also concerns all methods that have the ``skipna`` argument.
-- Having both values in pandas is confusing for users. Historically, pandas used only NaN.
-  Differentiating between NA and NaN would make behavior less intuitive for non-expert users.
-- It adds maintenance burden.
-- NA and NaN have different semantics in comparison operations, which adds further mental complexity.
-
-## Timeline
-
-Make Extension Arrays 2D and fix all inconsistencies in the FloatingArray. Ensure that
-this is done by the time that pandas 4.0 is released and then prepare the migration
-to nullable dtypes by default in the next major release.
-
-### PDEP History
-
-- March 2024: Initial draft
-
-Note: There is a very long discussion in [GH-32265](https://github.com/pandas-dev/pandas/issues/32265)
-that concerns this topic.
diff --git a/web/pandas/pdeps/0016-consistent-missing-values.md b/web/pandas/pdeps/0016-consistent-missing-values.md
new file mode 100644
index 0000000000000..a74f7c42eef4f
--- /dev/null
+++ b/web/pandas/pdeps/0016-consistent-missing-values.md
@@ -0,0 +1,107 @@
+# PDEP-16: Consistent missing value handling (with a single NA scalar)
+
+- Created: March 2024
+- Status: Under discussion
+- Discussion: [#32265](https://github.com/pandas-dev/pandas/issues/32265)
+- Author: [Patrick Hoefler](https://github.com/phofl)
+          [Joris Van den Bossche](https://github.com/jorisvandenbossche)
+- Revision: 1
+
+## Abstract
+
+...
+
+## Background
+
+Currently, pandas handles missing data differently for different data types. We
+use different types to indicate that a value is missing: ``np.nan`` for
+floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically
+strings or booleans -- with missing values, and ``pd.NaT`` for datetimelike
+data. Some other data types, such as integer and bool, cannot store missing data
+or are cast to float or object dtype. In addition, pandas 1.0 introduced a new
+missing value sentinel, ``pd.NA``, which is being used for the experimental
+nullable integer, float, boolean, and string data types, and more recently also
+for the pyarrow-backed data types.
+
+These different missing values also have different behaviors in user-facing
+operations. Specifically, we introduced different semantics for the nullable
+data types for certain operations (e.g. propagating in comparison operations
+instead of comparing as False).
+
+The nullable extension dtypes and the `pd.NA` scalar were originally designed to
+solve these problems and to provide consistent missing value behavior between
+different dtypes. Historically those are used as 1D arrays, which hinders usage
+of those dtypes in certain scenarios that rely on the 2D block structure of the
+pandas internals for fast operations (``axis=1`` operations, transposing, etc.).
+
+Long term, we want to introduce consistent missing data handling for all data
+types. This includes consistent behavior in all operations (indexing, arithmetic
+operations, comparisons, etc.) and using a missing value scalar that behaves
+consistently.
+
+## Proposal
+
+This proposal aims to unify the missing value handling across all dtypes. This
+proposal is not meant to address implementation details, rather to provide a
+high level way forward.
+
+1. All data types support missing values and use `pd.NA` exclusively as the
+   user-facing missing value indicator.
+
+2. All data types implement consistent missing value "semantics" corresponding
+   to the current nullable dtypes using `pd.NA` (i.e. regarding behaviour in
+   comparisons, see below for details).
+
+3. As a consequence, pandas will move to nullable extension arrays by default
+   for all data types, instead of using the NumPy dtypes that are currently the
+   default. To preserve the default 2D block structure of the DataFrame internals,
+   the ExtensionArray interface will be extended to support 2D arrays.
+
+4. For backwards compatibility, existing missing value indicators like `NaN` and
+   `NaT` will be interpreted as `pd.NA` when introduced in user input, IO or
+   through operations (to ensure it keeps being considered as missing).
+   Specifically for floating dtypes, in practice this means a float column can
+   for now only contain NA values. Potentially distinguishing NA and NaN is left
+   for a separate discussion.
+
+This will ensure that all dtypes have consistent missing value handling and there
+is no need to upcast if a missing value is inserted into integers or booleans. Those
+nullability semantics will be mostly consistent with how PyArrow treats nulls and thus
+make switching between both set of dtypes easier. Additionally, it allows the usage of
+other Arrow dtypes by default that use the same semantics (bytes, nested dtypes, ...).
+
+In practice, this means solidifying the existing integer, float, boolean and
+string nullable data types that already exist, and implementing (variants of)
+the categorical, datetimelike and interval data types using `pd.NA`. The
+proposal leaves the exact implementation details (e.g. whether to use a mask or
+a sentinel (where the best strategy might vary by data type depending on
+existing code), or whether to use byte masks vs bitmaps, or whether to use
+PyArrow under the hood like the string dtype, etc) out of scope.
+
+This PDEP also does not define the exact API for dtype constructors or
+propose a new consistent interface; this is left for a separate discussion
+(PDEP-13).
+
+### The `NA` scalar
+
+...
+
+### Missing value semantics
+
+
+...
+
+## Backward compatibility
+
+...
+
+## Timeline
+
+...
+
+### PDEP History
+
+- March 2024: Initial draft
+
+Note: There is a very long discussion in [GH-32265](https://github.com/pandas-dev/pandas/issues/32265)
+that concerns this topic.