Skip to content

Latest commit

 

History

History
1416 lines (1177 loc) · 111 KB

v2.0.0.rst

File metadata and controls

1416 lines (1177 loc) · 111 KB

What's new in 2.0.0 (April 3, 2023)

These are the changes in pandas 2.0.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Enhancements

Installing optional dependencies with pip extras

When installing pandas using pip, sets of optional dependencies can also be installed by specifying extras.

pip install "pandas[performance, aws]>=2.0.0"

The available extras, found in the :ref:`installation guide<install.dependencies>`, are [all, performance, computation, fss, aws, gcp, excel, parquet, feather, hdf5, spss, postgresql, mysql, sql-other, html, xml, plot, output_formatting, clipboard, compression, test] (:issue:`39164`).

:class:`Index` can now hold numpy numeric dtypes

It is now possible to use any numpy numeric dtype in a :class:`Index` (:issue:`42717`).

Previously it was only possible to use int64, uint64 & float64 dtypes:

In [1]: pd.Index([1, 2, 3], dtype=np.int8)
Out[1]: Int64Index([1, 2, 3], dtype="int64")
In [2]: pd.Index([1, 2, 3], dtype=np.uint16)
Out[2]: UInt64Index([1, 2, 3], dtype="uint64")
In [3]: pd.Index([1, 2, 3], dtype=np.float32)
Out[3]: Float64Index([1.0, 2.0, 3.0], dtype="float64")

:class:`Int64Index`, :class:`UInt64Index` & :class:`Float64Index` were deprecated in pandas version 1.4 and have now been removed. Instead :class:`Index` should be used directly, and can it now take all numpy numeric dtypes, i.e. int8/ int16/int32/int64/uint8/uint16/uint32/uint64/float32/float64 dtypes:

.. ipython:: python

    pd.Index([1, 2, 3], dtype=np.int8)
    pd.Index([1, 2, 3], dtype=np.uint16)
    pd.Index([1, 2, 3], dtype=np.float32)

The ability for :class:`Index` to hold the numpy numeric dtypes has meant some changes in pandas functionality. In particular, operations that previously were forced to create 64-bit indexes, can now create indexes with lower bit sizes, e.g. 32-bit indexes.

Below is a possibly non-exhaustive list of changes:

  1. Instantiating using a numpy numeric array now follows the dtype of the numpy array. Previously, all indexes created from numpy numeric arrays were forced to 64-bit. Now, for example, Index(np.array([1, 2, 3])) will be int32 on 32-bit systems, where it previously would have been int64 even on 32-bit systems. Instantiating :class:`Index` using a list of numbers will still return 64bit dtypes, e.g. Index([1, 2, 3]) will have a int64 dtype, which is the same as previously.

  2. The various numeric datetime attributes of :class:`DatetimeIndex` (:attr:`~DatetimeIndex.day`, :attr:`~DatetimeIndex.month`, :attr:`~DatetimeIndex.year` etc.) were previously in of dtype int64, while they were int32 for :class:`arrays.DatetimeArray`. They are now int32 on :class:`DatetimeIndex` also:

    .. ipython:: python
    
        idx = pd.date_range(start='1/1/2018', periods=3, freq='ME')
        idx.array.year
        idx.year
    
    
  3. Level dtypes on Indexes from :meth:`Series.sparse.from_coo` are now of dtype int32, the same as they are on the rows/cols on a scipy sparse matrix. Previously they were of dtype int64.

    .. ipython:: python
    
        from scipy import sparse
        A = sparse.coo_matrix(
            ([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4)
        )
        ser = pd.Series.sparse.from_coo(A)
        ser.index.dtypes
    
    
  4. :class:`Index` cannot be instantiated using a float16 dtype. Previously instantiating an :class:`Index` using dtype float16 resulted in a :class:`Float64Index` with a float64 dtype. It now raises a NotImplementedError:

    .. ipython:: python
        :okexcept:
    
        pd.Index([1, 2, 3], dtype=np.float16)
    
    
    

Argument dtype_backend, to return pyarrow-backed or numpy-backed nullable dtypes

The following functions gained a new keyword dtype_backend (:issue:`36712`)

When this option is set to "numpy_nullable" it will return a :class:`DataFrame` that is backed by nullable dtypes.

When this keyword is set to "pyarrow", then these functions will return pyarrow-backed nullable :class:`ArrowDtype` DataFrames (:issue:`48957`, :issue:`49997`):

.. ipython:: python

    import io
    data = io.StringIO("""a,b,c,d,e,f,g,h,i
        1,2.5,True,a,,,,,
        3,4.5,False,b,6,7.5,True,a,
    """)
    df = pd.read_csv(data, dtype_backend="pyarrow")
    df.dtypes

    data.seek(0)
    df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow", engine="pyarrow")
    df_pyarrow.dtypes

Copy-on-Write improvements

  • A new lazy copy mechanism that defers the copy until the object in question is modified was added to the methods listed in :ref:`Copy-on-Write optimizations <copy_on_write.optimizations>`. These methods return views when Copy-on-Write is enabled, which provides a significant performance improvement compared to the regular execution (:issue:`49473`).
  • Accessing a single column of a DataFrame as a Series (e.g. df["col"]) now always returns a new object every time it is constructed when Copy-on-Write is enabled (not returning multiple times an identical, cached Series object). This ensures that those Series objects correctly follow the Copy-on-Write rules (:issue:`49450`)
  • The :class:`Series` constructor will now create a lazy copy (deferring the copy until a modification to the data happens) when constructing a Series from an existing Series with the default of copy=False (:issue:`50471`)
  • The :class:`DataFrame` constructor will now create a lazy copy (deferring the copy until a modification to the data happens) when constructing from an existing :class:`DataFrame` with the default of copy=False (:issue:`51239`)
  • The :class:`DataFrame` constructor, when constructing a DataFrame from a dictionary of Series objects and specifying copy=False, will now use a lazy copy of those Series objects for the columns of the DataFrame (:issue:`50777`)
  • The :class:`DataFrame` constructor, when constructing a DataFrame from a :class:`Series` or :class:`Index` and specifying copy=False, will now respect Copy-on-Write.
  • The :class:`DataFrame` and :class:`Series` constructors, when constructing from a NumPy array, will now copy the array by default to avoid mutating the :class:`DataFrame` / :class:`Series` when mutating the array. Specify copy=False to get the old behavior. When setting copy=False pandas does not guarantee correct Copy-on-Write behavior when the NumPy array is modified after creation of the :class:`DataFrame` / :class:`Series`.
  • The :meth:`DataFrame.from_records` will now respect Copy-on-Write when called with a :class:`DataFrame`.
  • Trying to set values using chained assignment (for example, df["a"][1:3] = 0) will now always raise a warning when Copy-on-Write is enabled. In this mode, chained assignment can never work because we are always setting into a temporary object that is the result of an indexing operation (getitem), which under Copy-on-Write always behaves as a copy. Thus, assigning through a chain can never update the original Series or DataFrame. Therefore, an informative warning is raised to the user to avoid silently doing nothing (:issue:`49467`)
  • :meth:`DataFrame.replace` will now respect the Copy-on-Write mechanism when inplace=True.
  • :meth:`DataFrame.transpose` will now respect the Copy-on-Write mechanism.
  • Arithmetic operations that can be inplace, e.g. ser *= 2 will now respect the Copy-on-Write mechanism.
  • :meth:`DataFrame.__getitem__` will now respect the Copy-on-Write mechanism when the :class:`DataFrame` has :class:`MultiIndex` columns.
  • :meth:`Series.__getitem__` will now respect the Copy-on-Write mechanism when the
    :class:`Series` has a :class:`MultiIndex`.
  • :meth:`Series.view` will now respect the Copy-on-Write mechanism.

Copy-on-Write can be enabled through one of

pd.set_option("mode.copy_on_write", True)
pd.options.mode.copy_on_write = True

Alternatively, copy on write can be enabled locally through:

with pd.option_context("mode.copy_on_write", True):
    ...

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

:meth:`.DataFrameGroupBy.cumsum` and :meth:`.DataFrameGroupBy.cumprod` overflow instead of lossy casting to float

In previous versions we cast to float when applying cumsum and cumprod which lead to incorrect results even if the result could be hold by int64 dtype. Additionally, the aggregation overflows consistent with numpy and the regular :meth:`DataFrame.cumprod` and :meth:`DataFrame.cumsum` methods when the limit of int64 is reached (:issue:`37493`).

Old Behavior

In [1]: df = pd.DataFrame({"key": ["b"] * 7, "value": 625})
In [2]: df.groupby("key")["value"].cumprod()[5]
Out[2]: 5.960464477539062e+16

We return incorrect results with the 6th value.

New Behavior

.. ipython:: python

    df = pd.DataFrame({"key": ["b"] * 7, "value": 625})
    df.groupby("key")["value"].cumprod()

We overflow with the 7th value, but the 6th value is still correct.

In previous versions of pandas, :meth:`.DataFrameGroupBy.nth` and :meth:`.SeriesGroupBy.nth` acted as if they were aggregations. However, for most inputs n, they may return either zero or multiple rows per group. This means that they are filtrations, similar to e.g. :meth:`.DataFrameGroupBy.head`. pandas now treats them as filtrations (:issue:`13666`).

.. ipython:: python

    df = pd.DataFrame({"a": [1, 1, 2, 1, 2], "b": [np.nan, 2.0, 3.0, 4.0, 5.0]})
    gb = df.groupby("a")

Old Behavior

In [5]: gb.nth(n=1)
Out[5]:
   A    B
1  1  2.0
4  2  5.0

New Behavior

.. ipython:: python

    gb.nth(n=1)

In particular, the index of the result is derived from the input by selecting the appropriate rows. Also, when n is larger than the group, no rows instead of NaN is returned.

Old Behavior

In [5]: gb.nth(n=3, dropna="any")
Out[5]:
    B
A
1 NaN
2 NaN

New Behavior

.. ipython:: python

    gb.nth(n=3, dropna="any")

Backwards incompatible API changes

Construction with datetime64 or timedelta64 dtype with unsupported resolution

In past versions, when constructing a :class:`Series` or :class:`DataFrame` and passing a "datetime64" or "timedelta64" dtype with unsupported resolution (i.e. anything other than "ns"), pandas would silently replace the given dtype with its nanosecond analogue:

Previous behavior:

In [5]: pd.Series(["2016-01-01"], dtype="datetime64[s]")
Out[5]:
0   2016-01-01
dtype: datetime64[ns]

In [6] pd.Series(["2016-01-01"], dtype="datetime64[D]")
Out[6]:
0   2016-01-01
dtype: datetime64[ns]

In pandas 2.0 we support resolutions "s", "ms", "us", and "ns". When passing a supported dtype (e.g. "datetime64[s]"), the result now has exactly the requested dtype:

New behavior:

.. ipython:: python

   pd.Series(["2016-01-01"], dtype="datetime64[s]")

With an un-supported dtype, pandas now raises instead of silently swapping in a supported dtype:

New behavior:

.. ipython:: python
   :okexcept:

   pd.Series(["2016-01-01"], dtype="datetime64[D]")

Value counts sets the resulting name to count

In past versions, when running :meth:`Series.value_counts`, the result would inherit the original object's name, and the result index would be nameless. This would cause confusion when resetting the index, and the column names would not correspond with the column values. Now, the result name will be 'count' (or 'proportion' if normalize=True was passed), and the index will be named after the original object (:issue:`49497`).

Previous behavior:

In [8]: pd.Series(['quetzal', 'quetzal', 'elk'], name='animal').value_counts()

Out[2]:
quetzal    2
elk        1
Name: animal, dtype: int64

New behavior:

.. ipython:: python

    pd.Series(['quetzal', 'quetzal', 'elk'], name='animal').value_counts()

Likewise for other value_counts methods (for example, :meth:`DataFrame.value_counts`).

Disallow astype conversion to non-supported datetime64/timedelta64 dtypes

In previous versions, converting a :class:`Series` or :class:`DataFrame` from datetime64[ns] to a different datetime64[X] dtype would return with datetime64[ns] dtype instead of the requested dtype. In pandas 2.0, support is added for "datetime64[s]", "datetime64[ms]", and "datetime64[us]" dtypes, so converting to those dtypes gives exactly the requested dtype:

Previous behavior:

.. ipython:: python

   idx = pd.date_range("2016-01-01", periods=3)
   ser = pd.Series(idx)

Previous behavior:

In [4]: ser.astype("datetime64[s]")
Out[4]:
0   2016-01-01
1   2016-01-02
2   2016-01-03
dtype: datetime64[ns]

With the new behavior, we get exactly the requested dtype:

New behavior:

.. ipython:: python

   ser.astype("datetime64[s]")

For non-supported resolutions e.g. "datetime64[D]", we raise instead of silently ignoring the requested dtype:

New behavior:

.. ipython:: python
   :okexcept:

   ser.astype("datetime64[D]")

For conversion from timedelta64[ns] dtypes, the old behavior converted to a floating point format.

Previous behavior:

.. ipython:: python

   idx = pd.timedelta_range("1 Day", periods=3)
   ser = pd.Series(idx)

Previous behavior:

In [7]: ser.astype("timedelta64[s]")
Out[7]:
0     86400.0
1    172800.0
2    259200.0
dtype: float64

In [8]: ser.astype("timedelta64[D]")
Out[8]:
0    1.0
1    2.0
2    3.0
dtype: float64

The new behavior, as for datetime64, either gives exactly the requested dtype or raises:

New behavior:

.. ipython:: python
   :okexcept:

   ser.astype("timedelta64[s]")
   ser.astype("timedelta64[D]")

UTC and fixed-offset timezones default to standard-library tzinfo objects

In previous versions, the default tzinfo object used to represent UTC was pytz.UTC. In pandas 2.0, we default to datetime.timezone.utc instead. Similarly, for timezones represent fixed UTC offsets, we use datetime.timezone objects instead of pytz.FixedOffset objects. See (:issue:`34916`)

Previous behavior:

In [2]: ts = pd.Timestamp("2016-01-01", tz="UTC")
In [3]: type(ts.tzinfo)
Out[3]: pytz.UTC

In [4]: ts2 = pd.Timestamp("2016-01-01 04:05:06-07:00")
In [3]: type(ts2.tzinfo)
Out[5]: pytz._FixedOffset

New behavior:

.. ipython:: python

   ts = pd.Timestamp("2016-01-01", tz="UTC")
   type(ts.tzinfo)

   ts2 = pd.Timestamp("2016-01-01 04:05:06-07:00")
   type(ts2.tzinfo)

For timezones that are neither UTC nor fixed offsets, e.g. "US/Pacific", we continue to default to pytz objects.

Empty DataFrames/Series will now default to have a RangeIndex

Before, constructing an empty (where data is None or an empty list-like argument) :class:`Series` or :class:`DataFrame` without specifying the axes (index=None, columns=None) would return the axes as empty :class:`Index` with object dtype.

Now, the axes return an empty :class:`RangeIndex` (:issue:`49572`).

Previous behavior:

In [8]: pd.Series().index
Out[8]:
Index([], dtype='object')

In [9] pd.DataFrame().axes
Out[9]:
[Index([], dtype='object'), Index([], dtype='object')]

New behavior:

.. ipython:: python

   pd.Series().index
   pd.DataFrame().axes

DataFrame to LaTeX has a new render engine

The existing :meth:`DataFrame.to_latex` has been restructured to utilise the extended implementation previously available under :meth:`.Styler.to_latex`. The arguments signature is similar, albeit col_space has been removed since it is ignored by LaTeX engines. This render engine also requires jinja2 as a dependency which needs to be installed, since rendering is based upon jinja2 templates.

The pandas latex options below are no longer used and have been removed. The generic max rows and columns arguments remain but for this functionality should be replaced by the Styler equivalents. The alternative options giving similar functionality are indicated below:

  • display.latex.escape: replaced with styler.format.escape,
  • display.latex.longtable: replaced with styler.latex.environment,
  • display.latex.multicolumn, display.latex.multicolumn_format and display.latex.multirow: replaced with styler.sparse.rows, styler.sparse.columns, styler.latex.multirow_align and styler.latex.multicol_align,
  • display.latex.repr: replaced with styler.render.repr,
  • display.max_rows and display.max_columns: replace with styler.render.max_rows, styler.render.max_columns and styler.render.max_elements.

Note that due to this change some defaults have also changed:

  • multirow now defaults to True.
  • multirow_align defaults to "r" instead of "l".
  • multicol_align defaults to "r" instead of "l".
  • escape now defaults to False.

Note that the behaviour of _repr_latex_ is also changed. Previously setting display.latex.repr would generate LaTeX only when using nbconvert for a JupyterNotebook, and not when the user is running the notebook. Now the styler.render.repr option allows control of the specific output within JupyterNotebooks for operations (not just on nbconvert). See :issue:`39911`.

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
mypy (dev) 1.0   X
pytest (dev) 7.0.0   X
pytest-xdist (dev) 2.2.0   X
hypothesis (dev) 6.34.2   X
python-dateutil 2.8.2 X X
tzdata 2022.1 X X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
pyarrow 7.0.0 X
matplotlib 3.6.1 X
fastparquet 0.6.3 X
xarray 0.21.0 X

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Datetimes are now parsed with a consistent format

In the past, :func:`to_datetime` guessed the format for each element independently. This was appropriate for some cases where elements had mixed date formats - however, it would regularly cause problems when users expected a consistent format but the function would switch formats between elements. As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).

Old behavior:

In [1]: ser = pd.Series(['13-01-2000', '12-01-2000'])
In [2]: pd.to_datetime(ser)
Out[2]:
0   2000-01-13
1   2000-12-01
dtype: datetime64[ns]

New behavior:

.. ipython:: python
    :okwarning:

     ser = pd.Series(['13-01-2000', '12-01-2000'])
     pd.to_datetime(ser)

Note that this affects :func:`read_csv` as well.

If you still need to parse dates with inconsistent formats, you can use format='mixed' (possibly alongside dayfirst)

ser = pd.Series(['13-01-2000', '12 January 2000'])
pd.to_datetime(ser, format='mixed', dayfirst=True)

or, if your formats are all ISO8601 (but possibly not identically-formatted)

ser = pd.Series(['2020-01-01', '2020-01-01 03:00'])
pd.to_datetime(ser, format='ISO8601')

Other API changes

Note

A current PDEP proposes the deprecation and removal of the keywords inplace and copy for all but a small subset of methods from the pandas API. The current discussion takes place at here. The keywords won't be necessary anymore in the context of Copy-on-Write. If this proposal is accepted, both keywords would be deprecated in the next release of pandas and removed in pandas 3.0.

Deprecations

Removal of prior version deprecations/changes

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Period

Plotting

Groupby/resample/rolling

Reshaping

Sparse

ExtensionArray

Styler

Metadata

Other

Contributors

.. contributors:: v1.5.0rc0..v2.0.0