Skip to content
This repository was archived by the owner on Jul 10, 2024. It is now read-only.

Latest commit

 

History

History
914 lines (731 loc) · 48.5 KB

v1.5.0.rst

File metadata and controls

914 lines (731 loc) · 48.5 KB

What's new in 1.5.0 (??)

These are the changes in pandas 1.5.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Enhancements

DataFrame exchange protocol implementation

Pandas now implement the DataFrame exchange API spec. See the full details on the API at https://data-apis.org/dataframe-protocol/latest/index.html

The protocol consists of two parts:

  • New method :meth:`DataFrame.__dataframe__` which produces the exchange object. It effectively "exports" the Pandas dataframe as an exchange object so any other library which has the protocol implemented can "import" that dataframe without knowing anything about the producer except that it makes an exchange object.
  • New function :func:`pandas.api.exchange.from_dataframe` which can take an arbitrary exchange object from any conformant library and construct a Pandas DataFrame out of it.

Styler

The most notable development is the new method :meth:`.Styler.concat` which allows adding customised footer rows to visualise additional calculations on the data, e.g. totals and counts etc. (:issue:`43875`, :issue:`46186`)

Additionally there is an alternative output method :meth:`.Styler.to_string`, which allows using the Styler's formatting methods to create, for example, CSVs (:issue:`44502`).

Minor feature improvements are:

Control of index with group_keys in :meth:`DataFrame.resample`

The argument group_keys has been added to the method :meth:`DataFrame.resample`. As with :meth:`DataFrame.groupby`, this argument controls the whether each group is added to the index in the resample when :meth:`.Resampler.apply` is used.

Warning

Not specifying the group_keys argument will retain the previous behavior and emit a warning if the result will change by specifying group_keys=False. In a future version of pandas, not specifying group_keys will default to the same behavior as group_keys=False.

.. ipython:: python

    df = pd.DataFrame(
        {'a': range(6)},
        index=pd.date_range("2021-01-01", periods=6, freq="8H")
    )
    df.resample("D", group_keys=True).apply(lambda x: x)
    df.resample("D", group_keys=False).apply(lambda x: x)

Previously, the resulting index would depend upon the values returned by apply, as seen in the following example.

In [1]: # pandas 1.3
In [2]: df.resample("D").apply(lambda x: x)
Out[2]:
                     a
2021-01-01 00:00:00  0
2021-01-01 08:00:00  1
2021-01-01 16:00:00  2
2021-01-02 00:00:00  3
2021-01-02 08:00:00  4
2021-01-02 16:00:00  5

In [3]: df.resample("D").apply(lambda x: x.reset_index())
Out[3]:
                           index  a
2021-01-01 0 2021-01-01 00:00:00  0
           1 2021-01-01 08:00:00  1
           2 2021-01-01 16:00:00  2
2021-01-02 0 2021-01-02 00:00:00  3
           1 2021-01-02 08:00:00  4
           2 2021-01-02 16:00:00  5

Reading directly from TAR archives

I/O methods like :func:`read_csv` or :meth:`DataFrame.to_json` now allow reading and writing directly on TAR archives (:issue:`44787`).

df = pd.read_csv("./movement.tar.gz")
# ...
df.to_csv("./out.tar.gz")

This supports .tar, .tar.gz, .tar.bz and .tar.xz2 archives. The used compression method is inferred from the filename. If the compression method cannot be inferred, use the compression argument:

df = pd.read_csv(some_file_obj, compression={"method": "tar", "mode": "r:gz"}) # noqa F821

(mode being one of tarfile.open's modes: https://docs.python.org/3/library/tarfile.html#tarfile.open)

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

Using dropna=True with groupby transforms

A transform is an operation whose result has the same size as its input. When the result is a :class:`DataFrame` or :class:`Series`, it is also required that the index of the result matches that of the input. In pandas 1.4, using :meth:`.DataFrameGroupBy.transform` or :meth:`.SeriesGroupBy.transform` with null values in the groups and dropna=True gave incorrect results. Demonstrated by the examples below, the incorrect results either contained incorrect values, or the result did not have the same index as the input.

.. ipython:: python

    df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})

Old behavior:

In [3]: # Value in the last row should be np.nan
        df.groupby('a', dropna=True).transform('sum')
Out[3]:
   b
0  5
1  5
2  5

In [3]: # Should have one additional row with the value np.nan
        df.groupby('a', dropna=True).transform(lambda x: x.sum())
Out[3]:
   b
0  5
1  5

In [3]: # The value in the last row is np.nan interpreted as an integer
        df.groupby('a', dropna=True).transform('ffill')
Out[3]:
                     b
0                    2
1                    3
2 -9223372036854775808

In [3]: # Should have one additional row with the value np.nan
        df.groupby('a', dropna=True).transform(lambda x: x)
Out[3]:
   b
0  2
1  3

New behavior:

.. ipython:: python

    df.groupby('a', dropna=True).transform('sum')
    df.groupby('a', dropna=True).transform(lambda x: x.sum())
    df.groupby('a', dropna=True).transform('ffill')
    df.groupby('a', dropna=True).transform(lambda x: x)

Serializing tz-naive Timestamps with to_json() with iso_dates=True

:meth:`DataFrame.to_json`, :meth:`Series.to_json`, and :meth:`Index.to_json` would incorrectly localize DatetimeArrays/DatetimeIndexes with tz-naive Timestamps to UTC. (:issue:`38760`)

Note that this patch does not fix the localization of tz-aware Timestamps to UTC upon serialization. (Related issue :issue:`12997`)

Old Behavior

.. ipython:: python

    index = pd.date_range(
        start='2020-12-28 00:00:00',
        end='2020-12-28 02:00:00',
        freq='1H',
    )
    a = pd.Series(
        data=range(3),
        index=index,
    )

In [4]: a.to_json(date_format='iso')
Out[4]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}'

In [5]: pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index
Out[5]: array([False, False, False])

New Behavior

.. ipython:: python

    a.to_json(date_format='iso')
    # Roundtripping now works
    pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index

Backwards incompatible API changes

read_xml now supports dtype, converters, and parse_dates

Similar to other IO methods, :func:`pandas.read_xml` now supports assigning specific dtypes to columns, apply converter methods, and parse dates (:issue:`43567`).

.. ipython:: python

    xml_dates = """<?xml version='1.0' encoding='utf-8'?>
    <data>
      <row>
        <shape>square</shape>
        <degrees>00360</degrees>
        <sides>4.0</sides>
        <date>2020-01-01</date>
       </row>
      <row>
        <shape>circle</shape>
        <degrees>00360</degrees>
        <sides/>
        <date>2021-01-01</date>
      </row>
      <row>
        <shape>triangle</shape>
        <degrees>00180</degrees>
        <sides>3.0</sides>
        <date>2022-01-01</date>
      </row>
    </data>"""

    df = pd.read_xml(
        xml_dates,
        dtype={'sides': 'Int64'},
        converters={'degrees': str},
        parse_dates=['date']
    )
    df
    df.dtypes

read_xml now supports large XML using iterparse

For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml` now supports parsing such sizeable files using lxml's iterparse and etree's iterparse which are memory-efficient methods to iterate through XML trees and extract specific elements and attributes without holding entire tree in memory (:issue:`#45442`).

In [1]: df = pd.read_xml(
...      "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
...      iterparse = {"page": ["title", "ns", "id"]})
...  )
df
Out[2]:
                                                     title   ns        id
0                                       Gettysburg Address    0     21450
1                                                Main Page    0     42950
2                            Declaration by United Nations    0      8435
3             Constitution of the United States of America    0      8435
4                     Declaration of Independence (Israel)    0     17858
...                                                    ...  ...       ...
3578760               Page:Black cat 1897 07 v2 n10.pdf/17  104    219649
3578761               Page:Black cat 1897 07 v2 n10.pdf/43  104    219649
3578762               Page:Black cat 1897 07 v2 n10.pdf/44  104    219649
3578763      The History of Tom Jones, a Foundling/Book IX    0  12084291
3578764  Page:Shakespeare of Stratford (1926) Yale.djvu/91  104     21450

[3578765 rows x 3 columns]

api_breaking_change2

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
mypy (dev) 0.960   X
beautifulsoup4 4.9.3   X
blosc 1.21.0   X
bottleneck 1.3.2   X
fsspec 2021.05.0   X
hypothesis 6.13.0   X
gcsfs 2021.05.0   X
jinja2 3.0.0   X
lxml 4.6.3   X
numba 0.53.1   X
numexpr 2.7.3   X
openpyxl 3.0.7   X
pandas-gbq 0.15.0   X
psycopg2 2.8.6   X
pymysql 1.0.2   X
pyreadstat 1.1.2   X
pyxlsb 1.0.8   X
s3fs 2021.05.0   X
scipy 1.7.1   X
sqlalchemy 1.4.16   X
tabulate 0.8.9   X
xarray 0.19.0   X
xlsxwriter 1.4.3   X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
    X

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

Deprecations

In a future version, integer slicing on a :class:`Series` with a :class:`Int64Index` or :class:`RangeIndex` will be treated as label-based, not positional. This will make the behavior consistent with other :meth:`Series.__getitem__` and :meth:`Series.__setitem__` behaviors (:issue:`45162`).

For example:

.. ipython:: python

   ser = pd.Series([1, 2, 3, 4, 5], index=[2, 3, 5, 7, 11])

In the old behavior, ser[2:4] treats the slice as positional:

Old behavior:

In [3]: ser[2:4]
Out[3]:
5    3
7    4
dtype: int64

In a future version, this will be treated as label-based:

Future behavior:

In [4]: ser.loc[2:4]
Out[4]:
2    1
3    2
dtype: int64

To retain the old behavior, use series.iloc[i:j]. To get the future behavior, use series.loc[i:j].

Slicing on a :class:`DataFrame` will not be affected.

All attributes of :class:`ExcelWriter` were previously documented as not public. However some third party Excel engines documented accessing ExcelWriter.book or ExcelWriter.sheets, and users were utilizing these and possibly other attributes. Previously these attributes were not safe to use; e.g. modifications to ExcelWriter.book would not update ExcelWriter.sheets and conversely. In order to support this, pandas has made some attributes public and improved their implementations so that they may now be safely used. (:issue:`45572`)

The following attributes are now public and considered safe to access.

  • book
  • check_extension
  • close
  • date_format
  • datetime_format
  • engine
  • if_sheet_exists
  • sheets
  • supported_extensions

The following attributes have been deprecated. They now raise a FutureWarning when accessed and will be removed in a future version. Users should be aware that their usage is considered unsafe, and can lead to unexpected results.

  • cur_sheet
  • handles
  • path
  • save
  • write_cells

See the documentation of :class:`ExcelWriter` for further details.

Using group_keys with transformers in :meth:`.GroupBy.apply`

In previous versions of pandas, if it was inferred that the function passed to :meth:`.GroupBy.apply` was a transformer (i.e. the resulting index was equal to the input index), the group_keys argument of :meth:`DataFrame.groupby` and :meth:`Series.groupby` was ignored and the group keys would never be added to the index of the result. In the future, the group keys will be added to the index when the user specifies group_keys=True.

As group_keys=True is the default value of :meth:`DataFrame.groupby` and :meth:`Series.groupby`, not specifying group_keys with a transformer will raise a FutureWarning. This can be silenced and the previous behavior retained by specifying group_keys=False.

Try operating inplace when setting values with loc and iloc

Most of the time setting values with frame.iloc attempts to set values in-place, only falling back to inserting a new array if necessary. In the past, setting entire columns has been an exception to this rule:

.. ipython:: python

   values = np.arange(4).reshape(2, 2)
   df = pd.DataFrame(values)
   ser = df[0]

Old behavior:

In [3]: df.iloc[:, 0] = np.array([10, 11])
In [4]: ser
Out[4]:
0    0
1    2
Name: 0, dtype: int64

This behavior is deprecated. In a future version, setting an entire column with iloc will attempt to operate inplace.

Future behavior:

In [3]: df.iloc[:, 0] = np.array([10, 11])
In [4]: ser
Out[4]:
0    10
1    11
Name: 0, dtype: int64

To get the old behavior, use :meth:`DataFrame.__setitem__` directly:

Future behavior:

In [5]: df[0] = np.array([21, 31])
In [4]: ser
Out[4]:
0    10
1    11
Name: 0, dtype: int64

In the case where df.columns is not unique, use :meth:`DataFrame.isetitem`:

Future behavior:

In [5]: df.columns = ["A", "A"]
In [5]: df.isetitem(0, np.array([21, 31]))
In [4]: ser
Out[4]:
0    10
1    11
Name: 0, dtype: int64

numeric_only default value

Across the DataFrame and DataFrameGroupBy operations such as min, sum, and idxmax, the default value of the numeric_only argument, if it exists at all, was inconsistent. Furthermore, operations with the default value None can lead to surprising results. (:issue:`46560`)

In [1]: df = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})

In [2]: # Reading the next line without knowing the contents of df, one would
        # expect the result to contain the products for both columns a and b.
        df[["a", "b"]].prod()
Out[2]:
a    2
dtype: int64

To avoid this behavior, the specifying the value numeric_only=None has been deprecated, and will be removed in a future version of pandas. In the future, all operations with a numeric_only argument will default to False. Users should either call the operation only with columns that can be operated on, or specify numeric_only=True to operate only on Boolean, integer, and float columns.

In order to support the transition to the new behavior, the following methods have gained the numeric_only argument.

Other Deprecations

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Time Zones

Numeric

  • Bug in operations with array-likes with dtype="boolean" and :attr:`NA` incorrectly altering the array in-place (:issue:`45421`)
  • Bug in division, pow and mod operations on array-likes with dtype="boolean" not being like their np.bool_ counterparts (:issue:`46063`)
  • Bug in multiplying a :class:`Series` with IntegerDtype or FloatingDtype by an array-like with timedelta64[ns] dtype incorrectly raising (:issue:`45622`)

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Period

Plotting

Groupby/resample/rolling

Reshaping

Sparse

ExtensionArray

Styler

Metadata

Other

Contributors