Skip to content

Commit 5d8bbf9

Browse files
committed
Merge recent changes from upstream
2 parents 7627e93 + 269e95e commit 5d8bbf9

File tree

122 files changed

+3128
-2273
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

122 files changed

+3128
-2273
lines changed

Diff for: asv_bench/benchmarks/sparse.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from itertools import repeat
1+
import itertools
22

33
from .pandas_vb_common import *
44
import scipy.sparse
@@ -33,7 +33,7 @@ def time_sparse_from_scipy(self):
3333
SparseDataFrame(scipy.sparse.rand(1000, 1000, 0.005))
3434

3535
def time_sparse_from_dict(self):
36-
SparseDataFrame(dict(zip(range(1000), repeat([0]))))
36+
SparseDataFrame(dict(zip(range(1000), itertools.repeat([0]))))
3737

3838

3939
class sparse_series_from_coo(object):

Diff for: asv_bench/benchmarks/timeseries.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ def setup(self):
5656
self.no_freq = self.rng7[:50000].append(self.rng7[50002:])
5757
self.d_freq = self.rng7[:50000].append(self.rng7[50000:])
5858

59-
self.rng8 = date_range(start='1/1/1700', freq='B', periods=100000)
59+
self.rng8 = date_range(start='1/1/1700', freq='B', periods=75000)
6060
self.b_freq = self.rng8[:50000].append(self.rng8[50000:])
6161

6262
def time_add_timedelta(self):

Diff for: ci/requirements-3.6_NUMPY_DEV.build

-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
python=3.6*
22
pytz
3-
cython

Diff for: ci/requirements-3.6_NUMPY_DEV.build.sh

+3
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,7 @@ pip install --pre --upgrade --timeout=60 -f $PRE_WHEELS numpy scipy
1414
# install dateutil from master
1515
pip install -U git+git://github.com/dateutil/dateutil.git
1616

17+
# cython via pip
18+
pip install cython
19+
1720
true

Diff for: doc/source/10min.rst

+2-12
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
np.random.seed(123456)
1212
np.set_printoptions(precision=4, suppress=True)
1313
import matplotlib
14-
matplotlib.style.use('ggplot')
14+
# matplotlib.style.use('default')
1515
pd.options.display.max_rows = 15
1616
1717
#### portions of this were borrowed from the
@@ -95,17 +95,7 @@ will be completed:
9595
df2.append df2.combine_first
9696
df2.apply df2.compound
9797
df2.applymap df2.consolidate
98-
df2.as_blocks df2.convert_objects
99-
df2.asfreq df2.copy
100-
df2.as_matrix df2.corr
101-
df2.astype df2.corrwith
102-
df2.at df2.count
103-
df2.at_time df2.cov
104-
df2.axes df2.cummax
105-
df2.B df2.cummin
106-
df2.between_time df2.cumprod
107-
df2.bfill df2.cumsum
108-
df2.blocks df2.D
98+
df2.D
10999

110100
As you can see, the columns ``A``, ``B``, ``C``, and ``D`` are automatically
111101
tab completed. ``E`` is there as well; the rest of the attributes have been

Diff for: doc/source/advanced.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -638,9 +638,11 @@ and allows efficient indexing and storage of an index with a large number of dup
638638

639639
.. ipython:: python
640640
641+
from pandas.api.types import CategoricalDtype
642+
641643
df = pd.DataFrame({'A': np.arange(6),
642644
'B': list('aabbca')})
643-
df['B'] = df['B'].astype('category', categories=list('cab'))
645+
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
644646
df
645647
df.dtypes
646648
df.B.cat.categories

Diff for: doc/source/api.rst

+4-1
Original file line numberDiff line numberDiff line change
@@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like
646646
Categorical
647647
~~~~~~~~~~~
648648

649-
If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
649+
.. autoclass:: api.types.CategoricalDtype
650+
:members: categories, ordered
651+
652+
If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
650653
data. This accessor is similar to the ``Series.dt`` or ``Series.str`` and has the
651654
following usable methods and properties:
652655

Diff for: doc/source/categorical.rst

+95-8
Original file line numberDiff line numberDiff line change
@@ -89,12 +89,22 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
8989
df["B"] = raw_cat
9090
df
9191
92-
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
92+
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of
93+
94+
1. categories are inferred from the data
95+
2. categories are unordered.
96+
97+
To control those behaviors, instead of passing ``'category'``, use an instance
98+
of :class:`~pandas.api.types.CategoricalDtype`.
9399

94100
.. ipython:: python
95101
96-
s = pd.Series(["a","b","c","a"])
97-
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
102+
from pandas.api.types import CategoricalDtype
103+
104+
s = pd.Series(["a", "b", "c", "a"])
105+
cat_type = CategoricalDtype(categories=["b", "c", "d"],
106+
ordered=True)
107+
s_cat = s.astype(cat_type)
98108
s_cat
99109
100110
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
@@ -133,6 +143,75 @@ constructor to save the factorize step during normal constructor mode:
133143
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
134144
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
135145
146+
.. _categorical.categoricaldtype:
147+
148+
CategoricalDtype
149+
----------------
150+
151+
.. versionchanged:: 0.21.0
152+
153+
A categorical's type is fully described by
154+
155+
1. ``categories``: a sequence of unique values and no missing values
156+
2. ``ordered``: a boolean
157+
158+
This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
159+
The ``categories`` argument is optional, which implies that the actual categories
160+
should be inferred from whatever is present in the data when the
161+
:class:`pandas.Categorical` is created. The categories are assumed to be unordered
162+
by default.
163+
164+
.. ipython:: python
165+
166+
from pandas.api.types import CategoricalDtype
167+
168+
CategoricalDtype(['a', 'b', 'c'])
169+
CategoricalDtype(['a', 'b', 'c'], ordered=True)
170+
CategoricalDtype()
171+
172+
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
173+
expects a `dtype`. For example :func:`pandas.read_csv`,
174+
:func:`pandas.DataFrame.astype`, or in the Series constructor.
175+
176+
.. note::
177+
178+
As a convenience, you can use the string ``'category'`` in place of a
179+
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
180+
the categories being unordered, and equal to the set values present in the
181+
array. In other words, ``dtype='category'`` is equivalent to
182+
``dtype=CategoricalDtype()``.
183+
184+
Equality Semantics
185+
~~~~~~~~~~~~~~~~~~
186+
187+
Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
188+
whenever they have the same categories and orderedness. When comparing two
189+
unordered categoricals, the order of the ``categories`` is not considered
190+
191+
.. ipython:: python
192+
193+
c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)
194+
195+
# Equal, since order is not considered when ordered=False
196+
c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False)
197+
198+
# Unequal, since the second CategoricalDtype is ordered
199+
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True)
200+
201+
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
202+
203+
.. ipython:: python
204+
205+
c1 == 'category'
206+
207+
.. warning::
208+
209+
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
210+
and since all instances ``CategoricalDtype`` compare equal to ``'category'``,
211+
all instances of ``CategoricalDtype`` compare equal to a
212+
``CategoricalDtype(None, False)``, regardless of ``categories`` or
213+
``ordered``.
214+
136215
Description
137216
-----------
138217

@@ -184,7 +263,7 @@ It's also possible to pass in the categories in a specific order:
184263

185264
.. ipython:: python
186265
187-
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
266+
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
188267
s
189268
190269
# categories
@@ -301,7 +380,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``
301380
302381
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
303382
s.sort_values(inplace=True)
304-
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
383+
s = pd.Series(["a","b","c","a"]).astype(
384+
CategoricalDtype(ordered=True)
385+
)
305386
s.sort_values(inplace=True)
306387
s
307388
s.min(), s.max()
@@ -401,9 +482,15 @@ categories or a categorical with any list-like object, will raise a TypeError.
401482

402483
.. ipython:: python
403484
404-
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
405-
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
406-
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
485+
cat = pd.Series([1,2,3]).astype(
486+
CategoricalDtype([3, 2, 1], ordered=True)
487+
)
488+
cat_base = pd.Series([2,2,2]).astype(
489+
CategoricalDtype([3, 2, 1], ordered=True)
490+
)
491+
cat_base2 = pd.Series([2,2,2]).astype(
492+
CategoricalDtype(ordered=True)
493+
)
407494
408495
cat
409496
cat_base

Diff for: doc/source/computation.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
np.set_printoptions(precision=4, suppress=True)
99
import pandas as pd
1010
import matplotlib
11-
matplotlib.style.use('ggplot')
11+
# matplotlib.style.use('default')
1212
import matplotlib.pyplot as plt
1313
plt.close('all')
1414
pd.options.display.max_rows=15

Diff for: doc/source/cookbook.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
pd.options.display.max_rows=15
2121
2222
import matplotlib
23-
matplotlib.style.use('ggplot')
23+
# matplotlib.style.use('default')
2424
2525
np.set_printoptions(precision=4, suppress=True)
2626

Diff for: doc/source/dsintro.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
pd.options.display.max_rows = 15
1111
1212
import matplotlib
13-
matplotlib.style.use('ggplot')
13+
# matplotlib.style.use('default')
1414
import matplotlib.pyplot as plt
1515
plt.close('all')
1616

Diff for: doc/source/gotchas.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Frequently Asked Questions (FAQ)
1414
import pandas as pd
1515
pd.options.display.max_rows = 15
1616
import matplotlib
17-
matplotlib.style.use('ggplot')
17+
# matplotlib.style.use('default')
1818
import matplotlib.pyplot as plt
1919
plt.close('all')
2020

Diff for: doc/source/groupby.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
import pandas as pd
1111
pd.options.display.max_rows = 15
1212
import matplotlib
13-
matplotlib.style.use('ggplot')
13+
# matplotlib.style.use('default')
1414
import matplotlib.pyplot as plt
1515
plt.close('all')
1616
from collections import OrderedDict
@@ -1060,7 +1060,7 @@ To select from a DataFrame or Series the nth item, use the nth method. This is a
10601060
g.nth(-1)
10611061
g.nth(1)
10621062
1063-
If you want to select the nth not-null item, use the ``dropna`` kwarg. For a DataFrame this should be either ``'any'`` or ``'all'`` just like you would pass to dropna, for a Series this just needs to be truthy.
1063+
If you want to select the nth not-null item, use the ``dropna`` kwarg. For a DataFrame this should be either ``'any'`` or ``'all'`` just like you would pass to dropna:
10641064

10651065
.. ipython:: python
10661066
@@ -1072,7 +1072,7 @@ If you want to select the nth not-null item, use the ``dropna`` kwarg. For a Dat
10721072
g.nth(-1, dropna='any') # NaNs denote group exhausted when using dropna
10731073
g.last()
10741074
1075-
g.B.nth(0, dropna=True)
1075+
g.B.nth(0, dropna='all')
10761076
10771077
As with other methods, passing ``as_index=False``, will achieve a filtration, which returns the grouped row.
10781078

Diff for: doc/source/io.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -113,8 +113,8 @@ header : int or list of ints, default ``'infer'``
113113
rather than the first line of the file.
114114
names : array-like, default ``None``
115115
List of column names to use. If file contains no header row, then you should
116-
explicitly pass ``header=None``. Duplicates in this list are not allowed unless
117-
``mangle_dupe_cols=True``, which is the default.
116+
explicitly pass ``header=None``. Duplicates in this list will cause
117+
a ``UserWarning`` to be issued.
118118
index_col : int or sequence or ``False``, default ``None``
119119
Column to use as the row labels of the DataFrame. If a sequence is given, a
120120
MultiIndex is used. If you have a malformed file with delimiters at the end of

Diff for: doc/source/merging.rst

+8-3
Original file line numberDiff line numberDiff line change
@@ -830,8 +830,10 @@ The left frame.
830830

831831
.. ipython:: python
832832
833+
from pandas.api.types import CategoricalDtype
834+
833835
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
834-
X = X.astype('category', categories=['foo', 'bar'])
836+
X = X.astype(CategoricalDtype(categories=['foo', 'bar']))
835837
836838
left = pd.DataFrame({'X': X,
837839
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
@@ -842,8 +844,11 @@ The right frame.
842844

843845
.. ipython:: python
844846
845-
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
846-
'Z': [1, 2]})
847+
right = pd.DataFrame({
848+
'X': pd.Series(['foo', 'bar'],
849+
dtype=CategoricalDtype(['foo', 'bar'])),
850+
'Z': [1, 2]
851+
})
847852
right
848853
right.dtypes
849854

Diff for: doc/source/missing_data.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
import pandas as pd
88
pd.options.display.max_rows=15
99
import matplotlib
10-
matplotlib.style.use('ggplot')
10+
# matplotlib.style.use('default')
1111
import matplotlib.pyplot as plt
1212
1313
.. _missing_data:

0 commit comments

Comments
 (0)