{{ header }}
.. ipython:: python import pandas as pd
I want to analyze the Titanic passenger data, available as a CSV file.
.. ipython:: python titanic = pd.read_csv("data/titanic.csv")
pandas provides the :func:`read_csv` function to read data stored as a csv file into a pandas
DataFrame
. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefixread_*
.
Make sure to always have a check on the data after reading in the
data. When displaying a DataFrame
, the first and last 5 rows will be
shown by default:
.. ipython:: python titanic
I want to see the first 8 rows of a pandas DataFrame.
.. ipython:: python titanic.head(8)
To see the first N rows of a
DataFrame
, use the :meth:`~DataFrame.head` method with the required number of rows (in this case 8) as argument.
Note
Interested in the last N rows instead? pandas also provides a
:meth:`~DataFrame.tail` method. For example, titanic.tail(10)
will return the last
10 rows of the DataFrame.
A check on how pandas interpreted each of the column data types can be
done by requesting the pandas dtypes
attribute:
.. ipython:: python titanic.dtypes
For each of the columns, the used data type is enlisted. The data types
in this DataFrame
are integers (int64
), floats (float64
) and
strings (object
).
Note
When asking for the dtypes
, no parentheses ()
are used!
dtypes
is an attribute of a DataFrame
and Series
. Attributes
of a DataFrame
or Series
do not need ()
. Attributes
represent a characteristic of a DataFrame
/Series
, whereas
methods (which require parentheses ()
) do something with the
DataFrame
/Series
as introduced in the :ref:`first tutorial <10min_tut_01_tableoriented>`.
My colleague requested the Titanic data as a spreadsheet.
Note
If you want to use :func:`~pandas.to_excel` and :func:`~pandas.read_excel`, you need to install an Excel reader as outlined in the :ref:`Excel files <install.excel_dependencies>` section of the installation documentation.
.. ipython:: python titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
Whereas
read_*
functions are used to read data to pandas, theto_*
methods are used to store data. The :meth:`~DataFrame.to_excel` method stores the data as an excel file. In the example here, thesheet_name
is named passengers instead of the default Sheet1. By settingindex=False
the row index labels are not saved in the spreadsheet.
The equivalent read function :meth:`~DataFrame.read_excel` will reload the data to a
DataFrame
:
.. ipython:: python titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")
.. ipython:: python titanic.head()
.. ipython:: python :suppress: import os os.remove("titanic.xlsx")
I’m interested in a technical summary of a
DataFrame
.. ipython:: python titanic.info()
The method :meth:`~DataFrame.info` provides technical information about a
DataFrame
, so let’s explain the output in more detail:- It is indeed a :class:`DataFrame`.
- There are 891 entries, i.e. 891 rows.
- Each row has a row label (aka the
index
) with values ranging from 0 to 890. - The table has 12 columns. Most columns have a value for each of the
rows (all 891 values are
non-null
). Some columns do have missing values and less than 891non-null
values. - The columns
Name
,Sex
,Cabin
andEmbarked
consist of textual data (strings, akaobject
). The other columns are numerical data, some of them are whole numbers (integer
) and others are real numbers (float
). - The kind of data (characters, integers, …) in the different columns
are summarized by listing the
dtypes
. - The approximate amount of RAM used to hold the DataFrame is provided as well.
- Getting data in to pandas from many different file formats or data
sources is supported by
read_*
functions. - Exporting data out of pandas is provided by different
to_*
methods. - The
head
/tail
/info
methods and thedtypes
attribute are convenient for a first check.
For a complete overview of the input and output possibilities from and to pandas, see the user guide section about :ref:`reader and writer functions <io>`.