Skip to content

Latest commit

 

History

History
214 lines (141 loc) · 5.61 KB

02_read_write.rst

File metadata and controls

214 lines (141 loc) · 5.61 KB

{{ header }}

.. ipython:: python

    import pandas as pd

Data used for this tutorial:

How do I read and write tabular data?

  • I want to analyze the Titanic passenger data, available as a CSV file.

    .. ipython:: python
    
        titanic = pd.read_csv("data/titanic.csv")
    
    

    pandas provides the :func:`read_csv` function to read data stored as a csv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_*.

Make sure to always have a check on the data after reading in the data. When displaying a DataFrame, the first and last 5 rows will be shown by default:

.. ipython:: python

    titanic

  • I want to see the first 8 rows of a pandas DataFrame.

    .. ipython:: python
    
        titanic.head(8)
    
    

    To see the first N rows of a DataFrame, use the :meth:`~DataFrame.head` method with the required number of rows (in this case 8) as argument.

Note

Interested in the last N rows instead? pandas also provides a :meth:`~DataFrame.tail` method. For example, titanic.tail(10) will return the last 10 rows of the DataFrame.

A check on how pandas interpreted each of the column data types can be done by requesting the pandas dtypes attribute:

.. ipython:: python

    titanic.dtypes

For each of the columns, the used data type is enlisted. The data types in this DataFrame are integers (int64), floats (float64) and strings (object).

Note

When asking for the dtypes, no parentheses () are used! dtypes is an attribute of a DataFrame and Series. Attributes of a DataFrame or Series do not need (). Attributes represent a characteristic of a DataFrame/Series, whereas methods (which require parentheses ()) do something with the DataFrame/Series as introduced in the :ref:`first tutorial <10min_tut_01_tableoriented>`.

  • My colleague requested the Titanic data as a spreadsheet.

    Note

    If you want to use :func:`~pandas.to_excel` and :func:`~pandas.read_excel`, you need to install an Excel reader as outlined in the :ref:`Excel files <install.excel_dependencies>` section of the installation documentation.

    .. ipython:: python
    
        titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
    
    

    Whereas read_* functions are used to read data to pandas, the to_* methods are used to store data. The :meth:`~DataFrame.to_excel` method stores the data as an excel file. In the example here, the sheet_name is named passengers instead of the default Sheet1. By setting index=False the row index labels are not saved in the spreadsheet.

The equivalent read function :meth:`~DataFrame.read_excel` will reload the data to a DataFrame:

.. ipython:: python

    titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")

.. ipython:: python

    titanic.head()

.. ipython:: python
   :suppress:

   import os

   os.remove("titanic.xlsx")

  • I’m interested in a technical summary of a DataFrame

    .. ipython:: python
    
        titanic.info()
    
    
    

    The method :meth:`~DataFrame.info` provides technical information about a DataFrame, so let’s explain the output in more detail:

    • It is indeed a :class:`DataFrame`.
    • There are 891 entries, i.e. 891 rows.
    • Each row has a row label (aka the index) with values ranging from 0 to 890.
    • The table has 12 columns. Most columns have a value for each of the rows (all 891 values are non-null). Some columns do have missing values and less than 891 non-null values.
    • The columns Name, Sex, Cabin and Embarked consist of textual data (strings, aka object). The other columns are numerical data, some of them are whole numbers (integer) and others are real numbers (float).
    • The kind of data (characters, integers, …) in the different columns are summarized by listing the dtypes.
    • The approximate amount of RAM used to hold the DataFrame is provided as well.

REMEMBER

  • Getting data in to pandas from many different file formats or data sources is supported by read_* functions.
  • Exporting data out of pandas is provided by different to_* methods.
  • The head/tail/info methods and the dtypes attribute are convenient for a first check.
To user guide

For a complete overview of the input and output possibilities from and to pandas, see the user guide section about :ref:`reader and writer functions <io>`.