|
6 | 6 | Package overview
|
7 | 7 | ****************
|
8 | 8 |
|
9 |
| -:mod:`pandas` is an open source, BSD-licensed library providing high-performance, |
10 |
| -easy-to-use data structures and data analysis tools for the `Python <https://www.python.org/>`__ |
11 |
| -programming language. |
12 |
| - |
13 |
| -:mod:`pandas` consists of the following elements: |
14 |
| - |
15 |
| -* A set of labeled array data structures, the primary of which are |
16 |
| - Series and DataFrame. |
17 |
| -* Index objects enabling both simple axis indexing and multi-level / |
18 |
| - hierarchical axis indexing. |
19 |
| -* An integrated group by engine for aggregating and transforming data sets. |
20 |
| -* Date range generation (date_range) and custom date offsets enabling the |
21 |
| - implementation of customized frequencies. |
22 |
| -* Input/Output tools: loading tabular data from flat files (CSV, delimited, |
23 |
| - Excel 2003), and saving and loading pandas objects from the fast and |
24 |
| - efficient PyTables/HDF5 format. |
25 |
| -* Memory-efficient "sparse" versions of the standard data structures for storing |
26 |
| - data that is mostly missing or mostly constant (some fixed value). |
27 |
| -* Moving window statistics (rolling mean, rolling standard deviation, etc.). |
| 9 | +**pandas** is a `Python <https://www.python.org>`__ package providing fast, |
| 10 | +flexible, and expressive data structures designed to make working with |
| 11 | +"relational" or "labeled" data both easy and intuitive. It aims to be the |
| 12 | +fundamental high-level building block for doing practical, **real world** data |
| 13 | +analysis in Python. Additionally, it has the broader goal of becoming **the |
| 14 | +most powerful and flexible open source data analysis / manipulation tool |
| 15 | +available in any language**. It is already well on its way toward this goal. |
| 16 | + |
| 17 | +pandas is well suited for many different kinds of data: |
| 18 | + |
| 19 | + - Tabular data with heterogeneously-typed columns, as in an SQL table or |
| 20 | + Excel spreadsheet |
| 21 | + - Ordered and unordered (not necessarily fixed-frequency) time series data. |
| 22 | + - Arbitrary matrix data (homogeneously typed or heterogeneous) with row and |
| 23 | + column labels |
| 24 | + - Any other form of observational / statistical data sets. The data actually |
| 25 | + need not be labeled at all to be placed into a pandas data structure |
| 26 | + |
| 27 | +The two primary data structures of pandas, :class:`Series` (1-dimensional) |
| 28 | +and :class:`DataFrame` (2-dimensional), handle the vast majority of typical use |
| 29 | +cases in finance, statistics, social science, and many areas of |
| 30 | +engineering. For R users, :class:`DataFrame` provides everything that R's |
| 31 | +``data.frame`` provides and much more. pandas is built on top of `NumPy |
| 32 | +<https://www.numpy.org>`__ and is intended to integrate well within a scientific |
| 33 | +computing environment with many other 3rd party libraries. |
| 34 | + |
| 35 | +Here are just a few of the things that pandas does well: |
| 36 | + |
| 37 | + - Easy handling of **missing data** (represented as NaN) in floating point as |
| 38 | + well as non-floating point data |
| 39 | + - Size mutability: columns can be **inserted and deleted** from DataFrame and |
| 40 | + higher dimensional objects |
| 41 | + - Automatic and explicit **data alignment**: objects can be explicitly |
| 42 | + aligned to a set of labels, or the user can simply ignore the labels and |
| 43 | + let `Series`, `DataFrame`, etc. automatically align the data for you in |
| 44 | + computations |
| 45 | + - Powerful, flexible **group by** functionality to perform |
| 46 | + split-apply-combine operations on data sets, for both aggregating and |
| 47 | + transforming data |
| 48 | + - Make it **easy to convert** ragged, differently-indexed data in other |
| 49 | + Python and NumPy data structures into DataFrame objects |
| 50 | + - Intelligent label-based **slicing**, **fancy indexing**, and **subsetting** |
| 51 | + of large data sets |
| 52 | + - Intuitive **merging** and **joining** data sets |
| 53 | + - Flexible **reshaping** and pivoting of data sets |
| 54 | + - **Hierarchical** labeling of axes (possible to have multiple labels per |
| 55 | + tick) |
| 56 | + - Robust IO tools for loading data from **flat files** (CSV and delimited), |
| 57 | + Excel files, databases, and saving / loading data from the ultrafast **HDF5 |
| 58 | + format** |
| 59 | + - **Time series**-specific functionality: date range generation and frequency |
| 60 | + conversion, moving window statistics, moving window linear regressions, |
| 61 | + date shifting and lagging, etc. |
| 62 | + |
| 63 | +Many of these principles are here to address the shortcomings frequently |
| 64 | +experienced using other languages / scientific research environments. For data |
| 65 | +scientists, working with data is typically divided into multiple stages: |
| 66 | +munging and cleaning data, analyzing / modeling it, then organizing the results |
| 67 | +of the analysis into a form suitable for plotting or tabular display. pandas |
| 68 | +is the ideal tool for all of these tasks. |
| 69 | + |
| 70 | +Some other notes |
| 71 | + |
| 72 | + - pandas is **fast**. Many of the low-level algorithmic bits have been |
| 73 | + extensively tweaked in `Cython <https://cython.org>`__ code. However, as with |
| 74 | + anything else generalization usually sacrifices performance. So if you focus |
| 75 | + on one feature for your application you may be able to create a faster |
| 76 | + specialized tool. |
| 77 | + |
| 78 | + - pandas is a dependency of `statsmodels |
| 79 | + <https://www.statsmodels.org/stable/index.html>`__, making it an important part of the |
| 80 | + statistical computing ecosystem in Python. |
| 81 | + |
| 82 | + - pandas has been used extensively in production in financial applications. |
28 | 83 |
|
29 | 84 | Data Structures
|
30 | 85 | ---------------
|
|
0 commit comments