Skip to content

Commit 38ef59d

Browse files
author
xhlulu
committed
ML Docs: Update PCA page
1 parent 46d93de commit 38ef59d

File tree

1 file changed

+17
-2
lines changed

1 file changed

+17
-2
lines changed

doc/python/ml-pca.md

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ jupyter:
2020
name: python
2121
nbconvert_exporter: python
2222
pygments_lexer: ipython3
23-
version: 3.7.6
23+
version: 3.7.7
2424
plotly:
2525
description: Visualize Principle Component Analysis (PCA) of your high-dimensional
2626
data in Python with Plotly.
@@ -34,12 +34,21 @@ jupyter:
3434
thumbnail: thumbnail/ml-pca.png
3535
---
3636

37+
This page first shows how to visualize higher dimension data using various Plotly figures combined with dimensionality reduction (aka projection). Then, we dive into the specific details of our projection algorithm.
38+
39+
We will use [Scikit-learn](https://scikit-learn.org/) to load one of the datasets, and apply dimensionality reduction. Scikit-learn is a popular Machine Learning (ML) library that offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models. It was designed to be accessible, and to work seamlessly with popular libraries like NumPy and Pandas.
40+
41+
3742
## High-dimensional PCA Analysis with `px.scatter_matrix`
3843

44+
The dimensionality reduction technique we will be using is called the [Principal Component Analysis (PCA)](https://scikit-learn.org/stable/modules/decomposition.html#pca). It is a powerful technique that arises from linear algebra and probability theory. In essense, it computes a matrix that represents the variation of your data ([covariance matrix/eigenvectors][covmatrix]), and rank them by their relevance (explained variance/eigenvalues). For a video tutorial, see [this segment on PCA](https://youtu.be/rng04VJxUt4?t=98) from the Coursera ML course.
45+
46+
[covmatrix]: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues#:~:text=As%20it%20is%20a%20square%20symmetric%20matrix%2C%20it%20can%20be%20diagonalized%20by%20choosing%20a%20new%20orthogonal%20coordinate%20system%2C%20given%20by%20its%20eigenvectors%20(incidentally%2C%20this%20is%20called%20spectral%20theorem)%3B%20corresponding%20eigenvalues%20will%20then%20be%20located%20on%20the%20diagonal.%20In%20this%20new%20coordinate%20system%2C%20the%20covariance%20matrix%20is%20diagonal%20and%20looks%20like%20that%3A
47+
3948

4049
### Visualize all the original dimensions
4150

42-
First, let's plot all the features and see how the `species` in the Iris dataset are grouped. In a [splom](https://plot.ly/python/splom/), each subplot displays a feature against another, so if we have $N$ features we have a $N \times N$ matrix.
51+
First, let's plot all the features and see how the `species` in the Iris dataset are grouped. In a [Scatter Plot Matrix (splom)](https://plot.ly/python/splom/), each subplot displays a feature against another, so if we have $N$ features we have a $N \times N$ matrix.
4352

4453
In our example, we are plotting all 4 features from the Iris dataset, thus we can see how `sepal_width` is compared against `sepal_length`, then against `petal_width`, and so forth. Keep in mind how some pairs of features can more easily separate different species.
4554

@@ -169,6 +178,8 @@ fig.show()
169178

170179
Often, you might be interested in seeing how much variance PCA is able to explain as you increase the number of components, in order to decide how many dimensions to ultimately keep or analyze. This example shows you how to quickly plot the cumulative sum of explained variance for a high-dimensional dataset like [Diabetes](https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset).
171180

181+
With a higher explained variance, you are able to capture more variability in your dataset, which could potentially lead to better performance when training your model. For a more mathematical explanation, see this [Q&A thread](https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained).
182+
172183
```python
173184
import numpy as np
174185
import pandas as pd
@@ -198,6 +209,8 @@ $$
198209
loadings = eigenvectors \cdot \sqrt{eigenvalues}
199210
$$
200211

212+
For more details about the linear algebra behind eigenvectors and loadings, see this [Q&A thread](https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another).
213+
201214
```python
202215
import plotly.express as px
203216
from sklearn.decomposition import PCA
@@ -244,3 +257,5 @@ The following resources offer an in-depth overview of PCA and explained variance
244257
* https://en.wikipedia.org/wiki/Explained_variation
245258
* https://scikit-learn.org/stable/modules/decomposition.html#pca
246259
* https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579
260+
* https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another
261+
* https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained

0 commit comments

Comments
 (0)