You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/python/ml-pca.md
+17-2Lines changed: 17 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ jupyter:
20
20
name: python
21
21
nbconvert_exporter: python
22
22
pygments_lexer: ipython3
23
-
version: 3.7.6
23
+
version: 3.7.7
24
24
plotly:
25
25
description: Visualize Principle Component Analysis (PCA) of your high-dimensional
26
26
data in Python with Plotly.
@@ -34,12 +34,21 @@ jupyter:
34
34
thumbnail: thumbnail/ml-pca.png
35
35
---
36
36
37
+
This page first shows how to visualize higher dimension data using various Plotly figures combined with dimensionality reduction (aka projection). Then, we dive into the specific details of our projection algorithm.
38
+
39
+
We will use [Scikit-learn](https://scikit-learn.org/) to load one of the datasets, and apply dimensionality reduction. Scikit-learn is a popular Machine Learning (ML) library that offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models. It was designed to be accessible, and to work seamlessly with popular libraries like NumPy and Pandas.
40
+
41
+
37
42
## High-dimensional PCA Analysis with `px.scatter_matrix`
38
43
44
+
The dimensionality reduction technique we will be using is called the [Principal Component Analysis (PCA)](https://scikit-learn.org/stable/modules/decomposition.html#pca). It is a powerful technique that arises from linear algebra and probability theory. In essense, it computes a matrix that represents the variation of your data ([covariance matrix/eigenvectors][covmatrix]), and rank them by their relevance (explained variance/eigenvalues). For a video tutorial, see [this segment on PCA](https://youtu.be/rng04VJxUt4?t=98) from the Coursera ML course.
First, let's plot all the features and see how the `species` in the Iris dataset are grouped. In a [splom](https://plot.ly/python/splom/), each subplot displays a feature against another, so if we have $N$ features we have a $N \times N$ matrix.
51
+
First, let's plot all the features and see how the `species` in the Iris dataset are grouped. In a [Scatter Plot Matrix (splom)](https://plot.ly/python/splom/), each subplot displays a feature against another, so if we have $N$ features we have a $N \times N$ matrix.
43
52
44
53
In our example, we are plotting all 4 features from the Iris dataset, thus we can see how `sepal_width` is compared against `sepal_length`, then against `petal_width`, and so forth. Keep in mind how some pairs of features can more easily separate different species.
45
54
@@ -169,6 +178,8 @@ fig.show()
169
178
170
179
Often, you might be interested in seeing how much variance PCA is able to explain as you increase the number of components, in order to decide how many dimensions to ultimately keep or analyze. This example shows you how to quickly plot the cumulative sum of explained variance for a high-dimensional dataset like [Diabetes](https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset).
171
180
181
+
With a higher explained variance, you are able to capture more variability in your dataset, which could potentially lead to better performance when training your model. For a more mathematical explanation, see this [Q&A thread](https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained).
182
+
172
183
```python
173
184
import numpy as np
174
185
import pandas as pd
@@ -198,6 +209,8 @@ $$
198
209
loadings = eigenvectors \cdot \sqrt{eigenvalues}
199
210
$$
200
211
212
+
For more details about the linear algebra behind eigenvectors and loadings, see this [Q&A thread](https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another).
213
+
201
214
```python
202
215
import plotly.express as px
203
216
from sklearn.decomposition importPCA
@@ -244,3 +257,5 @@ The following resources offer an in-depth overview of PCA and explained variance
0 commit comments