You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/python/ml-pca.md
+138-27
Original file line number
Diff line number
Diff line change
@@ -34,87 +34,145 @@ jupyter:
34
34
thumbnail: thumbnail/ml-pca.png
35
35
---
36
36
37
-
## Basic PCA Scatter Plot
37
+
## High-dimensional PCA Analysis with `px.scatter_matrix`
38
38
39
-
This example shows you how to simply visualize the first two principal components of a PCA, by reducing a dataset of 4 dimensions to 2D. It uses scikit-learn's `PCA`.
39
+
40
+
### Visualize all the original dimensions
41
+
42
+
First, let's plot all the features and see how the `species` in the Iris dataset are grouped. In a [splom](https://plot.ly/python/splom/), each subplot displays a feature against another, so if we have $N$ features we have a $N \times N$ matrix.
43
+
44
+
In our example, we are plotting all 4 features from the Iris dataset, thus we can see how `sepal_width` is compared against `sepal_length`, then against `petal_width`, and so forth. Keep in mind how some pairs of features can more easily separate different species.
Now, we apply `PCA` the same dataset, and retrieve **all** the components. We use the same `px.scatter_matrix` trace to display our results, but this time our features are the resulting *principal components*, ordered by how much variance they are able to explain.
56
64
57
-
Just like the basic PCA plot, this let you visualize the first 3 dimensions. This additionally displays the total variance explained by those components.
65
+
The importance of explained variance is demonstrated in the example below. The subplot between PC3 and PC4 is clearly unable to separate each class, whereas the subplot between PC1 and PC2 shows a clear separation between each species.
58
66
59
67
```python
60
68
import plotly.express as px
61
69
from sklearn.decomposition importPCA
62
70
63
71
df = px.data.iris()
64
-
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
65
-
66
-
pca = PCA(n_components=3)
67
-
components = pca.fit_transform(X)
72
+
features = ["sepal_width", "sepal_length", "petal_width", "petal_length"]
## Plot high-dimensional components with `px.scatter_matrix`
91
+
### Visualize a subset of the principal components
92
+
93
+
When you will have too many features to visualize, you might be interested in only visualizing the most relevant components. Those components often capture a majority of the [explained variance](https://en.wikipedia.org/wiki/Explained_variation), which is a good way to tell if those components are sufficient for modelling this dataset.
81
94
82
-
If you need to visualize more than 3 dimensions, you can use scatter plot matrices.
95
+
In the example below, our dataset contains 10 features, but we only select the first 4 components, since they explain over 99% of the total variance.
In the previous examples, you saw how to visualize high-dimensional PCs. In this example, we show you how to simply visualize the first two principal components of a PCA, by reducing a dataset of 4 dimensions to 2D.
129
+
130
+
```python
131
+
import plotly.express as px
132
+
from sklearn.decomposition importPCA
133
+
134
+
df = px.data.iris()
135
+
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
Often, you might be interested in seeing how much variance the PCA is able to explain as you increase the number of components, in order to decide how many dimensions to ultimately keep or analyze. This example shows you how to quickly plot the cumulative sum of explained variance for a high-dimensional dataset like [Diabetes](https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset).
170
+
Often, you might be interested in seeing how much variance PCA is able to explain as you increase the number of components, in order to decide how many dimensions to ultimately keep or analyze. This example shows you how to quickly plot the cumulative sum of explained variance for a high-dimensional dataset like [Diabetes](https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset).
114
171
115
172
```python
116
173
import numpy as np
117
174
import pandas as pd
175
+
import plotly.express as px
118
176
from sklearn.decomposition importPCA
119
177
from sklearn.datasets import load_diabetes
120
178
@@ -132,4 +190,57 @@ px.area(
132
190
)
133
191
```
134
192
135
-
## Visualize loadings
193
+
## Visualize Loadings
194
+
195
+
It is also possible to visualize loadings using `shapes`, and use `annotations` to indicate which feature a certain loading original belong to. Here, we define loadings as:
196
+
197
+
$$
198
+
loadings = eigenvectors \cdot \sqrt{eigenvalues}
199
+
$$
200
+
201
+
```python
202
+
import plotly.express as px
203
+
from sklearn.decomposition importPCA
204
+
from sklearn import datasets
205
+
from sklearn.preprocessing import StandardScaler
206
+
207
+
df = px.data.iris()
208
+
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
0 commit comments