Skip to content

Commit b0f2c96

Browse files
author
xhlu
committed
ML Docs: Updated PCA notebook
Added loadings, moved high-dimensional analysis first
1 parent 9fc81be commit b0f2c96

File tree

1 file changed

+138
-27
lines changed

1 file changed

+138
-27
lines changed

doc/python/ml-pca.md

+138-27
Original file line numberDiff line numberDiff line change
@@ -34,87 +34,145 @@ jupyter:
3434
thumbnail: thumbnail/ml-pca.png
3535
---
3636

37-
## Basic PCA Scatter Plot
37+
## High-dimensional PCA Analysis with `px.scatter_matrix`
3838

39-
This example shows you how to simply visualize the first two principal components of a PCA, by reducing a dataset of 4 dimensions to 2D. It uses scikit-learn's `PCA`.
39+
40+
### Visualize all the original dimensions
41+
42+
First, let's plot all the features and see how the `species` in the Iris dataset are grouped. In a [splom](https://plot.ly/python/splom/), each subplot displays a feature against another, so if we have $N$ features we have a $N \times N$ matrix.
43+
44+
In our example, we are plotting all 4 features from the Iris dataset, thus we can see how `sepal_width` is compared against `sepal_length`, then against `petal_width`, and so forth. Keep in mind how some pairs of features can more easily separate different species.
4045

4146
```python
4247
import plotly.express as px
43-
from sklearn.decomposition import PCA
4448

4549
df = px.data.iris()
46-
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
50+
features = ["sepal_width", "sepal_length", "petal_width", "petal_length"]
4751

48-
pca = PCA(n_components=2)
49-
components = pca.fit_transform(X)
50-
51-
fig = px.scatter(x=components[:, 0], y=components[:, 1], color=df['species'])
52+
fig = px.scatter_matrix(
53+
df,
54+
dimensions=features,
55+
color="species"
56+
)
57+
fig.update_traces(diagonal_visible=False)
5258
fig.show()
5359
```
5460

55-
## Visualize PCA with `px.scatter_3d`
61+
### Visualize all the principal components
62+
63+
Now, we apply `PCA` the same dataset, and retrieve **all** the components. We use the same `px.scatter_matrix` trace to display our results, but this time our features are the resulting *principal components*, ordered by how much variance they are able to explain.
5664

57-
Just like the basic PCA plot, this let you visualize the first 3 dimensions. This additionally displays the total variance explained by those components.
65+
The importance of explained variance is demonstrated in the example below. The subplot between PC3 and PC4 is clearly unable to separate each class, whereas the subplot between PC1 and PC2 shows a clear separation between each species.
5866

5967
```python
6068
import plotly.express as px
6169
from sklearn.decomposition import PCA
6270

6371
df = px.data.iris()
64-
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
65-
66-
pca = PCA(n_components=3)
67-
components = pca.fit_transform(X)
72+
features = ["sepal_width", "sepal_length", "petal_width", "petal_length"]
6873

69-
total_var = pca.explained_variance_ratio_.sum() * 100
74+
pca = PCA()
75+
components = pca.fit_transform(df[features])
76+
labels = {
77+
str(i): f"PC {i+1} ({var:.1f}%)"
78+
for i, var in enumerate(pca.explained_variance_ratio_ * 100)
79+
}
7080

71-
fig = px.scatter_3d(
72-
x=components[:, 0], y=components[:, 1], z=components[:, 2],
73-
color=df['species'],
74-
title=f'Total Explained Variance: {total_var:.2f}%',
75-
labels={'x': 'PC 1', 'y': 'PC 2', 'z': 'PC 3'},
81+
fig = px.scatter_matrix(
82+
components,
83+
labels=labels,
84+
dimensions=range(4),
85+
color=df["species"]
7686
)
87+
fig.update_traces(diagonal_visible=False)
7788
fig.show()
7889
```
7990

80-
## Plot high-dimensional components with `px.scatter_matrix`
91+
### Visualize a subset of the principal components
92+
93+
When you will have too many features to visualize, you might be interested in only visualizing the most relevant components. Those components often capture a majority of the [explained variance](https://en.wikipedia.org/wiki/Explained_variation), which is a good way to tell if those components are sufficient for modelling this dataset.
8194

82-
If you need to visualize more than 3 dimensions, you can use scatter plot matrices.
95+
In the example below, our dataset contains 10 features, but we only select the first 4 components, since they explain over 99% of the total variance.
8396

8497
```python
8598
import pandas as pd
99+
import plotly.express as px
86100
from sklearn.decomposition import PCA
87101
from sklearn.datasets import load_boston
88102

89103
boston = load_boston()
90104
df = pd.DataFrame(boston.data, columns=boston.feature_names)
105+
n_components = 4
91106

92-
pca = PCA(n_components=5)
107+
pca = PCA(n_components=n_components)
93108
components = pca.fit_transform(df)
94109

95110
total_var = pca.explained_variance_ratio_.sum() * 100
96111

97-
labels = {str(i): f"PC {i+1}" for i in range(5)}
112+
labels = {str(i): f"PC {i+1}" for i in range(n_components)}
98113
labels['color'] = 'Median Price'
99114

100115
fig = px.scatter_matrix(
101116
components,
102117
color=boston.target,
103-
dimensions=range(5),
118+
dimensions=range(n_components),
104119
labels=labels,
105120
title=f'Total Explained Variance: {total_var:.2f}%',
106121
)
107122
fig.update_traces(diagonal_visible=False)
108123
fig.show()
109124
```
110125

126+
## 2D PCA Scatter Plot
127+
128+
In the previous examples, you saw how to visualize high-dimensional PCs. In this example, we show you how to simply visualize the first two principal components of a PCA, by reducing a dataset of 4 dimensions to 2D.
129+
130+
```python
131+
import plotly.express as px
132+
from sklearn.decomposition import PCA
133+
134+
df = px.data.iris()
135+
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
136+
137+
pca = PCA(n_components=2)
138+
components = pca.fit_transform(X)
139+
140+
fig = px.scatter(components, x=0, y=1, color=df['species'])
141+
fig.show()
142+
```
143+
144+
## Visualize PCA with `px.scatter_3d`
145+
146+
With `px.scatter_3d`, you can visualize an additional dimension, which let you capture even more variance.
147+
148+
```python
149+
import plotly.express as px
150+
from sklearn.decomposition import PCA
151+
152+
df = px.data.iris()
153+
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
154+
155+
pca = PCA(n_components=3)
156+
components = pca.fit_transform(X)
157+
158+
total_var = pca.explained_variance_ratio_.sum() * 100
159+
160+
fig = px.scatter_3d(
161+
components, x=0, y=1, z=2, color=df['species'],
162+
title=f'Total Explained Variance: {total_var:.2f}%',
163+
labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}
164+
)
165+
fig.show()
166+
```
167+
111168
## Plotting explained variance
112169

113-
Often, you might be interested in seeing how much variance the PCA is able to explain as you increase the number of components, in order to decide how many dimensions to ultimately keep or analyze. This example shows you how to quickly plot the cumulative sum of explained variance for a high-dimensional dataset like [Diabetes](https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset).
170+
Often, you might be interested in seeing how much variance PCA is able to explain as you increase the number of components, in order to decide how many dimensions to ultimately keep or analyze. This example shows you how to quickly plot the cumulative sum of explained variance for a high-dimensional dataset like [Diabetes](https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset).
114171

115172
```python
116173
import numpy as np
117174
import pandas as pd
175+
import plotly.express as px
118176
from sklearn.decomposition import PCA
119177
from sklearn.datasets import load_diabetes
120178

@@ -132,4 +190,57 @@ px.area(
132190
)
133191
```
134192

135-
## Visualize loadings
193+
## Visualize Loadings
194+
195+
It is also possible to visualize loadings using `shapes`, and use `annotations` to indicate which feature a certain loading original belong to. Here, we define loadings as:
196+
197+
$$
198+
loadings = eigenvectors \cdot \sqrt{eigenvalues}
199+
$$
200+
201+
```python
202+
import plotly.express as px
203+
from sklearn.decomposition import PCA
204+
from sklearn import datasets
205+
from sklearn.preprocessing import StandardScaler
206+
207+
df = px.data.iris()
208+
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
209+
X = df[features]
210+
211+
pca = PCA(n_components=2)
212+
components = pca.fit_transform(X)
213+
214+
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
215+
216+
fig = px.scatter(components, x=0, y=1, color=df['species'])
217+
218+
for i, feature in enumerate(features):
219+
fig.add_shape(
220+
type='line',
221+
x0=0, y0=0,
222+
x1=loadings[i, 0],
223+
y1=loadings[i, 1]
224+
)
225+
fig.add_annotation(
226+
x=loadings[i, 0],
227+
y=loadings[i, 1],
228+
ax=0, ay=0,
229+
xanchor="center",
230+
yanchor="bottom",
231+
text=feature,
232+
)
233+
fig.show()
234+
```
235+
236+
## References
237+
238+
Learn more about `px`, `px.scatter_3d`, and `px.scatter_matrix` here:
239+
* https://plot.ly/python/plotly-express/
240+
* https://plot.ly/python/3d-scatter-plots/
241+
* https://plot.ly/python/splom/
242+
243+
The following resources offer an in-depth overview of PCA and explained variance:
244+
* https://en.wikipedia.org/wiki/Explained_variation
245+
* https://scikit-learn.org/stable/modules/decomposition.html#pca
246+
* https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579

0 commit comments

Comments
 (0)