You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+1
Original file line number
Diff line number
Diff line change
@@ -83,6 +83,7 @@ Simply click on the `ipynb`/`nbviewer` links next to the chapter headlines to vi
83
83
12. Training Artificial Neural Networks for Image Recognition [[dir](./code/ch12)][[ipynb](./code/ch12/ch12.ipynb)][[nbviewer](http://nbviewer.ipython.org/github/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb)]
84
84
13. Parallelizing Neural Network Training via Theano [[dir](./code/ch13)][[ipynb](./code/ch13/ch13.ipynb)][[nbviewer](http://nbviewer.ipython.org/github/rasbt/python-machine-learning-book/blob/master/code/ch13/ch13.ipynb)]
\chapter{Working with Unlabeled Data -- Clustering Analysis}
1731
+
1732
+
1733
+
\section{Grouping objects by similarity using k-means}
1734
+
1735
+
Thus, our goal is to group the samples based on their feature similarities, which we can be achieved using the k-means algorithm that can be summarized by the following four steps:
1736
+
1737
+
\begin{enumerate}
1738
+
\item Randomly pick $k$ centroids from the sample points as initial cluster centers.
1739
+
\item Assign each sample to the nearest centroid $\mu^{(j)}, \quad j \in {1, ..., k}.$
1740
+
\item Move the centroids to the center of the samples that were assigned to it.
1741
+
\item Repeat steps 2 and 3 until the cluster assignments do not change or a user-de ned tolerance or a maximum number of iterations is reached.
1742
+
\end{enumerate}
1743
+
1744
+
Now the next question is \textit{how do we measure similarity between objects?} We can de ne similarity as the opposite of distance, and a commonly used distance for clustering samples with continuous features is the \textit{squared Euclidean distance} between two points $\mathbf{x}$ and $\mathbf{y}$ in $m$-dimensional space:
Note that, in the preceding equation, the index $j$ refers to the $j$th dimension(feature column) of the sample points x and y. In the rest of this section, we will use the superscripts $i$ and $j$ to refer to the sample index and cluster index, respectively.
1751
+
1752
+
Based on this Euclidean distance metric, we can describe the k-means algorithmas a simple optimization problem, an iterative approach for minimizing the \textit{within-cluster sum of squared errors (SSE)}, which is sometimes also called \textit{cluster inertia}:
Here, $\mu^{(j)}$ is the representative point (centroid) for cluster $j$, and $w^{(i, j)} = 1$ if the sample $\mathbf{x}^{(i)}$ is in cluster $j$; $w^{(i, j)}=0$ otherwise.
1759
+
1760
+
\subsection{K-means++}
1761
+
1762
+
[...] The initialization in k-means++ can be summarized as follows:
1763
+
\begin{enumerate}
1764
+
\item Initialize an empty set $M$ to store the $k$ centroids being selected.
1765
+
\item Randomly choose the first centroid $\mu^{(j)}$ from the input samples and assign it to $M$
1766
+
\item For each sample $\mathbf{x}^{(i)}$ that is not in $M$, find the minimum distance $d \big( x^{(i)}, M \big)^2$ to any of the centroids in $M$.
1767
+
\item To randomly select the next centroid $\mu^{(p)}$, use a weighted probability distribution equal to $\frac{d(\mu^{(p)}, M)^2 }{\sum_i d(x^{(i)}M)^2}$
1768
+
\item Repeat steps 2 and 3 until $k$ centroids are chosen.
1769
+
\item Proceed with the classic \textit{k}-means algorithm.
1770
+
\end{enumerate}
1771
+
1772
+
\subsection{Hard versus soft clustering}
1773
+
1774
+
The $fuzzy c-means (FCM)$ procedure is very similar to k-means. However, we replace the hard cluster assignment by probabilities for each point belonging to each cluster. In $k$-means, we could express the cluster membership of a sample $x$ by a sparse vector of binary values:
1775
+
1776
+
\[
1777
+
\begin{bmatrix}
1778
+
\mathbf{\mu}^{(1)} \rightarrow 0 \\
1779
+
\mathbf{\mu}^{(2)} \rightarrow 1 \\
1780
+
\mathbf{\mu}^{(3)} \rightarrow 0
1781
+
\end{bmatrix}
1782
+
\]
1783
+
1784
+
Here, the index position with value 1 indicates the cluster centroid $\mathbf{\mu}^{(j)}$ the sample is assigned to (assuming $k=3, \; j \in\{1, 2, 3\}$). In contrast, a membership vector in FCM could be represented as follows:
1785
+
1786
+
\[
1787
+
\begin{bmatrix}
1788
+
\mathbf{\mu}^{(1)} \rightarrow 0.1 \\
1789
+
\mathbf{\mu}^{(2)} \rightarrow 0.85 \\
1790
+
\mathbf{\mu}^{(3)} \rightarrow 0.05
1791
+
\end{bmatrix}
1792
+
\]
1793
+
1794
+
Here, each value falls in the range $[0, 1]$ and represents a probability of membership to the respective cluster centroid. The sum of the memberships for a given sample is equal to 1. Similarly to the k-means algorithm, we can summarize the FCM algorithm in four key steps:
1795
+
1796
+
\begin{enumerate}
1797
+
\item Specify the number of $k$ centroids and randomly assign the cluster memberships for each point.
1798
+
\item Compute the cluster centroids $\mathbf{\mu^{(j)}}, j \in\{1, \dots, k \}$.
1799
+
\item Update the cluster memberships for each point.
1800
+
\item Repeat steps 2 and 3 until the membership coefficients do not change or a user-de ned tolerance or a maximum number of iterations is reached.
1801
+
\end{enumerate}
1802
+
1803
+
The objective function of FCM -- we abbreviate it by $J_m$ -- looks very similar to the within cluster sum-squared-error that we minimize in $k$-means:
However, note that the membership indicator $w^{(i, j)}$ is not a binary value as in $k$-means $\big( w^{(i, j)} \in\{0, 1\}\big)$ but a real value that denotes the cluster membership probability $\big( w^{(i, j)} \in [0, 1] \big).$ You also may have noticed that we added an additional exponent to $w^{(i, j)}$; the exponent $m$, any number greater or equal to 1 (typically $m=2$), is the so-called \textit{fuzziness coefficient} (or simply \textit{fuzzifier}) that controls the degree of \textit{fuzziness}. The larger the value of $m$, the smaller the cluster membership $w^{(i, j)}$ becomes, which leads to fuzzier clusters. The cluster membership probability itself is calculated as follows:
For example, if we chose three cluster centers as in the previous $k$-means example, we could calculate the membership of the $\mathbf{x}^{(i)}$ sample belonging to its own cluster:
The center $\mu^{(j)}$ of a cluster itself is calculated as the mean of all samples in the cluster weighted by the membership degree of belonging to its own cluster:
\subsection{Using the elbow method to find the optimal number of clusters}
1828
+
\subsection{Quantifying the quality of clustering via silhouette plots}
1829
+
1830
+
To calculate the \textit{silhouette coefficient} of a single sample in our dataset, we can apply the following three steps:
1831
+
1832
+
\begin{enumerate}
1833
+
\item Calculate the cluster cohesion $a^{(i)}$ as the average distance between a sample $\mathbf{x}^{(i)}$ and all other points in the same cluster.
1834
+
\item Calculate the cluster separation $b^{(i)}$ from the next closest cluster as the average distance between the sample $\mathbf{x}^{(i)}$ and all samples in the nearest cluster.
1835
+
\item Calculate the silhouette $s^{(i)}$ as the difference between cluster cohesion and separation divided by the greater of the two, as shown here:
The silhouette coefficient is bounded in the range $-1$ to $1$. Based on the preceding formula, we can see that the silhouette coefficient is 0 if the cluster separationand cohesion are equal $(b^{(i)} = a^{(i)})$. Furthermore, we get close to an ideal silhouette coefficient of $1$ if $b^{(i)} >> a^{(i)}$, since $b^{(i)}$ quantifies how dissimilar a sample is to other clusters, and $a^{(i)}$ tells us how similar it is to the other samples in its own cluster, respectively.
1842
+
1843
+
\section{Organizing clusters as a hierarchical tree}
1844
+
\subsection{Performing hierarchical clustering on a distance matrix}
1845
+
\subsection{Attaching dendrograms to a heat map}
1846
+
\subsection{Applying agglomerative clustering via scikit-learn}
1847
+
\section{Locating regions of high density via DBSCAN}
1848
+
1849
+
[...] In \textit{Density-based Spatial Clustering of Applications with Noise} (DBSCAN), a special label is assigned to each sample (point) using the following criteria:
1850
+
1851
+
\begin{itemize}
1852
+
\item A point is considered as \textit{core point} if at least a specified number (\textit{MinPts}) of neighboring points fall within the specified radius $\epsilon$.
1853
+
\item A \textit{border point} is a point that has fewer neighbors than MinPts within $\epsilon$, but lies within the $\epsilon$ radius of a core point.
1854
+
\item All other points that are neither core nor border points are considered as \textit{noise points}.
1855
+
\end{itemize}
1856
+
1857
+
After labeling the points as core, border, or noise points, the DBSCAN algorithm can be summarized in two simple steps:
1858
+
1859
+
\begin{enumerate}
1860
+
\item Form a separate cluster for each core point or a connected group of core points (core points are connected if they are no farther away than $\epsilon$).
1861
+
\item Assign each border point to the cluster of its corresponding core poin.
0 commit comments