Skip to content

Commit d6d63a8

Browse files
committed
adding equations from chapter 11
1 parent f6d4665 commit d6d63a8

File tree

3 files changed

+142
-0
lines changed

3 files changed

+142
-0
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ Simply click on the `ipynb`/`nbviewer` links next to the chapter headlines to vi
8383
12. Training Artificial Neural Networks for Image Recognition [[dir](./code/ch12)] [[ipynb](./code/ch12/ch12.ipynb)] [[nbviewer](http://nbviewer.ipython.org/github/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb)]
8484
13. Parallelizing Neural Network Training via Theano [[dir](./code/ch13)] [[ipynb](./code/ch13/ch13.ipynb)] [[nbviewer](http://nbviewer.ipython.org/github/rasbt/python-machine-learning-book/blob/master/code/ch13/ch13.ipynb)]
8585

86+
<br>
8687

8788
- Equation Reference [[PDF](./docs/equations/pymle-equations.pdf)] [[TEX](./docs/equations/pymle-equations.tex)]
8889

docs/equations/pymle-equations.pdf

29.4 KB
Binary file not shown.

docs/equations/pymle-equations.tex

+141
Original file line numberDiff line numberDiff line change
@@ -1722,6 +1722,147 @@ \subsubsection{Random forest regression}
17221722
\section{Summary}
17231723

17241724

1725+
1726+
%%%%%%%%%%%%%%%
1727+
% CHAPTER 11
1728+
%%%%%%%%%%%%%%%
1729+
1730+
\chapter{Working with Unlabeled Data -- Clustering Analysis}
1731+
1732+
1733+
\section{Grouping objects by similarity using k-means}
1734+
1735+
Thus, our goal is to group the samples based on their feature similarities, which we can be achieved using the k-means algorithm that can be summarized by the following four steps:
1736+
1737+
\begin{enumerate}
1738+
\item Randomly pick $k$ centroids from the sample points as initial cluster centers.
1739+
\item Assign each sample to the nearest centroid $\mu^{(j)}, \quad j \in {1, ..., k}.$
1740+
\item Move the centroids to the center of the samples that were assigned to it.
1741+
\item Repeat steps 2 and 3 until the cluster assignments do not change or a user-de ned tolerance or a maximum number of iterations is reached.
1742+
\end{enumerate}
1743+
1744+
Now the next question is \textit{how do we measure similarity between objects?} We can de ne similarity as the opposite of distance, and a commonly used distance for clustering samples with continuous features is the \textit{squared Euclidean distance} between two points $\mathbf{x}$ and $\mathbf{y}$ in $m$-dimensional space:
1745+
1746+
\[
1747+
d(\mathbf{x}, \mathbf{y})^2 = \sum_{j=1}^{m} \big(x_j - y_j \big)^2 = \lVert \mathbf{x} - \mathbf{y} \rVert^{2}_{2}.
1748+
\]
1749+
1750+
Note that, in the preceding equation, the index $j$ refers to the $j$th dimension(feature column) of the sample points x and y. In the rest of this section, we will use the superscripts $i$ and $j$ to refer to the sample index and cluster index, respectively.
1751+
1752+
Based on this Euclidean distance metric, we can describe the k-means algorithmas a simple optimization problem, an iterative approach for minimizing the \textit{within-cluster sum of squared errors (SSE)}, which is sometimes also called \textit{cluster inertia}:
1753+
1754+
\[
1755+
SSE = \sum_{i=1}^{n} \sum^{k}_{j=1} w^{(i, j)} \big \lVert \mathbf{x}^{(i)} - \mu^{(j)} \big \rVert^{2}_{2}
1756+
\]
1757+
1758+
Here, $\mu^{(j)}$ is the representative point (centroid) for cluster $j$, and $w^{(i, j)} = 1$ if the sample $\mathbf{x}^{(i)}$ is in cluster $j$; $w^{(i, j)}=0$ otherwise.
1759+
1760+
\subsection{K-means++}
1761+
1762+
[...] The initialization in k-means++ can be summarized as follows:
1763+
\begin{enumerate}
1764+
\item Initialize an empty set $M$ to store the $k$ centroids being selected.
1765+
\item Randomly choose the first centroid $\mu^{(j)}$ from the input samples and assign it to $M$
1766+
\item For each sample $\mathbf{x}^{(i)}$ that is not in $M$, find the minimum distance $d \big( x^{(i)}, M \big)^2$ to any of the centroids in $M$.
1767+
\item To randomly select the next centroid $\mu^{(p)}$, use a weighted probability distribution equal to $\frac{d(\mu^{(p)}, M)^2 }{\sum_i d(x^{(i)}M)^2}$
1768+
\item Repeat steps 2 and 3 until $k$ centroids are chosen.
1769+
\item Proceed with the classic \textit{k}-means algorithm.
1770+
\end{enumerate}
1771+
1772+
\subsection{Hard versus soft clustering}
1773+
1774+
The $fuzzy c-means (FCM)$ procedure is very similar to k-means. However, we replace the hard cluster assignment by probabilities for each point belonging to each cluster. In $k$-means, we could express the cluster membership of a sample $x$ by a sparse vector of binary values:
1775+
1776+
\[
1777+
\begin{bmatrix}
1778+
\mathbf{\mu}^{(1)} \rightarrow 0 \\
1779+
\mathbf{\mu}^{(2)} \rightarrow 1 \\
1780+
\mathbf{\mu}^{(3)} \rightarrow 0
1781+
\end{bmatrix}
1782+
\]
1783+
1784+
Here, the index position with value 1 indicates the cluster centroid $\mathbf{\mu}^{(j)}$ the sample is assigned to (assuming $k=3, \; j \in \{ 1, 2, 3 \}$). In contrast, a membership vector in FCM could be represented as follows:
1785+
1786+
\[
1787+
\begin{bmatrix}
1788+
\mathbf{\mu}^{(1)} \rightarrow 0.1 \\
1789+
\mathbf{\mu}^{(2)} \rightarrow 0.85 \\
1790+
\mathbf{\mu}^{(3)} \rightarrow 0.05
1791+
\end{bmatrix}
1792+
\]
1793+
1794+
Here, each value falls in the range $[0, 1]$ and represents a probability of membership to the respective cluster centroid. The sum of the memberships for a given sample is equal to 1. Similarly to the k-means algorithm, we can summarize the FCM algorithm in four key steps:
1795+
1796+
\begin{enumerate}
1797+
\item Specify the number of $k$ centroids and randomly assign the cluster memberships for each point.
1798+
\item Compute the cluster centroids $\mathbf{\mu^{(j)}}, j \in \{1, \dots, k \}$.
1799+
\item Update the cluster memberships for each point.
1800+
\item Repeat steps 2 and 3 until the membership coefficients do not change or a user-de ned tolerance or a maximum number of iterations is reached.
1801+
\end{enumerate}
1802+
1803+
The objective function of FCM -- we abbreviate it by $J_m$ -- looks very similar to the within cluster sum-squared-error that we minimize in $k$-means:
1804+
1805+
\[
1806+
J_m = \sum_{i=1}^{n} \sum_{j=1}^{k} w^{m(i, j)} \big \lVert \mathbf{x}^{(i)} - \mathbf{\mu}^{(j)} \big \rVert^{2}_{2} \; m \in [1, \infty)
1807+
\]
1808+
1809+
However, note that the membership indicator $w^{(i, j)}$ is not a binary value as in $k$-means $\big( w^{(i, j)} \in \{0, 1\} \big)$ but a real value that denotes the cluster membership probability $\big( w^{(i, j)} \in [0, 1] \big).$ You also may have noticed that we added an additional exponent to $w^{(i, j)}$; the exponent $m$, any number greater or equal to 1 (typically $m=2$), is the so-called \textit{fuzziness coefficient} (or simply \textit{fuzzifier}) that controls the degree of \textit{fuzziness}. The larger the value of $m$, the smaller the cluster membership $w^{(i, j)}$ becomes, which leads to fuzzier clusters. The cluster membership probability itself is calculated as follows:
1810+
1811+
\[
1812+
w^{(i, j)} = \Bigg[ \sum^{k}_{p=1} \Bigg( \frac{\lVert \mathbf{x}^{(i)} - \mathbf{\mu}^{(j)} \rVert_2}{\lVert \mathbf{x}^{(i)} - \mathbf{\mu}^{(p)} \rVert_2} \Bigg)^{\frac{2}{m-1}} \Bigg]^{-1}
1813+
\]
1814+
1815+
For example, if we chose three cluster centers as in the previous $k$-means example, we could calculate the membership of the $\mathbf{x}^{(i)}$ sample belonging to its own cluster:
1816+
1817+
\[
1818+
w^{(i, j)} = \Bigg[ \sum^{k}_{p=1} \Bigg( \frac{\lVert \mathbf{x}^{(i)} - \mathbf{\mu}^{(j)} \rVert_2}{\lVert \mathbf{x}^{(i)} - \mathbf{\mu}^{(1)} \rVert_2} \Bigg)^{\frac{2}{m-1}} + \sum^{k}_{p=1} \Bigg( \frac{\lVert \mathbf{x}^{(i)} - \mathbf{\mu}^{(j)} \rVert_2}{\lVert \mathbf{x}^{(i)} - \mathbf{\mu}^{(2)} \rVert_2} \Bigg)^{\frac{2}{m-1}} + \sum^{k}_{p=1} \Bigg( \frac{\lVert \mathbf{x}^{(i)} - \mathbf{\mu}^{(j)} \rVert_2}{\lVert \mathbf{x}^{(i)} - \mathbf{\mu}^{(3)} \rVert_2} \Bigg)^{\frac{2}{m-1}} \Bigg]^{-1}
1819+
\]
1820+
1821+
The center $\mu^{(j)}$ of a cluster itself is calculated as the mean of all samples in the cluster weighted by the membership degree of belonging to its own cluster:
1822+
1823+
\[
1824+
\mathbf{\mu}^{(j)} = \frac{\sum_{i=1}^{n} w^{m(i, j)} \mathbf{x}^{(i)}}{\sum_{i=1}^{n} w^{m(i, j)}}
1825+
\]
1826+
1827+
\subsection{Using the elbow method to find the optimal number of clusters}
1828+
\subsection{Quantifying the quality of clustering via silhouette plots}
1829+
1830+
To calculate the \textit{silhouette coefficient} of a single sample in our dataset, we can apply the following three steps:
1831+
1832+
\begin{enumerate}
1833+
\item Calculate the cluster cohesion $a^{(i)}$ as the average distance between a sample $\mathbf{x}^{(i)}$ and all other points in the same cluster.
1834+
\item Calculate the cluster separation $b^{(i)}$ from the next closest cluster as the average distance between the sample $\mathbf{x}^{(i)}$ and all samples in the nearest cluster.
1835+
\item Calculate the silhouette $s^{(i)}$ as the difference between cluster cohesion and separation divided by the greater of the two, as shown here:
1836+
\[
1837+
s^{(i)} = \frac{b^{(i)} - a^{(i)}}{\max \{ b^{(i)}, a^{(i)} \}}.
1838+
\]
1839+
\end{enumerate}
1840+
1841+
The silhouette coefficient is bounded in the range $-1$ to $1$. Based on the preceding formula, we can see that the silhouette coefficient is 0 if the cluster separationand cohesion are equal $(b^{(i)} = a^{(i)})$. Furthermore, we get close to an ideal silhouette coefficient of $1$ if $b^{(i)} >> a^{(i)}$, since $b^{(i)}$ quantifies how dissimilar a sample is to other clusters, and $a^{(i)}$ tells us how similar it is to the other samples in its own cluster, respectively.
1842+
1843+
\section{Organizing clusters as a hierarchical tree}
1844+
\subsection{Performing hierarchical clustering on a distance matrix}
1845+
\subsection{Attaching dendrograms to a heat map}
1846+
\subsection{Applying agglomerative clustering via scikit-learn}
1847+
\section{Locating regions of high density via DBSCAN}
1848+
1849+
[...] In \textit{Density-based Spatial Clustering of Applications with Noise} (DBSCAN), a special label is assigned to each sample (point) using the following criteria:
1850+
1851+
\begin{itemize}
1852+
\item A point is considered as \textit{core point} if at least a specified number (\textit{MinPts}) of neighboring points fall within the specified radius $\epsilon$.
1853+
\item A \textit{border point} is a point that has fewer neighbors than MinPts within $\epsilon$, but lies within the $\epsilon$ radius of a core point.
1854+
\item All other points that are neither core nor border points are considered as \textit{noise points}.
1855+
\end{itemize}
1856+
1857+
After labeling the points as core, border, or noise points, the DBSCAN algorithm can be summarized in two simple steps:
1858+
1859+
\begin{enumerate}
1860+
\item Form a separate cluster for each core point or a connected group of core points (core points are connected if they are no farther away than $\epsilon$).
1861+
\item Assign each border point to the cluster of its corresponding core poin.
1862+
\end{enumerate}
1863+
1864+
\section{Summary}
1865+
17251866
\newpage
17261867

17271868
... to be continued ...

0 commit comments

Comments
 (0)