You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
\chapter{Building Good Training Sets -- Data Pre-Processing}
760
+
761
+
\section{Dealing with missing data}
762
+
\subsection{Eliminating samples or features with missing values}
763
+
\subsection{Imputing missing values}
764
+
\subsection{Understanding the scikit-learn estimator API}
765
+
\section{Handling categorical data}
766
+
\subsection{Mapping ordinal features}
767
+
\subsection{Encoding class labels}
768
+
\subsection{Performing one-hot encoding on nominal features}
769
+
\section{Partitioning a dataset in training and test sets}
770
+
\section{Bringing features onto the same scale}
771
+
772
+
Now, there are two common approaches to bringing different features onto the samescale: \textit{normalization} and \textit{standardization}. Those terms are often used quite looselyin different fields, and the meaning has to be derived from the context. Most often,\textit{normalization} refers to the rescaling of the features to a range of [0, 1], which is aspecial case of min-max scaling. To normalize our data, we can simply apply themin-max scaling to each feature column, where the new value $x_{norm}^{(i)}$ of a sample $x^{(i)}$:
Here, $x^{(i)}$ is a particular sample, $x_{min}$ is the smallest value in a feature column, and $x_{max}$ the largest value, respectively.
779
+
780
+
[...] Furthermore, standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling, which scalesthe data to a limited range of values.
781
+
782
+
The procedure of standardization can be expressed by the following equation:
Here, $\mu_{x}$ is the sample mean of a particular feature column and $\sigma_{x}$ the corresponding standard deviation, respectively.
789
+
790
+
\section{Selecting meaningful features}
791
+
\subsection{Sparse solutions with L1 regularization}
792
+
793
+
We recall from \textit{Chapter 3, A Tour of Machine Learning Classfiers Using Scikit-learn}, that L2 regularization is one approach to reduce the complexity of a model by penalizing large individual weights, where we de ned the L2 norm of our weight vector w as follows:
0 commit comments