.

“jake” · “jake” · commit f0e052514881 · 2019-07-17T15:33:43.000+08:00
diff --git a/classimbalance.rst b/classimbalance.rst
@@ -15,6 +15,18 @@ An important thing to note is that **resampling must be done AFTER the train-tes
 Over-Sampling
 ---------------
 
+SMOTE (synthetic minority over-sampling technique) is a common and popular up-sampling technique.
+
+.. code:: python
+
+    from imblearn.over_sampling import SMOTE
+    
+    smote = SMOTE()
+    X_resampled, y_resampled = smote.fit_sample(X_train, y_train)
+    clf = LogisticRegression()
+    clf.fit(X_resampled, y_resampled)
+
+
 ADASYN is one of the more advanced over sampling algorithms.
 
 .. code:: python
@@ -42,7 +54,7 @@ Under-Sampling
 Under/Over-Sampling
 --------------------
 
-SMOTEENN combines SMOTE (synthetic over sampling) with Edited Nearest Neighbours, 
+SMOTEENN combines SMOTE with Edited Nearest Neighbours, 
 which is used to pare down and centralise the negative cases.
 
 .. code:: python
diff --git a/evaluation.rst b/evaluation.rst
@@ -637,25 +637,52 @@ https://github.com/HDI-Project/BTB
     # remember to change INT to FLOAT where necessary
     tunables = [('n_estimators', HyperParameter(ParamTypes.INT, [500, 2000])),
                 ('max_depth', HyperParameter(ParamTypes.INT, [3, 20]))]
-    tuner = GP(tunables)
-    parameters = tuner.propose()
-    parameters
-
-    for i in range(10):
-        model = XGBClassifier(**parameters, n_jobs=-1)
-        model.fit(X_train, y_train)
-        y_predict = model.predict(X_test)
-        score = accuracy_score(y_test, y_predict)
-        tuner.add(parameters, score)
-        print(score, parameters)
+
+    def auto_tuning(tunables, epoch, X_train, X_test, y_train, y_test, verbose=0):
+        """Auto-tuner using BTB library"""
+        tuner = GP(tunables)
         parameters = tuner.propose()
-        
-    tuner._best_score
-
-    # 0.9492307692307692 {'n_estimators': 1200, 'max_depth': 13}
-    # 0.9507692307692308 {'n_estimators': 1659, 'max_depth': 15}
-    # 0.9492307692307692 {'n_estimators': 1661, 'max_depth': 14}
-    # 0.9492307692307692 {'n_estimators': 1654, 'max_depth': 13}
-    # 0.9492307692307692 {'n_estimators': 1658, 'max_depth': 16}
-    # 0.9476923076923077 {'n_estimators': 923, 'max_depth': 13}
-    # 0.9507692307692308 {'n_estimators': 1658, 'max_depth': 11}
+    
+        score_list = []
+        param_list = []
+
+        for i in range(epoch):
+            # ** unpacks dict in a argument
+            model = RandomForestClassifier(**parameters, n_jobs=-1)
+            model.fit(X_train, y_train)
+            y_predict = model.predict(X_test)
+            score = accuracy_score(y_test, y_predict)
+            print('epoch: {}, accuracy: {}'.format(i+1,score))
+
+            # store scores & parameters
+            score_list.append(score)
+            param_list.append(parameters)
+
+            if verbose==0:
+                pass
+            elif verbose==1:
+                print('epoch: {}, accuracy: {}'.format(i+1,score))
+            elif verbose==2:
+                print('epoch: {}, accuracy: {}, param: {}'.format(i+1,score,parameters))
+
+            # get new parameters
+            tuner.add(parameters, score)
+            parameters = tuner.propose()
+
+        best_s = tuner._best_score
+        best_score_index = score_list.index(best_s)
+        best_param = param_list[best_score_index]
+        print('\nbest accuracy: {}'.format(best_s))
+        print('best parameters: {}'.format(best_param))        
+        return best_param
+
+    best_param = auto_tuning(tunables, 5, X_train, X_test, y_train, y_test)
+
+    # epoch: 1, accuracy: 0.7437106918238994
+    # epoch: 2, accuracy: 0.779874213836478
+    # epoch: 3, accuracy: 0.7940251572327044
+    # epoch: 4, accuracy: 0.7908805031446541
+    # epoch: 5, accuracy: 0.7987421383647799
+
+    # best accuracy: 0.7987421383647799
+    # best parameters: {'n_estimators': 1939, 'max_depth': 18}
diff --git a/exploratory.rst b/exploratory.rst
@@ -102,7 +102,7 @@ From different dataframes, displaying the same feature.
                         's3': cf20['Pressure'], 's4': cf30['Pressure'],'s5': cf45['Pressure']})
     df.boxplot(figsize=(10,5));
 
-.. image:: images/box3.png
+.. image:: images/box3.PNG
     :scale: 50 %
     :align: center
 
diff --git a/normalisation.rst b/normalisation.rst
@@ -4,7 +4,8 @@ Normalisation is another important concept needed to change all features to the
 This allows for faster convergence on learning, and more uniform influence for all weights.
 More on sklearn website:
 
-http://scikit-learn.org/stable/modules/preprocessing.html
+ * http://scikit-learn.org/stable/modules/preprocessing.html
+
 
 `Tree-based models is not dependent on scaling, but non-tree models models,
 very often are hugely dependent on it.`
@@ -15,20 +16,33 @@ very often are hugely dependent on it.`
 
     Introduction to Machine Learning in Python
 
+Outliers can affect certain scalers, and it is important to either remove them or choose a scalar that is robust towards them.
+
+ * https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
+ * http://benalexkeen.com/feature-scaling-with-scikit-learn/
+
+
 Scaling
 -------
 Standard Scaler
 ****************
-This changes the data to have means of 0 and standard error of 1.
+It standardize features by removing the mean and scaling to unit variance
+The standard score of a sample x is calculated as:
+
+z = (x - u) / s
 
 .. code:: python
 
   import pandas pd
-  from sklearn import preprocessing
+  from sklearn.preprocessing import StandardScaler
+
+  X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
+                                                     random_state = 0)
+  scaler = StandardScaler()
+  X_train_scaled = scaler.fit_transform(X_train)
+  # note that the test set using the fitted scaler in train dataset to transform in the test set
+  X_test_scaled = scaler.transform(X_test)
 
-  # standardise the means to 0 and standard error to 1
-  for i in df.columns[:-1]: # df.columns[:-1] = dataframe for all features
-    df[i] = preprocessing.scale(df[i].astype('float64'))
 
 
 Min Max Scale
@@ -67,6 +81,9 @@ Pipeline
 ---------
 Scaling have a chance of leaking the part of the test data in train-test split into the training data.
 This is especially inevitable when using cross-validation.
+
+
+
 We can scale the train and test datasets separately to avoid this.
 However, a more convenient way is to use the pipeline function in sklearn, which wraps the scaler and classifier together,
 and scale them separately during cross validation.
@@ -88,3 +105,7 @@ Any other functions can also be input here, e.g., rolling window feature extract
 
   pipe.score(X_test, y_test)
   0.95104895104895104
+
+Persistance
+------------
+To save the fitted scaler to normalize new datasets, we can save it using pickle or joblib for reusing in the future.
diff --git a/preprocess.rst b/preprocess.rst
@@ -6,6 +6,23 @@ Numeric
 Feature Proprocessing
 ************************
 
+**Missing Values**
+^^^^^^^^^^^^^^^^^^^
+
+We can change missing values for the entire dataframe into their individual column means or medians.
+
+.. code:: python
+
+  import pandas as pd
+  import numpy as np
+  from sklearn.impute import SimpleImputer
+
+  impute = SimpleImputer(missing_values=np.nan, strategy='median', copy=False)
+  imp_mean.fit(df)
+  # output is in numpy, so convert to df
+  df2 = pd.DataFrame(imp_mean.transform(df),columns=df.columns)
+
+
 **Scaling**
 ^^^^^^^^^^^^