Skip to content

Commit f0e0525

Browse files
author
“jake”
committed
.
1 parent 43744a3 commit f0e0525

5 files changed

+106
-29
lines changed

classimbalance.rst

+13-1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,18 @@ An important thing to note is that **resampling must be done AFTER the train-tes
1515
Over-Sampling
1616
---------------
1717

18+
SMOTE (synthetic minority over-sampling technique) is a common and popular up-sampling technique.
19+
20+
.. code:: python
21+
22+
from imblearn.over_sampling import SMOTE
23+
24+
smote = SMOTE()
25+
X_resampled, y_resampled = smote.fit_sample(X_train, y_train)
26+
clf = LogisticRegression()
27+
clf.fit(X_resampled, y_resampled)
28+
29+
1830
ADASYN is one of the more advanced over sampling algorithms.
1931

2032
.. code:: python
@@ -42,7 +54,7 @@ Under-Sampling
4254
Under/Over-Sampling
4355
--------------------
4456

45-
SMOTEENN combines SMOTE (synthetic over sampling) with Edited Nearest Neighbours,
57+
SMOTEENN combines SMOTE with Edited Nearest Neighbours,
4658
which is used to pare down and centralise the negative cases.
4759

4860
.. code:: python

evaluation.rst

+48-21
Original file line numberDiff line numberDiff line change
@@ -637,25 +637,52 @@ https://github.com/HDI-Project/BTB
637637
# remember to change INT to FLOAT where necessary
638638
tunables = [('n_estimators', HyperParameter(ParamTypes.INT, [500, 2000])),
639639
('max_depth', HyperParameter(ParamTypes.INT, [3, 20]))]
640-
tuner = GP(tunables)
641-
parameters = tuner.propose()
642-
parameters
643-
644-
for i in range(10):
645-
model = XGBClassifier(**parameters, n_jobs=-1)
646-
model.fit(X_train, y_train)
647-
y_predict = model.predict(X_test)
648-
score = accuracy_score(y_test, y_predict)
649-
tuner.add(parameters, score)
650-
print(score, parameters)
640+
641+
def auto_tuning(tunables, epoch, X_train, X_test, y_train, y_test, verbose=0):
642+
"""Auto-tuner using BTB library"""
643+
tuner = GP(tunables)
651644
parameters = tuner.propose()
652-
653-
tuner._best_score
654-
655-
# 0.9492307692307692 {'n_estimators': 1200, 'max_depth': 13}
656-
# 0.9507692307692308 {'n_estimators': 1659, 'max_depth': 15}
657-
# 0.9492307692307692 {'n_estimators': 1661, 'max_depth': 14}
658-
# 0.9492307692307692 {'n_estimators': 1654, 'max_depth': 13}
659-
# 0.9492307692307692 {'n_estimators': 1658, 'max_depth': 16}
660-
# 0.9476923076923077 {'n_estimators': 923, 'max_depth': 13}
661-
# 0.9507692307692308 {'n_estimators': 1658, 'max_depth': 11}
645+
646+
score_list = []
647+
param_list = []
648+
649+
for i in range(epoch):
650+
# ** unpacks dict in a argument
651+
model = RandomForestClassifier(**parameters, n_jobs=-1)
652+
model.fit(X_train, y_train)
653+
y_predict = model.predict(X_test)
654+
score = accuracy_score(y_test, y_predict)
655+
print('epoch: {}, accuracy: {}'.format(i+1,score))
656+
657+
# store scores & parameters
658+
score_list.append(score)
659+
param_list.append(parameters)
660+
661+
if verbose==0:
662+
pass
663+
elif verbose==1:
664+
print('epoch: {}, accuracy: {}'.format(i+1,score))
665+
elif verbose==2:
666+
print('epoch: {}, accuracy: {}, param: {}'.format(i+1,score,parameters))
667+
668+
# get new parameters
669+
tuner.add(parameters, score)
670+
parameters = tuner.propose()
671+
672+
best_s = tuner._best_score
673+
best_score_index = score_list.index(best_s)
674+
best_param = param_list[best_score_index]
675+
print('\nbest accuracy: {}'.format(best_s))
676+
print('best parameters: {}'.format(best_param))
677+
return best_param
678+
679+
best_param = auto_tuning(tunables, 5, X_train, X_test, y_train, y_test)
680+
681+
# epoch: 1, accuracy: 0.7437106918238994
682+
# epoch: 2, accuracy: 0.779874213836478
683+
# epoch: 3, accuracy: 0.7940251572327044
684+
# epoch: 4, accuracy: 0.7908805031446541
685+
# epoch: 5, accuracy: 0.7987421383647799
686+
687+
# best accuracy: 0.7987421383647799
688+
# best parameters: {'n_estimators': 1939, 'max_depth': 18}

exploratory.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ From different dataframes, displaying the same feature.
102102
's3': cf20['Pressure'], 's4': cf30['Pressure'],'s5': cf45['Pressure']})
103103
df.boxplot(figsize=(10,5));
104104
105-
.. image:: images/box3.png
105+
.. image:: images/box3.PNG
106106
:scale: 50 %
107107
:align: center
108108

normalisation.rst

+27-6
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@ Normalisation is another important concept needed to change all features to the
44
This allows for faster convergence on learning, and more uniform influence for all weights.
55
More on sklearn website:
66

7-
http://scikit-learn.org/stable/modules/preprocessing.html
7+
* http://scikit-learn.org/stable/modules/preprocessing.html
8+
89

910
`Tree-based models is not dependent on scaling, but non-tree models models,
1011
very often are hugely dependent on it.`
@@ -15,20 +16,33 @@ very often are hugely dependent on it.`
1516

1617
Introduction to Machine Learning in Python
1718

19+
Outliers can affect certain scalers, and it is important to either remove them or choose a scalar that is robust towards them.
20+
21+
* https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
22+
* http://benalexkeen.com/feature-scaling-with-scikit-learn/
23+
24+
1825
Scaling
1926
-------
2027
Standard Scaler
2128
****************
22-
This changes the data to have means of 0 and standard error of 1.
29+
It standardize features by removing the mean and scaling to unit variance
30+
The standard score of a sample x is calculated as:
31+
32+
z = (x - u) / s
2333

2434
.. code:: python
2535
2636
import pandas pd
27-
from sklearn import preprocessing
37+
from sklearn.preprocessing import StandardScaler
38+
39+
X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
40+
random_state = 0)
41+
scaler = StandardScaler()
42+
X_train_scaled = scaler.fit_transform(X_train)
43+
# note that the test set using the fitted scaler in train dataset to transform in the test set
44+
X_test_scaled = scaler.transform(X_test)
2845
29-
# standardise the means to 0 and standard error to 1
30-
for i in df.columns[:-1]: # df.columns[:-1] = dataframe for all features
31-
df[i] = preprocessing.scale(df[i].astype('float64'))
3246
3347
3448
Min Max Scale
@@ -67,6 +81,9 @@ Pipeline
6781
---------
6882
Scaling have a chance of leaking the part of the test data in train-test split into the training data.
6983
This is especially inevitable when using cross-validation.
84+
85+
86+
7087
We can scale the train and test datasets separately to avoid this.
7188
However, a more convenient way is to use the pipeline function in sklearn, which wraps the scaler and classifier together,
7289
and scale them separately during cross validation.
@@ -88,3 +105,7 @@ Any other functions can also be input here, e.g., rolling window feature extract
88105
89106
pipe.score(X_test, y_test)
90107
0.95104895104895104
108+
109+
Persistance
110+
------------
111+
To save the fitted scaler to normalize new datasets, we can save it using pickle or joblib for reusing in the future.

preprocess.rst

+17
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,23 @@ Numeric
66
Feature Proprocessing
77
************************
88

9+
**Missing Values**
10+
^^^^^^^^^^^^^^^^^^^
11+
12+
We can change missing values for the entire dataframe into their individual column means or medians.
13+
14+
.. code:: python
15+
16+
import pandas as pd
17+
import numpy as np
18+
from sklearn.impute import SimpleImputer
19+
20+
impute = SimpleImputer(missing_values=np.nan, strategy='median', copy=False)
21+
imp_mean.fit(df)
22+
# output is in numpy, so convert to df
23+
df2 = pd.DataFrame(imp_mean.transform(df),columns=df.columns)
24+
25+
926
**Scaling**
1027
^^^^^^^^^^^^
1128

0 commit comments

Comments
 (0)