Skip to content

Added Normalization and Standardization Algorithms #2192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 10, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions machine_learning/data_transformations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
"""
Normalization Wikipedia: https://en.wikipedia.org/wiki/Normalization
Normalization is the process of converting numerical data to a standard range of values.
This range is typically between [0, 1] or [-1, 1]. The equation for normalization is
x_norm = (x - x_min)/(x_max - x_min) where x_norm is the normalized value, x is the
value, x_min is the minimum value within the column or list of data, and x_max is the
maximum value within the column or list of data. Normalization is used to speed up the
training of data and put all of the data on a similar scale. This is useful because
variance in the range of values of a dataset can heavily impact optimization
(particularly Gradient Descent).

Standardization Wikipedia: https://en.wikipedia.org/wiki/Standardization
Standardization is the process of converting numerical data to a normally distributed
range of values. This range will have a mean of 0 and standard deviation of 1. This is
also known as z-score normalization. The equation for standardization is
x_std = (x - mu)/(sigma) where mu is the mean of the column or list of values and sigma
is the standard deviation of the column or list of values.

Choosing between Normalization & Standardization is more of an art of a science, but it
is often recommended to run experiments with both to see which performs better.
Additionally, a few rules of thumb are:
1. gaussian (normal) distributions work better with standardization
2. non-gaussian (non-normal) distributions work better with normalization
3. If a column or list of values has extreme values / outliers, use standardization
"""
from statistics import mean, stdev


def normalization(data: list, ndigits: int = 3) -> list:
"""
Returns a normalized list of values
@params: data, a list of values to normalize
@returns: a list of normalized values (rounded to ndigits decimal places)
@examples:
>>> normalization([2, 7, 10, 20, 30, 50])
[0.0, 0.104, 0.167, 0.375, 0.583, 1.0]
>>> normalization([5, 10, 15, 20, 25])
[0.0, 0.25, 0.5, 0.75, 1.0]
"""
# variables for calculation
x_min = min(data)
x_max = max(data)
# normalize data
return [round((x - x_min) / (x_max - x_min), ndigits) for x in data]


def standardization(data: list, ndigits: int = 3) -> list:
"""
Returns a standardized list of values
@params: data, a list of values to standardize
@returns: a list of standardized values (rounded to ndigits decimal places)
@examples:
>>> standardization([2, 7, 10, 20, 30, 50])
[-0.999, -0.719, -0.551, 0.009, 0.57, 1.69]
>>> standardization([5, 10, 15, 20, 25])
[-1.265, -0.632, 0.0, 0.632, 1.265]
"""
# variables for calculation
mu = mean(data)
sigma = stdev(data)
# standardize data
return [round((x - mu) / (sigma), ndigits) for x in data]