Skip to content

Data Transformations #2191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
24632d4
NLP Word Frequency Algorithms
danmurphy1217 Jun 21, 2020
cbb5f41
Added type hints and Wikipedia link to tf-idf
danmurphy1217 Jun 22, 2020
eb260b0
Update machine_learning/word_frequency_functions.py
danmurphy1217 Jun 22, 2020
e961f52
Update machine_learning/word_frequency_functions.py
danmurphy1217 Jun 22, 2020
e6b2357
Update machine_learning/word_frequency_functions.py
danmurphy1217 Jun 22, 2020
aa61ec8
Update machine_learning/word_frequency_functions.py
danmurphy1217 Jun 22, 2020
bed579d
Fix line length for flake8
danmurphy1217 Jun 22, 2020
616087b
Fix line length for flake8
danmurphy1217 Jun 22, 2020
9ef8e62
Fix line length for flake8 V2
danmurphy1217 Jun 22, 2020
1152edd
Add line escapes and change int to float
danmurphy1217 Jun 22, 2020
e8890d6
Corrected doctests
danmurphy1217 Jun 23, 2020
bcbb8f6
Fix for TravisCI
danmurphy1217 Jun 23, 2020
a2628d4
Fix for TravisCI V2
danmurphy1217 Jun 23, 2020
a0bef59
Tests passing locally
danmurphy1217 Jun 23, 2020
4cd803a
Tests passing locally
danmurphy1217 Jun 23, 2020
fcc07c9
Update machine_learning/word_frequency_functions.py
danmurphy1217 Jun 24, 2020
d35b5a6
Update machine_learning/word_frequency_functions.py
danmurphy1217 Jun 24, 2020
e901e09
Update machine_learning/word_frequency_functions.py
danmurphy1217 Jun 24, 2020
0a85c0f
Update machine_learning/word_frequency_functions.py
danmurphy1217 Jun 24, 2020
fcef21e
Add doctest examples and clean up docstrings
danmurphy1217 Jun 24, 2020
a0892e5
Merge branch 'dan' of https://github.com/danmurphy1217/Python into dan
danmurphy1217 Jun 24, 2020
f669051
Added Standardization and Normalization algorithms
danmurphy1217 Jul 9, 2020
eab6ab4
Merge branch 'master' into dan
danmurphy1217 Jul 9, 2020
f273d6a
Delete word_frequency_functions.py
danmurphy1217 Jul 9, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions machine_learning/data_transformations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
"""
Normalization Wikipedia: https://en.wikipedia.org/wiki/Normalization
Standardization Wikipedia: https://en.wikipedia.org/wiki/Standardization

Normalization is the process of converting numerical data to a standard
range of values. This range is typically between [0, 1] or [-1, 1].
The equation for normalization is x_norm = (x - x_min)/(x_max - x_min)
where x_norm is the normalized value, x is the value, x_min is the
minimum value within the column or list of data, and x_max is the
maximum value within the column or list of data. Normalization is
used to speed up the training of data and put all of the data
on a similar scale. This is useful because variance in the range of
values of a dataset can heavily impact optimization
(particularly Gradient Descent).

Standardization is the process of converting numerical data to a normally
distributed range of values. This range will have a mean of 0 and standard
deviation of 1. This is also known as z-score normalization. The equation for
standardization is x_std = (x - mu)/(sigma) where mu is the mean of the
column or list of values and sigma is the standard deviation of the column
or list of values.

Choosing between Normalization & Standardization is more of an art of a science,
but it is often recommended to run experiments with both to see which performs
better. Additionally, a few rules of thumb are:
1. gaussian (normal) distributions work better with standardization
2. non-gaussian (non-normal) distributions work better with normalization
3. If a column or list of values has extreme values / outliers, use
standardization
"""


def normalization(data : list) -> list:
"""
Returns a normalized list of values
@params: data, a list of values to normalize
@returns: a list of normalized values (rounded to 3 decimals)
@examples:
>>> normalization([2, 7, 10, 20, 30, 50])
[0.0, 0.104, 0.167, 0.375, 0.583, 1.0]

>>> normalization([5, 10, 15, 20, 25])
[0.0, 0.25, 0.5, 0.75, 1.0]
"""
# variables for calculation
x_min = min(data)
x_max = max(data)
# normalize data
return [round((x - x_min) / (x_max - x_min), 3) for x in data]


def standardization(data : list) -> list:
"""
Returns a standardized list of values
@params: data, a list of values to standardize
@returns: a list of standardized values (rounded to 3 decimals)
@examples:
>>> standardization([2, 7, 10, 20, 30, 50])
[-1.095, -0.788, -0.604, 0.01, 0.624, 1.852]

>>> standardization([5, 10, 15, 20, 25])
[-1.414, -0.707, 0.0, 0.707, 1.414]
"""
# variables for calculation
mu = mean(data)
sigma = stdDeviation(data)

# standardize data
return [round((x - mu) / (sigma), 3) for x in data]


def mean(data : list) -> float:
"""
Helper function that returns the mean of a list of values
@params: data, a list of values
@returns: a float representing the mean (rounded to 3 decimals)
@examples:
>>> mean([2, 7, 10, 20, 30, 50])
19.833

>>> mean([5, 10, 15, 20, 25])
15.0
"""
return round(sum(data) / len(data), 3)


def stdDeviation(data : list) -> float:
"""
Helper function that returns the standard deviation of a list of values
@params: data, a list of values
@returns: a float representing the standard deviation (rounded to 3 values)
@examples:
>>> stdDeviation([2, 7, 10, 20, 30, 50])
16.293

>>> stdDeviation([5, 10, 15, 20, 25])
7.071
"""
x_mean = mean(data)
sum_squared_diff = sum([(x - x_mean)**2 for x in data])
return round(((sum_squared_diff) / len(data))**.5, 3)
129 changes: 0 additions & 129 deletions machine_learning/word_frequency_functions.py

This file was deleted.