-
Notifications
You must be signed in to change notification settings - Fork 137
Conversation
[Optimizer] Simplify optimizers using generalized vector math. (#218)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Sources/TensorFlow/Optimizer.swift
Outdated
public var step: Int = 0 | ||
public var accumulators: Model.AllDifferentiableVariables | ||
public var accDelta: Model.AllDifferentiableVariables | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra line.
Sources/TensorFlow/Optimizer.swift
Outdated
/// The current step. | ||
public var step: Int = 0 | ||
public var accumulators: Model.AllDifferentiableVariables | ||
public var accDelta: Model.AllDifferentiableVariables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Could you add a doc comment that explains this property?
- Avoid abbreviations in this property name since
accDelta
is not a word of art and may be confusing.
Sources/TensorFlow/Optimizer.swift
Outdated
along direction: Model.TangentVector) { | ||
update(&model.allDifferentiableVariables, along: direction) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave a trailing empty line in the file.
Sources/TensorFlow/Optimizer.swift
Outdated
public typealias Model = Model | ||
/// The learning rate. | ||
public var learningRate: Float | ||
///decay factor, corresponding to fraction of gradient to keep at each time step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
///decay factor, corresponding to fraction of gradient to keep at each time step. | |
/// The decay factor, corresponding to fraction of gradient to keep at each time step. |
Sources/TensorFlow/Optimizer.swift
Outdated
accNew += (1 - rho) * (direction[keyPath: kp] * direction[keyPath: kp]) | ||
accumulators[keyPath: kp] = accNew | ||
|
||
var stepSize = direction[keyPath: kp] * sqrt(updates[keyPath: kp] + epsilon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra whitespace.
var stepSize = direction[keyPath: kp] * sqrt(updates[keyPath: kp] + epsilon) | |
var stepSize = direction[keyPath: kp] * sqrt(updates[keyPath: kp] + epsilon) |
Sources/TensorFlow/Optimizer.swift
Outdated
accumulators[keyPath: kp] = accNew | ||
|
||
var stepSize = direction[keyPath: kp] * sqrt(updates[keyPath: kp] + epsilon) | ||
stepSize /= sqrt(accumulators[keyPath: kp] + epsilon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra whitespace.
stepSize /= sqrt(accumulators[keyPath: kp] + epsilon) | |
stepSize /= sqrt(accumulators[keyPath: kp] + epsilon) |
Sources/TensorFlow/Optimizer.swift
Outdated
var updatesNew = updates[keyPath: kp] * rho | ||
updatesNew += (1 - rho) * (stepSize * stepSize) | ||
updates[keyPath: kp] = updatesNew | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the redundant empty line.
Sources/TensorFlow/Optimizer.swift
Outdated
public var decay: Float | ||
/// The current step. | ||
public var step: Int = 0 | ||
/// Accumulate Gradients |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two suggestions on documentation comments:
- Properties should be described with a noun phrase ending with a period.
- Documentation comments for mathematical variables should use the term of art. In this case, we should switch to the accurate, proper terminology that's used by the paper.
/// Accumulate Gradients | |
/// The accumulated, exponentially decaying average of squared gradients. |
Sources/TensorFlow/Optimizer.swift
Outdated
/// Accumulate Gradients | ||
public var accumulators: Model.AllDifferentiableVariables | ||
/// Accumulate Updates(here StepSizes) | ||
public var updates: Model.AllDifferentiableVariables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public var updates: Model.AllDifferentiableVariables | |
public var accumulatedDelta: Model.AllDifferentiableVariables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here using the word "delta" would make this variable better correspond to the 𝚫x
term in the paper.
Sources/TensorFlow/Optimizer.swift
Outdated
public var step: Int = 0 | ||
/// Accumulate Gradients | ||
public var accumulators: Model.AllDifferentiableVariables | ||
/// Accumulate Updates(here StepSizes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Accumulate Updates(here StepSizes) | |
/// The accumulated parameter updates. |
Sources/TensorFlow/Optimizer.swift
Outdated
public init( | ||
for model: __shared Model, | ||
learningRate: Float = 1, | ||
rho: Float = 0.9, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keras uses rho = 0.95
by default.
Sources/TensorFlow/Optimizer.swift
Outdated
|
||
// Update Float & Double Tensor variables. | ||
for kp in model.recursivelyAllWritableKeyPaths(to: Tensor<Float>.self) { | ||
var accNew = rho * accumulators[keyPath: kp] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, I'd suggest rewriting this step as
averageSquared[keyPath: kp] *= rho
averageSquared[keyPath: kp] += (1 - rho) * (direction[keyPath: kp] * direction[keyPath: kp])
This way, you won't need to define extra variables. Same suggestion for the Tensor<Double>
handling block.
Sources/TensorFlow/Optimizer.swift
Outdated
stepSize /= sqrt(accumulators[keyPath: kp] + epsilon) | ||
model[keyPath: kp] -= learningRate * stepSize | ||
|
||
var updatesNew = updates[keyPath: kp] * rho |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest changing this step to the following:
accumulatedDelta *= rho
accumulatedDelta += (1 - rho) * stepSize.squared()
Same suggestion for the Tensor<Double>
handling block.
Sources/TensorFlow/Optimizer.swift
Outdated
public func update(_ model: inout Model.AllDifferentiableVariables, | ||
along direction: Model.AllDifferentiableVariables) { | ||
step += 1 | ||
let learningRate = self.learningRate * 1 / (1 + decay * Float(step)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* 1
is redundant.
let learningRate = self.learningRate * 1 / (1 + decay * Float(step)) | |
let learningRate = self.learningRate / (1 + decay * Float(step)) |
Sources/TensorFlow/Optimizer.swift
Outdated
for kp in model.recursivelyAllWritableKeyPaths(to: Tensor<Float>.self) { | ||
averageSquared[keyPath: kp] *= rho | ||
averageSquared[keyPath: kp] += | ||
(1 - rho) * (direction[keyPath: kp] * direction[keyPath: kp]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indent by 4.
(1 - rho) * (direction[keyPath: kp] * direction[keyPath: kp]) | |
(1 - rho) * (direction[keyPath: kp] * direction[keyPath: kp]) |
Sources/TensorFlow/Optimizer.swift
Outdated
averageSquared[keyPath: kp] += | ||
(1 - rho) * (direction[keyPath: kp] * direction[keyPath: kp]) | ||
var stepSize = direction[keyPath: kp] * | ||
sqrt(accumulatedDelta[keyPath: kp] + epsilon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqrt(accumulatedDelta[keyPath: kp] + epsilon) | |
sqrt(accumulatedDelta[keyPath: kp] + epsilon) |
Sources/TensorFlow/Optimizer.swift
Outdated
for kp in model.recursivelyAllWritableKeyPaths(to: Tensor<Double>.self) { | ||
averageSquared[keyPath: kp] *= Double(rho) | ||
averageSquared[keyPath: kp] += | ||
(1 - Double(rho)) * (direction[keyPath: kp] * direction[keyPath: kp]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1 - Double(rho)) * (direction[keyPath: kp] * direction[keyPath: kp]) | |
(1 - Double(rho)) * (direction[keyPath: kp] * direction[keyPath: kp]) |
Sources/TensorFlow/Optimizer.swift
Outdated
averageSquared[keyPath: kp] += | ||
(1 - Double(rho)) * (direction[keyPath: kp] * direction[keyPath: kp]) | ||
var stepSize = direction[keyPath: kp] * | ||
sqrt(accumulatedDelta[keyPath: kp] + Double(epsilon)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqrt(accumulatedDelta[keyPath: kp] + Double(epsilon)) | |
sqrt(accumulatedDelta[keyPath: kp] + Double(epsilon)) |
Sources/TensorFlow/Optimizer.swift
Outdated
@@ -355,3 +355,98 @@ public class AdaGrad<Model: Layer>: Optimizer | |||
update(&model.allDifferentiableVariables, along: direction) | |||
} | |||
} | |||
|
|||
/// Adadelta Optimizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Use the canonical name (the name in the paper) to describe this class.
- End all comments with a period.
- Make the "O" in "Optimizer" be lowercased, for consistency with API documentation on other optimizers in this file.
/// Adadelta Optimizer | |
/// ADADELTA optimizer. |
Sources/TensorFlow/Optimizer.swift
Outdated
|
||
/// Adadelta Optimizer | ||
/// | ||
/// It is a method that uses the magnitude of recent gradients |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API documentation should fit to size (i.e. to 100 columns).
I'd suggest changing the summary to the following, similar to the Keras documentation.
ADADELTA is a more robust extension of AdaGrad that adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients. This way, Adadelta continues learning even when many updates have been done. Compared to AdaGrad, in the original version of ADADELTA you don't have to set an initial learning rate. In this version, initial learning rate and decay factor can be set, as in most other optimizers.
Sources/TensorFlow/Optimizer.swift
Outdated
step += 1 | ||
let learningRate = self.learningRate * 1 / (1 + decay * Float(step)) | ||
|
||
// Update Float & Double Tensor variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Update Float & Double Tensor variables. | |
// Update `Tensor<Float>` and `Tensor<Double>` variables. |
Sources/TensorFlow/Optimizer.swift
Outdated
|
||
/// ADADELTA optimizer. | ||
/// | ||
/// ADADELTA is a more robust extension of AdaGrad that adapts learning rates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of API documentation has lines whose length varies significantly from lines to lines. Could you make sure each line is filled with words until it reaches 100 columns? This is what I meant by suggesting "fit to size". Hope this clarifies things!
Use "AdaGrad" and "ADADELTA" in documentation to stay faithful to paper spelling.
Thanks @dan-zheng for reordering the class definitions! |
No description provided.