Skip to content
This repository was archived by the owner on Jul 1, 2023. It is now read-only.

Add Adadelta Optimizer #302

Merged
merged 19 commits into from
Jun 28, 2019
Merged

Add Adadelta Optimizer #302

merged 19 commits into from
Jun 28, 2019

Conversation

lakshya-sky
Copy link
Contributor

No description provided.

@rxwei rxwei requested review from rxwei and dan-zheng June 27, 2019 18:38
Copy link
Contributor

@rxwei rxwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

public var step: Int = 0
public var accumulators: Model.AllDifferentiableVariables
public var accDelta: Model.AllDifferentiableVariables

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra line.

/// The current step.
public var step: Int = 0
public var accumulators: Model.AllDifferentiableVariables
public var accDelta: Model.AllDifferentiableVariables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Could you add a doc comment that explains this property?
  • Avoid abbreviations in this property name since accDelta is not a word of art and may be confusing.

along direction: Model.TangentVector) {
update(&model.allDifferentiableVariables, along: direction)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave a trailing empty line in the file.

public typealias Model = Model
/// The learning rate.
public var learningRate: Float
///decay factor, corresponding to fraction of gradient to keep at each time step.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
///decay factor, corresponding to fraction of gradient to keep at each time step.
/// The decay factor, corresponding to fraction of gradient to keep at each time step.

accNew += (1 - rho) * (direction[keyPath: kp] * direction[keyPath: kp])
accumulators[keyPath: kp] = accNew

var stepSize = direction[keyPath: kp] * sqrt(updates[keyPath: kp] + epsilon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra whitespace.

Suggested change
var stepSize = direction[keyPath: kp] * sqrt(updates[keyPath: kp] + epsilon)
var stepSize = direction[keyPath: kp] * sqrt(updates[keyPath: kp] + epsilon)

accumulators[keyPath: kp] = accNew

var stepSize = direction[keyPath: kp] * sqrt(updates[keyPath: kp] + epsilon)
stepSize /= sqrt(accumulators[keyPath: kp] + epsilon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra whitespace.

Suggested change
stepSize /= sqrt(accumulators[keyPath: kp] + epsilon)
stepSize /= sqrt(accumulators[keyPath: kp] + epsilon)

var updatesNew = updates[keyPath: kp] * rho
updatesNew += (1 - rho) * (stepSize * stepSize)
updates[keyPath: kp] = updatesNew

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the redundant empty line.

Suggested change

public var decay: Float
/// The current step.
public var step: Int = 0
/// Accumulate Gradients
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions on documentation comments:

  • Properties should be described with a noun phrase ending with a period.
  • Documentation comments for mathematical variables should use the term of art. In this case, we should switch to the accurate, proper terminology that's used by the paper.
Suggested change
/// Accumulate Gradients
/// The accumulated, exponentially decaying average of squared gradients.

/// Accumulate Gradients
public var accumulators: Model.AllDifferentiableVariables
/// Accumulate Updates(here StepSizes)
public var updates: Model.AllDifferentiableVariables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public var updates: Model.AllDifferentiableVariables
public var accumulatedDelta: Model.AllDifferentiableVariables

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here using the word "delta" would make this variable better correspond to the 𝚫x term in the paper.

public var step: Int = 0
/// Accumulate Gradients
public var accumulators: Model.AllDifferentiableVariables
/// Accumulate Updates(here StepSizes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Accumulate Updates(here StepSizes)
/// The accumulated parameter updates.

public init(
for model: __shared Model,
learningRate: Float = 1,
rho: Float = 0.9,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keras uses rho = 0.95 by default.


// Update Float & Double Tensor variables.
for kp in model.recursivelyAllWritableKeyPaths(to: Tensor<Float>.self) {
var accNew = rho * accumulators[keyPath: kp]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, I'd suggest rewriting this step as

averageSquared[keyPath: kp] *= rho
averageSquared[keyPath: kp] += (1 - rho) * (direction[keyPath: kp] * direction[keyPath: kp])

This way, you won't need to define extra variables. Same suggestion for the Tensor<Double> handling block.

stepSize /= sqrt(accumulators[keyPath: kp] + epsilon)
model[keyPath: kp] -= learningRate * stepSize

var updatesNew = updates[keyPath: kp] * rho
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing this step to the following:

accumulatedDelta *= rho
accumulatedDelta += (1 - rho) * stepSize.squared()

Same suggestion for the Tensor<Double> handling block.

public func update(_ model: inout Model.AllDifferentiableVariables,
along direction: Model.AllDifferentiableVariables) {
step += 1
let learningRate = self.learningRate * 1 / (1 + decay * Float(step))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* 1 is redundant.

Suggested change
let learningRate = self.learningRate * 1 / (1 + decay * Float(step))
let learningRate = self.learningRate / (1 + decay * Float(step))

for kp in model.recursivelyAllWritableKeyPaths(to: Tensor<Float>.self) {
averageSquared[keyPath: kp] *= rho
averageSquared[keyPath: kp] +=
(1 - rho) * (direction[keyPath: kp] * direction[keyPath: kp])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent by 4.

Suggested change
(1 - rho) * (direction[keyPath: kp] * direction[keyPath: kp])
(1 - rho) * (direction[keyPath: kp] * direction[keyPath: kp])

averageSquared[keyPath: kp] +=
(1 - rho) * (direction[keyPath: kp] * direction[keyPath: kp])
var stepSize = direction[keyPath: kp] *
sqrt(accumulatedDelta[keyPath: kp] + epsilon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sqrt(accumulatedDelta[keyPath: kp] + epsilon)
sqrt(accumulatedDelta[keyPath: kp] + epsilon)

for kp in model.recursivelyAllWritableKeyPaths(to: Tensor<Double>.self) {
averageSquared[keyPath: kp] *= Double(rho)
averageSquared[keyPath: kp] +=
(1 - Double(rho)) * (direction[keyPath: kp] * direction[keyPath: kp])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(1 - Double(rho)) * (direction[keyPath: kp] * direction[keyPath: kp])
(1 - Double(rho)) * (direction[keyPath: kp] * direction[keyPath: kp])

averageSquared[keyPath: kp] +=
(1 - Double(rho)) * (direction[keyPath: kp] * direction[keyPath: kp])
var stepSize = direction[keyPath: kp] *
sqrt(accumulatedDelta[keyPath: kp] + Double(epsilon))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sqrt(accumulatedDelta[keyPath: kp] + Double(epsilon))
sqrt(accumulatedDelta[keyPath: kp] + Double(epsilon))

@@ -355,3 +355,98 @@ public class AdaGrad<Model: Layer>: Optimizer
update(&model.allDifferentiableVariables, along: direction)
}
}

/// Adadelta Optimizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Use the canonical name (the name in the paper) to describe this class.
  • End all comments with a period.
  • Make the "O" in "Optimizer" be lowercased, for consistency with API documentation on other optimizers in this file.
Suggested change
/// Adadelta Optimizer
/// ADADELTA optimizer.


/// Adadelta Optimizer
///
/// It is a method that uses the magnitude of recent gradients
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API documentation should fit to size (i.e. to 100 columns).

I'd suggest changing the summary to the following, similar to the Keras documentation.

ADADELTA is a more robust extension of AdaGrad that adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients. This way, Adadelta continues learning even when many updates have been done. Compared to AdaGrad, in the original version of ADADELTA you don't have to set an initial learning rate. In this version, initial learning rate and decay factor can be set, as in most other optimizers.

step += 1
let learningRate = self.learningRate * 1 / (1 + decay * Float(step))

// Update Float & Double Tensor variables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Update Float & Double Tensor variables.
// Update `Tensor<Float>` and `Tensor<Double>` variables.


/// ADADELTA optimizer.
///
/// ADADELTA is a more robust extension of AdaGrad that adapts learning rates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of API documentation has lines whose length varies significantly from lines to lines. Could you make sure each line is filled with words until it reaches 100 columns? This is what I meant by suggesting "fit to size". Hope this clarifies things!

lakshya-sky and others added 3 commits June 28, 2019 13:07
Use "AdaGrad" and "ADADELTA" in documentation to stay faithful to paper spelling.
@rxwei
Copy link
Contributor

rxwei commented Jun 28, 2019

Thanks @dan-zheng for reordering the class definitions!

@rxwei rxwei merged commit a41903b into tensorflow:master Jun 28, 2019
@Shashi456
Copy link
Contributor

#127

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants