[2020 ICLR] Fantastic Generalization Measures and Where to Find Them #119

Jasonlee1995 · 2023-02-09T04:40:42Z

Complexity를 측정하는 다양한 metric과 generalization gap과의 관계에 대해 large scale study를 한 논문
(generalization gap : difference between train and test accuracy)

결과는 매우 흥미로운데 요약하면 다음과 같음

Many norm-based measures not only perform poorly, but negatively correlate with generalization (specifically when the optimization procedure injects some stochasticity)
Sharpness-based measures (PAC-Bayesian bounds, sharpness measure) perform the best overall and seem to be promising candidates for further research
Measures related to the optimization procedures (gradient noise, speed of the optimization) can be predictive of generalization

즉 sharpness-based, optimization-based complexity measures가 generalization gap을 predict하는데 좋다고 함

Openreview에서도 지적 받았듯이 몇몇 단점들을 가지고 있으나, 다양한 metric 소개와 결론이 재미있어 간단히 요약

결과만 보고 싶을 경우 Section 4만 보면 됨

1. Introduction

1.1. Related Work

Theoretically motivated measures
- VC-dimension
- norm-based bounds
- PAC-Bayes
Empirically motivated measures
- sharpness measure
- Fisher-Rao measure
- distance of trained weights from initialization
- path norm speed of the optimization algorithm
Optimization based measures
- speed of the optimization algorithm
- magnitude of gradient noise

1.2. Notation

해당 section만 보면 generalization gap이 뭔지 헷갈리는데, 1-0 classification loss on test dataset - 1-0 classification loss on train dataset임

1-0 classification loss는 정답 class의 logit이 다른 class의 logit 이하면 1, 크면 0을 뱉는 loss

Sample dependent margin : 정답 class logit - 다른 class의 logit 중 max

Overall margin $\gamma$ : train dataset의 sample dependent margin의 10th percentile
(a robust surrogate for the minimum, 하위 10%에 해당하는 값)

2. Generalization: What is the goal and how to evaluate?

2.1. Approaches to Compare Complexity Measures

Model, optimization algorithm 그리고 data properties로 하여금 training set을 넘어서 잘 generalize하는지는 여전히 명확히 알려진 것이 없음

이와 관련하여 다양한 가설들이 나왔으며, complexity measure가 가설의 core component임
(complexity measure that monotonically relates to some aspect of generalization)

각 가설들에서 제시한 complexity measure를 비교하는 다양한 접근법들이 있을텐데, 저자들은 3가지 접근법을 소개함

Tightness of Generalization Bounds

proving generalization bound
장점 : complexity measure과 generalization error 사이의 인과 관계를 설정하는데 유용
단점 : 현존하는 대부분의 bounds는 model과 dataset의 combination인 현재의 deep learning tasks에 맞지 않기에, 인과 관계에 대한 증거로 삼기 어려움

Regularizing the Complexity Measure

complexity measure를 regularizer로 넣고, 이를 directly optimizing하여 complexity measure를 evaluate하는 방법
단점 1 : complexity measure가 loss landscape를 non-trivial하게 바꾸어 optimization이 difficult해짐
단점 2 : optimization algorithm의 implicit regularization가 존재하는데, 이를 turn off할 수 없어 controlled experiment를 하는 것이 어려움

Correlation with Generalization

complexity measure과 generalization 간의 correlation을 측정하는 방법
단점 : architecture, optimization algorithm, hyperparameter 등의 요인으로 인해 잘못 해석할 수 있어, 매우 신중한 실험 설계를 해야함

해당 논문은 3번 방법의 단점을 다양한 architecture, optimization algorithm, hyperparameter 조합을 통해 complexity measure과 generalization gap의 인과관계를 밝힘

즉 수많은 실험을 통해 complexity measure과 generalization gap의 관계를 보인 논문

실험의 양이 넓게 잡으면 한도 끝도 없기에, hyperparameter space를 정하는데 있어 reasonable하다고 생각하는 prior knowledge를 사용

7개의 hyperparameter를 사용 : batch size, dropout probability, learning rate, network depth, weight decay coefficient, network width, optimizer

2.2. Evaluation Criteria

Complexity measure의 quality를 측정하는 가장 간단한 방법은 ranking임

즉, 실험 환경 내에서 다양하게 학습된 모델에 대해 complexity measure가 empirically observed generalization (generalization gap)과 얼마나 consistent한지를 보면 됨

2.2.1. Kendall’s Rank-Correlation Coefficient

$\theta$ : hyperparameter space 중 하나
$\mu (\theta)$ : complexity measure of hyperparameter set $\theta$
$g (\theta)$ : generalization gap of hyperparameter set $\theta$

2개의 hyperparameter set에 대해 complexity measure 차이의 부호가 generalization gap 차이의 부호와 같은지를 보는 measure

2.2.2. Granulated Kendall’s Coefficient

Kendall’s Rank-Correlation Coefficient는 2 ranking간의 relationship을 보는 widely used effective tool이지만, trivial한 방법으로 높은 값을 가질 수 있음
(the measure may strongly correlate with the generalization performance without necessarily capturing the cause of generalization)

이를 조금이라도 완화해보기 위해, 차이를 측정하는 2개의 hyperparameter set가 1개의 변수만 다르도록 제한을 건 Granulated Kendall’s Coefficient를 사용

논문에서 해당 measure가 왜 더 적합한지에 대한 사고 실험 예시를 하나 보여주는데 다음과 같음

특정 complexity measure가 network depth를 완벽히 capture, same depth를 가지는 경우 random prediction

이러한 경우, Kendall’s Rank-Correlation Coefficient에서는 큰 값을 가지고 Granulated Kendall’s Coefficient에서는 작은 값을 가지게 될 것임
(depth를 비교하는 case를 제외하면 차이를 측정하는 2개의 hyperparameter set의 depth가 동일하기에)

2.2.3. Conditional Independence Test: Towards Capturing the Causal Relationships

저자들은 complexity measure와 generalization gap 사이에 edge가 존재하는지 아닌지의 Inductive Causation (IC) Algorithm의 approach를 적용
(hyperparameter가 complexity measure에 영향을 미치는지, complexity measure가 generalization gap에 영향을 미치는지)

Edge가 존재하는 지를 확인하기 위해 hyperparameter set S가 관측되었을 때, complexity measure가 generalization gap간의 conditional mutual information을 이용하여 conditional independent test를 수행

Conditional mutual information는 위와 같은 방법으로 구함

Complexity measure과 generalization 간의 conditional mutual information은 conditional entropy of generalization과 거의 같기에, 0 ~ 1 값을 가지도록 conditional entropy로 normalize
(즉 값이 0 ~ 1 사이에서 값을 가지도록 conditional mutual information을 conditional entropy로 나눠줌)

Normalized conditional mutual information의 값이 0이라면 independent하다는 의미이며, 단 1개의 subset에 대해서 independent하다면 edge는 없어지게 됨
(Inductive Causation (IC) Algorithm logic : node 별로 edge를 다 만들었다가 차근차근 제거)

모든 hyperparameter set에 대해서 구하는 것이 어렵기에, hyperparameter가 최대 2개인 hyperparameter set인 경우만 계산

결국 $K(\mu)$를 통해 complexity measure을 측정하며, $K(\mu)$의 값이 클수록 complexity measure과 generalization gap 사이에 edge가 있을 확률이 높음

3. Generating a Family of Trained Models

CIFAR-10 dataset에서 실험
7개의 hyperparameter : batch size, dropout probability, learning rate, network depth, weight decay coefficient, network width, optimizer
hyperparameter마다 3개의 choice 존재

해당 셋업 만으로도 실험 양이 많은데, openreview의 지적(?)에 따라 randomness와 다른 dataset에서도 수행한 결과, general behavior을 보였다고 함
(위 셋업에서 repeat experiments 5 times 결과, SVHN dataset에서 실험 결과)

즉 CIFAR-10 dataset에만 국한된 결과가 아니며, randomness에 robust하기에 해당 결과가 image classification task에 적용 가능하다는 것

물론 openreview의 지적과 저자들이 언급한 limitation과 같이 general한 모델을 쓰지 않은 점, 과연 large-scale dataset (ImageNet)에서 똑같이 적용될 수 있는지 등은 open problem이라고 생각되긴 함
(ResNet 같은 모델이 아니라 Network in Network 모델의 variation으로 모델 사용)

그렇다면 학습을 언제 멈출지도 관건인데, 저자들은 cross-entropy loss가 0.01에 도달하면 학습을 멈췄다고 함

매 iteration마다 train dataset 전체에 대해 training loss를 구하는 것이 매우 cost가 크기에, random sample한 100개의 train data에 대해 training loss를 구함

어떤 hyperparameter set은 cross-entropy loss가 0.01에 도달하지 않는 경우도 있는데, 이러한 경우는 discard

저자들은 Figure 2를 통해 대부분의 모델 training accuracy가 0.99를 넘음에도 불구하고 generalization gap이 다양한 range를 가지고 있어 complexity measure를 측정하기 ideal하다고 분석

참고로 cross-entropy loss를 기준으로 삼은 이유에 대하여 저자들은 다음과 같이 설명

stop 기준을 iteration number로 정하게 될 경우, 어떤 hyperparameter set는 다른 set보다 빨리 optimize되기에 적합하지 않음

training loss 혹은 training error가 후보가 될 수 있음

training loss (cross-entropy)가 같으면 대부분 비슷한 training error를 가지기에, 어떤 것을 선택하는지는 중요해보이지 않을 수 있음

하지만 optimization 과정 중에 training error의 behavior가 training loss에 비해 더 noisy하고, training error가 0에 도달하면 distinguish하기가 어렵기에 training loss를 기준으로 선택

4. Performance of Complexity Measures

Table을 보면 oracle model이 나오는데 generalization gap에 약간의 noise를 더한 것과 generalization gap의 rank correlation, mutual information을 측정했다고 생각하면 됨
(complexity measure과 generalization gap간의 correlation을 측정하듯이, generalization gap + additive noise와 generalization gap의 correlation을 측정)

Table의 oracle 옆에 $\epsilon$이 붙어있는데, 이는 noise 정도임
( $g(\theta) + N(0, \epsilon^2)$ )

어떻게 실제로 구했는지, 그 식은 논문의 appendix 참고

4.1. Baseline Complexity Measures

Canonical ordering : complexity measure 대신 canonical ordering을 이용하여 계산한 것

Table 1에 대해 저자들은 다음과 같이 분석

VC-dimension, number of parameters negatively correlated with generalization gap
- which confirms the widely known empirical observation that overparametrization improves generalization in deep learning
cross-entropy loss, margin, entropy of the output are closely related to each other
- these results confirm the general understanding that larger margin, lower cross-entropy and higher entropy would
  lead to better generalization

4.2. Surprising Failure of Some (Norm & Margin)-Based Measures

Table 2에 대해 저자들은 다음과 같이 분석

Spectral bound
- spectral complexity is strongly negatively correlated with generalization
- spectral complexity has the highest conditional mutual information compared to all the other measures (cause conditional mutual information is agnostic to the direction of correlation)
- majority of spectral complexity’s predictive power is due to its ability to capture the depth of the network, as the mutual information is significantly lower if depth is already observed
- Frobenius distance to initialization is negatively correlated, which contradicts some theories suggesting solutions closer to initialization should generalize better
- Frobenius norm of the parameters is slightly positively correlated with generalization
Path norm
- while path-norm is a proper norm in the function space but not in parameter space, we observe that it is positively correlated with generalization
Fisher-Rao metric
- lower bound on the path norm
- worse correlation than the path norm
- low Kendall’s Rank-Correlation Coefficient, high Granulated Kendall’s Coefficient
- capture a single hyperparameter change but is not able to capture the interactions between different hyperparameter types

4.3. Success of Sharpness-Based Measures

PAC-Bayesian framework
- capture sharpness in the expected sense since adding randomly generated perturbations to the parameters
Sharpness
- concept from the paper On large-batch training for deep learning: Generalization gap and sharp minima
- the worst-case sharpness, where we search for the direction that changes the loss the most

4.4. Potential of Optimization-based Measures

step to 0.1 : number of iterations required to reach cross-entropy equals 0.1
step 0.1 to 0.01 : number of iterations required going from cross-entropy equals 0.1 to cross-entropy equals 0.01
grad noise 1 epoch : variance of the gradients after only seeing the entire dataset once (1 epoch)
grad noise final : variance of the gradients when the cross-entropy is approximately 0.01

Table 4에 대해 저자들은 다음과 같이 분석

Number of Iterations
- speed of optimization is not an explicit capacity measure so either positive or negative correlation could potentially be informative
Variance of Gradients
- towards the end of the training, variance of the gradients also captures a particular type of “flatness” of the local minima
- surprisingly predictive of the generalization, and more importantly, is positively correlated across every type of hyperparameter
- connection between variance of the gradient and generalization is perhaps natural since much of the recent advancement in deep learning such as residual networks or batch normalization have enabled using larger learning rates to train neural networks (stability with higher learning rates implies smaller noises in the minibatch gradient)

The text was updated successfully, but these errors were encountered:

Jasonlee1995 added Principle Understanding the AI Optimization Related with loss, optimization Vision Related with Computer Vision tasks labels Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2020 ICLR] Fantastic Generalization Measures and Where to Find Them #119

[2020 ICLR] Fantastic Generalization Measures and Where to Find Them #119

Jasonlee1995 commented Feb 9, 2023 •

edited

Loading

[2020 ICLR] Fantastic Generalization Measures and Where to Find Them #119

[2020 ICLR] Fantastic Generalization Measures and Where to Find Them #119

Comments

Jasonlee1995 commented Feb 9, 2023 • edited Loading

1. Introduction

1.1. Related Work

1.2. Notation

2. Generalization: What is the goal and how to evaluate?

2.1. Approaches to Compare Complexity Measures

2.2. Evaluation Criteria

2.2.1. Kendall’s Rank-Correlation Coefficient

2.2.2. Granulated Kendall’s Coefficient

2.2.3. Conditional Independence Test: Towards Capturing the Causal Relationships

3. Generating a Family of Trained Models

4. Performance of Complexity Measures

4.1. Baseline Complexity Measures

4.2. Surprising Failure of Some (Norm & Margin)-Based Measures

4.3. Success of Sharpness-Based Measures

4.4. Potential of Optimization-based Measures

Jasonlee1995 commented Feb 9, 2023 •

edited

Loading