[2021 NeurIPS] Deep Learning on a Data Diet: Finding Important Examples Early in Training #109

Jasonlee1995 · 2022-12-05T09:56:23Z

An Empirical Study of Example Forgetting during Deep Neural Network Learning 논문에서 example difficulty를 측정하는 forgetting event는 긴 training epoch를 통해 구함

해당 논문은 학습 초기에 example importance를 측정하는 2가지 metric인 GraNd score와 EL2N score를 제안함

저자들이 논문에서 말한 contribution은 다음과 같음

Expected loss gradient norm인 GraNd score을 이용하여 training example의 importance 측정
GraNd score가 낮은 training samples를 제거하여 기존 training data의 50%만 사용하여 학습하더라도 accuracy의 저하가 거의 없었음
학습이 어느정도 진행되면, GraNd score는 error vector의 norm인 EL2N score로 approximate되며, EL2N score는 더 나은 data-pruning information을 제공
Highest EL2N score를 가지는 examples를 일부 제거하여 오히려 performance가 향상되며, 이는 corrupted label regime로 인한 것임
Low EL2N score를 가지는 examples는 high EL2N score를 가지는 examples에 비해 linearly connected mode가 일찍 결정됨
Example의 EL2N score와 network의 training dynamics 간의 관계를 data-dependent NTK submatrices를 이용하여 분석

논문에서 중요하다고 생각되는 부분만 요약

1. Which Samples are Important for Learning?

1.1. Preliminaries

1.2. Gradient Norm Score and an infinitesimal analysis

GraNd score는 time t 시점에서의 model에 대한 example의 gradient norm을 통해 importance를 측정

EL2N score는 time t 시점에서의 model에 대한 example의 predicted probability와 one-hot label의 차이의 norm을 통해 importance를 측정

저자들이 GraNd score와 EL2N score를 나오게 된 intuition을 설명하면 다음과 같음

1.2.1. GraNd score

저자들이 궁금했던 것은, individual example이 training dynamics에 미치는 impact임

이는 간단하게 example이 학습에 포함되었을 때와 제외되었을 때의 차이를 통해 importance를 측정할 수 있음

저자들은 time t 시점에서의 minibatch $S_t$에서의 example $(x_j, y_j)$가 빠졌을 때, example의 loss time derivative를 통해 차이를 측정하여 importance를 계산하였음
(1개의 example에 대한 loss가 시간에 따라 변화가 어떻게 되는지가 기준인데, 이렇게 설정한 이유는 training dynamics를 알고 싶어서)

수식을 잘 정리하면 차이의 크기에 대해 upper bound를 구할 수 있으며, 이는 example $(x_j, y_j)$의 expected loss gradient norm과 비례함

이러한 이유로 GraNd score을 정했다고 볼 수 있으며, small GraNd score을 가지는 examples는 time t에서 미치는 영향이 작다라는 것을 알 수 있음
(bounded influence on learning how to classify the rest of the training data at a given training time)

1.2.2. EL2N score

GraNd score에 몇가지 가정을 통해 GraNd score를 EL2N score로 approximation할 수 있음

EL2N score는 GraNd score에 비해 훨씬 간단하게 계산할 수 있다는 장점이 있음

1.3. Correlations between scores

Scores간 high Spearman rank correlation를 가지며, most similar performance를 내는 EL2N score와 forgetting score가 highest Spearman rank correlation을 가짐을 확인할 수 있음

2. Empirical Evaluation of GraNd and EL2N Scores via Data Pruning

Example의 scores는 10 independent training runs의 average를 통해 calculate

Score을 구한 후, pruning을 통해 줄인 데이터셋으로 4 independent runs의 mean, variance를 시각화

2가지 목표에 대해 실험 진행

How early in training are forgetting, GraNd and EL2N score effective at identifying examples important for generalization
How GraNd scores at initialization, EL2N scores early in training and forgetting scores at the end of training negotiate the trade-off between generalization performance and training set size

결국 궁금한 점은 얼마나 빨리 알아낼 수 있는가, 그리고 training set size와 generalization performance 간의 관계가 어떻게 되는가임

2.1. Pruning at initialization

GraNd score는 initialization에서 training subset을 선택했을 때, random selection보다 test accuracy가 더 좋음

이는 random network로부터 얻은 training distribution의 geometry가 classification problem의 structure에 대한 정보를 가지고 있다는 것을 의미함

Information의 error에 대한 정보만 가지고 있는 EL2N score는 initialization에서 training subset을 선택했을 때, consistently effective하지 않음

2.2. Pruning early in training

모델을 few epoch 학습한 이후, EL2N score를 이용하여 generalization을 위해 중요한 examples를 extremely effective하게 구할 수 있음

또한 wide range of intermediate pruning levels에서, EL2N score로 구한 highest scores로 학습한 performance가 full dataset으로 학습한 performance보다 좋거나 비등비등함

Higher pruning levels에서도 학습 초기에 구한 EL2N score가 forgetting score와 비교하였을 때 충분히 competitive함

이는 few epochs를 학습한 모델의 average error vector로 decision boundary를 shape하는데 heavily하게 사용될 example을 충분히 identify할 수 있다고 볼 수 있음
(average error vector a few epochs into training can identify examples that the network heavily uses to shape the decision boundary throughout training)

또한 EL2N, GraNd score을 통해 extreme levels of pruning을 하게 되면 성능이 크게 감소함을 확인할 수 있음

이는 high levels of pruning에서 GraNd score 혹은 EL2N score을 사용하면 data distribution의 coverage가 좋지 않다는 의미임

즉 small number of very difficult examples는 학습 모델의 good test error를 위한 enough variety of examples를 가지지 못한다는 것을 알 수 있음

2.3. A property of the data

저자들은 ResNet18과 ResNet50의 모델을 사용하여 EL2N score와 GraNd score을 구한 결과, network에 specific하지 않은 dataset의 property임을 발견했다고 함

Example마다 GraNd score, EL2N score를 직접 비교한 것은 아니고 similar performance curve를 그리며 data를 prune할 수 있는 amount가 같다는 것을 통해 보임

그리고 scores를 average해야 model의 specific weight에 대한 dependence를 제거하여 dataset의 property를 가짐을 보임

3. Identifying Noise Examples

3.1. EL2N score

Pruning을 통해 accuracy의 손실 없이 dataset을 줄일 수 있었음

이에 저자들은 high accuracy에 reach할 수 있는 subpopulation의 nature에 대해 궁금증을 가짐

Highest-scoring examples가 accurate classifier를 achieve하는데 중요하다는 가설을 세워볼 수 있음

저자들은 모든 example에 대해 10 epoch를 학습한 모델들을 이용하여 EL2N score를 구한 후, 이를 기준으로 sort

Score percentile이 f ~ f + P 에 해당하는 examples로만 학습하여 성능을 측정하는 sliding window analysis 방식을 사용

실험 결과 window slide의 percentile이 높아질수록 performance가 증가하나, window가 very highest scores를 포함하는 경우 performance가 감소했음

추가로 K%의 label을 random label로 replace하여 똑같이 실험하였을 때, optimal window가 shift됨을 확인할 수 있음

즉 이를 통해 high-scoring samples로만 학습하는 것은 optimal하지 않을 수 있으며, 특히 label noise가 있다면 더욱 optimal하지 않을 것임

그렇다면 꼭 high-scoring samples를 빼야하냐?에 대해서는 저자들은 아니라고 표현

Figure 2의 실험은 dataset의 60%를 pruning한 결과인데, 50%를 pruning하게되면 highest scoring examples를 포함하는 것이 성능이 제일 좋음

즉 data가 충분하다면 high score examples가 noisy하거나 difficult하더라도 성능을 크게 해치지 않고 오히려 도움을 준다는 것
(물론 여기서 dataset이 noisy하지 않다라는 가정이 들어가게 됨)

저자들이 여기서 이야기하고 싶은 것은 validation set이 없다면 high score examples를 제거하는 것에 신중해야한다는 것임

3.2. GraNd score

학습하지 않은 random initialized 모델들에서의 GraNd score을 이용하여 pruning

High score examples를 포함하는 것이 오히려 성능에 도움이 됨

Noisy dataset에서는 양상이 달랐는데, larger GraNd score를 학습할수록 오히려 lower accuracy를 가지며 random subset baseline보다 오히려 성능이 안좋음

저자들은 randomly initialized network를 이용하여 GraNd score를 측정한 것이 문제가 된다고 분석

GraNd scores at initialization이 성공적으로 important examples을 찾을 수 있었던 이유는 data distribution의 어떠한 favorable property 때문인데, noise를 가하면 이러한 method가 망가진다는 것
(adding noise to samples selected uniformly over the training set cripples the method)

이에 대하여 자세한 분석은 future work로 남긴다고 함

4. Optimization Landscape and the Training Dynamics

해당 파트를 이해하기 위해서는 preliminary가 있는데, 이를 잘 몰라 대략적으로만 설명

4.1. Evolution of the data-dependent NTK

기존의 논문들로부터 neural network가 data를 NTK로 transform하여 linear classification하는 linear model과 같이 behave한다라는 것을 밝힘
(data를 NTK로 transform한다는 것은 kernel trick이라고 이해하면 되며, 요약한 문구인지라 자세한 사항은 논문 참고)

그러나 neural network는 finite하기에, constant가 아닌 data-dependent NTK로 보는 것이 맞고 이는 logit Jacobian의 Gram matrix로 정의할 수 있다고 함

학습 초기에는 Gram matrix가 high velocity를 가지다 linear mode connectivity가 될 때 쯤에는 small value로 velocity가 stabilize된다고 함

저자들은 higher EL2N score를 가지면 higher velocity를 가지는데, very highest scoring examples에서는 velocity가 감소함

이는 학습하기 너무 어려워 unrepresentative한 sample이거나 label noise 때문이라고 볼 수 있음

4.2. Connections to the Linear Mode Connectivity

기존 논문에서 neural network를 학습할 때 생기는 stochastic이 optimization trajectory에 어떠한 영향을 미치는지에 대해 실험하였음

Weight parameter w_t를 가지는 parent network와 t 시점에서의 parent network의 weight를 가지는 weight parameter v_t를 가지는 child network가 있음

Parent network, child network를 각각 independent minibatches로 학습하여 linearly connected mode가 되는 가장 early t를 구함
(linearly connected mode : 학습된 2개 모델의 weight간의 linear interpolation했을 때의 weight를 이용하여 구한 loss = 학습된 2개 모델의 loss간의 linear interpolation)

기존 논문에서는 standard vision dataset에서 linear mode connectivity (LMC)는 학습 초기에 일어난다고 함
(같은 weight로 시작되는 지점인 t가 학습 초기라는 의미이며, 학습 초기라고 표현하는데 완전 초기는 아님)

Low EL2N scores를 가지는 examples는 학습 초기에 error bound가 낮아 LMC가 학습 초기에 일어난다고 볼 수 있음

그에 반해 high EL2N scores를 가지는 examples는 전반적으로 error bound가 높아 LMC가 학습 후기에 일어난다고 볼 수 있음

즉 low EL2N score를 가지는 경우와 high EL2N score를 가지는 경우에 대해 loss landscape가 다르게 행동

Low EL2N score examples을 통해 얻은 loss landscape는 flat한데 비해 high EL2N score examples을 통해 얻은 loss landscape는 rough하다는 것

이를 통해 대부분의 학습은 high EL2N score examples에서 일어난다는 것을 알 수 있음
(low EL2N score examples의 loss landscape는 이미 flat해서 더 optimize되지 않는데 반해 high EL2N score의 loss landscape는 rough해서 optimize될 방향이 더 많음)

5. Discussion

정리하자면 다음과 같음

High score = high potential influence (important) = hard to learn

Very highest scoring examples : unrepresentative outliers of a class
(non standard background, odd angles, label noise, ...)

High-scoring examples가 NTK의 velocity를 maximally support하며, rougher loss landscape를 만든다는 것을 보임

NTK, LMC 관점의 실험들은 단순히 accuracy 이외에 다른 연구들에서 발견한 phenomena와 align되어있음을 보임

The text was updated successfully, but these errors were encountered:

Jasonlee1995 · 2022-12-13T13:53:13Z

Jasonlee1995 added Principle Understanding the AI Data Related with data Vision Related with Computer Vision tasks labels Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2021 NeurIPS] Deep Learning on a Data Diet: Finding Important Examples Early in Training #109

[2021 NeurIPS] Deep Learning on a Data Diet: Finding Important Examples Early in Training #109

Jasonlee1995 commented Dec 5, 2022 •

edited

Loading

Jasonlee1995 commented Dec 13, 2022

[2021 NeurIPS] Deep Learning on a Data Diet: Finding Important Examples Early in Training #109

[2021 NeurIPS] Deep Learning on a Data Diet: Finding Important Examples Early in Training #109

Comments

Jasonlee1995 commented Dec 5, 2022 • edited Loading

1. Which Samples are Important for Learning?

1.1. Preliminaries

1.2. Gradient Norm Score and an infinitesimal analysis

1.2.1. GraNd score

1.2.2. EL2N score

1.3. Correlations between scores

2. Empirical Evaluation of GraNd and EL2N Scores via Data Pruning

2.1. Pruning at initialization

2.2. Pruning early in training

2.3. A property of the data

3. Identifying Noise Examples

3.1. EL2N score

3.2. GraNd score

4. Optimization Landscape and the Training Dynamics

4.1. Evolution of the data-dependent NTK

4.2. Connections to the Linear Mode Connectivity

5. Discussion

Jasonlee1995 commented Dec 13, 2022

Jasonlee1995 commented Dec 5, 2022 •

edited

Loading