[2013 NIPS] Distributed Representations of Words and Phrases and their Compositionality #128

Jasonlee1995 · 2023-05-08T09:44:08Z

기존에 나온 word2vec을 개량한 논문
(several extensions of the original Skip-gram model)

4가지 main contribution은 다음과 같음

Subsampling of Frequent Words : data에서 word를 frequency 기반으로 discard
- significant speedup (around 2x - 10x)
- improves accuracy of the representations of less frequent words
Negative Sampling : negative words를 frequency를 기반으로 sampling하여 classification
- faster training
- better vector representations for frequent words compared to more complex hierarchical softmax
Phrase를 학습하는 방법 제안
Another interesting property of the Skip-gram model : vector addition

중요하다고 생각하는 부분만 간단히 요약

1. The Skip-gram Model

Skip-gram model의 가장 큰 bottleneck은 final classification layer이기에, 기존에 사용한 방법인 hierarchical softmax 방식을 간단히 소개하고, 제안한 방법인 negative sampling을 소개

또한 data를 줄일 수 있는 frequent words subsampling을 소개

1.1. Hierarchical Softmax

논문 내에서 설명만 보면 다소 이해하기 어려울 수 있어, word2vec Parameter Learning Explained 논문의 figure을 참고하여 설명

먼저 과정만 간단히 설명하면 다음과 같음

Input word의 embedding vector를 구함
Tree의 node의 learnable vector와 input embedding vector간의 dot product를 한 값에 sigmoid를 통해 left path, right path의 probability를 assign
2번 과정을 통해 root node에서 leaf node간의 모든 path마다 probability를 구하게 됨
p(output_word | input_word)를 root node에서 word에 해당하는 leaf node의 path에 해당하는 probability를 모두 곱한 값으로 정함

단순히 말하면 각 path마다 probability가 존재하며, path probability를 계속 곱해가면서 p(output_word | input_word)을 구함

해당 방법의 설계 철학을 조금만 살펴보면, 어떤 node에서 node의 left child node로 가는 path의 probability를 a라고 하면, right child node로 가는 path의 probability는 1-a가 되게끔 setting
(이렇게 하는 이유는 normalized probability distribution이 되게끔 하여 partition function에 대해 고민하지 않아도 되기 때문)

a라는 parameter에 bound에 대한 제약을 걸지 않고 마음 편히 학습할 수 있도록, sigmoid function을 이용

수식을 보면 [[x]]가 true면 1, false면 -1이 되는 함수가 있는데, 이는 sigmoid(-x) = 1 - sigmoid(x)의 성질을 이용한 것임

기존의 word2vec은 speedup을 위해 binary Huffman tree를 사용

1.2. Negative Sampling

Noise Contrastive Estimation (NCE)에서 motivation을 얻어, negative sampling 알고리즘을 제안했다고 함

NCE의 philosophy는 좋은 모델이라면 data와 noise를 logistic regression을 이용하여 잘 구분해야한다이며, 이는 softmax의 log probability를 approximately maximize한다는 것이 보여져있음

Negative sampling (NEG)의 objective는 위와 같으며, 이는 p(output_word | input_word)의 log를 대체

단순하게 생각하면 NCE에서의 noise를 input word의 context width에 포함되지 않는 word로 보고, 이를 잘 구분할 줄 알면 되는 것임

각 data sample마다 k개의 negative samples를 뽑아서 계산하며, k는 small training dataset에서는 5 - 20, large training dataset에서는 2 - 5에서 사용하면 된다고 함

Negative sample을 뽑는 방식은 unigram distribution의 3/4rd power에서 sampling한다고 함
(unigram = frequency distribution, $U(w)^{3/4}/Z$에서 sampling, 3/4를 사용한 이유는 실험적으로 좋았기에 선택)

Frequent하게 나오는 word가 자주 sample될 것이기에, frequent word의 representation이 더 좋아지게 됨

1.3. Subsampling of Frequent Words

'in', 'the', 'a'와 같은 단어는 무수히 많이 나오며, rare words와 비교했을 때 less information value를 가짐

이 말은, frequent words의 representation은 data를 엄청 많이 학습한다고 해서 significant하게 변하지 않는다는 것을 의미함

Frequent words와 rare words간의 imbalance를 해결하기 위해, training set에 있는 each word를 일정 확률로 discard하는 방식을 위와 같이 제안
(formula는 heuristic하게 chosen하였으나, 잘 작동하기에 사용했다고 함)

여기서 $f(w_i)$는 word $w_i$의 frequency이며, $t$는 chosen threshold로 $10^{-5}$을 사용했다고 함

해당 방식을 통해, 학습 속도의 개선과 더불어 rare word의 representation이 더 좋아지게 됨

2. Learning Phrases

단순하게 생각하면, phrase를 하나의 token으로 취급하여 학습하면 됨

그렇다면 phrase를 어떻게 만드는지가 문제로 남게 됨

저자들은 다음과 같이 text에서 phrase를 identify했다고 함

즉 phrase를 잘 묶고, 이를 iterate하여 word가 2개 이상인 phrase도 만드는 방법임
(나와있듯이 score가 특정 threshold 이상이어야 phrase로 취급)

저자들이 이렇게 설계한 이유를 간단하게만 예시를 들어 생각하면 다음과 같음

Text에 he is라는 bigram이 자주 나타날텐데, 이는 phrase가 아니기에 he와 is가 나타나는 수를 분모에 넣었다고 생각됨

3. Additive Compositionality

이전 논문에서는 word vector가 vector(queen) - vector(woman) + vector(man) = vector(king)와 같은 뺄셈 연산의 특성을 가진다는 것을 보여줬는데, 이에 더 나아가서 vector(Germany) + vector(capital) = vector(Berlin)와 같은 덧셈 연산의 특성도 가지고 있음을 보여줌

이는 training objective를 통해 설명할 수 있음

학습은 input word vector가 주어졌을 때, 어떠한 단어가 나타날지에 대한 context distribution을 represent하도록 학습함

이 값들은 probability와 logarithmically related되어있기에, two word vector의 sum은 two context distribution의 product와 related되어있다고 볼 수 있음
(logit 값이 sigmoid를 통해 probability로 바뀌기에, logit을 더하면 지수에서 곱셈이 되는 느낌?을 말하는 듯)

Distribution간의 product는 AND function과 같은 역할을 하기에, 덧셈 연산의 특성을 가진다고 볼 수 있음
(두 단어에 대해서 모두 high probability를 가지는 값을 가지는 word가 나오게 될 것이기에)

The text was updated successfully, but these errors were encountered:

Jasonlee1995 added Representation Learning Self-Supervised Learning, Manifold Learning Language Related with Natural Language Processing tasks labels May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2013 NIPS] Distributed Representations of Words and Phrases and their Compositionality #128

[2013 NIPS] Distributed Representations of Words and Phrases and their Compositionality #128

Jasonlee1995 commented May 8, 2023 •

edited

Loading

[2013 NIPS] Distributed Representations of Words and Phrases and their Compositionality #128

[2013 NIPS] Distributed Representations of Words and Phrases and their Compositionality #128

Comments

Jasonlee1995 commented May 8, 2023 • edited Loading

1. The Skip-gram Model

1.1. Hierarchical Softmax

1.2. Negative Sampling

1.3. Subsampling of Frequent Words

2. Learning Phrases

3. Additive Compositionality

Jasonlee1995 commented May 8, 2023 •

edited

Loading