[2016 AAAI Oral] Character-Aware Neural Language Models #162

Jasonlee1995 · 2023-09-19T04:21:31Z

Traditional Language Model인 count-based model은 학습하기 쉽지만, data sparsity로 인해 rare n-grams의 probability가 poorly estimate된다는 단점이 있음

Neural Language Model (NLM)은 word를 word embedding vector로 바꿔서 neural network의 input으로 넣어주는 방식을 통해 n-gram data sparsity issue를 해결했으며, count-based n-gram LM을 outperform함

Word-level NLM은 subword information을 배우지 않기에, rare words의 embeddings이 poorly estimate되어 perplexity가 높아짐

저자들은 character-level convolutional neural network (CNN)을 통해 subword information을 leverage하는 language model을 제안

말이 복잡한데, character-level embedding을 잘 합쳐서 word embedding을 만드는 방법을 제안한 논문임

기존의 non word-level NLM 연구에서 사용된 morphological tagging과 같은 pre-processing step이 필요 없으며, 가볍고 성능도 좋음

중요하다고 생각되는 부분만 간단히 요약

1. Model

Task : next word prediction (word classification, unsupervised learning)

RNN-LM의 input으로 character embedding을 통해 구한 word embedding을 넣어줌

1.1. Character-level Convolutional Neural Network

원문

l개의 character로 구성된 word k의 character embedding을 concat (shape : (d, l))
convolution filter H with width w + tanh를 통해 word k의 feature map을 생성 (shape : (l-w+1,))
참고로 convolution filter를 적용할 때, 원소를 모두 더하지 않고 trace만 더해줌
feature map 중 가장 큰 값을 선택 (shape : (1,))
h개의 convolution filter에 대해 word k의 feature vector 생성 (shape : (h,))

1.2. Highway Network

원문

Character-level CNN 위에 MLP 등을 더 쌓아서 성능을 더 높일 수 있음

저자들은 실험을 통해 MLP를 사용하면 성능이 저하되고, highway network를 사용하면 성능이 향상되었다고 함

따라서, CNN + highway network를 이용하여 word embedding을 구하여 RNN-LM의 input으로 넣어줌

2. Experiments

2.1. Experimental Setup

영어뿐만 아니라, morphologically rich languages dataset에서도 실험
(morphologically rich languages : Czech, German, French, Spanish, Russian, and Arabic)

1번만 등장하는 singleton words는 UNK token으로 replace
(character model로 OOV tokens의 embedding을 구할 수 있지만, 기존 연구와의 exact comparison을 위해 UNK token으로 replace)

Evaluation metric으로 perplexity 사용
(perplexity : test dataset의 word sequence에 대한 model의 NLL)

다른 detail로는 hierarchical softmax를 사용하며, gradient norm에 constraint를 줌
(constrain the norm of the gradients to be below 5)

2.2. Result

Character-level model의 parameter가 약 60% 적음에도 불구하고 기존의 sota 모델과 비슷한 성능을 냄

Character-level model이 word-level, morphological model보다 더 좋은 성능을 냄

3. Discussion

3.1. Learned Word Representations

Word-level model과 character-level model의 word representation의 차이를 보기 위해 nearest neighbor words를 찾음

Highway layer 이전의 representation은 surface form에 solely rely하는 것처럼 보임
(e.g. you의 nearest neighbor words는 your, young, four, youth인데, edit distance 측면에서 가까운 느낌)

Highway layer 이후의 representation은 semantic feature를 잘 encode하는 것처럼 보임

즉, character-level model은 word-level model과 같이 semantic한 정보를 잘 담고 있음을 확인할 수 있음

다만, his와 가까운 단어로 hhs를 예측하는 clear mistakes를 하는 limitation은 존재

OOV words에 대해서도, 그리고 incorrect/non-standard spelling word에 대해서도 잘 찾음
(incorrect/non-standard spelling word : looooook)

3.2. Learned Character N-gram Representations

CharCNN의 각 filter는 particular character n-grams를 detect하도록 학습됨
(각 filter마다 max에 해당하는 scalar가 나오게 되는데, 이는 max에 해당하는 window의 subword를 선택하게 되는 것)

저자들의 initial expectation은 each filter가 different morphemes를 activate하도록 학습하여, identified morphemes을 이용하여 word의 semantic representation을 build-up하지 않을까였음

하지만 selected character n-grams를 살펴본 결과, valid morpheme에 해당되지 않았음

이에 대한 intuition을 얻기 위해, 모든 character n-grams에 대한 learned representation를 PCA하여 plot

Figure 2를 보면 알 수 있듯이, model은 prefixes, suffixes, others를 구분하도록 학습함
(prefixes : 접두사, suffixes : 접미사)

또한 model은 hyphens를 포함하는 character n-grams에 특히 sensitive한데, 저자들은 이가 part-of-speech에 대한 strong signal이기 때문에 그런 것 같다고 분석
(part-of-speech : 품사, hyphen을 기준으로 품사가 나뉘기에 hyphen에 sensitive)

3.3. Highway Layers

MLP does poorly, although this could be due to optimization issues
We hypothesize that highway networks are especially well-suited to work with CNNs

이외에도 저자들이 실험한 결과 다음과 같았다고 함
(Table 7에 없는 내용)

1~2개의 highway layers를 사용하는 것은 성능에 중요한데, 더 많이 쌓는다고 해서 성능이 오르진 않음
(물론 이는 dataset size에 따라 달라질 수 있음)
Max-pooling 전에 convolutional layers를 더 사용한다고 해서 성능이 오르지 않음
Word-level word embedding + highway layer을 사용해도 성능이 오르지 않음
Word-level word embedding과 CharCNN's ouput을 concat해서 사용하면 성능이 slightly degrade

3.4. Effect of Corpus/Vocab Sizes

Vocabulary size를 조절하기 위해 most frequent k words를 제외한 나머지를 UNK token으로 replace
(위의 실험 setup과 같이 기존 연구와의 exact comparison을 위해 UNK token으로 replace)

Character-level model이 모든 시나리오에서 word-level model보다 성능이 좋지만, corpus size가 커질수록 성능 차이 폭이 줄어듬

The text was updated successfully, but these errors were encountered:

Jasonlee1995 added Representation Learning Self-Supervised Learning, Manifold Learning Language Related with Natural Language Processing tasks labels Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2016 AAAI Oral] Character-Aware Neural Language Models #162

[2016 AAAI Oral] Character-Aware Neural Language Models #162

Jasonlee1995 commented Sep 19, 2023 •

edited

Loading

[2016 AAAI Oral] Character-Aware Neural Language Models #162

[2016 AAAI Oral] Character-Aware Neural Language Models #162

Comments

Jasonlee1995 commented Sep 19, 2023 • edited Loading

1. Model

1.1. Character-level Convolutional Neural Network

1.2. Highway Network

2. Experiments

2.1. Experimental Setup

2.2. Result

3. Discussion

3.1. Learned Word Representations

3.2. Learned Character N-gram Representations

3.3. Highway Layers

3.4. Effect of Corpus/Vocab Sizes

Jasonlee1995 commented Sep 19, 2023 •

edited

Loading