[2023 INTERSPEECH] Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition #160

Jasonlee1995 · 2023-09-10T09:20:41Z

Multilingual ASR, CS-ASR에서 Mixture of Experts (MoE) architecture가 좋은 성능을 내고 있음

기존의 MoE-based methods는 language-specific representations을 얻기 위해 independent encoders을 사용하고 decode할 때 fuse하여 결과를 만듬
(각 language별 encoder가 따로 있는 구조)

이러한 구조는 supported languages가 많아질수록 model의 computational complexity가 커지기에 scalable하지 않음

해당 논문은 이러한 computation inefficiency를 해결하는 Language-Routing Mixture of Experts (LR-MoE)를 제안

LR-MoE는 global representation을 capture하는 shared block과 language-specific representation을 capture하는 Mixture-of-Language-Experts (MLE) block으로 구성됨

중요하다고 생각되는 부분만 간단히 요약

1. Motivations

기존의 MoE-based methods는 다음과 같은 문제점을 가지고 있음

모든 language-specific blocks를 계산해야함
monolingual input이 들어올 경우, 다른 language-specific block도 계산해야하는 redundant computational overhead가 발생
language-specific blocks 간의 interaction이 존재하지 않음
따라서 cross-linguistic contextual information이 쉽게 손실됨

저자들은 sparsely-gated mixture of experts (sMoE)에 영감을 받아 LR-MoE를 제안하게 되었다고 함

2. Proposed Methods

2.1. Sparsely-Gated Mixture of Experts

기존에 존재하던 방법으로, Figure 1 (a)와 같이 동작하며 자세히 설명하면 다음과 같음

router가 non-expert layer의 ouput을 input으로 받아, 각 expert일 확률을 predict
router의 output probability가 가장 높은 expert만을 이용하여 계산
2번의 결과에 probability를 곱하여 다음 layer input으로 사용

모든 experts를 거치는게 아니라 router에 의해 선택된 1개의 expert만 사용하기에 computation efficient함

학습할 때 experts 간의 load balancing을 위해 auxiliary loss 추가

2.2. Architecture of LR-MoE

LR-MoE의 encoder는 2가지 block으로 구성되어있음

shared block
Mixture-of-Language-Experts (MLE) block (Figure 1 (b))

MLE가 sMoE와 다른 점은, 각 expert가 Lanugage-Specific Experts (LSE)이며 LID loss를 사용

참고로, 모든 MLE module에서 router를 share함

2.3. Language Routing

2.3.1. LID-Gated Network

결국 router의 역할은 non-expert layer의 output이 주어졌을 때, 어떤 language expert를 써야할 지를 알려주는 것임

Non-expert layer의 output은 이미 high-dimensional linguistic information을 가지고 있기에, linear layer를 통해 frame-level LID task를 수행

Frame-level LID에 대해서 LID-CTC loss를 통해 token-to-frame alignment를 match

저자들은 비교를 위해 utterance-wise LID task로도 학습
(time-dimension global average pooling하여 utterance의 language ID 맞추는 cross entropy loss)

뭔가 복잡해보이는데, 정리하면 다음과 같음

frame-level LID : router가 각 frame별 predicted language id를 predict하며, 정답 text label을 language id로 바꾼 output을 맞추도록 CTC loss로 학습
utterance-level LID : router가 각 frame별 predicted language id를 predict하며, utterance-level language id를 맞추도록 cross entropy loss로 학습

즉, utterance-level LID는 multilingual ASR에는 사용할 수 있겠으나, CS-ASR에 적합하지 않은 세팅이라고 유추할 수 있음

2.3.2. Shared Router

sMoE와는 다르게 LR-MoE에서 router의 역할은 어떤 language인지 맞추는 것이기에, 모든 MLE layer에서 shared router를 사용했다고 함

the expert layers are language-specific and the desired routing paths are determined with a priori in LR-MoE
therefore, the shared LID router might be helpful to reduce additional computation and the multi-level error accumulation caused by the alignment drift of the language routing

2.4. Pre-trained Shared Block

LID-gated network의 convergence speed를 높이기 위해 pre-trained shared block을 사용했다고 함

2.5. Overall Loss

$L_{mtl} = L_{asr} + \lambda_{lid} L_{lid}$

CTC-based ASR이면 $L_{asr} = L_{ctc}$이고, attention-based ASR이면 $L_{asr} = \lambda_{ctc} L_{ctc} + L_{seq2seq}$임

논문에서 $\lambda_{lid} = 0.3, \lambda_{ctc} = 0.3$ 사용

3. Experiments

3.1. Datasets

ASRU 2019 Mandarin-English code-switching Challenge dataset
Aishell-1 (CN)
train-clean-100 subset of Librispeech (EN)
Japanese (JA) from Datatang
Zeroth-Korean (KR)
Mandarin-English code-switching (CN-EN) from Datatang

3.2. Experimental Results

3.2.1. Results on Mandarin-English ASR

Table 3을 보면 알 수 있듯이, LR-MoE 방식이 성능과 parameter 면에서 모두 좋음
(FLR-MoE : frame-level LID LR-MoE)

3.2.2. Results on multilingual ASR

Table 4를 보면 알 수 있듯이, monolingual, code-switching scenario 모두에서 효과적임

FLR-MoE가 code-switching scenario에서 ULR-MoE보다 성능이 좋음

또한 shared router를 사용하는 것이 성능에 좋음

3.3. Ablation Study and Analysis

3.3.1. Position of MLE

어느 layer까지 shared layer를 사용하고 어느 layer부터 MLE를 사용해야하는지에 대한 ablation study

Shared layer의 depth가 클수록 LID는 accurate하지만, MLE layer가 적어 language-specific representation을 잘 못배움

즉, LID와 language-specific representation간의 trade-off가 존재함

MLE layer를 middle position을 사용하면 성능이 제일 좋음

3.3.2. LID and Routing Analysis

저자들이 제안한 방법이 language confusion을 감소시켜줌
(CTC + LID CTC한 baseline과 FLR-MoE와 비교)

Router가 language segments를 language experts에게 잘 routing해줌

The text was updated successfully, but these errors were encountered:

Jasonlee1995 added Speech Related with Speech tasks Discriminative Discriminative Modeling labels Sep 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2023 INTERSPEECH] Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition #160

[2023 INTERSPEECH] Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition #160

Jasonlee1995 commented Sep 10, 2023 •

edited

Loading

[2023 INTERSPEECH] Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition #160

[2023 INTERSPEECH] Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition #160

Comments

Jasonlee1995 commented Sep 10, 2023 • edited Loading

1. Motivations

2. Proposed Methods

2.1. Sparsely-Gated Mixture of Experts

2.2. Architecture of LR-MoE

2.3. Language Routing

2.3.1. LID-Gated Network

2.3.2. Shared Router

2.4. Pre-trained Shared Block

2.5. Overall Loss

3. Experiments

3.1. Datasets

3.2. Experimental Results

3.2.1. Results on Mandarin-English ASR

3.2.2. Results on multilingual ASR

3.3. Ablation Study and Analysis

3.3.1. Position of MLE

3.3.2. LID and Routing Analysis

Jasonlee1995 commented Sep 10, 2023 •

edited

Loading