Skip to content

Commit 510f82e

Browse files
authored
Update transformer_tutorial.py (#2363)
Fix to "perhaps there is a misprint at line 40 #2111"; review of referenced paper https://arxiv.org/pdf/1706.03762.pdf section 3.2.3 suggests: "Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2." Thus the suggested change in reference from nn.Transform.Encoder to nn.Transform.Decoder seems reasonable.
1 parent 921f4fb commit 510f82e

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

beginner_source/transformer_tutorial.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
# ``nn.TransformerEncoder`` consists of multiple layers of
3838
# `nn.TransformerEncoderLayer <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html>`__.
3939
# Along with the input sequence, a square attention mask is required because the
40-
# self-attention layers in ``nn.TransformerEncoder`` are only allowed to attend
40+
# self-attention layers in ``nn.TransformerDecoder`` are only allowed to attend
4141
# the earlier positions in the sequence. For the language modeling task, any
4242
# tokens on the future positions should be masked. To produce a probability
4343
# distribution over output words, the output of the ``nn.TransformerEncoder``

0 commit comments

Comments
 (0)