Skip to content

Commit 01b1d43

Browse files
committedApr 2, 2023
Tutorial 6 (JAX): Clarify initialization of qkv
1 parent fa80c4d commit 01b1d43

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed
 

‎docs/tutorial_notebooks/JAX/tutorial6/Transformers_and_MHAttention.ipynb

+2-1
Original file line numberDiff line numberDiff line change
@@ -340,7 +340,8 @@
340340
"\n",
341341
"<center width=\"100%\"><img src=\"../../tutorial6/multihead_attention.svg\" width=\"230px\"></center>\n",
342342
"\n",
343-
"How are we applying a Multi-Head Attention layer in a neural network, where we don't have an arbitrary query, key, and value vector as input? Looking at the computation graph above, a simple but effective implementation is to set the current feature map in a NN, $X\\in\\mathbb{R}^{B\\times T\\times d_{\\text{model}}}$, as $Q$, $K$ and $V$ ($B$ being the batch size, $T$ the sequence length, $d_{\\text{model}}$ the hidden dimensionality of $X$). The consecutive weight matrices $W^{Q}$, $W^{K}$, and $W^{V}$ can transform $X$ to the corresponding feature vectors that represent the queries, keys, and values of the input. Using this approach, we can implement the Multi-Head Attention module below."
343+
"How are we applying a Multi-Head Attention layer in a neural network, where we don't have an arbitrary query, key, and value vector as input? Looking at the computation graph above, a simple but effective implementation is to set the current feature map in a NN, $X\\in\\mathbb{R}^{B\\times T\\times d_{\\text{model}}}$, as $Q$, $K$ and $V$ ($B$ being the batch size, $T$ the sequence length, $d_{\\text{model}}$ the hidden dimensionality of $X$). The consecutive weight matrices $W^{Q}$, $W^{K}$, and $W^{V}$ can transform $X$ to the corresponding feature vectors that represent the queries, keys, and values of the input. Note that commonly, these weight matrices are initialized with the Xavier initialization. However, the layer is usually not too sensitive to the initialization, as long as the variance of $Q$ and $K$ do not become too large.\n",
344+
"With this in mind, we can implement the Multi-Head Attention module below."
344345
]
345346
},
346347
{

0 commit comments

Comments
 (0)
Please sign in to comment.