Tutorial 6 (JAX): Clarify initialization of qkv

phlippe · phlippe · commit 01b1d43f8df0 · 2023-04-02T13:47:15.000+02:00
diff --git a/docs/tutorial_notebooks/JAX/tutorial6/Transformers_and_MHAttention.ipynb b/docs/tutorial_notebooks/JAX/tutorial6/Transformers_and_MHAttention.ipynb
@@ -340,7 +340,8 @@
     "\n",
     "<center width=\"100%\"><img src=\"../../tutorial6/multihead_attention.svg\" width=\"230px\"></center>\n",
     "\n",
-    "How are we applying a Multi-Head Attention layer in a neural network, where we don't have an arbitrary query, key, and value vector as input? Looking at the computation graph above, a simple but effective implementation is to set the current feature map in a NN, $X\\in\\mathbb{R}^{B\\times T\\times d_{\\text{model}}}$, as $Q$, $K$ and $V$ ($B$ being the batch size, $T$ the sequence length, $d_{\\text{model}}$ the hidden dimensionality of $X$). The consecutive weight matrices $W^{Q}$, $W^{K}$, and $W^{V}$ can transform $X$ to the corresponding feature vectors that represent the queries, keys, and values of the input. Using this approach, we can implement the Multi-Head Attention module below."
+    "How are we applying a Multi-Head Attention layer in a neural network, where we don't have an arbitrary query, key, and value vector as input? Looking at the computation graph above, a simple but effective implementation is to set the current feature map in a NN, $X\\in\\mathbb{R}^{B\\times T\\times d_{\\text{model}}}$, as $Q$, $K$ and $V$ ($B$ being the batch size, $T$ the sequence length, $d_{\\text{model}}$ the hidden dimensionality of $X$). The consecutive weight matrices $W^{Q}$, $W^{K}$, and $W^{V}$ can transform $X$ to the corresponding feature vectors that represent the queries, keys, and values of the input. Note that commonly, these weight matrices are initialized with the Xavier initialization. However, the layer is usually not too sensitive to the initialization, as long as the variance of $Q$ and $K$ do not become too large.\n",
+    "With this in mind, we can implement the Multi-Head Attention module below."
    ]
   },
   {

Original file line number	Diff line number	Diff line change
`@@ -340,7 +340,8 @@`
`340`	`340`	`"\n",`
`341`	`341`	`"<center width=\"100%\"><img src=\"../../tutorial6/multihead_attention.svg\" width=\"230px\"></center>\n",`
`342`	`342`	`"\n",`
`343`		- "How are we applying a Multi-Head Attention layer in a neural network, where we don't have an arbitrary query, key, and value vector as input? Looking at the computation graph above, a simple but effective implementation is to set the current feature map in a NN, $X\\in\\mathbb{R}^{B\\times T\\times d_{\\text{model}}}$, as $Q$, $K$ and $V$ ($B$ being the batch size, $T$ the sequence length, $d_{\\text{model}}$ the hidden dimensionality of $X$). The consecutive weight matrices $W^{Q}$, $W^{K}$, and $W^{V}$ can transform $X$ to the corresponding feature vectors that represent the queries, keys, and values of the input. Using this approach, we can implement the Multi-Head Attention module below."
	`343`	+ "How are we applying a Multi-Head Attention layer in a neural network, where we don't have an arbitrary query, key, and value vector as input? Looking at the computation graph above, a simple but effective implementation is to set the current feature map in a NN, $X\\in\\mathbb{R}^{B\\times T\\times d_{\\text{model}}}$, as $Q$, $K$ and $V$ ($B$ being the batch size, $T$ the sequence length, $d_{\\text{model}}$ the hidden dimensionality of $X$). The consecutive weight matrices $W^{Q}$, $W^{K}$, and $W^{V}$ can transform $X$ to the corresponding feature vectors that represent the queries, keys, and values of the input. Note that commonly, these weight matrices are initialized with the Xavier initialization. However, the layer is usually not too sensitive to the initialization, as long as the variance of $Q$ and $K$ do not become too large.\n",
	`344`	`+ "With this in mind, we can implement the Multi-Head Attention module below."`
`344`	`345`	`]`
`345`	`346`	`},`
`346`	`347`	`{`