Skip to content

Commit ffea03e

Browse files
committed
Tutorial 6: Fixing minor typo in WO dimension
1 parent 55d7996 commit ffea03e

File tree

2 files changed

+2
-2
lines changed

2 files changed

+2
-2
lines changed

docs/tutorial_notebooks/JAX/tutorial6/Transformers_and_MHAttention.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -337,7 +337,7 @@
337337
"\\end{split}\n",
338338
"$$\n",
339339
"\n",
340-
"We refer to this as Multi-Head Attention layer with the learnable parameters $W_{1...h}^{Q}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{K}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{V}\\in\\mathbb{R}^{D\\times d_v}$, and $W^{O}\\in\\mathbb{R}^{h\\cdot d_k\\times d_{out}}$ ($D$ being the input dimensionality). Expressed in a computational graph, we can visualize it as below (figure credit - [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)).\n",
340+
"We refer to this as Multi-Head Attention layer with the learnable parameters $W_{1...h}^{Q}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{K}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{V}\\in\\mathbb{R}^{D\\times d_v}$, and $W^{O}\\in\\mathbb{R}^{h\\cdot d_v\\times d_{out}}$ ($D$ being the input dimensionality). Expressed in a computational graph, we can visualize it as below (figure credit - [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)).\n",
341341
"\n",
342342
"<center width=\"100%\"><img src=\"../../tutorial6/multihead_attention.svg\" width=\"230px\"></center>\n",
343343
"\n",

docs/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -319,7 +319,7 @@
319319
"\\end{split}\n",
320320
"$$\n",
321321
"\n",
322-
"We refer to this as Multi-Head Attention layer with the learnable parameters $W_{1...h}^{Q}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{K}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{V}\\in\\mathbb{R}^{D\\times d_v}$, and $W^{O}\\in\\mathbb{R}^{h\\cdot d_k\\times d_{out}}$ ($D$ being the input dimensionality). Expressed in a computational graph, we can visualize it as below (figure credit - [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)).\n",
322+
"We refer to this as Multi-Head Attention layer with the learnable parameters $W_{1...h}^{Q}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{K}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{V}\\in\\mathbb{R}^{D\\times d_v}$, and $W^{O}\\in\\mathbb{R}^{h\\cdot d_v\\times d_{out}}$ ($D$ being the input dimensionality). Expressed in a computational graph, we can visualize it as below (figure credit - [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)).\n",
323323
"\n",
324324
"<center width=\"100%\"><img src=\"multihead_attention.svg\" width=\"230px\"></center>\n",
325325
"\n",

0 commit comments

Comments
 (0)