Tutorial 11 (JAX): Explaining abbrev LDJ

phlippe · phlippe · commit fa80c4da32b8 · 2023-04-02T13:19:14.000+02:00
diff --git a/docs/tutorial_notebooks/JAX/tutorial11/NF_image_modeling.ipynb b/docs/tutorial_notebooks/JAX/tutorial11/NF_image_modeling.ipynb
@@ -357,7 +357,7 @@
     "\n",
     "<center width=\"100%\"><img src=\"../../tutorial11/uniform_flow.png\" width=\"300px\"></center>\n",
     "\n",
-    "You can see that the height of $p(y)$ should be lower than $p(x)$ after scaling. This change in volume represents $\\left|\\frac{df(x)}{dx}\\right|$ in our equation above, and ensures that even after scaling, we still have a valid probability distribution. We can go on with making our function $f$ more complex. However, the more complex $f$ becomes, the harder it will be to find the inverse $f^{-1}$ of it, and to calculate the log-determinant of the Jacobian $\\log{} \\left|\\det \\frac{df(\\mathbf{x})}{d\\mathbf{x}}\\right|$. An easier trick to stack multiple invertible functions $f_{1,...,K}$ after each other, as all together, they still represent a single, invertible function. Using multiple, learnable invertible functions, a normalizing flow attempts to transform $p_z(z)$ slowly into a more complex distribution which should finally be $p_x(x)$. We visualize the idea below\n",
+    "You can see that the height of $p(y)$ should be lower than $p(x)$ after scaling. This change in volume represents $\\left|\\frac{df(x)}{dx}\\right|$ in our equation above, and ensures that even after scaling, we still have a valid probability distribution. We can go on with making our function $f$ more complex. However, the more complex $f$ becomes, the harder it will be to find the inverse $f^{-1}$ of it, and to calculate the log-determinant of the Jacobian $\\log{} \\left|\\det \\frac{df(\\mathbf{x})}{d\\mathbf{x}}\\right|$ (often abbreviated as *LDJ*). An easier trick to stack multiple invertible functions $f_{1,...,K}$ after each other, as all together, they still represent a single, invertible function. Using multiple, learnable invertible functions, a normalizing flow attempts to transform $p_z(z)$ slowly into a more complex distribution which should finally be $p_x(x)$. We visualize the idea below\n",
     "(figure credit - [Lilian Weng](https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html)):\n",
     "\n",
     "<center width=\"100%\"><img src=\"../../tutorial11/normalizing_flow_layout.png\" width=\"700px\"></center>\n",
@@ -414,7 +414,8 @@
     "        return bpd, rng\n",
     "\n",
     "    def encode(self, imgs, rng):\n",
-    "        # Given a batch of images, return the latent representation z and ldj of the transformations\n",
+    "        # Given a batch of images, return the latent representation z and \n",
+    "        # log-determinant jacobian (ldj) of the transformations\n",
     "        z, ldj = imgs, jnp.zeros(imgs.shape[0])\n",
     "        for flow in self.flows:\n",
     "            z, ldj, rng = flow(z, ldj, rng, reverse=False)\n",
@@ -446,6 +447,7 @@
     "            z = z_init\n",
     "        \n",
     "        # Transform z to x by inverting the flows\n",
+    "        # The log-determinant jacobian (ldj) is usually not of interest during sampling\n",
     "        ldj = jnp.zeros(img_shape[0])\n",
     "        for flow in reversed(self.flows):\n",
     "            z, ldj, rng = flow(z, ldj, rng, reverse=True)\n",
@@ -6712,7 +6714,7 @@
     "\n",
     "$$z'_{j+1:d} = \\mu_{\\theta}(z_{1:j}) + \\sigma_{\\theta}(z_{1:j}) \\odot z_{j+1:d}$$\n",
     "\n",
-    "The functions $\\mu$ and $\\sigma$ are implemented as a shared neural network, and the sum and multiplication are performed element-wise. The LDJ is thereby the sum of the logs of the scaling factors: $\\sum_i \\left[\\log \\sigma_{\\theta}(z_{1:j})\\right]_i$. Inverting the layer can as simply be done as subtracting the bias and dividing by the scale: \n",
+    "The functions $\\mu$ and $\\sigma$ are implemented as a shared neural network, and the sum and multiplication are performed element-wise. The log-determinant Jacobian (LDJ) is thereby the sum of the logs of the scaling factors: $\\sum_i \\left[\\log \\sigma_{\\theta}(z_{1:j})\\right]_i$. Inverting the layer can as simply be done as subtracting the bias and dividing by the scale: \n",
     "\n",
     "$$z_{j+1:d} = \\left(z'_{j+1:d} - \\mu_{\\theta}(z_{1:j})\\right) / \\sigma_{\\theta}(z_{1:j})$$\n",
     "\n",
@@ -8786,36 +8788,6 @@
     "        return self.nn(x)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": 16,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Out (3, 32, 32, 18)\n"
-     ]
-    }
-   ],
-   "source": [
-    "## Test MultiheadAttention implementation\n",
-    "# Example features as input\n",
-    "main_rng, x_rng = random.split(main_rng)\n",
-    "x = random.normal(x_rng, (3, 32, 32, 16))\n",
-    "# Create attention\n",
-    "mh_attn = GatedConvNet(c_hidden=32, c_out=18, num_layers=3)\n",
-    "# Initialize parameters of attention with random key and inputs\n",
-    "main_rng, init_rng = random.split(main_rng)\n",
-    "params = mh_attn.init(init_rng, x)['params']\n",
-    "# Apply attention with parameters on the inputs\n",
-    "out = mh_attn.apply({'params': params}, x)\n",
-    "print('Out', out.shape)\n",
-    "\n",
-    "del mh_attn, params"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},