thearjunkumar
diff --git a/‎docs/tutorial_notebooks/DL2/High-performant_DL/Multi_GPU/hpdlmultigpu-not-filled.ipynb
+5-5 b/‎docs/tutorial_notebooks/DL2/High-performant_DL/Multi_GPU/hpdlmultigpu-not-filled.ipynb
+5-5
diff --git a/‎docs/tutorial_notebooks/DL2/High-performant_DL/Multi_GPU/hpdlmultigpu.ipynb
+4-4 b/‎docs/tutorial_notebooks/DL2/High-performant_DL/Multi_GPU/hpdlmultigpu.ipynb
+4-4
diff --git a/‎docs/tutorial_notebooks/DL2/High-performant_DL/hyperparameter_search/hpdlhyperparam.ipynb
+3-3 b/‎docs/tutorial_notebooks/DL2/High-performant_DL/hyperparameter_search/hpdlhyperparam.ipynb
+3-3
diff --git a/‎docs/tutorial_notebooks/DL2/sampling/graphs.ipynb
+4-4 b/‎docs/tutorial_notebooks/DL2/sampling/graphs.ipynb
+4-4
diff --git a/‎docs/tutorial_notebooks/DL2/sampling/introduction.ipynb
+2-2 b/‎docs/tutorial_notebooks/DL2/sampling/introduction.ipynb
+2-2
@@ -252,7 +252,7 @@
     "\n",
     "Initially, we want to give the current weights of the model to every GPU that we are using. To do so, we will **broadcast** the necessary tensors.\n",
     "\n",
-    "Then, each GPU will collect a subset of the full batch, lets say only 64 out of 256 samples, from memory and perform a forward pass of the model. At the end, we need to compute the loss over the entire batch of 256 samples, but no GPU can fit all of these. Here, the **reduction** primitive comes to the resque. The tensors that reside in different GPUs are collected and an operation is performed that will *reduce* the tensors to a single one. This allows for the result of the operation to still fit in memory. We may want to keep thisresult in a single GPU (using **reduce**) or send it to all of them (using **all_reduce**).\n",
+    "Then, each GPU will collect a subset of the full batch, lets say only 64 out of 256 samples, from memory and perform a forward pass of the model. At the end, we need to compute the loss over the entire batch of 256 samples, but no GPU can fit all of these. Here, the **reduction** primitive comes to the rescue. The tensors that reside in different GPUs are collected and an operation is performed that will *reduce* the tensors to a single one. This allows for the result of the operation to still fit in memory. We may want to keep this result in a single GPU (using **reduce**) or send it to all of them (using **all_reduce**).\n",
     "\n",
     "The operations that we can perform are determined by the backend that we are currently using. When using `nccl`, the list of available operations is the following:\n",
     " - `SUM`\n",
@@ -389,7 +389,7 @@
    "source": [
     "### All Gather\n",
     "\n",
-    "The **all gather** operation allows for all GPUs to have access to all the data processed by the others. This can be expecially useful when different operations need to be performed by each GPU, after a common operation has been performed on each subset of the data. It is important to note that the entirety of the data needs to fit in a single GPU, so here the bottleneck won't be the memory, instead, it will be the processing speed. \n",
+    "The **all gather** operation allows for all GPUs to have access to all the data processed by the others. This can be especially useful when different operations need to be performed by each GPU, after a common operation has been performed on each subset of the data. It is important to note that the entirety of the data needs to fit in a single GPU, so here the bottleneck won't be the memory, instead, it will be the processing speed. \n",
     "\n",
     "**Place a disk in the rank 0 device and a square in the rank 1 device, then create a list of empty tensors and use it as target for the all gather operation.**\n",
     "\n",
@@ -448,13 +448,13 @@
     "The first thing we do is to spawn the two processes. \n",
     "In each, we begin by initializing the distributed processing environment.\n",
     "\n",
-    "Then, the datasets needs to be downloaded. Here, I assume that it has not been downloaded yet, and I only let the GPU in rank 0 perform this operation. This avoids having two processes writing in the same file. In order to have the other process wait for the first one to download, a **barrier** is used. The working principle is very simple, when a barrier is reached in the code, the process waits for all other processes to also reach that point in the code. Here we see how this can be a very useful construct in parallel computing, all processes require the dataset to be downloaded before proceding, so one of them starts the download, and all wait until it's done.\n",
+    "Then, the datasets needs to be downloaded. Here, I assume that it has not been downloaded yet, and I only let the GPU in rank 0 perform this operation. This avoids having two processes writing in the same file. In order to have the other process wait for the first one to download, a **barrier** is used. The working principle is very simple, when a barrier is reached in the code, the process waits for all other processes to also reach that point in the code. Here we see how this can be a very useful construct in parallel computing, all processes require the dataset to be downloaded before proceeding, so one of them starts the download, and all wait until it's done.\n",
     "\n",
     "Then we initialize the weights, only in the rank 0 GPU, and **broadcast** them to all other GPUs. This broadcast operation is performed asynchronously, to allow for the rank 0 GPU to start loading images before the rank 1 has received the weights. This operation is akin to what DataParallel does, which is slowing the processing of the other GPUs down, waiting to receive the weights from the root GPU.\n",
     "\n",
     "<center width=\"100%\"><img style=\"margin:0 auto\" src=\"assets/example_broadcast.png\" /></center>\n",
     "\n",
-    "Each GPU will then load the images from disk, perform a product to find the activations of the next layer and caculate a softmax to get class-belonging probabilities. \n",
+    "Each GPU will then load the images from disk, perform a product to find the activations of the next layer and calculate a softmax to get class-belonging probabilities. \n",
     "\n",
     "Finally, the loss is computed by summing over the dimensions and a **reduction** with sum is performed to compute the overall loss over the entire batch.\n",
     "\n",
@@ -576,7 +576,7 @@
     "\n",
     "We have seen how we can use these collectives to perform a calculation of the loss of a neural network, but the same can be extended to any type of parallelizable computation.\n",
     "\n",
-    "Finally, we saw how simple it is to set a PyTorch Lighnining training to use multiple GPUs.\n",
+    "Finally, we saw how simple it is to set a PyTorch Lightning training to use multiple GPUs.\n",
     "\n",
     "### References\n",
     "\n",
 
@@ -431,7 +431,7 @@
    "source": [
     "### All Gather\n",
     "\n",
-    "The **all gather** operation allows for all GPUs to have access to all the data processed by the others. This can be expecially useful when different operations need to be performed by each GPU, after a common operation has been performed on each subset of the data. It is important to note that the entirety of the data needs to fit in a single GPU, so here the bottleneck won't be the memory, instead, it will be the processing speed. The file to run is `all_gather.py`.\n",
+    "The **all gather** operation allows for all GPUs to have access to all the data processed by the others. This can be especially useful when different operations need to be performed by each GPU, after a common operation has been performed on each subset of the data. It is important to note that the entirety of the data needs to fit in a single GPU, so here the bottleneck won't be the memory, instead, it will be the processing speed. The file to run is `all_gather.py`.\n",
     "\n",
     "<center width=\"100%\"><img style=\"margin:0 auto\" src=\"assets/allgather.png\"/></center>"
    ]
@@ -567,13 +567,13 @@
     "The first thing we do is to spawn the two processes. \n",
     "In each, we begin by initializing the distributed processing environment.\n",
     "\n",
-    "Then, the datasets needs to be downloaded. Here, I assume that it has not been downloaded yet, and I only let the GPU in rank 0 perform this operation. This avoids having two processes writing in the same file. In order to have the other process wait for the first one to download, a **barrier** is used. The working principle is very simple, when a barrier is reached in the code, the process waits for all other processes to also reach that point in the code. Here we see how this can be a very useful construct in parallel computing, all processes require the dataset to be downloaded before proceding, so one of them starts the download, and all wait until it's done.\n",
+    "Then, the datasets needs to be downloaded. Here, I assume that it has not been downloaded yet, and I only let the GPU in rank 0 perform this operation. This avoids having two processes writing in the same file. In order to have the other process wait for the first one to download, a **barrier** is used. The working principle is very simple, when a barrier is reached in the code, the process waits for all other processes to also reach that point in the code. Here we see how this can be a very useful construct in parallel computing, all processes require the dataset to be downloaded before proceeding, so one of them starts the download, and all wait until it's done.\n",
     "\n",
     "Then we initialize the weights, only in the rank 0 GPU, and **broadcast** them to all other GPUs. This broadcast operation is performed asynchronously, to allow for the rank 0 GPU to start loading images before the rank 1 has received the weights. This operation is akin to what DataParallel does, which is slowing the processing of the other GPUs down, waiting to receive the weights from the root GPU.\n",
     "\n",
     "<center width=\"100%\"><img style=\"margin:0 auto\" src=\"assets/example_broadcast.png\" /></center>\n",
     "\n",
-    "Each GPU will then load the images from disk, perform a product to find the activations of the next layer and caculate a softmax to get class-belonging probabilities. \n",
+    "Each GPU will then load the images from disk, perform a product to find the activations of the next layer and calculate a softmax to get class-belonging probabilities. \n",
     "\n",
     "Finally, the loss is computed by summing over the dimensions and a **reduction** with sum is performed to compute the overall loss over the entire batch.\n",
     "\n",
@@ -694,7 +694,7 @@
     "\n",
     "We have seen how we can use these collectives to perform a calculation of the loss of a neural network, but the same can be extended to any type of parallelizable computation.\n",
     "\n",
-    "Finally, we saw how simple it is to set a PyTorch Lighnining training to use multiple GPUs.\n",
+    "Finally, we saw how simple it is to set a PyTorch Lightning training to use multiple GPUs.\n",
     "\n",
     "### References\n",
     "\n",
 
@@ -97,7 +97,7 @@
     "\n",
     "Hydra configurations are usually defined through `.yaml` files. However, we can also define them manually using Python. By using Python-based config files we have more freedom in the definition of the configurations and in the re-usability of the code. The trade-off is more complexity in the management of the code, as all the configuration needs to be defined manually. \n",
     "\n",
-    "Hydra interfaces with your scripts using a decorator to the main function. This defines where to get the configs from, and which primary config file should be used to parse the arguments. In the case of this repositoru, the configuration is done through Python, so we don't need the `config_path`:\n",
+    "Hydra interfaces with your scripts using a decorator to the main function. This defines where to get the configs from, and which primary config file should be used to parse the arguments. In the case of this repository, the configuration is done through Python, so we don't need the `config_path`:\n",
     "\n",
     "    @hydra.main(config_path=None, config_name=os.environ[\"MAIN_CONFIG\"])\n",
     "\n",
@@ -108,11 +108,11 @@
     "Overall, using Hydra is a very straightforward way to neatly organize you configurations and get closer to reproducible results.\n",
     "\n",
     "### Some observations\n",
-    "Hydra is used to manage the configuration of your experiments. All command line arguments and their processing can be handled through it. The are several advantages over using the traditional argument parser from Python. The first one is that we can more easily store and restore argument configurations. Another is that target classes can be defined directly in the configuration. A target class can then be initialized with the arguments given in the configuration. Think of what would happen if you wanted to switch between using `model_A` and `model_B` which are defined with the class `ModelA` and `ModelB`. From the config, you would say `model=model_A` and then in the code you would need a long if statement chain to select which class to launch with the given configuration, which in this case would be `ModelA`. Then, there would be several default parameters for this class that we would want to use, but would be hidden in the code. Instead, with Hydra you can more simply define the target class directly in the config file, which will be automatically selected when `model=model_A` is called in the command line. This will come automatically with all the parameters defined explicitely in the config file.\n",
+    "Hydra is used to manage the configuration of your experiments. All command line arguments and their processing can be handled through it. The are several advantages over using the traditional argument parser from Python. The first one is that we can more easily store and restore argument configurations. Another is that target classes can be defined directly in the configuration. A target class can then be initialized with the arguments given in the configuration. Think of what would happen if you wanted to switch between using `model_A` and `model_B` which are defined with the class `ModelA` and `ModelB`. From the config, you would say `model=model_A` and then in the code you would need a long if statement chain to select which class to launch with the given configuration, which in this case would be `ModelA`. Then, there would be several default parameters for this class that we would want to use, but would be hidden in the code. Instead, with Hydra you can more simply define the target class directly in the config file, which will be automatically selected when `model=model_A` is called in the command line. This will come automatically with all the parameters defined explicitly in the config file.\n",
     "\n",
     "This only makes sense because the modular approach to config management allows for simple parameter switching when testing different models, datasets or when running different experiments entirely. Having modular configuration management simplifies the entire file structure as well, removing the need for separating different experiments in different folders, which can quickly become difficult to maintain as time goes on. Instead, each part of your project can be seen as a different packages, one dedicated for data handling, one for model definition, one for the logging, another for visualization and maybe also one for all the metrics that you want to test your models with. Such modularity would be very difficult without also having modular configurations, which Hydra handles very easily. \n",
     "\n",
-    "One thing that is important to higlight is that Hydra is not some magical wand that we can use to solve all our problems. Instead, Hydra takes its root in the elegant configuration management that [Omegaconf](https://omegaconf.readthedocs.io/en/2.1_branch/) already provides. Hydra is a handy extension of Omegaconf, with some features tailored for machine learning. When more complicated things need to be done, do not hesitate to put your hands on what Hydra is doing and add your own code to make your workflow faster. Often, trying to work around the issue and use only the features available in the library slows you down more than you think."
+    "One thing that is important to highlight is that Hydra is not some magical wand that we can use to solve all our problems. Instead, Hydra takes its root in the elegant configuration management that [Omegaconf](https://omegaconf.readthedocs.io/en/2.1_branch/) already provides. Hydra is a handy extension of Omegaconf, with some features tailored for machine learning. When more complicated things need to be done, do not hesitate to put your hands on what Hydra is doing and add your own code to make your workflow faster. Often, trying to work around the issue and use only the features available in the library slows you down more than you think."
    ]
   },
   {
 
@@ -21,13 +21,13 @@
     "The problem dealt by this model is one of predicting particle trajectories.\n",
     "Suppose that we have $N$ interacting particles (say, charges) with some interaction structure (say, attractive/repulsive forces) that are moving about in space.\n",
     "For each particle we are observe its trajectory (say, position and velocity) as it moves about over some period of time T.\n",
-    "Each new state in the tracjectory of a particle will depend on the current state (position and velocity) and on the interaction with other particles.\n",
+    "Each new state in the trajectory of a particle will depend on the current state (position and velocity) and on the interaction with other particles.\n",
     "Our data consists of a set of $N$ particle trajectories but the actual interactions are unknown.\n",
     "The task of our model is to learn the dynamics of the particles to predict future trajectories given example trajectories only.\n",
     "\n",
     "Notice that the task of predicting particle dynamics would become easier if we knew the form of the interactions between particles. which we think of as a graph of interactions.\n",
-    "Each particle in the system would occupy a node in the graph and the strenght of the interaction could be represented by the weight of the graph.\n",
-    "The interaction graph could be fed to a neural network alongwith the currently known trajectory which could then predict the next steps of the particles.\n",
+    "Each particle in the system would occupy a node in the graph and the strength of the interaction could be represented by the weight of the graph.\n",
+    "The interaction graph could be fed to a neural network along with the currently known trajectory which could then predict the next steps of the particles.\n",
     "\n",
     "However, since in this problem we are not given the interaction graph, the approach taken by Neural Relational Inference is to use the encoder of a variational autoencoder to sample a graph using the given trajectory as input.\n",
     "For this the method uses a graph neural network as encoder.\n",
@@ -328,7 +328,7 @@
    "id": "7584bfb2",
    "metadata": {},
    "source": [
-    "This interaction is in the form of a list of interactions pairs. We can convert one such list to an interaction graph adjancency matrix as follows. \n",
+    "This interaction is in the form of a list of interactions pairs. We can convert one such list to an interaction graph adjacency matrix as follows. \n",
     "Here we specify that a particle does not interact with itself by setting the diagonal to 0."
    ]
   },
 
@@ -30,7 +30,7 @@
     "\n",
     "Suppose we are given a categorical distribution with $C$ values as weights $w_i \\in (0,\\infty)$. \n",
     "We would like to obtain a sample from this distribution.\n",
-    "The probablity of each category $c_i$ is given by the following softmax distribution\n",
+    "The probability of each category $c_i$ is given by the following softmax distribution\n",
     "\n",
     "$$p_i = \\frac{\\exp(\\log(w_i))}{\\sum_j \\exp(\\log(w_j))}$$\n",
     "\n",
@@ -73,7 +73,7 @@
     "A smaller temperature indicates a tighter approximation and the larger the temperature the looser the approximation.\n",
     "Of course, if the temperature is too small we wouldn't be able to train the model since the gradients would be very small.\n",
     "On the other hand, a large temperature would make the categorical outputs very far from being discrete, so it is important to choose an appropriate temperature for the problem at hand.\n",
-    "One possiblity is to slowly anneal the temperature from large to small so that close to the end of training the relaxed categorical outputs are closed to discrete.\n",
+    "One possibility is to slowly anneal the temperature from large to small so that close to the end of training the relaxed categorical outputs are closed to discrete.\n",
     "In practice, however, the temperature is often kept fixed during each training trial and tuned with cross-validation.\n",
     "\n",
     "With the softmax relaxation the sampling then proceeds as follows\n",