Skip to content

Commit 0bb66a7

Browse files
committed
Updating to 2022 version
1 parent cde7e46 commit 0bb66a7

File tree

6 files changed

+1516
-1478
lines changed

6 files changed

+1516
-1478
lines changed

dl2021_cpu.yml

-23
This file was deleted.

dl2021_gpu.yml

-23
This file was deleted.

dl2022_cpu.yml

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
name: dl2021
2+
channels:
3+
- pytorch
4+
- conda-forge
5+
- defaults
6+
dependencies:
7+
- python=3.10.6
8+
- pip=22.2.2
9+
- cpuonly=2.0
10+
- pytorch=1.13.0
11+
- torchvision=0.14.0
12+
- torchaudio=0.13.0
13+
- pip:
14+
- pytorch-lightning==1.7.7
15+
- tensorboard==2.10.1
16+
- tabulate>=0.8.9
17+
- tqdm>=4.62.3
18+
- pillow>=8.0.1
19+
- notebook>=6.4.5
20+
- jupyterlab>=3.2.1
21+
- matplotlib>=3.4.3
22+
- seaborn>=0.11.2
23+
- ipywidgets>=7.6.5

dl2022_gpu.yml

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
name: dl2022
2+
channels:
3+
- pytorch
4+
- nvidia
5+
- conda-forge
6+
- defaults
7+
dependencies:
8+
- python=3.10.6
9+
- pip=22.2.2
10+
- pytorch-cuda=11.7
11+
- pytorch=1.13.0
12+
- torchvision=0.14.0
13+
- torchaudio=0.13.0
14+
- pip:
15+
- pytorch-lightning==1.7.7
16+
- tensorboard==2.10.1
17+
- tabulate>=0.8.9
18+
- tqdm>=4.62.3
19+
- pillow>=8.0.1
20+
- notebook>=6.4.5
21+
- jupyterlab>=3.2.1
22+
- matplotlib>=3.4.3
23+
- seaborn>=0.11.2
24+
- ipywidgets>=7.6.5

docs/tutorial_notebooks/tutorial1/Lisa_Cluster.ipynb

+18-14
Original file line numberDiff line numberDiff line change
@@ -88,14 +88,14 @@
8888
"module load Anaconda3/2021.05\n",
8989
"```\n",
9090
"\n",
91-
"The CUDA and cuDNN libraries are already taken care of by installing the cudatoolkits in conda.\n",
91+
"Note that there also exist a `2022` module with slightly newer package versions, but it is at the moment not functional. Hence, we stick with the `2021` module. The CUDA and cuDNN libraries are already taken care of by installing the cudatoolkits in conda.\n",
9292
"\n",
9393
"### Install the environment\n",
9494
"\n",
95-
"To run the Deep Learning assignments and other code like the notebooks on Lisa, you need to install the [provided environment for Lisa](https://github.com/uvadlc/uvadlc_practicals_2021) (`dl2021_gpu.yml`). You can either download it locally and copy it to your Lisa account via rsync or scp as described before, or simply clone the [practicals github](https://github.com/uvadlc/uvadlc_practicals_2021) on Lisa: \n",
95+
"To run the Deep Learning assignments and other code like the notebooks on Lisa, you need to install the [provided environment for Lisa](https://github.com/uvadlc/uvadlc_practicals_2022) (`dl2022_gpu.yml`). You can either download it locally and copy it to your Lisa account via rsync or scp as described before, or simply clone the [practicals github](https://github.com/uvadlc/uvadlc_practicals_2022) on Lisa: \n",
9696
"\n",
9797
"```bash\n",
98-
"git clone https://github.com/uvadlc/uvadlc_practicals_2021.git\n",
98+
"git clone https://github.com/uvadlc/uvadlc_practicals_2022.git\n",
9999
"```\n",
100100
"\n",
101101
"Lisa provides an Anaconda module, which you can load via `module load Anaconda3/2021.05` as mentioned before (remember to load the `2021` module beforehand). We recommend installing the package via a job file since the installation can take 20-30 minutes, and any command on the login node will be killed without warning after 15 minutes.\n",
@@ -118,8 +118,8 @@
118118
"module load 2021\n",
119119
"module load Anaconda3/2021.05\n",
120120
"\n",
121-
"cd $HOME/uvadlc_practicals_2021/\n",
122-
"conda env create -f dl2021_gpu.yml\n",
121+
"cd $HOME/uvadlc_practicals_2022/\n",
122+
"conda env create -f dl2022_gpu.yml\n",
123123
"```\n",
124124
"\n",
125125
"You can use e.g. `nano` to do that. If the environment file is not in the cloned repository or you haven't cloned the repo, change the cd statement to the directory where it is stored. Once the file is saved, start the job with the command `sbatch install_environment.job`. The installation process is started on a compute node with a time limit of 4 hours, which should be sufficiently long. Let's look at the next section to understand what we have actually done here with respect to 'job files'.\n",
@@ -129,17 +129,17 @@
129129
"If the installation via job file does not work, try to install the environment with the following command from the login node after navigating to the directory the environment file is in: \n",
130130
"\n",
131131
"```bash\n",
132-
"conda env create -f dl2021_gpu.yml\n",
132+
"conda env create -f dl2022_gpu.yml\n",
133133
"```\n",
134-
"Note that the jobs on the login node on Lisa are limited to 15 minutes. This is often not enough to install the full environment. If the installation command is killed, you can simply restart it. If you get the error that a package is corrupted, go to `/home/lcur___/.conda/pkgs/` under your home directory and remove the directory of the corrupted package. If you get the error that the environment dl2021 already exists, go to `/home/lcur___/.conda/envs/`, and remove the folder 'dl2021'.\n",
134+
"Note that the jobs on the login node on Lisa are limited to 15 minutes. This is often not enough to install the full environment. If the installation command is killed, you can simply restart it. If you get the error that a package is corrupted, go to `/home/lcur___/.conda/pkgs/` under your home directory and remove the directory of the corrupted package. If you get the error that the environment dl2022 already exists, go to `/home/lcur___/.conda/envs/`, and remove the folder 'dl2022'.\n",
135135
"\n",
136136
"If you experience issues with the Anaconda module, you can also install Anaconda yourself ([download link](https://docs.anaconda.com/anaconda/install/linux/)) or ask your TA for help.\n",
137137
"\n",
138138
"#### Verifying the installation\n",
139139
"\n",
140-
"When the installation process is completed, you can check if the process was successful by activating your environment on the login node via `source activate dl2021` (remember to have loaded the anaconda module beforehand), and starting a python console with executing `python`. It should say `Python 3.9.7 | packaged by conda-forge`. If you see a different python version, you might not have activated the environment correctly. \n",
140+
"When the installation process is completed, you can check if the process was successful by activating your environment on the login node via `source activate dl2022` (remember to have loaded the anaconda module beforehand), and starting a python console with executing `python`. It should say `Python 3.10.6 | packaged by conda-forge`. If you see a different python version, you might not have activated the environment correctly. \n",
141141
"\n",
142-
"In the python console, try to import pytorch via `import torch` and check the version: `torch.__version__`. It should say `1.10.0`. Finally, check whether PyTorch can access the GPU: `torch.cuda.is_available()`. Note that in most cases, this will return `False` because most login-nodes on Lisa do not have GPUs. You can login to a GPU node via `ssh lcur___@login-gpu.lisa.surfsara.nl`, and on this node, you should see that the command returns `True`. If that is the case, you should be all set."
142+
"In the python console, try to import pytorch via `import torch` and check the version: `torch.__version__`. It should say `1.13.0`. Finally, check whether PyTorch can access the GPU: `torch.cuda.is_available()`. Note that in most cases, this will return `False` because most login-nodes on Lisa do not have GPUs. You can login to a GPU node via `ssh lcur___@login-gpu.lisa.surfsara.nl`, and on this node, you should see that the command returns `True`. If that is the case, you should be all set."
143143
]
144144
},
145145
{
@@ -173,7 +173,7 @@
173173
"# Your job starts in the directory where you call sbatch\n",
174174
"cd $HOME/...\n",
175175
"# Activate your environment\n",
176-
"source activate dl2021\n",
176+
"source activate dl2022\n",
177177
"# Run your code\n",
178178
"srun python -u ...\n",
179179
"```\n",
@@ -197,7 +197,7 @@
197197
"\n",
198198
"If you work with a lot of data, or a larger dataset, it is advised to copy your data to the `/scratch` directory of the node. Otherwise, the read/write operation might become a bottleneck of your job. To do this, simply use your copy operation of choice (`cp`, `rsync`, ...), and copy the data to the directory `$TMPDIR`. You should add this command to your job file before calling `srun ...`. Remember to point to this data when you are running your code. If you have a dataset that can be downloaded, you can also directly download it to the scratch (can sometimes be faster than copying). In case you also write something on the scratch, you need to copy it back to your home directory before finishing the job. \n",
199199
"\n",
200-
"**Edit Dec. 6, 2021**: Due to internal changes to the filesystem of Lisa, it is required to **use the scratch for any dataset** such as CIFAR10. In PyTorch, the CIFAR10 dataset is structured into multiple large batches (usually 5), and only that batch is loaded which is currently needed. This is why during training, it requires a lot of reading operations on the disk which can slow down your training and constitutes a challenge to the Lisa system when hundreds of students share the same filesystem. Hence, you have to use the scratch for such datasets. For most parts in the assignment, you can do this by specifying `--data_dir $TMPDIR` on the python command in your job file (check if the argument parser has this argument, otherwise you can add it yourself). This will download the dataset to the scratch and only load it from there."
200+
"**Edit Dec. 6, 2021**: Due to internal changes to the filesystem of Lisa, it is required to **use the scratch for any dataset** such as CIFAR10. In PyTorch, the CIFAR10 dataset is structured into multiple large batches (usually 5), and only that batch is loaded which is currently needed. This is why during training, it requires a lot of reading operations on the disk which can slow down your training and constitutes a challenge to the Lisa system when hundreds of students share the same filesystem. Hence, you have to use the scratch for such datasets. For most parts in the assignment, you can do this by specifying `--data_dir $TMPDIR` on the python command in your job file (check if the argument parser has this argument, otherwise you can add it yourself). This will download the dataset to the scratch and only load it from there. We recommend using this approach also for future course editions."
201201
]
202202
},
203203
{
@@ -268,11 +268,15 @@
268268
"\n",
269269
"### PyTorch or other packages cannot be imported\n",
270270
"\n",
271-
"If you run a job and see the python error message in the slurm output file that a package is missing although you have installed it in the environment, there are two things to check. Firstly, make sure to not have the environment activated on the login node when submitting the job. This can lead to an error in the anaconda module such that packages are not found on the compute node. Secondly, check that you activate the environment correctly. To verify that the correct python version is used, you can add the command `which python` before your training file. This prints out the path of the python that will be used, in which you should see the anaconda version in the dl2021 environment.\n",
271+
"If you run a job and see the python error message in the slurm output file that a package is missing although you have installed it in the environment, there are two things to check. Firstly, make sure to not have the environment activated on the login node when submitting the job. This can lead to an error in the anaconda module such that packages are not found on the compute node. Secondly, check that you activate the environment correctly. To verify that the correct python version is used, you can add the command `which python` before your training file. This prints out the path of the python that will be used, in which you should see the anaconda version in the dl2022 environment.\n",
272272
"\n",
273273
"### My job runs very slow\n",
274274
"\n",
275-
"If your job executes your script much slower than you expect, check for two things: (1) have you requested a GPU and are you using it, and (2) are you using the scratch for your dataset? Not using the scratch for your dataset can create a significant communication bottleneck, especially if multiple students do it at the same time. Make sure to download or copy your dataset to the scratch and load it from there."
275+
"If your job executes your script much slower than you expect, check for two things: (1) have you requested a GPU and are you using it, and (2) are you using the scratch for your dataset? Not using the scratch for your dataset can create a significant communication bottleneck, especially if multiple students do it at the same time. Make sure to download or copy your dataset to the scratch and load it from there.\n",
276+
"\n",
277+
"### I am not able to use Lisa at all\n",
278+
"\n",
279+
"If during the course, there will be major issues with Lisa (e.g. cluster goes into maintenance for a long time, issues with the filesystem, etc.), you can make use of GoogleColab, as we do already for all notebook tutorials here. All assignments do not necessarily require large amount of compute, and can often be comfortably trained on a GPU provided by GoogleColab. For an introduction to GoogleColab, see this [tutorial](https://colab.research.google.com/)."
276280
]
277281
},
278282
{
@@ -409,7 +413,7 @@
409413
"name": "python",
410414
"nbconvert_exporter": "python",
411415
"pygments_lexer": "ipython3",
412-
"version": "3.8.2"
416+
"version": "3.10.6"
413417
}
414418
},
415419
"nbformat": 4,

0 commit comments

Comments
 (0)