|
1 | | -# CUDA Course |
2 | | - |
3 | | -> Note: This course is designed for Ubuntu Linux. Windows users can use Windows Subsystem for Linux or Docker containers to simulate the ubuntu Linux environment. |
4 | | -
|
5 | | -## Table of Contents |
6 | | - |
7 | | -1. [The Deep Learning Ecosystem](01%20Deep%20Learning%20Ecosystem/README.md) |
8 | | -2. [Setup/Installation](02%20Setup/README.md) |
9 | | -3. [C C++ Python R and Mojo review](03%20C%20C%2B%2B%20Python%20R%20and%20Mojo%20review) |
10 | | -4. [Gentle Intro to GPUs](04_Gentle_Intro_to_GPUs/README.md) |
11 | | -5. [Writing Your First Kernels](05_Writing_your_First_Kernels/README.md) |
12 | | -6. [CUDA APIs (cuBLAS, cuDNN, etc)](06_CUDA_APIs/README.md) |
13 | | -7. [Optimizing Matrix Multiplication](07_Faster_Matmul/README.md) |
14 | | -8. [Triton](08_Triton/README.md) |
15 | | -9. [PyTorch Extensions (CUDA)](08_PyTorch_Extensions/README.md) |
16 | | -10. [Final Project](09_Final_Project/README.md) |
17 | | -11. [Extras](10_Extras/README.md) |
18 | | - |
19 | | -## Course Philosophy |
20 | | - |
21 | | -This course aims to: |
22 | | - |
23 | | -- Lower the barrier to entry for HPC jobs |
24 | | -- Provide a foundation for understanding projects like Karpathy's [llm.c](https://github.com/karpathy/llm.c) |
25 | | -- Consolidate scattered CUDA programming resources into a comprehensive, organized course |
26 | | - |
27 | | -## Overview |
28 | | - |
29 | | -- Focus on GPU kernel optimization for performance improvement |
30 | | -- Cover CUDA, PyTorch, and Triton |
31 | | -- Emphasis on technical details of writing faster kernels |
32 | | -- Tailored for NVIDIA GPUs |
33 | | -- Culminates in a simple MLP MNIST project in CUDA |
34 | | - |
35 | | -## Prerequisites |
36 | | - |
37 | | -- Python programming (required) |
38 | | -- Basic differentiation and vector calculus for backprop (recommended) |
39 | | -- Linear algebra fundamentals (recommended) |
40 | | - |
41 | | -## Key Takeaways |
42 | | - |
43 | | -- Optimizing existing implementations |
44 | | -- Building CUDA kernels for cutting-edge research |
45 | | -- Understanding GPU performance bottlenecks, especially memory bandwidth |
46 | | - |
47 | | -## Hardware Requirements |
48 | | - |
49 | | -- Any NVIDIA GTX, RTX, or datacenter level GPU |
50 | | -- Cloud GPU options available for those without local hardware |
51 | | - |
52 | | -## Use Cases for CUDA/GPU Programming |
53 | | - |
54 | | -- Deep Learning (primary focus of this course) |
55 | | -- Graphics and Ray-tracing |
| 1 | +# 🚀 Advanced CUDA Programming & GPU Architecture |
| 2 | + |
| 3 | +> *Unlocking the Power of Parallel Computing* |
| 4 | +
|
| 5 | +## 🎯 Course Mission |
| 6 | +Transform complex GPU programming concepts into practical skills for high-performance computing professionals. Master CUDA programming through hands-on projects and real-world applications. |
| 7 | + |
| 8 | +## 🛠️ Core Technologies |
| 9 | +- **CUDA** - NVIDIA's parallel computing platform |
| 10 | +- **PyTorch** - Deep learning framework with CUDA support |
| 11 | +- **Triton** - Open-source GPU programming language |
| 12 | +- **cuBLAS & cuDNN** - GPU-accelerated libraries |
| 13 | + |
| 14 | +## 📚 Curriculum Roadmap |
| 15 | + |
| 16 | +### Phase 1: Foundations |
| 17 | +#### 1. Deep Learning Ecosystem Deep Dive |
| 18 | +- Modern GPU Architecture Overview |
| 19 | +- Memory Hierarchy & Data Flow |
| 20 | +- CUDA in the ML Stack |
| 21 | +- Hardware Accelerator Landscape (GPU vs TPU vs DPU) |
| 22 | + |
| 23 | +#### 2. Development Environment Setup |
| 24 | +- 🐧 Linux Environment Configuration |
| 25 | +- 🐋 Docker Containerization |
| 26 | +- 🔧 CUDA Toolkit Installation |
| 27 | +- 📊 Monitoring & Profiling Tools |
| 28 | + |
| 29 | +#### 3. Programming Language Mastery |
| 30 | +- C/C++ Advanced Concepts |
| 31 | +- Python High-Performance Computing |
| 32 | +- Mojo Language Introduction |
| 33 | +- R for GPU Computing |
| 34 | + |
| 35 | +### Phase 2: Core CUDA Concepts |
| 36 | +#### 4. GPU Architecture & Computing |
| 37 | +- SM Architecture Deep Dive |
| 38 | +- Memory Coalescing |
| 39 | +- Warp Execution Model |
| 40 | +- Shared Memory & L1/L2 Cache |
| 41 | + |
| 42 | +#### 5. CUDA Kernel Development |
| 43 | +- Thread Hierarchy |
| 44 | +- Memory Management |
| 45 | +- Synchronization Primitives |
| 46 | +- Error Handling & Debugging |
| 47 | + |
| 48 | +#### 6. Advanced CUDA APIs |
| 49 | +- cuBLAS Optimization |
| 50 | +- cuDNN for Deep Learning |
| 51 | +- Thrust Library |
| 52 | +- NCCL for Multi-GPU |
| 53 | + |
| 54 | +### Phase 3: Optimization & Performance |
| 55 | +#### 7. Matrix Operations Optimization |
| 56 | +- Tiled Matrix Multiplication |
| 57 | +- Memory Access Patterns |
| 58 | +- Bank Conflicts Resolution |
| 59 | +- Warp-Level Primitives |
| 60 | + |
| 61 | +#### 8. Modern GPU Programming |
| 62 | +- Triton Programming Model |
| 63 | +- Automatic Kernel Tuning |
| 64 | +- Memory Access Optimization |
| 65 | +- Performance Comparison with CUDA |
| 66 | + |
| 67 | +#### 9. PyTorch CUDA Extensions |
| 68 | +- Custom CUDA Kernels |
| 69 | +- C++/CUDA Extension Development |
| 70 | +- JIT Compilation |
| 71 | +- Performance Profiling |
| 72 | + |
| 73 | +### Phase 4: Applied Projects |
| 74 | +#### 10. Capstone Project |
| 75 | +- MNIST MLP Implementation |
| 76 | +- Custom CUDA Kernels |
| 77 | +- Performance Optimization |
| 78 | +- Multi-GPU Scaling |
| 79 | + |
| 80 | +#### 11. Advanced Topics |
| 81 | +- Ray Tracing |
56 | 82 | - Fluid Simulation |
57 | | -- Video Editing |
58 | | -- Crypto Mining |
59 | | -- 3D modeling |
60 | | -- Anything that requires parallel processing with large arrays |
61 | | - |
62 | | -## Resources |
63 | | - |
64 | | -- GitHub repo (this repository) |
65 | | -- Stack Overflow |
66 | | -- NVIDIA Developer Forums |
67 | | -- NVIDIA and PyTorch documentation |
68 | | -- LLMs for navigating the space |
69 | | - |
70 | | -## Other Learning Material |
71 | | - |
72 | | -- https://github.com/CoffeeBeforeArch/cuda_programming |
73 | | -- https://www.youtube.com/@CUDAMODE |
74 | | -- https://discord.com/invite/cudamode |
75 | | - |
76 | | -## Fun YouTube Videos: |
77 | | -- [How do GPUs works? Exploring GPU Architecture](https://www.youtube.com/watch?v=h9Z4oGN89MU) |
78 | | -- [But how do GPUs actually work?](https://www.youtube.com/watch?v=58jtf24uijw&ab_channel=Graphicode) |
79 | | -- [Getting Started With CUDA for Python Programmers](https://www.youtube.com/watch?v=nOxKexn3iBo&ab_channel=JeremyHoward) |
80 | | -- [Transformers Explained From The Atom Up](https://www.youtube.com/watch?v=7lJZHbg0EQ4&ab_channel=JacobRintamaki) |
81 | | -- [How CUDA Programming Works - Stephen Jones, CUDA Architect, NVIDIA](https://www.youtube.com/watch?v=QQceTDjA4f4&ab_channel=ChristopherHollinworth) |
82 | | -- [Parallel Computing with Nvidia CUDA - NeuralNine](https://www.youtube.com/watch?v=zSCdTOKrnII&ab_channel=NeuralNine) |
83 | | -- [CPU vs GPU vs TPU vs DPU vs QPU](https://www.youtube.com/watch?v=r5NQecwZs1A&ab_channel=Fireship) |
84 | | -- [Nvidia CUDA in 100 Seconds](https://www.youtube.com/watch?v=pPStdjuYzSI&ab_channel=Fireship) |
85 | | -- [How AI Discovered a Faster Matrix Multiplication Algorithm](https://www.youtube.com/watch?v=fDAPJ7rvcUw&t=1s&ab_channel=QuantaMagazine) |
86 | | -- [The fastest matrix multiplication algorithm](https://www.youtube.com/watch?v=sZxjuT1kUd0&ab_channel=Dr.TreforBazett) |
87 | | -- [From Scratch: Cache Tiled Matrix Multiplication in CUDA](https://www.youtube.com/watch?v=ga2ML1uGr5o&ab_channel=CoffeeBeforeArch) |
88 | | -- [From Scratch: Matrix Multiplication in CUDA](https://www.youtube.com/watch?v=DpEgZe2bbU0&ab_channel=CoffeeBeforeArch) |
89 | | -- [Intro to GPU Programming](https://www.youtube.com/watch?v=G-EimI4q-TQ&ab_channel=TomNurkkala) |
90 | | -- [CUDA Programming](https://www.youtube.com/watch?v=xwbD6fL5qC8&ab_channel=TomNurkkala) |
91 | | -- [Intro to CUDA (part 1): High Level Concepts](https://www.youtube.com/watch?v=4APkMJdiudU&ab_channel=JoshHolloway) |
92 | | -- [Intro to GPU Hardware](https://www.youtube.com/watch?v=kUqkOAU84bA&ab_channel=TomNurkkala) |
| 83 | +- Cryptographic Applications |
| 84 | +- Scientific Computing |
| 85 | + |
| 86 | +## 🎓 Learning Outcomes |
| 87 | +By the end of this course, you will be able to: |
| 88 | +- Design and implement efficient CUDA kernels |
| 89 | +- Optimize GPU memory usage and access patterns |
| 90 | +- Develop custom PyTorch extensions |
| 91 | +- Profile and debug GPU applications |
| 92 | +- Deploy multi-GPU solutions |
| 93 | + |
| 94 | +## 🔍 Prerequisites |
| 95 | +### Required: |
| 96 | +- Strong Python programming skills |
| 97 | +- Basic understanding of C/C++ |
| 98 | +- Computer architecture fundamentals |
| 99 | + |
| 100 | +### Recommended: |
| 101 | +- Linear algebra basics |
| 102 | +- Calculus (for backpropagation) |
| 103 | +- Basic ML/DL concepts |
| 104 | + |
| 105 | +## 💻 Hardware Requirements |
| 106 | +### Minimum: |
| 107 | +- NVIDIA GTX 1660 or better |
| 108 | +- 16GB RAM |
| 109 | +- 50GB free storage |
| 110 | + |
| 111 | +### Recommended: |
| 112 | +- NVIDIA RTX 3070 or better |
| 113 | +- 32GB RAM |
| 114 | +- 100GB SSD storage |
| 115 | + |
| 116 | +## 📚 Learning Resources |
| 117 | + |
| 118 | +### Official Documentation |
| 119 | +- [NVIDIA CUDA Documentation](https://docs.nvidia.com/cuda/) |
| 120 | +- [PyTorch CUDA Documentation](https://pytorch.org/docs/stable/cuda.html) |
| 121 | +- [Triton Documentation](https://triton-lang.org/) |
| 122 | + |
| 123 | +### Community Resources |
| 124 | +- 💬 NVIDIA Developer Forums |
| 125 | +- 🤝 Stack Overflow CUDA tag |
| 126 | +- 🎮 Discord: CUDAMODE community |
| 127 | + |
| 128 | +### Video Learning |
| 129 | +#### Fundamentals |
| 130 | +- 🎥 [GPU Architecture Deep Dive](https://www.youtube.com/watch?v=h9Z4oGN89MU) |
| 131 | +- 🎥 [CUDA Programming Essentials](https://www.youtube.com/watch?v=QQceTDjA4f4) |
| 132 | + |
| 133 | +#### Advanced Topics |
| 134 | +- 🎥 [Matrix Multiplication Optimization](https://www.youtube.com/watch?v=DpEgZe2bbU0) |
| 135 | +- 🎥 [Multi-GPU Programming](https://www.youtube.com/watch?v=4APkMJdiudU) |
| 136 | + |
| 137 | +## 🌟 Course Philosophy |
| 138 | +We believe in: |
| 139 | +- Hands-on learning through practical projects |
| 140 | +- Understanding fundamentals before optimization |
| 141 | +- Building real-world applicable skills |
| 142 | +- Community-driven knowledge sharing |
| 143 | + |
| 144 | +## 📈 Industry Applications |
| 145 | +- 🤖 Deep Learning & AI |
| 146 | +- 🎮 Graphics & Gaming |
| 147 | +- 🌊 Scientific Simulation |
| 148 | +- 📊 Data Analytics |
| 149 | +- 🔐 Cryptography |
| 150 | +- 🎬 Media Processing |
0 commit comments