Skip to content

Commit fe3eb63

Browse files
authored
Update README.md
1 parent 56688f0 commit fe3eb63

File tree

1 file changed

+149
-91
lines changed

1 file changed

+149
-91
lines changed

README.md

Lines changed: 149 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -1,92 +1,150 @@
1-
# CUDA Course
2-
3-
> Note: This course is designed for Ubuntu Linux. Windows users can use Windows Subsystem for Linux or Docker containers to simulate the ubuntu Linux environment.
4-
5-
## Table of Contents
6-
7-
1. [The Deep Learning Ecosystem](01%20Deep%20Learning%20Ecosystem/README.md)
8-
2. [Setup/Installation](02%20Setup/README.md)
9-
3. [C C++ Python R and Mojo review](03%20C%20C%2B%2B%20Python%20R%20and%20Mojo%20review)
10-
4. [Gentle Intro to GPUs](04_Gentle_Intro_to_GPUs/README.md)
11-
5. [Writing Your First Kernels](05_Writing_your_First_Kernels/README.md)
12-
6. [CUDA APIs (cuBLAS, cuDNN, etc)](06_CUDA_APIs/README.md)
13-
7. [Optimizing Matrix Multiplication](07_Faster_Matmul/README.md)
14-
8. [Triton](08_Triton/README.md)
15-
9. [PyTorch Extensions (CUDA)](08_PyTorch_Extensions/README.md)
16-
10. [Final Project](09_Final_Project/README.md)
17-
11. [Extras](10_Extras/README.md)
18-
19-
## Course Philosophy
20-
21-
This course aims to:
22-
23-
- Lower the barrier to entry for HPC jobs
24-
- Provide a foundation for understanding projects like Karpathy's [llm.c](https://github.com/karpathy/llm.c)
25-
- Consolidate scattered CUDA programming resources into a comprehensive, organized course
26-
27-
## Overview
28-
29-
- Focus on GPU kernel optimization for performance improvement
30-
- Cover CUDA, PyTorch, and Triton
31-
- Emphasis on technical details of writing faster kernels
32-
- Tailored for NVIDIA GPUs
33-
- Culminates in a simple MLP MNIST project in CUDA
34-
35-
## Prerequisites
36-
37-
- Python programming (required)
38-
- Basic differentiation and vector calculus for backprop (recommended)
39-
- Linear algebra fundamentals (recommended)
40-
41-
## Key Takeaways
42-
43-
- Optimizing existing implementations
44-
- Building CUDA kernels for cutting-edge research
45-
- Understanding GPU performance bottlenecks, especially memory bandwidth
46-
47-
## Hardware Requirements
48-
49-
- Any NVIDIA GTX, RTX, or datacenter level GPU
50-
- Cloud GPU options available for those without local hardware
51-
52-
## Use Cases for CUDA/GPU Programming
53-
54-
- Deep Learning (primary focus of this course)
55-
- Graphics and Ray-tracing
1+
# 🚀 Advanced CUDA Programming & GPU Architecture
2+
3+
> *Unlocking the Power of Parallel Computing*
4+
5+
## 🎯 Course Mission
6+
Transform complex GPU programming concepts into practical skills for high-performance computing professionals. Master CUDA programming through hands-on projects and real-world applications.
7+
8+
## 🛠️ Core Technologies
9+
- **CUDA** - NVIDIA's parallel computing platform
10+
- **PyTorch** - Deep learning framework with CUDA support
11+
- **Triton** - Open-source GPU programming language
12+
- **cuBLAS & cuDNN** - GPU-accelerated libraries
13+
14+
## 📚 Curriculum Roadmap
15+
16+
### Phase 1: Foundations
17+
#### 1. Deep Learning Ecosystem Deep Dive
18+
- Modern GPU Architecture Overview
19+
- Memory Hierarchy & Data Flow
20+
- CUDA in the ML Stack
21+
- Hardware Accelerator Landscape (GPU vs TPU vs DPU)
22+
23+
#### 2. Development Environment Setup
24+
- 🐧 Linux Environment Configuration
25+
- 🐋 Docker Containerization
26+
- 🔧 CUDA Toolkit Installation
27+
- 📊 Monitoring & Profiling Tools
28+
29+
#### 3. Programming Language Mastery
30+
- C/C++ Advanced Concepts
31+
- Python High-Performance Computing
32+
- Mojo Language Introduction
33+
- R for GPU Computing
34+
35+
### Phase 2: Core CUDA Concepts
36+
#### 4. GPU Architecture & Computing
37+
- SM Architecture Deep Dive
38+
- Memory Coalescing
39+
- Warp Execution Model
40+
- Shared Memory & L1/L2 Cache
41+
42+
#### 5. CUDA Kernel Development
43+
- Thread Hierarchy
44+
- Memory Management
45+
- Synchronization Primitives
46+
- Error Handling & Debugging
47+
48+
#### 6. Advanced CUDA APIs
49+
- cuBLAS Optimization
50+
- cuDNN for Deep Learning
51+
- Thrust Library
52+
- NCCL for Multi-GPU
53+
54+
### Phase 3: Optimization & Performance
55+
#### 7. Matrix Operations Optimization
56+
- Tiled Matrix Multiplication
57+
- Memory Access Patterns
58+
- Bank Conflicts Resolution
59+
- Warp-Level Primitives
60+
61+
#### 8. Modern GPU Programming
62+
- Triton Programming Model
63+
- Automatic Kernel Tuning
64+
- Memory Access Optimization
65+
- Performance Comparison with CUDA
66+
67+
#### 9. PyTorch CUDA Extensions
68+
- Custom CUDA Kernels
69+
- C++/CUDA Extension Development
70+
- JIT Compilation
71+
- Performance Profiling
72+
73+
### Phase 4: Applied Projects
74+
#### 10. Capstone Project
75+
- MNIST MLP Implementation
76+
- Custom CUDA Kernels
77+
- Performance Optimization
78+
- Multi-GPU Scaling
79+
80+
#### 11. Advanced Topics
81+
- Ray Tracing
5682
- Fluid Simulation
57-
- Video Editing
58-
- Crypto Mining
59-
- 3D modeling
60-
- Anything that requires parallel processing with large arrays
61-
62-
## Resources
63-
64-
- GitHub repo (this repository)
65-
- Stack Overflow
66-
- NVIDIA Developer Forums
67-
- NVIDIA and PyTorch documentation
68-
- LLMs for navigating the space
69-
70-
## Other Learning Material
71-
72-
- https://github.com/CoffeeBeforeArch/cuda_programming
73-
- https://www.youtube.com/@CUDAMODE
74-
- https://discord.com/invite/cudamode
75-
76-
## Fun YouTube Videos:
77-
- [How do GPUs works? Exploring GPU Architecture](https://www.youtube.com/watch?v=h9Z4oGN89MU)
78-
- [But how do GPUs actually work?](https://www.youtube.com/watch?v=58jtf24uijw&ab_channel=Graphicode)
79-
- [Getting Started With CUDA for Python Programmers](https://www.youtube.com/watch?v=nOxKexn3iBo&ab_channel=JeremyHoward)
80-
- [Transformers Explained From The Atom Up](https://www.youtube.com/watch?v=7lJZHbg0EQ4&ab_channel=JacobRintamaki)
81-
- [How CUDA Programming Works - Stephen Jones, CUDA Architect, NVIDIA](https://www.youtube.com/watch?v=QQceTDjA4f4&ab_channel=ChristopherHollinworth)
82-
- [Parallel Computing with Nvidia CUDA - NeuralNine](https://www.youtube.com/watch?v=zSCdTOKrnII&ab_channel=NeuralNine)
83-
- [CPU vs GPU vs TPU vs DPU vs QPU](https://www.youtube.com/watch?v=r5NQecwZs1A&ab_channel=Fireship)
84-
- [Nvidia CUDA in 100 Seconds](https://www.youtube.com/watch?v=pPStdjuYzSI&ab_channel=Fireship)
85-
- [How AI Discovered a Faster Matrix Multiplication Algorithm](https://www.youtube.com/watch?v=fDAPJ7rvcUw&t=1s&ab_channel=QuantaMagazine)
86-
- [The fastest matrix multiplication algorithm](https://www.youtube.com/watch?v=sZxjuT1kUd0&ab_channel=Dr.TreforBazett)
87-
- [From Scratch: Cache Tiled Matrix Multiplication in CUDA](https://www.youtube.com/watch?v=ga2ML1uGr5o&ab_channel=CoffeeBeforeArch)
88-
- [From Scratch: Matrix Multiplication in CUDA](https://www.youtube.com/watch?v=DpEgZe2bbU0&ab_channel=CoffeeBeforeArch)
89-
- [Intro to GPU Programming](https://www.youtube.com/watch?v=G-EimI4q-TQ&ab_channel=TomNurkkala)
90-
- [CUDA Programming](https://www.youtube.com/watch?v=xwbD6fL5qC8&ab_channel=TomNurkkala)
91-
- [Intro to CUDA (part 1): High Level Concepts](https://www.youtube.com/watch?v=4APkMJdiudU&ab_channel=JoshHolloway)
92-
- [Intro to GPU Hardware](https://www.youtube.com/watch?v=kUqkOAU84bA&ab_channel=TomNurkkala)
83+
- Cryptographic Applications
84+
- Scientific Computing
85+
86+
## 🎓 Learning Outcomes
87+
By the end of this course, you will be able to:
88+
- Design and implement efficient CUDA kernels
89+
- Optimize GPU memory usage and access patterns
90+
- Develop custom PyTorch extensions
91+
- Profile and debug GPU applications
92+
- Deploy multi-GPU solutions
93+
94+
## 🔍 Prerequisites
95+
### Required:
96+
- Strong Python programming skills
97+
- Basic understanding of C/C++
98+
- Computer architecture fundamentals
99+
100+
### Recommended:
101+
- Linear algebra basics
102+
- Calculus (for backpropagation)
103+
- Basic ML/DL concepts
104+
105+
## 💻 Hardware Requirements
106+
### Minimum:
107+
- NVIDIA GTX 1660 or better
108+
- 16GB RAM
109+
- 50GB free storage
110+
111+
### Recommended:
112+
- NVIDIA RTX 3070 or better
113+
- 32GB RAM
114+
- 100GB SSD storage
115+
116+
## 📚 Learning Resources
117+
118+
### Official Documentation
119+
- [NVIDIA CUDA Documentation](https://docs.nvidia.com/cuda/)
120+
- [PyTorch CUDA Documentation](https://pytorch.org/docs/stable/cuda.html)
121+
- [Triton Documentation](https://triton-lang.org/)
122+
123+
### Community Resources
124+
- 💬 NVIDIA Developer Forums
125+
- 🤝 Stack Overflow CUDA tag
126+
- 🎮 Discord: CUDAMODE community
127+
128+
### Video Learning
129+
#### Fundamentals
130+
- 🎥 [GPU Architecture Deep Dive](https://www.youtube.com/watch?v=h9Z4oGN89MU)
131+
- 🎥 [CUDA Programming Essentials](https://www.youtube.com/watch?v=QQceTDjA4f4)
132+
133+
#### Advanced Topics
134+
- 🎥 [Matrix Multiplication Optimization](https://www.youtube.com/watch?v=DpEgZe2bbU0)
135+
- 🎥 [Multi-GPU Programming](https://www.youtube.com/watch?v=4APkMJdiudU)
136+
137+
## 🌟 Course Philosophy
138+
We believe in:
139+
- Hands-on learning through practical projects
140+
- Understanding fundamentals before optimization
141+
- Building real-world applicable skills
142+
- Community-driven knowledge sharing
143+
144+
## 📈 Industry Applications
145+
- 🤖 Deep Learning & AI
146+
- 🎮 Graphics & Gaming
147+
- 🌊 Scientific Simulation
148+
- 📊 Data Analytics
149+
- 🔐 Cryptography
150+
- 🎬 Media Processing

0 commit comments

Comments
 (0)