|
1 | | -# Writing your first CUDA Kernels |
| 1 | +# 🚀 CUDA Programming Guide: From Basics to Advanced |
2 | 2 |
|
3 | | -> Everything starts here -> https://docs.nvidia.com/cuda/ |
4 | | -> We mainly focus on the CUDA C programming guide -> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html |
5 | | -> Consider following along here -> https://developer.nvidia.com/blog/even-easier-introduction-cuda/ |
| 3 | +## 📚 Table of Contents |
6 | 4 |
|
7 | | -- its generally a good idea to write code for a kernel first on CPU (easy to write), then on GPU to ensure your logic lines up on the level of blocks and threads. you can set some input x, feed it through the CPU function and GPU kernel, check if outputs are the same. this tells you if your GPU code is working as expected |
| 5 | +- [Introduction](#introduction) |
| 6 | +- [Prerequisites](#prerequisites) |
| 7 | +- [Core Concepts](#core-concepts) |
| 8 | +- [Getting Started](#getting-started) |
| 9 | +- [Memory Hierarchy](#memory-hierarchy) |
| 10 | +- [Best Practices](#best-practices) |
| 11 | +- [Advanced Topics](#advanced-topics) |
8 | 12 |
|
9 | | -- Practice vector addition and matrix multiplication by hand |
10 | | -- Understand the concept of threads, blocks, and grids |
| 13 | +## 🎯 Introduction |
11 | 14 |
|
12 | | -## To run our compile & run our vec add kernel: |
| 15 | +CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model. This guide will help you understand and implement CUDA kernels efficiently. |
13 | 16 |
|
14 | | -```bash |
15 | | -nvcc -o 01 01_vector_addition.cu |
16 | | -./01 |
| 17 | +### Key Resources |
| 18 | +- [CUDA Documentation](https://docs.nvidia.com/cuda/) |
| 19 | +- [CUDA C Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) |
| 20 | +- [Hands-on Introduction](https://developer.nvidia.com/blog/even-easier-introduction-cuda/) |
| 21 | + |
| 22 | +## 💻 Prerequisites |
| 23 | + |
| 24 | +- NVIDIA GPU (Compute Capability 3.0+) |
| 25 | +- CUDA Toolkit installed |
| 26 | +- Basic C/C++ knowledge |
| 27 | +- Understanding of parallel computing concepts |
| 28 | + |
| 29 | +## 🔍 Core Concepts |
| 30 | + |
| 31 | +### Thread Hierarchy |
| 32 | +``` |
| 33 | +Grid |
| 34 | +└── Blocks |
| 35 | + └── Threads |
17 | 36 | ``` |
18 | 37 |
|
19 | | -(add small explanations and diagrams from assets folder) |
| 38 | +## 🚀 Getting Started |
| 39 | + |
| 40 | +### First CUDA Program: Vector Addition |
20 | 41 |
|
21 | | -## Hardware Mapping |
| 42 | +```cpp |
| 43 | +__global__ void vectorAdd(float *a, float *b, float *c, int n) { |
| 44 | + int idx = blockIdx.x * blockDim.x + threadIdx.x; |
| 45 | + if (idx < n) { |
| 46 | + c[idx] = a[idx] + b[idx]; |
| 47 | + } |
| 48 | +} |
| 49 | +``` |
22 | 50 |
|
23 | | -- CUDA cores handle threads |
24 | | -- Streaming Multiprocessors (SMs) handle blocks (typically multiple blocks per SM depending on resources required) |
25 | | -- Grids are mapped to the entire GPU since they are the highest level of the hierarchy |
| 51 | +### Compilation & Execution |
| 52 | +```bash |
| 53 | +nvcc -o vector_add vector_add.cu |
| 54 | +./vector_add |
| 55 | +``` |
26 | 56 |
|
27 | | -## Memory Model |
| 57 | +## 🧠 Memory Hierarchy |
| 58 | + |
| 59 | +| Memory Type | Scope | Lifetime | Speed | |
| 60 | +|------------|--------|----------|--------| |
| 61 | +| Registers | Thread | Thread | Fastest | |
| 62 | +| Shared Memory | Block | Block | Very Fast | |
| 63 | +| Global Memory | Grid | Application | Slow | |
| 64 | +| Constant Memory | Grid | Application | Fast (cached) | |
| 65 | + |
| 66 | +## 🎯 Best Practices |
| 67 | + |
| 68 | +1. **Memory Coalescing** |
| 69 | + - Align memory accesses |
| 70 | + - Use appropriate data types |
| 71 | + |
| 72 | +2. **Occupancy Optimization** |
| 73 | + - Balance resource usage |
| 74 | + - Optimize block sizes |
| 75 | + |
| 76 | +3. **Warp Efficiency** |
| 77 | + - Minimize divergent branching |
| 78 | + - Utilize warp-level primitives |
| 79 | + |
| 80 | +## 🔬 Advanced Topics |
| 81 | + |
| 82 | +### Matrix Operations |
| 83 | +```cpp |
| 84 | +__global__ void matrixMul(float *A, float *B, float *C, int N) { |
| 85 | + int row = blockIdx.y * blockDim.y + threadIdx.y; |
| 86 | + int col = blockIdx.x * blockDim.x + threadIdx.x; |
| 87 | + |
| 88 | + if (row < N && col < N) { |
| 89 | + float sum = 0.0f; |
| 90 | + for (int k = 0; k < N; k++) { |
| 91 | + sum += A[row * N + k] * B[k * N + col]; |
| 92 | + } |
| 93 | + C[row * N + col] = sum; |
| 94 | + } |
| 95 | +} |
| 96 | +``` |
28 | 97 |
|
29 | | -- Registers & Local Memory |
30 | | -- Shared Memory ⇒ allows threads within a block to communicate |
31 | | -- L2 cache. acts as buffer between cores/registers and global mem. also is a shared memory across SMs |
32 | | -- L2 cache and Shared/L1 cache both use the same circuitry as SRAM so they run at about the same speed. L2 cache is bigger and gives |
33 | | -- Speed: While both use SRAM, L2 is generally slower than L1. This is not due to the underlying technology, but rather due to: |
34 | | - - Size: L2 is larger, which increases access time. |
35 | | - - Shared nature: L2 is shared among all SMs, requiring more complex access mechanisms. |
36 | | - - Physical location: L2 is typically further from the compute units than L1. |
37 | | -- Global Memory ⇒ Stores data copies to and from Host. Everything on device can access Global mem |
38 | | -- Host ⇒ 16/32/64GB DRAM depending on your rig (those 4 RAM sticks on the motherboard) |
39 | | -- Arrays too big to fit into the Register will spill into local memory. our goal is to make sure this doesn’t happen because we want to keep our program running as fast as possible |
| 98 | +### Performance Monitoring |
| 99 | +
|
| 100 | +```bash |
| 101 | +nvprof ./your_program # Profile CUDA applications |
| 102 | +``` |
40 | 103 |
|
41 | | - |
| 104 | +## 📈 Optimization Tips |
42 | 105 |
|
| 106 | +1. **Memory Transfer** |
| 107 | + - Minimize host-device transfers |
| 108 | + - Use pinned memory for better bandwidth |
43 | 109 |
|
44 | | -### What is _random_ access memory? |
| 110 | +2. **Kernel Configuration** |
| 111 | + - Choose optimal block sizes |
| 112 | + - Consider hardware limitations |
45 | 113 |
|
46 | | -- in a video tape you have to access the bits sequentially to reach |
47 | | - the last ones. random refers to the nature of instantly getting information |
48 | | - from a given random index (without relying on having to index anything else). we are provided with an abstraction that seems like memory is a giant line but on chip its actually layed out as a grid (circuitry takes care of things here) |
| 114 | +3. **Algorithm Design** |
| 115 | + - Design for parallelism |
| 116 | + - Reduce sequential dependencies |
49 | 117 |
|
50 | | - |
| 118 | +## 🔗 Additional Resources |
51 | 119 |
|
| 120 | +- [NVIDIA Developer Blog](https://developer.nvidia.com/blog) |
| 121 | +- [CUDA Samples Repository](https://github.com/NVIDIA/cuda-samples) |
| 122 | +- [CUDA Training](https://developer.nvidia.com/cuda-training) |
52 | 123 |
|
53 | | -> [Efficient Matrix Tranpose Nvidia Blog Post](https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/) |
| 124 | +## 📝 License |
54 | 125 |
|
| 126 | +This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |
0 commit comments