Skip to content

Commit 0948284

Browse files
authored
Update README.md
1 parent b4a877d commit 0948284

File tree

1 file changed

+107
-35
lines changed

1 file changed

+107
-35
lines changed
Lines changed: 107 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,54 +1,126 @@
1-
# Writing your first CUDA Kernels
1+
# 🚀 CUDA Programming Guide: From Basics to Advanced
22

3-
> Everything starts here -> https://docs.nvidia.com/cuda/
4-
> We mainly focus on the CUDA C programming guide -> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
5-
> Consider following along here -> https://developer.nvidia.com/blog/even-easier-introduction-cuda/
3+
## 📚 Table of Contents
64

7-
- its generally a good idea to write code for a kernel first on CPU (easy to write), then on GPU to ensure your logic lines up on the level of blocks and threads. you can set some input x, feed it through the CPU function and GPU kernel, check if outputs are the same. this tells you if your GPU code is working as expected
5+
- [Introduction](#introduction)
6+
- [Prerequisites](#prerequisites)
7+
- [Core Concepts](#core-concepts)
8+
- [Getting Started](#getting-started)
9+
- [Memory Hierarchy](#memory-hierarchy)
10+
- [Best Practices](#best-practices)
11+
- [Advanced Topics](#advanced-topics)
812

9-
- Practice vector addition and matrix multiplication by hand
10-
- Understand the concept of threads, blocks, and grids
13+
## 🎯 Introduction
1114

12-
## To run our compile & run our vec add kernel:
15+
CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model. This guide will help you understand and implement CUDA kernels efficiently.
1316

14-
```bash
15-
nvcc -o 01 01_vector_addition.cu
16-
./01
17+
### Key Resources
18+
- [CUDA Documentation](https://docs.nvidia.com/cuda/)
19+
- [CUDA C Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)
20+
- [Hands-on Introduction](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)
21+
22+
## 💻 Prerequisites
23+
24+
- NVIDIA GPU (Compute Capability 3.0+)
25+
- CUDA Toolkit installed
26+
- Basic C/C++ knowledge
27+
- Understanding of parallel computing concepts
28+
29+
## 🔍 Core Concepts
30+
31+
### Thread Hierarchy
32+
```
33+
Grid
34+
└── Blocks
35+
└── Threads
1736
```
1837

19-
(add small explanations and diagrams from assets folder)
38+
## 🚀 Getting Started
39+
40+
### First CUDA Program: Vector Addition
2041

21-
## Hardware Mapping
42+
```cpp
43+
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
44+
int idx = blockIdx.x * blockDim.x + threadIdx.x;
45+
if (idx < n) {
46+
c[idx] = a[idx] + b[idx];
47+
}
48+
}
49+
```
2250
23-
- CUDA cores handle threads
24-
- Streaming Multiprocessors (SMs) handle blocks (typically multiple blocks per SM depending on resources required)
25-
- Grids are mapped to the entire GPU since they are the highest level of the hierarchy
51+
### Compilation & Execution
52+
```bash
53+
nvcc -o vector_add vector_add.cu
54+
./vector_add
55+
```
2656

27-
## Memory Model
57+
## 🧠 Memory Hierarchy
58+
59+
| Memory Type | Scope | Lifetime | Speed |
60+
|------------|--------|----------|--------|
61+
| Registers | Thread | Thread | Fastest |
62+
| Shared Memory | Block | Block | Very Fast |
63+
| Global Memory | Grid | Application | Slow |
64+
| Constant Memory | Grid | Application | Fast (cached) |
65+
66+
## 🎯 Best Practices
67+
68+
1. **Memory Coalescing**
69+
- Align memory accesses
70+
- Use appropriate data types
71+
72+
2. **Occupancy Optimization**
73+
- Balance resource usage
74+
- Optimize block sizes
75+
76+
3. **Warp Efficiency**
77+
- Minimize divergent branching
78+
- Utilize warp-level primitives
79+
80+
## 🔬 Advanced Topics
81+
82+
### Matrix Operations
83+
```cpp
84+
__global__ void matrixMul(float *A, float *B, float *C, int N) {
85+
int row = blockIdx.y * blockDim.y + threadIdx.y;
86+
int col = blockIdx.x * blockDim.x + threadIdx.x;
87+
88+
if (row < N && col < N) {
89+
float sum = 0.0f;
90+
for (int k = 0; k < N; k++) {
91+
sum += A[row * N + k] * B[k * N + col];
92+
}
93+
C[row * N + col] = sum;
94+
}
95+
}
96+
```
2897
29-
- Registers & Local Memory
30-
- Shared Memory ⇒ allows threads within a block to communicate
31-
- L2 cache. acts as buffer between cores/registers and global mem. also is a shared memory across SMs
32-
- L2 cache and Shared/L1 cache both use the same circuitry as SRAM so they run at about the same speed. L2 cache is bigger and gives
33-
- Speed: While both use SRAM, L2 is generally slower than L1. This is not due to the underlying technology, but rather due to:
34-
- Size: L2 is larger, which increases access time.
35-
- Shared nature: L2 is shared among all SMs, requiring more complex access mechanisms.
36-
- Physical location: L2 is typically further from the compute units than L1.
37-
- Global Memory ⇒ Stores data copies to and from Host. Everything on device can access Global mem
38-
- Host ⇒ 16/32/64GB DRAM depending on your rig (those 4 RAM sticks on the motherboard)
39-
- Arrays too big to fit into the Register will spill into local memory. our goal is to make sure this doesn’t happen because we want to keep our program running as fast as possible
98+
### Performance Monitoring
99+
100+
```bash
101+
nvprof ./your_program # Profile CUDA applications
102+
```
40103

41-
![](assets/memhierarchy.png)
104+
## 📈 Optimization Tips
42105

106+
1. **Memory Transfer**
107+
- Minimize host-device transfers
108+
- Use pinned memory for better bandwidth
43109

44-
### What is _random_ access memory?
110+
2. **Kernel Configuration**
111+
- Choose optimal block sizes
112+
- Consider hardware limitations
45113

46-
- in a video tape you have to access the bits sequentially to reach
47-
the last ones. random refers to the nature of instantly getting information
48-
from a given random index (without relying on having to index anything else). we are provided with an abstraction that seems like memory is a giant line but on chip its actually layed out as a grid (circuitry takes care of things here)
114+
3. **Algorithm Design**
115+
- Design for parallelism
116+
- Reduce sequential dependencies
49117

50-
![](../assets/memmodel.png)
118+
## 🔗 Additional Resources
51119

120+
- [NVIDIA Developer Blog](https://developer.nvidia.com/blog)
121+
- [CUDA Samples Repository](https://github.com/NVIDIA/cuda-samples)
122+
- [CUDA Training](https://developer.nvidia.com/cuda-training)
52123

53-
> [Efficient Matrix Tranpose Nvidia Blog Post](https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/)
124+
## 📝 License
54125

126+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

0 commit comments

Comments
 (0)