Stars
Kanban board to manage your AI coding agents
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
Introduction to Machine Learning Systems
ArcticInference: vLLM plugin for high-throughput, low-latency inference
Manage your GPUs across NVIDIA, AMD, Intel, and Apple Silicon systems.
FlashInfer: Kernel Library for LLM Serving
Real-time terminal monitor for InfiniBand networks - htop for high-speed interconnects
Notes for using Julia while learning calculus
MIT IAP short course: Matrix Calculus for Machine Learning and Beyond
Code to create the figures and solve the exercises in the textbook
Master calculus 1 using Python: derivatives and applications
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
NCCL Examples from Official NVIDIA NCCL Developer Guide.
example code for using DC QP for providing RDMA READ and WRITE operations to remote GPU memory
Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, allGather, reduceScatter and sendRecv operations.
High Performance Computing Project 2021-2022.
Implementation of the allreduce algorithm using only MPI point to point communication routines (MPI_Send, MPI_Recv).
MSCCL++: A GPU-driven communication stack for scalable AI applications