Building a C++ Deep Learning Framework from Scratch: My Journey with CUDA
Building a PyTorch-like deep learning framework from the ground up.
Series: Learning CUDA
- 1. From CNNs to CUDA: An Intuitive Guide to Tiled Matrix Multiplication
- 2. Building a C++ Deep Learning Framework from Scratch: My Journey with CUDA Current
Building a C++ Deep Learning Framework from Scratch: My Journey with CUDA
Continuing the madness. Over the past few weeks, I’ve been heads-down in the weeds of C++, CUDA kernels, and memory management to build Minitorch, my own GPU-accelerated ML library.
The Performance Gap: Beyond Naive Kernels
When you start, you write naive kernels. They work, but they’re slow. The real magic happens when you start thinking about GPU architecture.
1. The Caching Allocator (Memory Pool)
In a training loop with hundreds of thousands of iterations, calling cudaMalloc and cudaFree every step is a death sentence for performance. I implemented a custom GPU Memory Pool that caches and reuses allocations. Instead of asking the OS for memory every time, the framework manages its own slab, making tensor operations feel instantaneous.
2. Tiled Matrix Multiplication
Optimizing MatMul is where you truly understand GPU architecture. By using Shared Memory Tiling, I moved computation from slow Global Memory to blazingly fast SM-local memory. You can review my journey on this optimisation in this blog post.
Design Choice: Why Modules over Tensor Autograd?
PyTorch uses Tensor-based Autograd, where every tensor remembers its history via a dynamic computational graph. For MiniTorch, I chose an Object-Oriented Module System.
In this architecture, layers like Linear or Sigmoid are explicitly responsible for their own forward and backward passes.
Why?
- Transparency: It forced me to manually cache the exact intermediate states (like Sigmoid’s output or ReLU’s mask) needed for backprop.
- Simplicity & Speed: At this stage, building a complex logic for dynamic graph construction adds overhead that distracts from the core goal of writing a sufficient mini-framework.
- The “Lego” Feel: Using a
Sequentialcontainer that explicitly iterates through modules makes the flow of gradients incredibly clear.
The Results
Running on a dataset of 350,000+ sales samples, the framework now trains a multi-layer DNN with Adam optimization completely on the GPU. Seeing the MSE loss drop in real-time on a framework I wrote from scratch is one of the most rewarding moments in my engineering journey.
Check my LinkedIn post for the demo video of the framework in action!
What’s Next?
The foundation is solid. Next up:
- CNNs: Moving beyond MLPs.
- Infrastructure: Decoupling the backend to support CPU/GPU dispatch.
- The Big One: Building a Transformer block entirely in this C++ ecosystem.
Stay tuned—I’ll be sharing more deep dives into the kernels soon!