Blog
Thoughts on AI, Deep Learning & ML Optimization
Substack
Revisiting GEMM: From Kernel Tweaks to System Thinking
Revisited GEMM to move beyond kernel tuning, exploring how virtual memory, cache locality, and memory transactions shape GPU performance at a systems level.
Mar 2026
Substack
Deep Learning Profiling
Profiled deep learning inference to uncover how kernel execution, launch latency, and activation memory dominate performance, enabling targeted optimizations like proper warmup and mixed precision.
Feb 2026
Substack
Revisiting DeepLearning: DenseNet, MobileNet, EfficientNet
Explored efficient CNN design from DenseNet to EfficientNet, showing how feature reuse, computation reduction, and compound scaling require precise architectural choices to avoid large parameter inefficiencies.
Jan 2026
BearBlog
WorkLog: Optimizing Convolution
Performance benchmarking of CUDA-based convolution vs cuDNN, highlighting how shared memory optimizations can outperform cuDNN in low-batch, low-channel scenarios.
Jan 2026
Substack
Revisiting Deep Learning: AlexNet & ResNet
Revisited AlexNet and ResNet to examine how techniques like ReLU, dropout, and normalization evolved into residual learning and skip connections, enabling stable training of 100+ layer networks.
Jan 2026
BearBlog
WorkLog: Optimizing Softmax
Optimizing Softmax with a custom CUDA kernel using Online Softmax, reducing memory passes and achieving up to 16Γ speedup over cuDNN and PyTorch.
Dec 2025
Substack
WorkLog: Reduction in CUDA
Optimized CUDA parallel reductions by addressing warp divergence, memory access inefficiencies, and synchronization overhead. Progressed from naive interleaved addressing to warp-level unrolling, achieving near-peak memory bandwidth.
Dec 2025
BearBlog
WorkLog: Optimizing GEMM
Explores GEMM optimization from naive kernels to 2D thread tiling, improving arithmetic intensity and reaching ~50% of cuBLAS performance.
Nov 2025