Blog

Thoughts on AI, Deep Learning & ML Optimization

Substack

Revisiting GEMM: From Kernel Tweaks to System Thinking

Revisited GEMM to move beyond kernel tuning, exploring how virtual memory, cache locality, and memory transactions shape GPU performance at a systems level.

Mar 2026

Substack

Deep Learning Profiling

Profiled deep learning inference to uncover how kernel execution, launch latency, and activation memory dominate performance, enabling targeted optimizations like proper warmup and mixed precision.

Feb 2026

Substack

Revisiting DeepLearning: DenseNet, MobileNet, EfficientNet

Explored efficient CNN design from DenseNet to EfficientNet, showing how feature reuse, computation reduction, and compound scaling require precise architectural choices to avoid large parameter inefficiencies.

Jan 2026

BearBlog

WorkLog: Optimizing Convolution

Performance benchmarking of CUDA-based convolution vs cuDNN, highlighting how shared memory optimizations can outperform cuDNN in low-batch, low-channel scenarios.

Jan 2026

Substack

Revisiting Deep Learning: AlexNet & ResNet

Revisited AlexNet and ResNet to examine how techniques like ReLU, dropout, and normalization evolved into residual learning and skip connections, enabling stable training of 100+ layer networks.

Jan 2026

BearBlog

WorkLog: Optimizing Softmax

Optimizing Softmax with a custom CUDA kernel using Online Softmax, reducing memory passes and achieving up to 16Γ— speedup over cuDNN and PyTorch.

Dec 2025

Substack

WorkLog: Reduction in CUDA

Optimized CUDA parallel reductions by addressing warp divergence, memory access inefficiencies, and synchronization overhead. Progressed from naive interleaved addressing to warp-level unrolling, achieving near-peak memory bandwidth.

Dec 2025

BearBlog

WorkLog: Optimizing GEMM

Explores GEMM optimization from naive kernels to 2D thread tiling, improving arithmetic intensity and reaching ~50% of cuBLAS performance.

Nov 2025