NVIDIA announced new fused MoE kernels for CUDA that deliver 1.3x–2x kernel-level speedups, achieving an 8% end-to-end...
NVIDIA announced new fused MoE kernels for CUDA that deliver 1.3x–2x kernel-level speedups, achieving an 8% end-to-end improvement on DeepSeek-V3 pre-training and a 93% improvement on GPT-OSS (Source 1).
signal brief
NVIDIA announced new fused MoE kernels for CUDA that deliver 1.3x–2x kernel-level speedups, achieving an 8% end-to-end improvement on DeepSeek-V3 pre-training and a 93% improvement on GPT-OSS (Source 1). The kernels are built with NVIDIA CuTe DSL and available in cuDNN Frontend, accessible via Transformer Engine and Megatron-Core. This optimization directly addresses memory and synchronization bottlenecks in MoE blocks, a critical architecture for large-scale AI models. The timing aligns with growing adoption of MoE models (e.g., DeepSeek-V3.2 in Hugging Face Transformers v5.11.0, Source 4), reinforcing CUDA's performance advantage for training workloads. While not a breakthrough, the steady stream of kernel enhancements solidifies NVIDIA's software moat and developer ecosystem, making it harder for competitors to erode CUDA's dominance.
evidence
- https://developer.nvidia.com/blog/boosting-moe-training-throughput-with-advanced-fusion-kernels/web
- https://www.hpcwire.com/off-the-wire/pnnl-prepares-for-quantum-advantage/web
- https://developer.nvidia.com/cuda-toolkitweb
- https://github.com/huggingface/transformers/releases/tag/v5.11.0github
- https://manifold.markets/_deleted_/will-cuda-remain-a-monopoly-for-gpuweb
Decision support, not stock advice. This signal is research with cited evidence — not a recommendation to buy, sell, or hold any security.