semantic-scholar
Semantic ScholarKeeps: paper title, abstract snippet
Archive source — full history has value. Use pagination to browse older records.
Query: mixture of experts inference serving Authors: Can Hankendi, Rana Shahout, Minlan Yu, A. Coskun Citations: 1 Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior s
Query: mixture of experts inference serving Authors: Ananya Hegde, Akshata Kumble, Ravi Gupta Citations: 0 Mixture-of-Experts (MoE) architectures reduce inference cost by activating only a sparse subset of parameters per token. However, when these models exceed single-GPU memory,
- 2026-04-25Research paper: Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
Query: mixture of experts inference serving Authors: A. Bambhaniya, Geonhwa Jeong, Jason Park, Jiecao Yu, Jaewon Lee Citations: 0 Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportion
Query: photonic computing transformer accelerator Authors: S. Afifi, O. Alo, I. Thakkar, S. Pasricha Citations: 0 Transformers achieve state-of-the-art performance in natural language processing, vision, and scientific computing, but demand high computation and memory. To address
Query: mixture of experts inference serving Authors: Adrian Zhao, Zhenkun Cai, Zhenyu Song, Lin Yu, Haozheng Fan Citations: 1 Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant
Query: mixture of experts inference serving Authors: Kexin Chu, Dawei Xiang, Zixu Shen, Yiwei Yang, Zechen Liu Citations: 3 Mixture-of-Experts (MoE) has become a practical architecture for scaling LLM capacity while keeping per-token compute modest, but deploying MoE models on a
Query: photonic computing transformer accelerator Authors: Hanqing Zhu, Zhican Zhou, Shupeng Ning, Xuhao Wu, Ray T. Chen Citations: 0 Photonic computing has emerged as a promising substrate for accelerating the dense linear-algebra operations at the heart of AI, but its adoption
- 2025-09-26Research paper: Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
Query: mixture of experts inference serving Authors: Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu Citations: 8 Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. However, real-world deployments often fac
Query: mixture of experts inference serving Authors: Rui Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo Citations: 27 Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexi
Query: photonic computing transformer accelerator Authors: Yi Li, Zijian Ye, Xiangqu Fu, Song-jian Wang, Shucheng Du Citations: 1 Vision Transformers (ViTs) are new foundation models for vision applications. Edge-deploying ViTs to realize energy-saving, low-latency, and high-perf
Query: mixture of experts inference serving Authors: Shaoyu Wang, Guangrong He, Geon-Woo Kim, Yanqi Zhou, S. Park Citations: 3 Mixture-of-Experts (MoE) architectures offer the promise of larger model capacity without the prohibitive costs of fully dense designs. However, in real-
Query: mixture of experts inference serving Authors: Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo Citations: 35 Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational comp
Query: photonic computing transformer accelerator Authors: Shiyue Hua, Erwan Divita, Shanshan Yu, Bo Peng, C. Roques-Carmes Citations: 157 Integrated photonics, particularly silicon photonics, have emerged as cutting-edge technology driven by promising applications such as short-
- 2025-03-31Research paper: HyAtten: Hybrid Photonic-Digital Architecture for Accelerating Attention Mechanism
Query: photonic computing transformer accelerator Authors: Huize Li, Dan Chen, Tulika Mitra Citations: 0 The wide adoption and substantial computational resource requirements of attention-based Transformers have spurred the demand for efficient hardware accelerators. Unlike digit
Query: photonic computing transformer accelerator Authors: Bo Chen, T. Chang Citations: 3 This paper introduces the first low-power hardware accelerator for Spiking Transformers, an emerging alternative to traditional artificial neural networks. By modifying the base Spikformer m
Query: photonic computing transformer accelerator Authors: Pingcheng Dong, Yonghao Tan, Xuejiao Liu, Peng Luo, Yu Liu Citations: 17 Recently, hybrid models integrating a CNN and a Transformer (ConvFormer), shown in Fig. 23.2.1, have achieved significant advancements in semantic s
Query: photonic computing transformer accelerator Authors: Jiaqi Liu, Yiwen Ma Citations: 1 The demand for extensive computing resources and energy to support the increasing size of machine learning models has created a disparity between AI applications and the underlying hardwar
Query: photonic computing transformer accelerator Authors: Huize Li, Dan Chen, Tulika Mitra Citations: 0 The wide adoption and substantial computational resource requirements of attention-based Transformers have spurred the demand for efficient hardware accelerators. Unlike digit
Query: photonic computing transformer accelerator Authors: Seok-Woo Chang, Dong-Sun Kim Citations: 0 Processing-in-memory (PIM) is designed to overcome data transfer bottlenecks by performing repeated data-intensive operations on the same die as the memory. In this study, we prop
Query: mixture of experts inference serving Authors: Mengfan Liu, Wei Wang, Chuan Wu Citations: 13 With the advancement of serverless computing, running machine learning (ML) inference services over a serverless platform has been advocated, given its labor-free scalability and co