semantic-scholar

Semantic Scholar

20 events in storerole: thematic30d historyaccess: keyless

Keeps: paper title, abstract snippet

Archive source — full history has value. Use pagination to browse older records.

all historytoday

2026-05-20Research paper: PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
Query: mixture of experts inference serving Authors: Can Hankendi, Rana Shahout, Minlan Yu, A. Coskun Citations: 1 Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior s
2026-05-01Research paper: Parallelism Strategies and Concurrency Effects for Mixture-of-Experts Inference on GPU Systems
Query: mixture of experts inference serving Authors: Ananya Hegde, Akshata Kumble, Ravi Gupta Citations: 0 Mixture-of-Experts (MoE) architectures reduce inference cost by activating only a sparse subset of parameters per token. However, when these models exceed single-GPU memory,
2026-04-25Research paper: Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
Query: mixture of experts inference serving Authors: A. Bambhaniya, Geonhwa Jeong, Jason Park, Jiecao Yu, Jaewon Lee Citations: 0 Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportion
2026-04-10Research paper: Sustainable Transformer Neural Network Acceleration with Stochastic Photonic Computing
Query: photonic computing transformer accelerator Authors: S. Afifi, O. Alo, I. Thakkar, S. Pasricha Citations: 0 Transformers achieve state-of-the-art performance in natural language processing, vision, and scientific computing, but demand high computation and memory. To address
LIGHTMATTER
2026-01-12Research paper: CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving
Query: mixture of experts inference serving Authors: Adrian Zhao, Zhenkun Cai, Zhenyu Song, Lin Yu, Haozheng Fan Citations: 1 Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant
2025-11-19Research paper: Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference
Query: mixture of experts inference serving Authors: Kexin Chu, Dawei Xiang, Zixu Shen, Yiwei Yang, Zechen Liu Citations: 3 Mixture-of-Experts (MoE) has become a practical architecture for scaling LLM capacity while keeping per-token compute modest, but deploying MoE models on a
2025-10-02Research paper: ENLighten: Lighten the Transformer, Enable Efficient Optical Acceleration
Query: photonic computing transformer accelerator Authors: Hanqing Zhu, Zhican Zhou, Shupeng Ning, Xuhao Wu, Ray T. Chen Citations: 0 Photonic computing has emerged as a promising substrate for accelerating the dense linear-algebra operations at the heart of AI, but its adoption
LIGHTMATTER
2025-09-26Research paper: Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
Query: mixture of experts inference serving Authors: Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu Citations: 8 Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. However, real-world deployments often fac
2025-08-27Research paper: MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism
Query: mixture of experts inference serving Authors: Rui Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo Citations: 27 Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexi
2025-06-22Research paper: Efficient Edge Vision Transformer Accelerator with Decoupled Chunk Attention and Hybrid Computing-In-Memory
Query: photonic computing transformer accelerator Authors: Yi Li, Zijian Ye, Xiangqu Fu, Song-jian Wang, Shucheng Du Citations: 1 Vision Transformers (ViTs) are new foundation models for vision applications. Edge-deploying ViTs to realize energy-saving, low-latency, and high-perf
LIGHTMATTER
2025-05-13Research paper: Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony
Query: mixture of experts inference serving Authors: Shaoyu Wang, Guangrong He, Geon-Woo Kim, Yanqi Zhou, S. Park Citations: 3 Mixture-of-Experts (MoE) architectures offer the promise of larger model capacity without the prohibitive costs of fully dense designs. However, in real-
2025-04-03Research paper: MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
Query: mixture of experts inference serving Authors: Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo Citations: 35 Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational comp
2025-04-01Research paper: An integrated large-scale photonic accelerator with ultralow latency
Query: photonic computing transformer accelerator Authors: Shiyue Hua, Erwan Divita, Shanshan Yu, Bo Peng, C. Roques-Carmes Citations: 157 Integrated photonics, particularly silicon photonics, have emerged as cutting-edge technology driven by promising applications such as short-
LIGHTMATTER
2025-03-31Research paper: HyAtten: Hybrid Photonic-Digital Architecture for Accelerating Attention Mechanism
Query: photonic computing transformer accelerator Authors: Huize Li, Dan Chen, Tulika Mitra Citations: 0 The wide adoption and substantial computational resource requirements of attention-based Transformers have spurred the demand for efficient hardware accelerators. Unlike digit
LIGHTMATTER
2025-03-25Research paper: Hardware Efficient Accelerator for Spiking Transformer With Reconfigurable Parallel Time Step Computing
Query: photonic computing transformer accelerator Authors: Bo Chen, T. Chang Citations: 3 This paper introduces the first low-power hardware accelerator for Spiking Transformers, an emerging alternative to traditional artificial neural networks. By modifying the base Spikformer m
LIGHTMATTER
2025-02-16Research paper: A 28nm 0.22μJ/Token Memory-Compute-Intensity-Aware CNN-Transformer Accelerator with Hybrid-Attention-Based Layer-Fusion and Cascaded Pruning for Semantic-Segmentation
Query: photonic computing transformer accelerator Authors: Pingcheng Dong, Yonghao Tan, Xuejiao Liu, Peng Luo, Yu Liu Citations: 17 Recently, hybrid models integrating a CNN and a Transformer (ConvFormer), shown in Fig. 23.2.1, have achieved significant advancements in semantic s
LIGHTMATTER
2025-01-20Research paper: OpticalHDC: Ultra-fast Photonic Hyperdimensional Computing Accelerator
Query: photonic computing transformer accelerator Authors: Jiaqi Liu, Yiwen Ma Citations: 1 The demand for extensive computing resources and energy to support the increasing size of machine learning models has created a disparity between AI applications and the underlying hardwar
LIGHTMATTER
2025-01-20Research paper: Hybrid Photonic-digital Accelerator for Attention Mechanism
Query: photonic computing transformer accelerator Authors: Huize Li, Dan Chen, Tulika Mitra Citations: 0 The wide adoption and substantial computational resource requirements of attention-based Transformers have spurred the demand for efficient hardware accelerators. Unlike digit
LIGHTMATTER
2025-01-11Research paper: An In-Memory Computing-based Efficient Transformer Accelerator Using Stateful Matrix Multiplier for Voice Assistant Consumer Applications
Query: photonic computing transformer accelerator Authors: Seok-Woo Chang, Dong-Sun Kim Citations: 0 Processing-in-memory (PIM) is designed to overcome data transfer bottlenecks by performing repeated data-intensive operations on the same die as the memory. In this study, we prop
LIGHTMATTER
2025-01-09Research paper: Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing
Query: mixture of experts inference serving Authors: Mengfan Liu, Wei Wang, Chuan Wu Citations: 13 With the advancement of serverless computing, running machine learning (ML) inference services over a serverless platform has been advocated, given its labor-free scalability and co