Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services

Published in KDD 2025 (CCF-A), 2025

Weiyan Wang, Yilun Jin, Yiming Zhang, Victor Junqiu Wei, Han Tian, Li Chen, Jinbao Xue, Yangyu Tao, Di Wang, Kai Chen

KDD 2025 (CCF-A)

Recommended citation: Weiyan Wang, Yilun Jin, Yiming Zhang, Victor Junqiu Wei, Han Tian, Li Chen, Jinbao Xue, Yangyu Tao, Di Wang, Kai Chen. "Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services." KDD 2025.