Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services
Weiyan Wang, Yilun Jin, Yiming Zhang, Victor Junqiu Wei, Han Tian, Li Chen, Jinbao Xue, Yangyu Tao, Di Wang, Kai Chen. "Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services." KDD 2025.
