Yuancheng Wang

I’m Yuancheng Wang (王远程), a PhD student at the Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), SDS, supervised by Prof. Zhizheng Wu. before that, I received the B.S. degree at CUHK-Shenzhen. My research interests include Multi-modal LLM, Generative AI for Speech and Audio, Post-Training, and Representation Learning. I have interned at Meta Superintelligence Labs, Microsoft Research Asia (MSRA) and ByteDance.

I have developed several advanced TTS models, including NaturalSpeech 3 and MaskGCT, and I am one of the main contributors and leaders of the open-source Amphion Amphion toolkit. My work has been published at top international AI conferences such as NeurIPS, ICML, ICLR, ACL, IEEE SLT and IEEE TASLP.

I am looking for a full-time position now, feel free to contact me if you are interested in my experience!

news

Oct 25, 2025	🎉 TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling gets the honourable Mention Awards at Nanyang Speech Technology Forum (NYSF) 2025. Noro: Noise-Robust One-shot Voice Conversion with Hidden Speaker Representation Learning gets the Best Paper Finalist at APSIPA 2025.
Sep 19, 2025	🎉 Our paper TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling and Metis: A Foundation Speech Generation Model with Masked Generative Pre-training got accepted by NeurIPS 2025!
May 17, 2025	🎉 Our paper Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment, got accepted by ACL 2025 main!
Jan 25, 2025	🎉 MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer have been accepted to ICLR 2025!
Oct 28, 2024	🔥 We released code (3k+ stars in one week) and checkpoints of MaskGCT, which has been used in Quwan All Voice.
Sep 20, 2024	🎉 Our paper, SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words, got accepted by NeurIPS 2024!
Aug 25, 2024	🎉 Our papers, Amphion and Emila, got accepted by IEEE SLT 2024!
Jul 28, 2024	🔥 We released Emila: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation, with 101k hours of speech in six languages and features diverse speech with varied speaking styles.
May 15, 2024	🎉 Our paper Factorized Diffusion Models are Natural and Zero-shot Speech Synthesizers, aka NaturalSpeech 3, got accepted by ICML 2024 as an Oral presentation!
Nov 26, 2023	🔥 We released Amphion v0.1 , which is an open-source toolkit for audio, music, and speech generation.
Sep 20, 2023	🎉 My first paper about audio generation and editing AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models got accepted by NeurIPS 2023!

selected publications

NeurIPS 2025

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, and Zhizheng Wu

2025

arXiv Code Demo Huggingface

TL;DR: We introduce the Text-aware Diffusion Transformer Speech Codec with the token rate of 6.25 Hz for speech language modeling.
NeurIPS 2025

Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

Yuancheng Wang, Jiachen Zheng, Junan Zhang, Xueyao Zhang, Huan Liao, and Zhizheng Wu

2025

arXiv Code Huggingface

TL;DR: We propose a foundation speech generation model with masked generative pre-training.
ACL 2025

Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment

Xueyao Zhang^*, Yuancheng Wang^*, Chaoren Wang, Ziniu Li, Zhuo Chen, and Zhizheng Wu

2025

arXiv Demo

TL;DR: We propose the INTP dataset and extend preference alignment to enhance the intelligibility and overall quality of TTS systems in challenging scenarios.
ICLR 2025

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu

2025

arXiv Code Demo Huggingface

TL;DR: A fully non-autoregressive large-scale zero-shot TTS model eliminates the need for phone-level duration prediction.
ICML 2024 Oral

Naturalspeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Zeqian Ju^*, Yuancheng Wang^*, Kai Shen^*, Xu Tan^*, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, and Sheng Zhao

2024

arXiv Code Demo

TL;DR: A large-scale zero-shot TTS model achieves on-par quality with human recordings.
NeurIPS 2024

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Junyi Ao^*, Yuancheng Wang^*, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, and Zhizheng Wu

2024

arXiv Huggingface

TL;DR: We propose a benchmark dataset to evaluate spoken dialogue understanding and generation.
NeurIPS 2023

AUDIT: Audio Editing by following Instructions with Latent Diffusion Models

Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, and Jiang Bian

2023

arXiv Code Demo

TL;DR: The first audio editing model that can follow natural language instructions.
IEEE SLT 2024

Amphion: an Open-Source Audio, Music, and Speech Generation Toolkit

Xueyao Zhang^*, Liumeng Xue^*, Yicheng Gu^*, Yuancheng Wang^*, Jiaqi Li, Haorui He, Chaoren Wang, Songting Liu, Xi Chen, Junan Zhang, Tze Ying Tang, Lexiao Zou, Mingxuan Wang, Jun Han, Kai Chen, Haizhou Li, and Zhizheng Wu

2024

arXiv Code Huggingface

TL;DR: We develop a unified toolkit for audio, music, and speech generation.
IEEE SLT 2024

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, and Zhizheng Wu

2024

arXiv Huggingface

TL;DR: We collect a 10w hours in-the-wild speech dataset for speech generation.

Full publication list available at: Google Scholar.

internships

Meta, Superintelligence Labs	Research Scientist Intern · California, USA · 2025.05 - 2025.09 Speech LLM Speech Tokenizer
ByteDance	Research Intern · Shenzhen, China · 2024.05 - 2025.04 Speech Understanding Speech Language Model
Microsoft Research Asia	Research Intern · Beijing, China · 2022.12 - 2023.06 Audio Generation Speech Synthesis

invited talks

Xmart Youth Forum	Towards Natural and Efficient Speech Synthesis: Perspectives on Modeling, Alignment, and Representation Online · 2025.06 I was honored to be invited to give an online talk at the Xmart Youth Forum hosted by the SJTU X-LANCE lab. [Slides] [Video]
NUS Speech and Music AI Workshop	Speech Generation with Masked Generative Modeling NUS, Singapore · 2025.04 I was honored to be invited to give a talk at Professor Ye Wang’s Lab at National University of Singapore.
OpenMMLab 社区开放麦	MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer Online · 2024.12 I was honored to be invited to give a talk at the joint event organized by OpenMMLab and SpeechHome. [Video]
SpeechHome AI Tech Salon	NaturalSpeech 3: Speech Disentanglement and Zero-Shot TTS in the Era of Big Data Online · 2024.03 I was honored to be invited to give a talk with Zeqian at the SpeechHome (语音之家) AI Tech Salon.