Yuancheng Wang

Yuancheng Wang is a second-year Ph.D. student at the Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), SDS, supervised by Professor Zhizheng Wu. before that, he received the B.S. degree at CUHK-Shenzhen. He also collaborates with Xu Tan from Microsoft Research Asia.
His research interest includes speech/audio generation & representation, large speeech language model, and generative AI. He is one of the main contributors and leaders of the open-source Amphion toolkit. He has developed some advanced TTS models, including NaturalSpeech 3, MaskGCT.
news
May 17, 2025 | 馃帀 Our paper Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment, got accepted by ACL 2025 main! |
---|---|
Jan 25, 2025 | 馃帀 MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer have been accepted to ICLR 2025! |
Oct 28, 2024 | 馃敟 We released code (2.5k+ stars in one week) and checkpoints of MaskGCT, which has been used in Quwan All Voice. |
Sep 20, 2024 | 馃帀 Our paper, SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words, got accepted by NeurIPS 2024! |
Aug 25, 2024 | 馃帀 Our papers, Amphion and Emila, got accepted by IEEE SLT 2024! |
Jul 28, 2024 | 馃敟 We released Emila: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation, with 101k hours of speech in six languages and features diverse speech with varied speaking styles. |
May 15, 2024 | 馃帀 Our paper Factorized Diffusion Models are Natural and Zero-shot Speech Synthesizers, aka NaturalSpeech 3, got accepted by ICML 2024 as an Oral presentation! |
Nov 26, 2023 | 馃敟 We released Amphion v0.1 |
Sep 20, 2023 | 馃帀 My first paper about audio generation and editing AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models got accepted by NeurIPS 2023! |
selected publications
- PreprintMetis: A Foundation Speech Generation Model with Masked Generative Pre-training2025TL;DR: We propose a foundation speech generation model with masked generative pre-training.
- ICLR 2025MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer2025TL;DR: A fully non-autoregressive large-scale zero-shot TTS model eliminates the need for phone-level duration prediction.
- NeurIPS 2024SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words2024TL;DR: We propose a benchmark dataset to evaluate spoken dialogue understanding and generation.
- IEEE SLT 2024Amphion: an Open-Source Audio, Music, and Speech Generation Toolkit2024TL;DR: We develop a unified toolkit for audio, music, and speech generation.
- IEEE SLT 2024Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation2024TL;DR: We collect a 10w hours in-the-wild speech dataset for speech generation.
internships
Meta, GenAI | Research Scientist Intern 路 California, USA 路 2025.05 - Present |
---|---|
ByteDance | Research Intern 路 Shenzhen, China 路 2024.05 - 2025.04 Built a benchmark dataset for spoken dialogue understanding, our work SD-Eval was Accepted to NeurIPS 2024. Speech Understanding Speech Language Model |
Microsoft Research Asia | Research Intern 路 Beijing, China 路 2022.12 - 2023.06 Developed on audio generation & editing (AUDIT) and larger scale text-to-speech synthesis (NaturalSpeech 3). Audio Generation & Editing Speech Synthesis |