Yuancheng Wang

wyc.png

Yuancheng Wang is a second-year Ph.D. student at the Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), SDS, supervised by Professor Zhizheng Wu. before that, he received the B.S. degree at CUHK-Shenzhen. He also collaborates with Xu Tan from Microsoft Research Asia.

His research interest includes speech/audio generation & representation, large speeech language model, and generative AI. He is one of the main contributors and leaders of the open-source Amphion toolkit. He has developed some advanced TTS models, including NaturalSpeech 3, MaskGCT.

news

May 17, 2025 馃帀 Our paper Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment, got accepted by ACL 2025 main!
Jan 25, 2025 馃帀 MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer have been accepted to ICLR 2025!
Oct 28, 2024 馃敟 We released code (2.5k+ stars in one week) and checkpoints of MaskGCT, which has been used in Quwan All Voice.
Sep 20, 2024 馃帀 Our paper, SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words, got accepted by NeurIPS 2024!
Aug 25, 2024 馃帀 Our papers, Amphion and Emila, got accepted by IEEE SLT 2024!
Jul 28, 2024 馃敟 We released Emila: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation, with 101k hours of speech in six languages and features diverse speech with varied speaking styles.
May 15, 2024 馃帀 Our paper Factorized Diffusion Models are Natural and Zero-shot Speech Synthesizers, aka NaturalSpeech 3, got accepted by ICML 2024 as an Oral presentation!
Nov 26, 2023 馃敟 We released Amphion v0.1 GitHub stars, which is an open-source toolkit for audio, music, and speech generation.
Sep 20, 2023 馃帀 My first paper about audio generation and editing AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models got accepted by NeurIPS 2023!

selected publications

  1. Preprint
    TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling
    Yuancheng Wang,聽Dekun Chen,聽Xueyao Zhang,聽Junan Zhang,聽Jiaqi Li,聽and聽Zhizheng Wu
    2025
    TL;DR: We propose a text-aware speech tokenizer with a single codebook and a frame rate of 6.25 Hz for speech language modeling.
  2. ACL 2025
    Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
    Xueyao Zhang*,聽Yuancheng Wang*,聽Chaoren Wang,聽Ziniu Li,聽Zhuo Chen,聽and聽Zhizheng Wu
    2025
    TL;DR: We propose the INTP dataset and extend preference alignment to enhance the intelligibility and overall quality of TTS systems in challenging scenarios.
  3. Preprint
    Metis: A Foundation Speech Generation Model with Masked Generative Pre-training
    Yuancheng Wang,聽Jiachen Zheng,聽Junan Zhang,聽Xueyao Zhang,聽Huan Liao,聽and聽Zhizheng Wu
    2025
    TL;DR: We propose a foundation speech generation model with masked generative pre-training.
  4. ICLR 2025
    MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
    Yuancheng Wang,聽Haoyue Zhan,聽Liwei Liu,聽Ruihong Zeng,聽Haotian Guo,聽Jiachen Zheng,聽Qiang Zhang,聽Xueyao Zhang,聽Shunsi Zhang,聽and聽Zhizheng Wu
    2025
    TL;DR: A fully non-autoregressive large-scale zero-shot TTS model eliminates the need for phone-level duration prediction.
  5. ICML 2024 Oral
    Naturalspeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
    Zeqian Ju*,聽Yuancheng Wang*,聽Kai Shen*,聽Xu Tan*,聽Detai Xin,聽Dongchao Yang,聽Yanqing Liu,聽Yichong Leng,聽Kaitao Song,聽Siliang Tang,聽Zhizheng Wu,聽Tao Qin,聽Xiang-Yang Li,聽Wei Ye,聽Shikun Zhang,聽Jiang Bian,聽Lei He,聽Jinyu Li,聽and聽Sheng Zhao
    2024
    TL;DR: A large-scale zero-shot TTS model achieves on-par quality with human recordings.
  6. NeurIPS 2024
    SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
    Junyi Ao*,聽Yuancheng Wang*,聽Xiaohai Tian,聽Dekun Chen,聽Jun Zhang,聽Lu Lu,聽Yuxuan Wang,聽Haizhou Li,聽and聽Zhizheng Wu
    2024
    TL;DR: We propose a benchmark dataset to evaluate spoken dialogue understanding and generation.
  7. NeurIPS 2023
    AUDIT: Audio Editing by following Instructions with Latent Diffusion Models
    Yuancheng Wang,聽Zeqian Ju,聽Xu Tan,聽Lei He,聽Zhizheng Wu,聽and聽Jiang Bian
    2023
    TL;DR: The first audio editing model that can follow natural language instructions.
  8. IEEE SLT 2024
    Amphion: an Open-Source Audio, Music, and Speech Generation Toolkit
    Xueyao Zhang*,聽Liumeng Xue*,聽Yicheng Gu*,聽Yuancheng Wang*,聽Jiaqi Li,聽Haorui He,聽Chaoren Wang,聽Songting Liu,聽Xi Chen,聽Junan Zhang,聽Tze Ying Tang,聽Lexiao Zou,聽Mingxuan Wang,聽Jun Han,聽Kai Chen,聽Haizhou Li,聽and聽Zhizheng Wu
    2024
    TL;DR: We develop a unified toolkit for audio, music, and speech generation.
  9. IEEE SLT 2024
    Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
    Haorui He,聽Zengqiang Shang,聽Chaoren Wang,聽Xuyuan Li,聽Yicheng Gu,聽Hua Hua,聽Liwei Liu,聽Chen Yang,聽Jiaqi Li,聽Peiyang Shi,聽Yuancheng Wang,聽Kai Chen,聽Pengyuan Zhang,聽and聽Zhizheng Wu
    2024
    TL;DR: We collect a 10w hours in-the-wild speech dataset for speech generation.

internships

Meta, GenAI
Research Scientist Intern California, USA 2025.05 - Present
ByteDance
Research Intern Shenzhen, China 2024.05 - 2025.04
Built a benchmark dataset for spoken dialogue understanding, our work SD-Eval was Accepted to NeurIPS 2024.
Speech Understanding Speech Language Model
Microsoft Research Asia
Research Intern Beijing, China 2022.12 - 2023.06
Developed on audio generation & editing (AUDIT) and larger scale text-to-speech synthesis (NaturalSpeech 3).
Audio Generation & Editing Speech Synthesis