Yuancheng Wang

wyc.png

Yuancheng Wang is a second-year Ph.D. student at the Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), SDS, supervised by Professor Zhizheng Wu. before that, he received my B.S. degree at CUHK-Shenzhen. He also collaborates with Xu Tan from Microsoft Research Asia.

His research interest includes text-to-speech synthesis, text-to-audio generation, and unified audio representation and generation. He is one of the main contributors and leaders of the open-source Amphion toolkit. He has developed some advanced TTS models, including NaturalSpeech 3, MaskGCT.

news

Jan 25, 2025 🎉 MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer have been accepted to ICLR 2025!
Oct 28, 2024 🔥 We released code (2.5k+ stars in one week) and checkpoints of MaskGCT, which has been used in Quwan All Voice.
Sep 20, 2024 🎉 Our paper, SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words, got accepted by NeurIPS 2024!
Aug 25, 2024 🎉 Our papers, Amphion and Emila, got accepted by IEEE SLT 2024!
Jul 28, 2024 🔥 We released Emila: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation, with 101k hours of speech in six languages and features diverse speech with varied speaking styles.
May 15, 2024 🎉 Our paper Factorized Diffusion Models are Natural and Zero-shot Speech Synthesizers, aka NaturalSpeech 3, got accepted by ICML 2024 as an Oral presentation!
Nov 26, 2023 🔥 We released Amphion v0.1 GitHub stars, which is an open-source toolkit for audio, music, and speech generation.
Sep 20, 2023 🎉 My first paper about audio generation and editing AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models got accepted by NeurIPS 2023!

selected publications

  1. ICLR 2025
    MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
    Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu
    2025
    TL;DR: A fully non-autoregressive large-scale zero-shot TTS model eliminates the need for phone-level duration prediction.
  2. ICML 2024 Oral
    Naturalspeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
    Zeqian Ju*Yuancheng Wang*, Kai Shen*, Xu Tan*, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, and Sheng Zhao
    2024
    TL;DR: A large-scale zero-shot TTS model achieves on-par quality with human recordings.
  3. NeurIPS 2024
    SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
    Junyi Ao*Yuancheng Wang*, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, and Zhizheng Wu
    2024
    TL;DR: We propose a benchmark dataset to evaluate spoken dialogue understanding and generation.
  4. NeurIPS 2023
    AUDIT: Audio Editing by following Instructions with Latent Diffusion Models
    Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, and Jiang Bian
    2023
    TL;DR: The first audio editing model that can follow natural language instructions.
  5. IEEE SLT 2024
    Amphion: an Open-Source Audio, Music, and Speech Generation Toolkit
    Xueyao Zhang*, Liumeng Xue*, Yicheng Gu*Yuancheng Wang*, Jiaqi Li, Haorui He, Chaoren Wang, Songting Liu, Xi Chen, Junan Zhang, Tze Ying Tang, Lexiao Zou, Mingxuan Wang, Jun Han, Kai Chen, Haizhou Li, and Zhizheng Wu
    2024
    TL;DR: We develop a unified toolkit for audio, music, and speech generation.
  6. IEEE SLT 2024
    Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
    Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, and Zhizheng Wu
    2024
    TL;DR: We collect a 10w hours in-the-wild speech dataset for speech generation.

internships

ByteDance
Research Intern · Shenzhen, China · 2024.05 - Present
Speech Understanding
Microsoft Research Asia
Research Intern · Beijing, China · 2022.12 - 2023.06
Developed on audio generation & editing and larger scale text-to-speech synthesis.
Audio Generation & Editing Speech Synthesis