Building multimodal generative audio systems for more immersive listening experiences.

Focus: Multi-Modal Generative AI · Speech Synthesis · Spatial Audio

I am an undergraduate student at Turing Class, Chu Kochen Honors College, Zhejiang University, majoring in Artificial Intelligence. I currently work in the Audio Research Team at Zhejiang University, under the supervision of Prof. Zhou Zhao.

My research interest lies in Multi-Modal Generative AI, with a particular focus on Speech Synthesis and Spatial Audio Generation. My work aims to build immersive auditory experiences through advanced generative modeling (e.g., Flow Matching). I have papers published or under review at top-tier venues including NeurIPS, ACM MM, and ACL.

I am always open to potential collaborations and opportunities. Feel free to reach out.

🔥 News

2026.01
🎉 Started my internship at Luna Lab(宇生月伴) as a text-to-speech model researcher.
2025.12
🎉 Submitted two papers to ACL 2026.

📝 Publications

ACL 2026 (Under Review)
sym

CSAVocoder: A Causal Spatial Audio Vocoder Towards Real-Time Spatial Audio Generation

#Zhiyuan Zhu, #Han Wang, et al.

  • We introduce CSAVocoder, a strictly causal streaming neural vocoder. It features a Spatial Adaptor to fuse pose information and a Spatial Consistency Discriminator to explicitly supervise inter-channel phase and level differences.
  • The model achieves high-fidelity waveform reconstruction while preserving precise spatial rendering, all within a constant memory overhead suitable for real-time streaming.
ACL 2026 (Under Review)
sym

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

#Changhao Pan, #Rui Yang, #Han Wang, et al.

  • We propose LFS-Bench, a standardized benchmark decomposing “long-form quality” into acoustics, semantics, and expressiveness. It includes 1,101 samples spanning 17 diverse scenarios (e.g., dialogues, audiobooks).
  • Our extensive experiments reveal that current SOTA models still struggle significantly with consistency and hierarchy in highly expressive scenarios compared to real recordings.

[A Multimodal Evaluation Framework for Spatial Audio Playback Systems: From Localization to Listener Preference]

Changhao Pan, Wenxiang Guo, Yu Zhang, Zhiyuan Zhu, Zhetao Chen, Han Wang, Zhou Zhao ACM MM 2025

[MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations]

Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Xintong Hu, Yu Zhang, Li Tang, Rui Yang, Han Wang, et al. NeurIPS 2025

🎖 Honors and Awards

  • 2024 First-class Scholarship in Zhejiang University
  • 2025 Second-class Scholarship in Zhejiang University
  • 2025 National Student Research Training Program

📖 Education

  • 2023.08 - Present Undergraduate, Chu Kochen Honors College, Zhejiang University

💻 Internships

  • 2024.04 – 2025.12 Research Assistant, Audio Research Team, Zhejiang University.
    Under the supervision of Prof. Zhou Zhao.
  • 2026.01 – Present MLE Intern (TTS), VUI Lab, Hangzhou.
    Mentored by Mengxiao Bi; under the supervision of Prof. Yanmin Qian.