Building multimodal generative audio systems for more immersive listening experiences.

Focus: Multi-Modal Generative AI · Speech Synthesis · Spatial Audio

I am an undergraduate student at Turing Class, Chu Kochen Honors College, Zhejiang University, majoring in Artificial Intelligence. I currently work in the Audio Research Team at Zhejiang University, under the supervision of Prof. Zhou Zhao.

My research interest lies in Multi-Modal Generative AI, with a particular focus on Speech Synthesis and Spatial Audio Generation. My work aims to build immersive auditory experiences through advanced generative modeling (e.g., Flow Matching). I have papers published or under review at top-tier venues including NeurIPS, ACM MM, and ACL.

I am always open to potential collaborations and opportunities. Feel free to reach out.

🔥 News

2026.01

🎉 Started my internship at Luna Lab（宇生月伴） as a text-to-speech model researcher.

2025.12

🎉 Submitted two papers to ACL 2026.

📝 Publications

ACL 2026 (Under Review)

CSAVocoder: A Causal Spatial Audio Vocoder Towards Real-Time Spatial Audio Generation

#Zhiyuan Zhu, #Han Wang, et al.

We introduce CSAVocoder, a strictly causal streaming neural vocoder. It features a Spatial Adaptor to fuse pose information and a Spatial Consistency Discriminator to explicitly supervise inter-channel phase and level differences.
The model achieves high-fidelity waveform reconstruction while preserving precise spatial rendering, all within a constant memory overhead suitable for real-time streaming.

ACL 2026 (Under Review)

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

#Changhao Pan, #Rui Yang, #Han Wang, et al.

We propose LFS-Bench, a standardized benchmark decomposing “long-form quality” into acoustics, semantics, and expressiveness. It includes 1,101 samples spanning 17 diverse scenarios (e.g., dialogues, audiobooks).
Our extensive experiments reveal that current SOTA models still struggle significantly with consistency and hierarchy in highly expressive scenarios compared to real recordings.

[A Multimodal Evaluation Framework for Spatial Audio Playback Systems: From Localization to Listener Preference]

Changhao Pan, Wenxiang Guo, Yu Zhang, Zhiyuan Zhu, Zhetao Chen, Han Wang, Zhou Zhao ACM MM 2025

[MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations]

Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Xintong Hu, Yu Zhang, Li Tang, Rui Yang, Han Wang, et al. NeurIPS 2025

🎖 Honors and Awards

2024 First-class Scholarship in Zhejiang University
2025 Second-class Scholarship in Zhejiang University
2025 National Student Research Training Program

📖 Education

2023.08 - Present Undergraduate, Chu Kochen Honors College, Zhejiang University

💻 Internships

2024.04 – 2025.12 Research Assistant, Audio Research Team, Zhejiang University.
Under the supervision of Prof. Zhou Zhao.
2026.01 – Present MLE Intern (TTS), VUI Lab, Hangzhou.
Mentored by Mengxiao Bi; under the supervision of Prof. Yanmin Qian.