Sampled from YouTube, the link has been shown below the sample
Music generation is pivotal in multimedia, aiding creation and lowering the creative threshold. It focuses on generating music with clear vocals and harmonious accompaniment based on lyrics, combining high artistic creativity with technical challenges. The music codec is an important bridging component in large language model-based music generation, connecting language models with the generated music. However, existing neural codecs typically require token rates exceeding 50 Hz to achieve acceptable music quality, resulting in a context length that surpasses 12,000 tokens for a 4-minute song—a scale that is computationally demanding. This highlights the need for high-compression, high-fidelity music codecs that can reconstruct both vocals and accompaniment with high quality at low frame rates and bitrates, thereby better assisting music generation. To address this, we introduce MuCodec, designed for high-quality music reconstruction at ultra-low bitrates, facilitating more efficient music generation. MuCodec employs a two-stage training method, enabling its encoder, MuEncoder, to extract semantic and acoustic features in a unified representation. These features are discretized using residual vector quantization and converted into Mel-VAE features through flow matching, with reconstruction quality improved by representation alignment during training. The Mel-VAE features are then reconstructed into music using a pre-trained Mel-VAE decoder and HiFi-GAN. To the best of our knowledge, MuCodec is the first codec capable of reconstructing 48kHz stereo music at an ultra-low bitrate of 0.35 kbps (25 Hz), achieving state-of-the-art performance in both subjective and objective evaluations, and can more effectively support music generation.
Sampled from YouTube, the link has been shown below the sample
| Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | |
|---|---|---|---|---|---|
| Link | Link | Link | Link | Link | |
| Original Audio | |||||
| low-bitrate scenario (0.35kbps) | |||||
| GAN-based (0.35kbps) | |||||
| SemantiCodec (0.375kbps) | |||||
| WavTokenizer (0.48kbps) | |||||
| XCodec (0.50kbps) | |||||
| MuCodec-proposed (0.35kbps) | |||||
| high-bitrate scenario (1.33kbps) | |||||
| GAN-based (1.33kbps) | |||||
| SemantiCodec (1.40kbps) | |||||
| WavTokenizer (0.90kbps) | |||||
| MuCodec-proposed (1.33kbps) | |||||
Sampled from YouTube, the link has been shown below the sample
| Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | |
|---|---|---|---|---|---|
| Link | Link | Link | Link | Link | |
| Original Audio | |||||
| low-bitrate scenario (0.35kbps) | |||||
| GAN-based (0.35kbps) | |||||
| SemantiCodec (0.375kbps) | |||||
| WavTokenizer (0.48kbps) | |||||
| XCodec (0.50kbps) | |||||
| MuCodec-proposed (0.35kbps) | |||||
| high-bitrate scenario (1.33kbps) | |||||
| GAN-based (1.33kbps) | |||||
| SemantiCodec (1.40kbps) | |||||
| WavTokenizer (0.90kbps) | |||||
| MuCodec-proposed (1.33kbps) | |||||
Sampled from YouTube, the link has been shown below the sample
| France Music | Korean Music | Japenese Music | India Music | |
|---|---|---|---|---|
| Link | Link | Link | Link | |
| Original Audio | ||||
| low-bitrate scenario (0.35kbps) | ||||
| GAN-based (0.35kbps) | ||||
| SemantiCodec (0.375kbps) | ||||
| WavTokenizer (0.48kbps) | ||||
| XCodec (0.50kbps) | ||||
| MuCodec-proposed (0.35kbps) | ||||
| high-bitrate scenario (1.33kbps) | ||||
| GAN-based (1.33kbps) | ||||
| SemantiCodec (1.40kbps) | ||||
| WavTokenizer (0.90kbps) | ||||
| MuCodec-proposed (1.33kbps) | ||||
Sampled from YouTube, the link has been shown below the sample
| Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | |
|---|---|---|---|---|---|
| Link | Link | Link | Link | Link | |
| Original Audio | |||||
| low-bitrate scenario (0.35kbps) | |||||
| GAN-based (0.35kbps) | |||||
| SemantiCodec (0.375kbps) | |||||
| WavTokenizer (0.48kbps) | |||||
| XCodec (0.50kbps) | |||||
| MuCodec-proposed (0.35kbps) | |||||
| high-bitrate scenario (1.33kbps) | |||||
| GAN-based (1.33kbps) | |||||
| SemantiCodec (1.40kbps) | |||||
| WavTokenizer (0.90kbps) | |||||
| MuCodec-proposed (1.33kbps) | |||||
Sampled from Opencpop dataset
| Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | |
|---|---|---|---|---|---|
| Original Audio | |||||
| low-bitrate scenario (0.35kbps) | |||||
| GAN-based (0.35kbps) | |||||
| SemantiCodec (0.375kbps) | |||||
| WavTokenizer (0.48kbps) | |||||
| XCodec (0.50kbps) | |||||
| MuCodec-proposed (0.35kbps) | |||||
| high-bitrate scenario (1.33kbps) | |||||
| GAN-based (1.33kbps) | |||||
| SemantiCodec (1.40kbps) | |||||
| WavTokenizer (0.90kbps) | |||||
| MuCodec-proposed (1.33kbps) | |||||
Sampled from AudioSet
| Sample 1 | Sample 2 | Sample 3 | ||
|---|---|---|---|---|
| Original Audio | ||||
| low-bitrate scenario (0.35kbps) | ||||
| GAN-based (0.35kbps) | ||||
| SemantiCodec (0.375kbps) | ||||
| WavTokenizer (0.48kbps) | ||||
| XCodec (0.50kbps) | ||||
| MuCodec-proposed (0.35kbps) | ||||
| high-bitrate scenario (1.33kbps) | ||||
| GAN-based (1.33kbps) | ||||
| SemantiCodec (1.40kbps) | ||||
| WavTokenizer (0.90kbps) | ||||
| MuCodec-proposed (1.33kbps) | ||||
Sampled from THCHS-30
| Sample 1 | Sample 2 | Sample 3 | ||
|---|---|---|---|---|
| Original Audio | ||||
| low-bitrate scenario (0.35kbps) | ||||
| GAN-based (0.35kbps) | ||||
| SemantiCodec (0.375kbps) | ||||
| WavTokenizer (0.48kbps) | ||||
| XCodec (0.50kbps) | ||||
| MuCodec-proposed (0.35kbps) | ||||
| high-bitrate scenario (1.33kbps) | ||||
| GAN-based (1.33kbps) | ||||
| SemantiCodec (1.40kbps) | ||||
| WavTokenizer (0.90kbps) | ||||
| MuCodec-proposed (1.33kbps) | ||||
Sampled from Librispeech
| Sample 1 | Sample 2 | Sample 3 | ||
|---|---|---|---|---|
| Original Audio | ||||
| low-bitrate scenario (0.35kbps) | ||||
| GAN-based (0.35kbps) | ||||
| SemantiCodec (0.375kbps) | ||||
| WavTokenizer (0.48kbps) | ||||
| XCodec (0.50kbps) | ||||
| MuCodec-proposed (0.35kbps) | ||||
| high-bitrate scenario (1.33kbps) | ||||
| GAN-based (1.33kbps) | ||||
| SemantiCodec (1.40kbps) | ||||
| WavTokenizer (0.90kbps) | ||||
| MuCodec-proposed (1.33kbps) | ||||
| Lyric | Codec | Sample |
|---|---|---|
|
风轻轻吹过古道 岁月在墙上刻下记号 梦中你笑得多甜 醒来却只剩下寂寥 |
WavTokenizer (40Hz) | |
| XCodec (50Hz) | ||
| MuCodec (25Hz) | ||
|
哎呀 时光匆匆流转 思绪纷飞不已 独自等待 期盼着你的消息 天空突然洒下细雨 湿润了我的心 无法逃避 脑海中全是你 |
WavTokenizer (40Hz) | |
| XCodec (50Hz) | ||
| MuCodec (25Hz) | ||
|
每个清晨 当我睁开眼 你的笑容就像第一缕阳光 温暖我心 驱散夜的寒 每个夜晚 当星星闪烁 我和你 在梦中相遇 |
WavTokenizer (40Hz) | |
| XCodec (50Hz) | ||
| MuCodec (25Hz) |