TL;DR

Music generation is pivotal in multimedia, aiding creation and lowering the creative threshold. It focuses on generating music with clear vocals and harmonious accompaniment based on lyrics, combining high artistic creativity with technical challenges. The music codec is an important bridging component in large language model-based music generation, connecting language models with the generated music. However, existing neural codecs typically require token rates exceeding 50 Hz to achieve acceptable music quality, resulting in a context length that surpasses 12,000 tokens for a 4-minute song—a scale that is computationally demanding. This highlights the need for high-compression, high-fidelity music codecs that can reconstruct both vocals and accompaniment with high quality at low frame rates and bitrates, thereby better assisting music generation. To address this, we introduce MuCodec, designed for high-quality music reconstruction at ultra-low bitrates, facilitating more efficient music generation. MuCodec employs a two-stage training method, enabling its encoder, MuEncoder, to extract semantic and acoustic features in a unified representation. These features are discretized using residual vector quantization and converted into Mel-VAE features through flow matching, with reconstruction quality improved by representation alignment during training. The Mel-VAE features are then reconstructed into music using a pre-trained Mel-VAE decoder and HiFi-GAN. To the best of our knowledge, MuCodec is the first codec capable of reconstructing 48kHz stereo music at an ultra-low bitrate of 0.35 kbps (25 Hz), achieving state-of-the-art performance in both subjective and objective evaluations, and can more effectively support music generation.

1.1 Music Reconstruction

Here we provide samples of music from different languages, including English, Chinese, and other languages. The samples are from YouTube, and the links are provided below each sample.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Link

Original Audio

low-bitrate scenario (0.35kbps)

GAN-based (0.35kbps)

SemantiCodec (0.375kbps)

WavTokenizer (0.48kbps)

XCodec (0.50kbps)

MuCodec-proposed (0.35kbps)

high-bitrate scenario (1.33kbps)

GAN-based (1.33kbps)

SemantiCodec (1.40kbps)

WavTokenizer (0.90kbps)

MuCodec-proposed (1.33kbps)

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Link

Original Audio

low-bitrate scenario (0.35kbps)

GAN-based (0.35kbps)

SemantiCodec (0.375kbps)

WavTokenizer (0.48kbps)

XCodec (0.50kbps)

MuCodec-proposed (0.35kbps)

high-bitrate scenario (1.33kbps)

GAN-based (1.33kbps)

SemantiCodec (1.40kbps)

WavTokenizer (0.90kbps)

MuCodec-proposed (1.33kbps)

France Music

Korean Music

Japenese Music

India Music

Link

Original Audio

low-bitrate scenario (0.35kbps)

GAN-based (0.35kbps)

SemantiCodec (0.375kbps)

WavTokenizer (0.48kbps)

XCodec (0.50kbps)

MuCodec-proposed (0.35kbps)

high-bitrate scenario (1.33kbps)

GAN-based (1.33kbps)

SemantiCodec (1.40kbps)

WavTokenizer (0.90kbps)

MuCodec-proposed (1.33kbps)

1.2 Other Types of Audio (Domain Transfer Capability)

Please note that MuCodec itself does not target other types of audio, and we have not used any other kind of audio data except Music, only demonstrating the domain transfer capability. We will focus on developing a universal audio codec at ultra low bit rates in our future work.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Link

Original Audio

low-bitrate scenario (0.35kbps)

GAN-based (0.35kbps)

SemantiCodec (0.375kbps)

WavTokenizer (0.48kbps)

XCodec (0.50kbps)

MuCodec-proposed (0.35kbps)

high-bitrate scenario (1.33kbps)

GAN-based (1.33kbps)

SemantiCodec (1.40kbps)

WavTokenizer (0.90kbps)

MuCodec-proposed (1.33kbps)

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Original Audio

low-bitrate scenario (0.35kbps)

GAN-based (0.35kbps)

SemantiCodec (0.375kbps)

WavTokenizer (0.48kbps)

XCodec (0.50kbps)

MuCodec-proposed (0.35kbps)

high-bitrate scenario (1.33kbps)

GAN-based (1.33kbps)

SemantiCodec (1.40kbps)

WavTokenizer (0.90kbps)

MuCodec-proposed (1.33kbps)

Sample 1

Sample 2

Sample 3

Original Audio

low-bitrate scenario (0.35kbps)

GAN-based (0.35kbps)

SemantiCodec (0.375kbps)

WavTokenizer (0.48kbps)

XCodec (0.50kbps)

MuCodec-proposed (0.35kbps)

high-bitrate scenario (1.33kbps)

GAN-based (1.33kbps)

SemantiCodec (1.40kbps)

WavTokenizer (0.90kbps)

MuCodec-proposed (1.33kbps)

Sample 1

Sample 2

Sample 3

Original Audio

low-bitrate scenario (0.35kbps)

GAN-based (0.35kbps)

SemantiCodec (0.375kbps)

WavTokenizer (0.48kbps)

XCodec (0.50kbps)

MuCodec-proposed (0.35kbps)

high-bitrate scenario (1.33kbps)

GAN-based (1.33kbps)

SemantiCodec (1.40kbps)

WavTokenizer (0.90kbps)

MuCodec-proposed (1.33kbps)

Sample 1

Sample 2

Sample 3

Original Audio

low-bitrate scenario (0.35kbps)

GAN-based (0.35kbps)

SemantiCodec (0.375kbps)

WavTokenizer (0.48kbps)

XCodec (0.50kbps)

MuCodec-proposed (0.35kbps)

high-bitrate scenario (1.33kbps)

GAN-based (1.33kbps)

SemantiCodec (1.40kbps)

WavTokenizer (0.90kbps)

MuCodec-proposed (1.33kbps)

2.1 Generated music samples

To evaluate MuCodec's role in LLM-based music generation, we trained a 1.5B LLaMA-structured model to predict discrete tokens from lyrics, then reconstructed audio using the codec. For comparison, we also benchmarked two state-of-the-art codecs (XCodec and WavTokenizer, both single-codebook) by training identical architecture language models on their discrete tokens of music from the same dataset. Below we present three generated song clips using identical inputs and consistent sampling methods across all codecs.

Lyric	Codec	Sample
风轻轻吹过古道岁月在墙上刻下记号梦中你笑得多甜醒来却只剩下寂寥	WavTokenizer (40Hz)
XCodec (50Hz)
MuCodec (25Hz)
哎呀时光匆匆流转思绪纷飞不已独自等待期盼着你的消息天空突然洒下细雨湿润了我的心无法逃避脑海中全是你	WavTokenizer (40Hz)
XCodec (50Hz)
MuCodec (25Hz)
每个清晨当我睁开眼你的笑容就像第一缕阳光温暖我心驱散夜的寒每个夜晚当星星闪烁我和你在梦中相遇	WavTokenizer (40Hz)
XCodec (50Hz)
MuCodec (25Hz)

Lyric

Codec

Sample

风轻轻吹过古道
岁月在墙上刻下记号
梦中你笑得多甜
醒来却只剩下寂寥

WavTokenizer (40Hz)