MuCodec: Ultra Low-Bitrate Music Codec for Music Generation

TL;DR

Music generation is pivotal in multimedia, aiding creation and lowering the creative threshold. It focuses on generating music with clear vocals and harmonious accompaniment based on lyrics, combining high artistic creativity with technical challenges. The music codec is an important bridging component in large language model-based music generation, connecting language models with the generated music. However, existing neural codecs typically require token rates exceeding 50 Hz to achieve acceptable music quality, resulting in a context length that surpasses 12,000 tokens for a 4-minute song—a scale that is computationally demanding. This highlights the need for high-compression, high-fidelity music codecs that can reconstruct both vocals and accompaniment with high quality at low frame rates and bitrates, thereby better assisting music generation. To address this, we introduce MuCodec, designed for high-quality music reconstruction at ultra-low bitrates, facilitating more efficient music generation. MuCodec employs a two-stage training method, enabling its encoder, MuEncoder, to extract semantic and acoustic features in a unified representation. These features are discretized using residual vector quantization and converted into Mel-VAE features through flow matching, with reconstruction quality improved by representation alignment during training. The Mel-VAE features are then reconstructed into music using a pre-trained Mel-VAE decoder and HiFi-GAN. To the best of our knowledge, MuCodec is the first codec capable of reconstructing 48kHz stereo music at an ultra-low bitrate of 0.35 kbps (25 Hz), achieving state-of-the-art performance in both subjective and objective evaluations, and can more effectively support music generation.

MuCodec's overall process

Interpolate start reference image.
Table of Contents

Table of Contents


1. Audio Reconstruction

1.1 Music Reconstruction

1.2 Other Types of Audio Reconstruction


2. Aid for Music Generation

2.1 Generated Music Samples

Music Samples

1.1 Music Reconstruction

Here we provide samples of music from different languages, including English, Chinese, and other languages. The samples are from YouTube, and the links are provided below each sample.

1.1.1 English Music

Sampled from YouTube, the link has been shown below the sample

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Link Link Link Link Link
Original Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
WavTokenizer (0.48kbps)
XCodec (0.50kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
WavTokenizer (0.90kbps)
MuCodec-proposed (1.33kbps)

1.1.2 Chinese Music

Sampled from YouTube, the link has been shown below the sample

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Link Link Link Link Link
Original Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
WavTokenizer (0.48kbps)
XCodec (0.50kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
WavTokenizer (0.90kbps)
MuCodec-proposed (1.33kbps)

1.1.3 Other Language Music

Sampled from YouTube, the link has been shown below the sample

France Music Korean Music Japenese Music India Music
Link Link Link Link
Original Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
WavTokenizer (0.48kbps)
XCodec (0.50kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
WavTokenizer (0.90kbps)
MuCodec-proposed (1.33kbps)

1.2 Other Types of Audio (Domain Transfer Capability)

Please note that MuCodec itself does not target other types of audio, and we have not used any other kind of audio data except Music, only demonstrating the domain transfer capability. We will focus on developing a universal audio codec at ultra low bit rates in our future work.

1.2.1 Music Background

Sampled from YouTube, the link has been shown below the sample

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Link Link Link Link Link
Original Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
WavTokenizer (0.48kbps)
XCodec (0.50kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
WavTokenizer (0.90kbps)
MuCodec-proposed (1.33kbps)

1.2.2 Vocal

Sampled from Opencpop dataset

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Original Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
WavTokenizer (0.48kbps)
XCodec (0.50kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
WavTokenizer (0.90kbps)
MuCodec-proposed (1.33kbps)

1.2.3 Audio Event

Sampled from AudioSet

Sample 1 Sample 2 Sample 3
Original Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
WavTokenizer (0.48kbps)
XCodec (0.50kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
WavTokenizer (0.90kbps)
MuCodec-proposed (1.33kbps)

1.2.4 Chinese Speech

Sampled from THCHS-30

Sample 1 Sample 2 Sample 3
Original Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
WavTokenizer (0.48kbps)
XCodec (0.50kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
WavTokenizer (0.90kbps)
MuCodec-proposed (1.33kbps)

1.2.5 English Speech

Sampled from Librispeech

Sample 1 Sample 2 Sample 3
Original Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
WavTokenizer (0.48kbps)
XCodec (0.50kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
WavTokenizer (0.90kbps)
MuCodec-proposed (1.33kbps)

2.1 Generated music samples

To evaluate MuCodec's role in LLM-based music generation, we trained a 1.5B LLaMA-structured model to predict discrete tokens from lyrics, then reconstructed audio using the codec. For comparison, we also benchmarked two state-of-the-art codecs (XCodec and WavTokenizer, both single-codebook) by training identical architecture language models on their discrete tokens of music from the same dataset. Below we present three generated song clips using identical inputs and consistent sampling methods across all codecs.

Lyric Codec Sample
风轻轻吹过古道
岁月在墙上刻下记号
梦中你笑得多甜
醒来却只剩下寂寥
WavTokenizer (40Hz)
XCodec (50Hz)
MuCodec (25Hz)
哎呀
时光匆匆流转
思绪纷飞不已
独自等待
期盼着你的消息
天空突然洒下细雨
湿润了我的心
无法逃避
脑海中全是你
WavTokenizer (40Hz)
XCodec (50Hz)
MuCodec (25Hz)
每个清晨
当我睁开眼
你的笑容就像第一缕阳光
温暖我心
驱散夜的寒
每个夜晚
当星星闪烁
我和你
在梦中相遇
WavTokenizer (40Hz)
XCodec (50Hz)
MuCodec (25Hz)