MusicLM, a new model developed by Google Research, generates high-quality music from text descriptions. It outperforms previous systems in terms of audio quality and adherence to the text description, and can be conditioned on both text and melody. To support future research, the team has publicly released MusicCaps, a dataset of 5.5k music-text pairs with rich text descriptions provided by human experts.
- High-fidelity music generation: MusicLM generates high-quality music from text descriptions.
- Hierarchical sequence-to-sequence modeling: The model casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task.
- Consistency over several minutes: MusicLM generates music that remains consistent over several minutes.
- Outperforms previous systems: The model outperforms previous systems in both audio quality and adherence to the text description.
- Conditioned on both text and melody: MusicLM can be conditioned on both text and a melody to transform whistled and hummed melodies according to the style described in a text caption.
- Publicly released dataset: The MusicCaps dataset, composed of 5.5k music-text pairs with rich text descriptions provided by human experts, is publicly released to support future research.