Sesame CSM: A Conversational Speech Generation Model
Published on: 2025-06-09 00:48:16
CSM
2025/03/13 - We are releasing the 1B CSM variant. The checkpoint is hosted on Hugging Face.
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.
A fine-tuned variant of CSM powers the interactive voice demo shown in our blog post.
A hosted Hugging Face space is also available for testing audio generation.
Requirements
A CUDA-compatible GPU
The code has been tested on CUDA 12.4 and 12.6, but it may also work on other versions
Similarly, Python 3.10 is recommended, but newer versions may be fine
For some audio operations, ffmpeg may be required
may be required Access to the following Hugging Face models: Llama-3.2-1B CSM-1B
Setup
git clone [email protected]:SesameAILabs/csm.git cd csm python3.10 -m venv .venv source .venv/bin/activate pip install -r requirements.txt # You will need acces
... Read full article.