Find Related products on Amazon

Shop on Amazon

Sesame CSM: A Conversational Speech Generation Model

Published on: 2025-06-09 00:48:16

CSM 2025/03/13 - We are releasing the 1B CSM variant. The checkpoint is hosted on Hugging Face. CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes. A fine-tuned variant of CSM powers the interactive voice demo shown in our blog post. A hosted Hugging Face space is also available for testing audio generation. Requirements A CUDA-compatible GPU The code has been tested on CUDA 12.4 and 12.6, but it may also work on other versions Similarly, Python 3.10 is recommended, but newer versions may be fine For some audio operations, ffmpeg may be required may be required Access to the following Hugging Face models: Llama-3.2-1B CSM-1B Setup git clone [email protected]:SesameAILabs/csm.git cd csm python3.10 -m venv .venv source .venv/bin/activate pip install -r requirements.txt # You will need acces ... Read full article.