Over the last 6 months, we collected ~10k hours of data across thousands of unique individuals. As far as we know, this is the largest neuro-language dataset in the world.See here, here, here, here, and here (discussion only, no data available) for some of the larger datasets. See recent papers discussing the problem of small datasets here, here, and here. Why did we do this? We train thought-to-text models. That is, we train models to decode semantic content from noninvasive neural data. Here are some entirely zero-shot examples:
The neural data is taken from the seconds leading up to but not including the time when the subject typed or spoke, meaning that the model detects an idea before the subject even compiles that idea down into words.
Ground truth Model prediction (based ONLY on neural data) the room seemed colder there was a breeze even a gentle gust do you have a favorite app or website do you have any favorite robot then she smiled faintly and nodded she shrugged, hoping to look indifferent. All examples are zero-shot to new subjects, whom the model has never seen before.
We'll write about the model in a future post. But before you can train a model that generalizes to new people, you need to get many thousands of hours of data. When we started, the existing datasets were either inapplicable or tiny. Most were in the low hundreds of hours (if that), and most had tens or, at a stretch, hundreds of subjects.
So we got thousands of people to come wear headsets in our basement. This post is about how we collected our dataset—what participants do, the hardware and software involved, and what we learned about operations and ML when we scaled it up.
What participants actually do
A participant comes in, signs a consent form, and sits down in a booth. A session manager fits a headset onto them and starts the session. Then, the participant has a freeform conversation with an LLM for two hours.
Sessions vary. Some are listening and speaking with an LLM, and some are reading and typing.We use Deepgram for audio transcription, OSS120B on Cerebras for the LLM responses, and ElevenLabs for voicing certain replies. In the past, we used various Gemma and Llama models on Groq. The goal is to maximize the amount that subjects type or say during the two-hour period, without constraining the topics they discuss.In the beginning, we included tasks like 'retype this sentence', or 'paraphrase this but use this different tone'. Over time, we eliminated these and replaced them with more freeform conversation. We still include a few baseline tasks for calibration and easy model evals. Each session produces multimodal neural data time-aligned with text and audio.
Participants have to touch-type without looking at the keyboard. In the beginning, participants would occasionally press a crazy key combination that crashed or closed the software. We could have fixed this in the code, but that would've taken time—so instead we 'simplified' the keyboards.
What your participants type—and whether it's remotely coherent—is a more difficult problem. We implemented a token quantity/quality scoring system that determines if we invite a participant back for future sessions, and we make sure participants know about this so they're incentivized to engage.
... continue reading