Mapping the neuronal building blocks of human language with language models

Single-neuronal recordings

Participants

All aspects of the study were approved and performed in strict accordance with guidelines set by the Massachusetts General Hospital Institutional Review Board (IRB) and Harvard Medical School for the ethical conduct of research. All of the participants included in the study were scheduled to undergo epilepsy monitoring30,33,34,70. The decision for surgery was made by a multidisciplinary team that commonly included neurologists, neurosurgeons, radiologists and neuropsychologists. Operative planning was made independently by the surgical team and without consideration of study participation. Once the decision to undergo surgery was made, the participant’s research candidacy was reviewed. The participants were enrolled only if the surgical plan was for cortical surface grid recordings for seizure monitoring. The patients were aged at least 18 years, and they had intact language function and were able to provide informed consent. A total of eight participants was included in the study (3 female and 5 male, mean age of 40 years; range, 27 to 52 years). All of the participants that were consented were free to withdraw from study participation at any time without impact on clinical care.

Microelectrode recordings

For each participant, multi-electrode microarrays (Utah arrays; Blackrock Microsystems) were implanted into an area planned for surgical resection. These micro-arrays (used for research) were placed within the cortex in tandem with the surface cortical grids (used for clinical monitoring) and were confirmed to be positioned in areas planned for resection based on intraoperative cortical mapping, as described previously30,71,72,73. The micro-arrays are configured in a 10 × 10 electrode channel mapping separated by 400 μm (96 recording electrodes covering an approximate area of 4 × 4 mm). For placement, the electrode arrays were implanted using a semi-automated impactor and were connected to a temporarily implanted female port. A total of eight arrays was implanted into eight participants. On the left hemisphere, two were in the anterior temporal cortex, one was in the posterior temporal cortex and one was in the prefrontal cortex. On the right hemisphere, one was in the anterior temporal cortex, two were in the posterior temporal cortex and one was in the prefrontal cortex. Together, these recordings lay in regions that have been shown to reliably engage in language production and sentence construction41,42,43,65,66 and to display robust language-selective responses41,42,44,65,74,75,76 and activation by validated language localizers on imaging studies28,41,45 (Fig. 1a). Each participant had a single array. The voltages were recorded and amplified via a 96-multichannel acquisition processor, and a band-pass filter was applied to the neuronal signals (150Hz to 8 kHz; 1-pole low-cut and 3-pole high-cut with 1,000× gain). The signals were then digitized at 30 kHz and stored offline. Audio recordings were tracked and aligned through a synchronization trigger that was integrated into both the neuronal and audio recording systems (Natus Medical). After clinical monitoring, all arrays were removed.

Single-unit isolation and LFP processing

To select and cluster APs of putative neurons, an automatic waveform detection was commonly performed using Kilosort 3.0 (https://github.com/MouseLand/Kilosort). The selected APs were then manually curated using Phy (https://phy.readthedocs.io/en/latest/, version 2.0b6). Candidate units were clustered based on their voltage waveform morphologies, spiking rates and their distribution profiles of inter-spike intervals, auto-correlation and cross-correlation profiles. We excluded units that did not demonstrate waveform stability over the course of recordings or that did not have inter-spike-interval profiles or waveform morphologies consistent with those of cortical neurons. Units from different sessions were recorded at separate time periods and across different linguistic contexts and were therefore sorted separately77,78,79. Finally, to confirm the quality of recordings, we used SpikeInterface to extract the measures of the single-unit sorting (v.0.102.3). Here the signal-to-noise ratio was defined as the ratio of the maximum amplitude of the mean spike waveform to the s.d. of the background noise on the same channel. Noise was computed through the median absolute deviation of the peak channel. Isolation distances were used to quantify the separation of putative units from their neighbours in principal component space. Here, after projecting spike waveforms into a lower-dimensional principal component space (across 5 principal components), the isolation distances were calculated using the Mahalanobis distance. Lastly, the inter-spike intervals for each unit were defined by the time intervals between consecutive spikes. Any putative units that had a high percentage of ISIs lower than 2 ms or firing rates lower than 0.1 Hz were excluded. Across all the single units, the mean signal-to-noise ratios were 2.5 ± 0.16 (mean ± s.d.), the mean isolation distances were 37.6 ± 2.7 and the mean peak inter-spike intervals (that is, the mean times that the inter-spike interval distributions peaked) were 8.7 ± 0.4 ms and were largely consistent with previous literature using analogous recording techniques32,80,81.

LFPs were obtained from the same electrode contacts as our single-neuronal recordings. Here, for each electrode of the Utah array (96 active channels in total per array), the raw voltages were processed through a series of notch filters to eliminate the AC powerline at 60 Hz and its higher-order harmonics. Then, a fourth-order Butterworth low-pass filter was applied with a cut-off frequency of 300 Hz to capture LFPs. Finally, the filtered signals were downsampled to 1 kHz for downstream analysis.

Language production and audio recordings

We tracked the activities of neurons as part of real-time conversations to study their responses during natural language production42,43,46,47,48,76,82. Here the participants responded freely to questions and prompts that were given ad hoc, therefore allowing them to produce phrases and sentences in a manner that reflected natural speech (Extended Data Table 1). The sentences were also constructed de novo (for example, rather than being simply read or repeated) and therefore enabled neuronal responses to be evaluated independently of explicit sensory cues. Thus, the participant may be prompted with a question such as “Do you want a phone right now?”, to which the participant may respond with the compound sentence “Yes, keep it there just in case, and then just bring that back over”. Therefore, in alignment with growing efforts in the field to study natural language21,42,43,46,47,48,76,82,83,84,85,86,87, the sentences produced by the participants varied naturalistically in structure and content, were constructed de novo and were representative of natural language production.

... continue reading