Sample collection and data analysis for the P56 dataset
Most of the methods that apply to the adult P56 10x scRNA and MERFISH datasets used for this paper were described previously16 and the following methods are either newly introduced or a modified version was used for this Article. The P56 dataset used in this study is from the subpallium GABA neighbourhood in the WMB cell atlas16, with a total of 611,423 high-quality single-cell transcriptomes (10xv2: 269,307 cells, 3,567 ± 1,264 (mean ± s.d.) genes per cell, 9,328 ± 5,502 unique molecular identifiers (UMIs) per cell; 10xv3: 342,116 cells, 5,949 ± 1,625 genes per cell, 26,476 ± 14,943 UMIs per cell) (Supplementary Table 3).
UMAP projection
We performed principal component analysis (R package stats, v.4.4.1, RRID: SCR_025968) based on the imputed gene expression matrix of 4,895 marker genes using the 10xv3 reference. We selected the top 100 principal components, then removed one principal component with more than 0.7 correlation with the technical bias vector, defined as log 2 [gene count] for each cell. We used the remaining principal components as an input to create 2D and 3D UMAPs (v.0.5.6, RRID: SCR_018217)94 using the parameters nn.neighbors = 25 and md = 0.4.
Constellation plot
To generate the constellation plot, each transcriptomic supertype was represented by a node (circle), of which the surface area reflected the number of cells within the supertype in log scale. The position of nodes was based on the centroid positions of the corresponding supertypes in UMAP coordinates. The relationships between nodes were indicated by edges that were calculated as follows. For each cell, 15 nearest neighbours in reduced-dimension space were determined and summarized by supertypes. For each supertype, we then calculated the fraction of nearest neighbours that were assigned to other supertypes. The edges connected two nodes in which at least one of the nodes had >5% of nearest neighbours in the connecting node. The width of the edge at the node reflected the fraction of nearest neighbours that were assigned to the connecting node and was scaled to node size. For all nodes in the plot, we then determined the maximum fraction of outside neighbours and set this as edge width = 100% of node width. The function for creating these plots, plot_constellation, is included in scrattch.bigcat (v.0.0.5; https://github.com/AllenInstitute/scrattch.bigcat)16.
Imputation of scRNA-seq data into the MERFISH space
The MERFISH dataset was collected using only 500 genes. To obtain the spatial distribution of all of the genes, we projected gene expression of the MERFISH dataset to the 10xv3 scRNA-seq dataset using the impute_knn_global function in the scrattch.bigcat package16. We used the self-imputed 10xv3 dataset as a reference, meaning that the expression of each 10xv3 cell was first imputed based on its nearest 15 neighbours in the reduced principal component space. This decision was made to ensure that, in the following hierarchical imputation step, the transitions between major cell types were preserved. The imputation was conducted in the order specified by the hierarchy defined by class and subclass. At the root, we imputed the expression for all of the genes for each MERFISH cell based on the average expression of their nearest neighbours from the reference 10xv3 dataset, defined by the cosine similarity using all 500 MERFISH genes. In each of the following iterations, we selected the node to which each MERFISH cell was assigned and imputed only the expression of the differentially expressed genes (DEGs) based on pairwise comparison for all of the clusters under this node. The nearest neighbours for imputation were selected from the clusters under this node in the reference dataset, using only the subset of DEGs that were present on the MERFISH gene panel. We repeated this procedure until reaching the leaf node. This strategy enabled us to preserve the cell-type resolution during imputation, making it less susceptible to the global platform differences between MERFISH and scRNA-seq.
Analysis of spatial gene expression gradients
We performed ICA using fastICA (v.1.2-5.1; RRID: SCR_013110)95 to decompose the gene expression matrix into independent components. These components were then projected onto the imputed MERFISH data to determine whether the component represents a spatial gene expression program. For components that represent a spatial gene expression program, the top loading genes were selected and visualized on both the UMAP in scRNA-seq space and on sections in the imputed MERFISH space. We evaluated the gene modules in the identified individual components and applied UCell (v.2.8.0; https://doi.org/10.18129/B9.bioc.UCell)96 to assign a gene module score based on both positive and negative genes to each cell.
... continue reading