Sample acquisition and ethics approval
The University of Pennsylvania institutional review board (IRB) approved the collection of human kidney tissue for this study. Left-over kidney samples were irreversibly de-identified, and no personal identifiers were gathered. Therefore, they were exempt from IRB review (category 4). We engaged an external, honest broker who was responsible for clinical data collection without disclosing personally identifiable information. Participants were not compensated. Some samples were collected as part of the Transformative Research in Diabetic Nephropathy (TRIDENT) study, a multicentre observational cohort study that enrolls, follows, and performs multiomics characterization of individuals with DKD. This study was approved by the University of Pennsylvania IRB and by the TRIDENT Steering Committee. Informed consent was obtained from each participant. The present research using the UKBB Resource was approved under Application Number 273810. Participants from the UKBB provided written informed consent allowing the use of their samples and data for medical research purposes.
CosMx sample preparation and data preprocessing
Tissue sections were cut at 5 µm thickness and prepared according to the manufacturer specifications (NanoString Technologies). We used the human universal cell characterization RNA probes, and a 50 additional custom probes for the following genes: ESRRB, SLC12A1, UMOD, CD247, SLC8A1, SNTG1, SLC12A3, TRPM6, ACSL4, SCN2A, SATB2, STOX2, EMCN, MEIS2, SEMA3A, PLVAP, NEGR1, SERPINE1, CSMD1, SLC26A7, SLC22A7, SLC4A9, SLC26A4, CREB5, HAVCR1, REN, AP1S3, LAMA3, NOS1, PAPPA2, SYNPO2, RET, LHX1, SIX2, CITED1, WNT9B, AQP2, SCNN1G, ALDH1A2, CFH, NTRK3, WT1, NPHS2, PTPRQ, CUBN, LRP2, SLC13A3, ACSM2B, SLC4A4, PARD3, XIST and UTY. We used the DAPI, CD298/B2M, CK8/18 and PanCK/CD45 for additional staining per Nanostring protocol. Imaging was performed using configuration A. After imaging was completed, the flowcell was incubated in 100% xylene overnight, the coverslip was removed from the slide with a razor blade—the slide was then stained with haematoxylin and eosin. The expression matrix and metadata from each CosMx run was exported from the AtoMx platform and was converted to a Python object using squidpy. All samples were merged and preprocessed and analysed together using scanpy. Cells with fewer than 30 counts were filtered out.
Xenium sample preparation and data preprocessing
Tissue sections were cut at 5 µm thickness and cut onto a Xenium slide. according to the manufacturer specifications (10X Genomics). We used the human Xenium Prime 5K Human Pan Tissue and Pathways Panel with 100 additional custom probes for the following genes: TPM1, ESRRB, COL6A3, AGR2, SLC26A7, ATP1B1, SLC8A1, ATP6AP2, TAGLN, SPP1, SAT1, MYL9, LDB2, DEFB1, COL1A2, ACTA2, ST6GALNAC3, SLC13A3, SLC12A3, SLC12A1, MGP, IGHG1, FN1, C7, ACSM2B, AIF1, APOE, AQP3, AZGP1, C1QA, C1QB, C1QC, CAV1, PPIA, CD74, CHI3L1, COL1A1, COL6A1, CRYAB, CXCL14, ENO1, HLA-DPA1, HLA-DRA, IFI27, IGHA1, IL32, KLF2, LGALS3, LUM, MMP7, PIGR, S100A2, SLC4A4, SLPI, SOD2, SPINK1, SOX4, SPOCK2, TACSTD2, TM4SF1, TPM2, VIM, ZFP36, AQP2, RNASE1, ALDOB, PCGF6, RHOB, CD81, ASS1, MYL6, COX8A, CTSB, GATM, MT1G, TMSB10, COL3A1, MIF, TPT1, COL6A2, BST2, CLU, APOC1, APOD, PHKG2, RGCC, HLA-DQA2, CORO1A, HSPB1, ADIRF, CKB, HLA-DQB1, COX5B, MT1H, RAMP3, TYROBP, LAMTOR5, ITM2B, UBB and CTSD. Additionally, the same sections were stained according to the Xenium Cell Segmentation workflow for automated morphology-based cell segmentation, and subsequently loaded onto the Xenium Analyzer for in situ transcriptomic analysis. Xenium raw output files were processed using the spatialdata framework (v0.0.14) with the spatialdata-io Xenium plugin. Xenium transcript and segmentation data were loaded from the manufacturer’s output directory using the xenium() function, which parses transcript tables, cell segmentation boundaries and spatial metadata into a structured SpatialData object. The gene expression table was extracted as an AnnData object for downstream single-cell analysis. Cells with fewer than 30 counts were filtered out.
Integration of the single-cell spatial kidney atlas
To integrate snRNA-seq with spatial transcriptomic data, we applied a deep generative modelling framework using scVI and scANVI, as implemented in the scvi-tools Python package (v1.0+). Two parallel integrations were performed: one with CosMx and one with Xenium data. Raw count matrices were used, and cells were annotated with metadata, including technology, project, and sample identifiers. The SCVI.setup_anndata() function was used to register these batch and covariate annotations. The scVI model was initialized with a negative binomial likelihood and trained using 4 hidden layers and a 30-dimensional latent space. We applied early stopping based on validation loss and used a batch size of 2,048. We refined this by training a scANVI model based using the scVI-trained model as a prior and using the annotations from the annotated snRNA-seq reference atlas. scANVI was trained with early stopping, The mean distance between a given CosMx cells and 10 snRNA-seq cells was then calculated in the resulting latent space using k-nearest neighbours. Cells with a high distance were discarded and reintegration was performed. To create a unified reference map across technologies, we then concatenated the filtered CosMx and Xenium objects. scVI was trained using the raw counts as input, with tech as batch key, and proj, orig_ident, nCount_RNA, and percent_mt as additional covariates. The model was trained using 30 latent dimensions, 4 hidden layers, and a negative binomial likelihood, with dropout and early stopping enabled. scANVI was subsequently initialized using the trained scVI model and fine-tuned using coarse annotations. The model was trained for 50 epochs with a learning rate of 0.001. UMAP coordinates were computed based on the scANVI latent space, and Leiden clustering was performed for annotation. Based on the final scanVI latent space, the distance from each spatial cell was calculated to the nearest snRNA-seq cell using a k-nearest neighbour graph based on Euclidean distance in the latent space. The expression of 3,000 highly variable genes of the snRNA-seq dataset was then inferred for each spatial cell by averaging the corresponding gene expression values of its nearest neighbours, with contributions weighted by the squared inverse of their latent space distances, such that closer neighbours contributed more strongly.
Annotation-based niche generation
To quantify cell-type-specific neighbourhood compositions across multiple spatial scales, we computed the number of neighbouring cells of each cell type within 20, 40, 60 and 80 μm radii for each cell in the combined CosMx and Xenium dataset. The input contained spatial connectivity matrices at each distance threshold. For each sample, we extracted subsets of the data and iteratively computed, for every cell type, the total number of neighbours per cell using the respective spatial connectivity matrix. This process yielded four neighbourhood matrices, each representing a different spatial scale. To capture neighbours newly included at increasing distances, we computed differences between consecutive shells (for example, 40 μm minus 20 μm). Cells with zero neighbours across all types were excluded, and the bottom 5% of sparse neighbourhoods (based on total neighbour count at 80 µm) were filtered out to mitigate edge effect. Cell-type-specific neighbour compositions across all four scales were then used as input features to define spatially resolved kidney niches. These features were normalized using StandardScaler, and dimensionality reduction was performed using principal component analysis (PCA). The optimal number of clusters was determined using the elbow method, followed by MiniBatchKMeans clustering to assign niche labels. To assess disease-associated shifts in kidney niche composition, we quantified the relative abundance of annotated niche labels across healthy and DKD samples. For each condition and niche, we calculated the number of contributing cells and normalized this count by the total number of cells from the respective condition to correct for differences in sampling depth. This yielded a normalized abundance matrix for each niche within control and DKD groups. To compare niche representation between groups, we computed the relative increase or depletion in DKD by calculating the ratio of normalized counts (DKD/control), assigning negative values when niches were more abundant in controls. These relative changes were visualized as a heat map.
... continue reading