Patient recruitment and single-cell sequencing
This study was approved by the National Health Service (NHS) Research Ethics Committee (Cambridge South, REC ID 17/EE/0338). Written informed consent was given by all participants.
296 individuals without IBD and 125 individuals with CD were recruited at Addenbrooke’s Hospital, Cambridge, UK. Patients with CD were required to have a history of ileal disease involvement. During routine endoscopy, gastrointestinal biopsies were collected from the rectum and terminal ileum, and blood samples were drawn after the procedure. Terminal ileal biopsies were obtained from 243 individuals without IBD and 119 individuals with CD. Ileal biopsies from patients with CD included both inflamed and uninflamed tissue. Rectal biopsies were collected from 275 individuals without IBD. Where possible, rectal and ileal samples were collected from the same healthy individuals, and blood and ileal samples were collected from the same patients. For gastrointestinal biopsies, we have developed a gentle single-cell dissociation protocol that minimizes transcriptional changes and cell stress or death during cell isolation. Early samples were processed using version 1 of the protocol, whereas most of the samples were processed using version 290,91. Whole blood was collected for scRNA-seq from 95 individuals with CD. Peripheral blood mononuclear blood cells (PBMCs) were isolated from the whole blood using the EasySep Direct Human PBMC Isolation Kit (StemCell Technologies).
Because IBD tissue displays increased cellular heterogeneity due to immune infiltration, we targeted 6,000 and 3,000 cells for CD and healthy biopsies, respectively. This is to ensure adequate representation of diverse cell populations and to maintain statistical power for expression quantification and eQTL mapping across annotations. scRNA-seq was performed using 3′ 10X Genomics Chromium kits (v3.0 and v3.1) according to the manufacturer’s instructions. Libraries were sequenced to a target depth of 50,000 reads per cell. CellRanger (v7.2.0) was used to align reads to the GRCh38 human genome reference with Ensembl v93 transcript definitions (GRCh38-3.0.0 reference file provided by 10X Genomics), and to generate cell-by-gene count matrices. The greater number of cells targeted for patients with CD is confirmed in the terminal ileum samples from the ‘atlasing’ dataset (see ‘Single-cell RNA sequencing quality control and cell clustering’) (Wilcoxon P value < 2.2 × 10−16), with no discernable influence on sequencing depth, as indicated by median number of counts per sample (P = 0.27) (Supplementary Fig. 6).
Single-cell RNA-sequencing quality control and cell clustering
Cell count matrices were initially subset to retain droplets likely to contain cells and corrected for ambient RNA using CellBender v2.192, as previously described76. Scrublet v0.2.193 was then used to identify and remove potential multiplets. All subsequent quality control, normalization, and clustering steps were performed with Scanpy v1.9.694, including normalizing to library sizes of 10,000 total counts using the function sc.pp.normalise_per_cell and log transformation using sc.pp.log1p to produce the ln[cp10k + 1] count matrix.
A lineage-sensitive, relative quality control strategy was applied to account for technical and biological variation across tissues and lineages (Supplementary Fig. 1). First, the input dataset (4,099,804 cells) was aggregated and subjected to an initial round of quality control. Cells were removed if they had fewer than 250 genes expressed, fewer than 500 total counts, or more than 20% or 50% of reads originating from the mitochondrial genome, for blood-derived and non-blood-derived cells, respectively. This left a total of 2,725,568 cells. Genes were then removed if they were not expressed in at least five cells. A total of 5,000 highly variable genes were then classified within each sample and merged using sc.pp.highly_variable_genes and flavour = seurat. Ribosomal, mitochondrial and immunoglobulin genes were then excluded from the highly variable set.
These data were then subject to an initial round of integration and clustering. Data were integrated using the scVI v0.16.395 function scvi.model.SCVI on raw count data of highly variable genes. This was trained on 2 layers with 30 latent variables and gene likelihood modelled as a negative binomial distribution. A total of 50 nearest neighbours was calculated per cell using the function sc.pp.neighbors to generate an adjacency matrix, which was used for calculation of both an initial uniform manifold approximation and projection (UMAP) embedding using function sc.tl.umap with both a minimum distance and spread of 0.5, and Leiden clusters at a resolution of 0.03 using sc.tl.leiden. Initial clusters were then grouped into cell lineages, based on the expression of canonical marker genes (Supplementary Fig. 7), including: B cells, epithelial cells, mast cells, mesenchymal cells, myeloid cells, platelets and T/ILC cells. Markers for these annotations are described in ‘Annotation of cell clusters’.
Within each cell lineage (B cells, epithelial cells, mesenchymal cells, myeloid cells (not including platelets or mast cells) and T/ILC cells), we then further subset cells based on relative thresholds for quality control parameters. Cells with number of unique genes expressed, total counts and percentage of genes originating from the mitochondrial genome that lay outside the median ± 2.5 median absolute deviations were removed. For the epithelial lineage, the median deviation was far greater than that for non-epithelial, and so a window of 2 median absolute deviations was utilized. Additionally, epithelial cells from samples with lower overall depth (indicated by median number of genes expressed per cell <2,000) were removed from the epithelial population only, and any infiltration of blood-derived cells were also removed. Highly variable genes were then re-selected from the full initial set within each lineage independently, and re-integrated using scVI as before, computing 20 nearest neighbours for non-epithelial cells and 50 nearest neighbours for epithelial cells.
Cell types were then identified by clustering within each lineage independently. As described previously76, the resolution at which final clusters were derived was determined in a data-driven manner. Cells from each lineage were clustered at resolutions of 0.1 to 1.5 (interval of 0.1) using the sc.tl.leiden function. To determine the reproducibility of the clusters detected at each resolution, two-thirds of the data were then used to train a single layer dense neural network model using keras v2.4.3, and the per-cluster matthews correlation coefficient (MCC) of the fit was assessed in the remaining one-third of data. After clusters that were either comprising >40% cells from a single sample or showed outlying quality control parameters were removed, we selected the resolution at which all remaining clusters had a MCC > 0.75. Additionally, reproducible (MCC > 0.75) clusters demarcated by the expression of genes indicative of poor quality cells, such as MALAT196, and cross-lineage marker genes were also removed. The resolutions utilized for the final cell clusters were: epithelial cells, 0.9; mesenchymal cells, 1.0; myeloid cells, 0.7; B cells, 0.8; and T/ILC cells, 1.1.
... continue reading