Atlas-guided discovery of transcription factors for T cell programming

Dataset acquisition for mouse CD8+ T cell state multi-omics atlas

CD8+ T cell samples were collected from ten datasets, including those generated in this study (Extended Data Fig. 1e). In total, we analysed 121 experiments, comprising 52 ATAC-seq and 69 RNA-seq datasets, which were integrated to generate paired samples and served as input for the Taiji pipeline. The samples encompassed nine distinct CD8+ T cell subtypes: naive, TE, MP, T RM , T EM , T CM , TEX prog , TEX eff and TEX term . Cell states were defined on the basis of established surface marker combinations and LCMV-specific tetramers, including IL7R, KLRG1, PD1, SLAMF6, CD101, Tim3, CD69, CD103, H2-Db LCMV GP33–41 and H2-Db LCMV GP276–286 or congenic markers for P14 (T cell receptor (TCR) specific for the LCMV GP33–41 peptide CD8+ T cells), in the context of either acute (LCMV–Armstrong) or chronic (LCMV–Clone 13) infection models. A complete summary of dataset sources, accession numbers, infection conditions and corresponding cell state definitions (sorting gates) is provided in Supplementary Table 1 and Extended Data Fig. 1e.

TF regulatory networks construction and visualization

To perform integrative analysis of RNA-seq and ATAC-seq data, we developed Taiji v.2.0, which allows visualization of several downstream analysis–TF wave, TF–TF association and TF community analysis. Epitensor was used for the prediction of chromatin interactions. Putative TF binding motifs were curated from the latest CIS-BP database61. In this analysis, 695 TF genes were identified as having binding sites centred around ATAC-seq peak summits. The average number of nodes (genes) and edges (interactions) of the genetic regulatory networks across CD8+ T cell states were 15,845 and 1,325,694, respectively, including 695 (4.38%) TF nodes. On average, each TF regulated 1,907 genes, and each gene was regulated by 22 TFs.

Identification of single-state and multi-state TF genes

We first identified universal TF genes with mean PageRank across nine cell states ranked as top 10% and coefficients of variation less than 0.5. In total, 54 universal TF genes were identified (Supplementary Table 1). The remaining 641 TF genes were candidates for single-state TF genes. To identify single-state TF genes, we divided the samples into two groups: target and background. The target group included all samples belonging to the cell state of interest, and the background group comprised the remaining samples. We then performed the normality test using Shapiro-Wilk’s method to determine whether the two groups were distributed normally, and we found that the PageRank scores of most (90%) samples followed a log-normal distribution. On the basis of the log-normality assumption, an unpaired t-test was used to calculate the P value. A P value cut-off of 0.05 and log 2 fold change (log 2 FC) cut-off of 0.5 were used for calling lineage-specific TFs. In total, 255 specific TF genes were identified (Supplementary Table 2). Depending on whether the TF gene appeared in several cell states, they could be divided further into multi-state TF genes (Fig. 1e and Supplementary Table 2) and 136 single-state exclusive TF genes (Fig. 1d and Supplementary Table 2). Out of 255 single-state TF genes, 84 appear in TEX term or T RM cells. To identify the truly distinctive TF genes between TEX term and T RM , we performed a second round of unpaired t-tests, between only TEX term and T RM cells (Supplementary Table 3). The same cut-offs, that is, P value of 0.05 and log 2 FC of 0.5, were applied to select TEX term single-taskers and T RM single-taskers. Out of 84 TF genes that did not pass the cut-off, 30 were identified as TEX term and T RM multi-taskers. The full workflow is summarized in Extended Data Fig. 2a.

Identification of transcriptional waves

Combined with previous knowledge of the T cell differentiation path, TF waves are combinations of TFs that are particularly active in certain differentiation stages, revealing possible mechanisms of how TF activities are coordinated during differentiation. To be more specific, we clustered the TFs based on the normalized PageRank scores across samples. First, we performed principal component analysis (PCA) for dimensionality reduction of the TF score matrix. We retained the first ten principal components for further clustering analysis, which explained more than 70% of the variance (Extended Data Fig. 3b; left panel). We used the k-means algorithm for clustering analysis. To find the optimal number of clusters and similarity metric, we performed Silhouette analysis to evaluate the clustering quality using five distance metrics: Euclidean distance, Manhattan distance, Kendall correlation, Pearson correlation and Spearman correlation (Extended Data Fig. 3b; right panel). Pearson correlation was the most appropriate distance metric, as the average Silhouette width was highest of all five distance metrics. On the basis of these analyses, we identified seven distinct dynamic patterns of TF activity during immune cell development. We further performed functional enrichment analysis to identify gene ontology (GO) terms for these clusters.

TF–TF collaboration network analysis and visualization

To build the TF–TF association networks, we first defined a set of relevant TFs for each context (TEX term and T RM ) by combining cell state-important and single-state TF genes, resulting in 159 TFs for TEX term and 170 for T RM cells. The analysis was based on a TF–regulatee network derived from Taiji, where we first consolidated sample networks by averaging the edge weights for each TF–regulatee pair. To reduce noise, regulatees with low variation across all TFs (s.d. ≤ 1) were removed. Subsequently, a TF–TF correlation matrix was generated by calculating the Spearman’s correlation of edge weights for each TF pair across their common regulatees. From this matrix, we constructed a graphical model using the R package ‘huge’62, which uses the Graphical Lasso algorithm and a shrunken empirical cumulative distribution function estimator. An edge between two TFs was established if their correlation was deemed significant by the model, controlled by a lasso penalty parameter (lambda) of 0.052. This value was chosen as it represents a local minimum on the sparsity lambda curve, resulting in approximately 15% of TF–TF pairs being connected. To validate this method, we estimated the false discovery rate by generating a null model through random shuffling of the TF–regulatee edge weights. Applying our algorithm to this null data identified zero interactions, confirming that our approach has a very low false discovery rate.

... continue reading