Mastering Single-Cell RNA-Seq Analysis with Scanpy: A Step-by-Step Guide to Clustering, Annotation, and Trajectory Inference

By — min read

This guide walks you through a complete single-cell RNA-seq analysis pipeline using Scanpy on the PBMC-3k benchmark dataset. You will learn how to load and inspect data, apply quality control filters, remove doublets, normalize expression, identify highly variable genes, reduce dimensionality, cluster cells, annotate populations, and explore developmental trajectories. Each step includes practical code snippets and visualizations to ensure a hands-on understanding of the workflow.

1. What is the PBMC-3k dataset and how do you load it into Scanpy?

The PBMC-3k dataset consists of 2,700 peripheral blood mononuclear cells (PBMCs) from a healthy donor, sequenced using 10x Genomics technology. It is a widely used benchmark for single-cell RNA-seq analysis. In Scanpy, you can load it directly with sc.datasets.pbmc3k(). This function retrieves the raw count matrix, gene names, and cell barcodes. After loading, we ensure gene names are unique using adata.var_names_make_unique(). The resulting AnnData object holds the count matrix in adata.X, gene metadata in adata.var, and cell metadata in adata.obs. Inspecting print(adata) shows the number of cells and genes, which is essential before proceeding to quality control.

Mastering Single-Cell RNA-Seq Analysis with Scanpy: A Step-by-Step Guide to Clustering, Annotation, and Trajectory Inference
Source: www.marktechpost.com

2. How do you perform quality control on single-cell RNA-seq data?

Quality control removes low-quality cells and poorly detected genes. First, we flag mitochondrial and ribosomal genes based on gene name prefixes (e.g., MT-, RPS, RPL). Using sc.pp.calculate_qc_metrics, we compute per-cell metrics like number of genes, total UMI counts, and percentage of mitochondrial reads. Visual inspection with violin plots for n_genes_by_counts, total_counts, and pct_counts_mt helps identify cutoffs. Typically, we filter cells with fewer than 200 genes or more than 2,500 genes (to remove empty droplets or doublets), and those with more than 5% mitochondrial reads (indicating damaged cells). Genes expressed in fewer than 3 cells are removed. These thresholds are applied sequentially using sc.pp.filter_cells and Boolean indexing.

3. How do you detect and remove doublets using Scrublet?

Doublets are cells containing two or more barcoded droplets. They can confound clustering and differential expression. We use Scrublet, integrated into Scanpy via sc.pp.scrublet, which simulates doublets from the observed data and assigns a doublet score to each cell. A threshold (default 0.25) classifies cells as predicted doublets. After calling the function, the predicted doublet labels are stored in adata.obs['predicted_doublet']. We print the total number of predicted doublets—in PBMC-3k it is typically around 30–50. Finally, we remove those cells with adata = adata[~adata.obs['predicted_doublet'], :].copy() to obtain a clean dataset for downstream analysis.

4. How do you normalize data and select highly variable genes?

Normalization corrects for differences in sequencing depth across cells. We first save raw counts to a layer with adata.layers['counts'] = adata.X.copy(). Then we perform total-count normalization to 10,000 per cell (sc.pp.normalize_total(target_sum=1e4)) followed by log transformation (sc.pp.log1p). This stabilizes variance. Next, we identify highly variable genes (HVGs) using sc.pp.highly_variable_genes with parameters min_mean=0.0125, max_mean=3, min_disp=0.5. A plot visualizes the mean-expression vs. dispersion relationship, highlighting selected HVGs. We store the full dataset in adata.raw (for later gene-level queries) and then subset the object to only HVGs for dimensionality reduction and clustering.

5. How do you cluster cells and annotate cell types using canonical markers?

After scaling the data (sc.pp.scale) and reducing dimensionality with PCA (sc.pp.pca), we compute UMAP (sc.pp.neighbors then sc.tl.umap) and cluster cells using the Leiden algorithm (sc.tl.leiden). Marker genes per cluster are identified with sc.tl.rank_genes_groups using a Wilcoxon test. We then annotate clusters by comparing marker lists to known PBMC markers, e.g., CD3D for T cells, CD14 for monocytes, MS4A1 for B cells, etc. This step transforms cluster numbers into cell type labels. The results can be visualized in a UMAP plot colored by annotation using sc.pl.umap(adata, color='cell_type'). Accurate annotation validates the clustering quality and enables biological interpretation.

Mastering Single-Cell RNA-Seq Analysis with Scanpy: A Step-by-Step Guide to Clustering, Annotation, and Trajectory Inference
Source: www.marktechpost.com

6. How do you infer cell trajectories with PAGA and diffusion pseudotime?

Trajectory analysis reveals developmental progression or continuous cellular states. We first compute PAGA (Partition-based Graph Abstraction) using sc.tl.paga on the Leiden clusters, which outputs a connectivity graph showing relationships between clusters. This graph can be visualized overlaid on UMAP. Then we compute diffusion maps (sc.tl.diffmap) and assign a root cell (usually from the earliest population) to calculate diffusion pseudotime (sc.tl.dpt). Pseudotime values order cells along a continuous trajectory. We visualize pseudotime on a diffusion map or UMAP to identify branches. This approach works well for PBMC data, revealing, for example, the transition from naive to activated T cells or monocyte-to-dendritic cell differentiation.

7. How do you compute a custom gene signature score, e.g., interferon response?

Custom scores quantify the activity of a gene set per cell. Using Scanpy's sc.tl.score_genes, we compute an interferon-response score by providing a list of interferon-stimulated genes (e.g., IFIT1, IFIT3, MX1). The function calculates the average expression of the signature genes minus the average expression of a random set of reference genes (matched for mean expression). The resulting score is added to adata.obs['interferon_score']. We can then visualize this score on UMAP to see which cell types or clusters show high interferon activity. Such scores are useful for identifying activated immune cells or response to viral infection, providing cell-level functional insights beyond clustering.

8. How do you save the fully analyzed AnnData object for future use?

After completing all analyses, we save the AnnData object to an HDF5 file using adata.write('pbmc3k_analyzed.h5ad'). This file preserves the raw counts, normalized data, cluster assignments, annotation, UMAP coordinates, pseudotime values, and custom scores. To reload later, use sc.read('pbmc3k_analyzed.h5ad'). Storing the object allows easy reproduction of figures or sharing with collaborators. We recommend setting adata.write_h5ad to compress the file (default compression='gzip') to save disk space. The saved object maintains all metadata, so future steps like differential expression or visualizing new gene sets can be performed without rerunning the entire pipeline.

Tags:

Recommended

Discover More

Preserving Team Bonds: A Guide to Balancing AI Efficiency with Human ConnectionUnveiling Financial Webs: A Step-by-Step Guide to Analyzing Related-Party Transactions in Corporate FilingsA Fleet Operator’s Guide to Tesla’s Semi Charging Infrastructure: Basecharger and MegachargerMastering Human Agency in an AI-Driven World: A Practical GuideApple Finally Secures Cross-Platform Messaging: End-to-End Encryption for iPhone-Android RCS Arrives in iOS 26.5