PanKbase — sc/snRNA-seq processing pipeline

Parker Lab
Ha Vu PhD, Peter Orchard PhD, Steve Parker PhD

The purpose of this document is to outline the overall steps we take to process and QC sc/snRNA-seq data from the 10x Genomics platform. All steps are run through nextflow pipeline management and all tools are containerized to enable widespread reproduction of our results.

I. Single-nucleus RNA-seq preprocessing:

Note: single-cell RNA-seq preprocessing is performed in the same manner as single-nucleus RNA-seq, though slightly different QC metrics are emphasized. One could consider using a different count matrix for scRNA vs snRNA as well (e.g. one that excludes introns for scRNA vs one that includes introns for snRNA; both are produced in the pipeline), though 10X Genomics has stated that there may be value in including intronic reads for both single cell and single nucleus RNA data (https://kb.10xgenomics.com/hc/en-us/articles/4998628924429-Why-should-I-include-introns-for-my-single-cell-whole-transcriptome-Gene-Expression-data-analysis)

1. FastQC and fastQC-based MultiQC:

Purpose: QC of raw sequencing reads to detect flow cell problems, presence of unexpected technical sequences, etc.
Inputs: fastq files.
Outputs: HTML QC reports.

2. Starsolo:

Purpose:
- Align reads to the genome.
- Produce count matrices quantifying various features, most notably gene counts (with multiple different count methods, e.g. including vs excluding introns). We use count matrix GeneFull_ExonOverIntron for downstream analysis (this is gene counting that takes into account exons+introns, but favors exon overlap in the case of exon+intron overlap)
Inputs: Raw read fastq files.
Outputs: Aligned reads (BAM file) and count matrices.

3. MultiQC on STAR logs:

Purpose: QC of mapping stage (examining % uniquely mapped reads, etc.).
Inputs: STARsolo log files.
Output: HTML report.

4. Calculation of per-barcode QC metrics (custom python script)

Purpose: Calculate per-barcode QC metrics (total number of reads, uniquely mapped reads, secondary and supplementary alignments, UMI counts, mitochondrial fractions).
Inputs: BAM file + gene count matrix.
Outputs: Text file of QC metrics.

5. Samtools view:

Purpose: Filter to high-quality mapped reads.
Inputs: BAM file from Starsolo.
Outputs: Filtered BAM file
Command: samtools view -h -b -q 255 -F 4 -F 256 -F 2048 $bam

6. CellBender:

Purpose: Remove ambient RNA.
Inputs: Gene count matrix from starsolo.
Outputs: HDF5 file that includes a count matrix with ambient RNA removed, as well as additional information such as the probability that a barcode represents a cell or not. Also includes HTML reports that help QC this step or determine if cellbender parameters need to be updated.

7. EmptyDrops:

Purpose: Distinguish potential cells from empty droplets.
Inputs: matrix.mtx, features.tsv and barcodes.tsv files from Starsolo.
Outputs: Droplets with significant deviations from the ambient profile are detected at a specified FDR threshold; by default, with FDR below 5%. Also includes a report of locations of critical points such as inflection and knee points.

II. Selection of quality nuclei:

1. In the case of scRNA / snRNA only (no matching snATAC): Produce these QC plots, and threshold on:

a. Min. RNA UMIs. This threshold can be determined using Multi-Otsu thresholding or EmptyDrops, if desired (such a threshold is output by the RNA preprocessing pipeline above). One can use all cells/nuclei, or exclude cells/nuclei with low UMI counts before running Multi-Otsu thresholding. It's important to inspect the plots before making a final choice for thresholds.

b. Max fraction of mitochondrial reads: This threshold can be determined using Multi-Otsu thresholding, if desired. One can use all cells/nuclei, or exclude cells/nuclei with low UMI counts before running Multi-Otsu thresholding. This threshold is (usually between 0.05 and 0.2 for snRNA, depending on the distribution.)

c. Cellbender-estimated probability that a barcode represents a cell (usually use threshold = 0.99)

d. If it's snRNA data (not scRNA), ratio of exon coverage vs full gene body coverage (calculated using ratio of UMI counts from starsolo matrices Gene vs GeneFull_ExonOverIntron); quality nuclei often have ratio < 0.5.

III. Doublet removal

If the RNA modality is present, run doubletfinder on RNA gene count matrix (after cellbender correction for ambient RNA)
Any barcodes called as doublets by any of the above methods are considered doublets.
Optionally, if > 1 library is available: Jointly cluster libraries using an appropriate method (e.g., Seurat if all cells/nuclei have the RNA modality) with doublets included, and then filter out any nuclei that (1) are called as doublets by the above methods or (2) are in a suspected doublet cluster (e.g. a cluster with > 50% doublets).

IV. Clustering

If RNA modality is present for all samples, final clustering is usually performed using Seurat on the RNA modality only, usually testing both the standard workflow (log normalize, scale, find variable features) and scTransform workflow; if necessary batch correction may be performed using Harmony.