Single cell RNA-seq data processing standards

1. Requirements on processing steps:

a. Information on genome build

b. Basic processing to obtain alignment, read filtering, barcode counts, and UMI counts

c. Ambient RNA correction

d. Doublet detection

2. Requirements on quality control:

a. Genotype checks to match donor identification and samples (optional)

b. Quality control for read files using tools such as FastQC and MultiQC

c. Distinction between empty and likely-true droplets using tools such as EmptyDrops (optional)

d. Filtering cells based on mitochondrial read rates

e. Filtering cells based on distribution of gene counts and UMI counts

f. Number of cells per sample

g. Barcode rank plots

h. Filtering clusters based on doublet identification and doublet rates

3. Requirements on integration:

a. Information on covariates included in integration model

b. Information on cell annotation approaches


PanKbase V3 freeze example:

Requirements on processing steps:

RequirementImplementation
Genome buildHuman genome, Gencode V39 GRCh38.p13
Basic processing to obtain alignment, read filtering, barcode counts, and UMI countsSTARsolo, custom scripts<br>https://github.com/PanKbase/snRNAseq-NextFlow
Ambient RNA correctionCellBender<br>CellBender was run twice using default parameters, and modified settings
Doublet detectionDoubletFinder<br>https://github.com/PanKbase/Multiome-Doublet-Detection-NextFlow<br>DoubletFinder was run twice; the second time without all doublets detected from the first round

Requirements on quality control

RequirementImplementation
Genotype checksmbv tool and manual search
Read file checksFastQC; Remove read files with flow cell issues (indicated by Per Tile Quality plots) or low quality reads (indicated by Quality Score plots)
Distinction between empty and likely-true droplets- EmptyDrops; FDR < 0.005<br>- CellBender:<br>&nbsp;&nbsp;- Cell probability > 0.99<br>&nbsp;&nbsp;- Cells with fractions of ambient reads < a dynamic threshold determined per sample using the Multi-Otsu Thresholding algorithm
Filtering cells based on mitochondrial read ratesRetain cells with fractions of ambient reads < a dynamic threshold determined per sample using the Multi-Otsu Thresholding algorithm
Filtering cells based on distribution of gene counts and UMI countsAfter integration, determine if a cluster is significantly different in profiles of gene numbers and UMI numbers using Wilcoxon rank sum test. A cluster is determined to be significantly different if their adjusted p-value < 0.05 and fold change > 2.
Number of cells per sampleRetain samples with > 200 cells that satisfied QC criteria of EmptyDrops, CellBender, and mitochondrial reads.
Barcode rank plotsObtain using custom scripts<br>https://github.com/PanKbase/PanKbase-scRNA-seq/blob/main/1_preprocessing/2_barcode_qc.ipynb
Filtering clusters based on doublet identification and doublet rates- Remove doublets and doublet-enriched clusters which are defined as ones with doublet rates > 65%.<br>- Remove doublet-like cells which exhibit high UMI counts and express markers from at least two cell populations.

Requirements on integration

RequirementImplementation
Integration modelHarmony, correcting for the following covariates: sex, BMI, age, studies, treatments, chemistry and tissue sources.
Cell annotation approachAnnotate using known marker genes:<br>- Beta: 'INS', 'IAPP'<br>- Alpha: 'GCG'<br>- Delta: 'SST'<br>- Gamma: 'PPY'<br>- Epsilon: 'GHRL'<br>- Ductal: 'KRT19'<br>- Acinar: 'REG1A', 'CTRB2', 'PRSS1', 'PRSS2', 'CPA1'<br>- Active Stellate: 'PDGFRB', 'COL6A1'<br>- Quiescent Stellate: 'PDGFRB', 'COL6A1', 'RGS5'<br>- Endothelial: 'PECAM1', 'PLVAP', 'ESAM', 'VWF'<br>- Immune: 'PTPRC'<br>- Cycling Alpha: 'GCG', 'MKI67', 'CDK1'