PanKbase — Single cell RNA-seq data processing standards

1. Requirements on processing steps:

a. Information on genome build

b. Basic processing to obtain alignment, read filtering, barcode counts, and UMI counts

c. Ambient RNA correction

d. Doublet detection

2. Requirements on quality control:

a. Genotype checks to match donor identification and samples (optional)

b. Quality control for read files using tools such as FastQC and MultiQC

c. Distinction between empty and likely-true droplets using tools such as EmptyDrops (optional)

d. Filtering cells based on mitochondrial read rates

e. Filtering cells based on distribution of gene counts and UMI counts

f. Number of cells per sample

g. Barcode rank plots

h. Filtering clusters based on doublet identification and doublet rates

3. Requirements on integration:

a. Information on covariates included in integration model

b. Information on cell annotation approaches

PanKbase V3 freeze example:

Requirements on processing steps:

Requirement	Implementation
Genome build	Human genome, Gencode V39 GRCh38.p13
Basic processing to obtain alignment, read filtering, barcode counts, and UMI counts	STARsolo, custom scripts<br>https://github.com/PanKbase/snRNAseq-NextFlow
Ambient RNA correction	CellBender<br>CellBender was run twice using default parameters, and modified settings
Doublet detection	DoubletFinder<br>https://github.com/PanKbase/Multiome-Doublet-Detection-NextFlow<br>DoubletFinder was run twice; the second time without all doublets detected from the first round

Requirements on quality control

Requirement	Implementation
Genotype checks	mbv tool and manual search
Read file checks	FastQC; Remove read files with flow cell issues (indicated by Per Tile Quality plots) or low quality reads (indicated by Quality Score plots)
Distinction between empty and likely-true droplets	- EmptyDrops; FDR < 0.005<br>- CellBender:<br>  - Cell probability > 0.99<br>  - Cells with fractions of ambient reads < a dynamic threshold determined per sample using the Multi-Otsu Thresholding algorithm
Filtering cells based on mitochondrial read rates	Retain cells with fractions of ambient reads < a dynamic threshold determined per sample using the Multi-Otsu Thresholding algorithm
Filtering cells based on distribution of gene counts and UMI counts	After integration, determine if a cluster is significantly different in profiles of gene numbers and UMI numbers using Wilcoxon rank sum test. A cluster is determined to be significantly different if their adjusted p-value < 0.05 and fold change > 2.
Number of cells per sample	Retain samples with > 200 cells that satisfied QC criteria of EmptyDrops, CellBender, and mitochondrial reads.
Barcode rank plots	Obtain using custom scripts<br>https://github.com/PanKbase/PanKbase-scRNA-seq/blob/main/1_preprocessing/2_barcode_qc.ipynb
Filtering clusters based on doublet identification and doublet rates	- Remove doublets and doublet-enriched clusters which are defined as ones with doublet rates > 65%.<br>- Remove doublet-like cells which exhibit high UMI counts and express markers from at least two cell populations.

Requirements on integration

Requirement	Implementation
Integration model	Harmony, correcting for the following covariates: sex, BMI, age, studies, treatments, chemistry and tissue sources.
Cell annotation approach	Annotate using known marker genes:<br>- Beta: 'INS', 'IAPP'<br>- Alpha: 'GCG'<br>- Delta: 'SST'<br>- Gamma: 'PPY'<br>- Epsilon: 'GHRL'<br>- Ductal: 'KRT19'<br>- Acinar: 'REG1A', 'CTRB2', 'PRSS1', 'PRSS2', 'CPA1'<br>- Active Stellate: 'PDGFRB', 'COL6A1'<br>- Quiescent Stellate: 'PDGFRB', 'COL6A1', 'RGS5'<br>- Endothelial: 'PECAM1', 'PLVAP', 'ESAM', 'VWF'<br>- Immune: 'PTPRC'<br>- Cycling Alpha: 'GCG', 'MKI67', 'CDK1'

Requirement

Implementation

Integration model

Harmony, correcting for the following covariates: sex, BMI, age, studies, treatments, chemistry and tissue sources.

Cell annotation approach

Annotate using known marker genes: - Beta: 'INS', 'IAPP' - Alpha: 'GCG' - Delta: 'SST' - Gamma: 'PPY' - Epsilon: 'GHRL' - Ductal: 'KRT19' - Acinar: 'REG1A', 'CTRB2', 'PRSS1', 'PRSS2', 'CPA1' - Active Stellate: 'PDGFRB', 'COL6A1' - Quiescent Stellate: 'PDGFRB', 'COL6A1', 'RGS5' - Endothelial: 'PECAM1', 'PLVAP', 'ESAM', 'VWF' - Immune: 'PTPRC' - Cycling Alpha: 'GCG', 'MKI67', 'CDK1'