PanKbase — snATAC processing pipeline

Gaulton Lab

Parul Kudtarkar, Emily Griffin, Kyle Gaulton

This pipeline is designed for reproducibility and transparency:

All scripts are well-documented and available on GitHub.
https://github.com/PanKbase-pipelines/HPAP-scATAC-seq
Intermediate and final results are saved, along with metadata in the data library.
https://data.pankbase.org/analysis-sets/PKBDS0470WCHR
Outputs include both raw and processed data to ensure traceability.
QC metrics are generated at each step for quality control.
For inquiries or feedback, please open a GitHub issue.
https://github.com/Gaulton-Lab/HPAP-scATAC-seq/issues
The project adheres to the MIT License, enabling reuse and modification with attribution.

I. Single-nucleus ATAC-seq preprocessing

1. Cell Ranger Single Cell ATAC

Purpose: Run Cell Ranger ATAC to generate position-sorted BAM files

Run the following script to process snATAC-seq FASTQ files using Cell Ranger ATAC. Replace HPAP-109 with sample names.

for SAMPLE in HPAP-109; do
    ~/scripts/cellranger-atac-2.0.0/cellranger-atac count \
    --id ${SAMPLE} \
    --fastqs ~/hpap/atac/${SAMPLE}/Upenn_scATACseq/fastq/ \
    --sample ${SAMPLE} \
    --reference ~/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/ \
    --localcores 24 \
    --disable-ui;
done

Input: Raw Fastq files

Output: QC in HTML format, possorted_bam.bam: Position-sorted BAM file with barcode annotations

2. Generate QC metrics and generate chromatin accessibility matrices

Purpose: The script snATAC_pipeline_hg38_10X.py is used to process the BAM file and generate QC metrics and chromatin accessibility matrices. The script processes snATAC-seq BAM files to:

Filter reads and barcodes.
Remove duplicate and mitochondrial reads.
Calculate QC metrics.
Generate chromatin accessibility matrices.

python3 snATAC_pipeline_hg38_10X.py \
    -b /hpap/atac/HPAP-043/Upenn_scATACseq/HPAP-043/outs/possorted_bam.bam \
    -o HPAP-043 \
    -n HPAP-043 \
    -t 24 \
    -m 2 \
    --minimum-reads 1000

Command Options:

-b: Path to the input BAM file from Cell Ranger ATAC.
-o: Output directory for processed files.
-n: Prefix for output file names.
-t: Number of threads to use (default: 8).
-m: Maximum memory per thread in GB (default: 4).
--minimum-reads: Minimum unique usable reads per barcode (default: 500).

Note: To determine --minimum-reads we plotted distribution of Reads per Barcode for each sample. Low-read barcodes form the left peak (likely background or noise). High-read barcodes form the right peak (true cells). We selected a cutoff that separates the low-quality and high-quality barcodes, based on this we determined 1000 to be cutoff for our dataset. During processing in Step 2:

The pipeline calculates unique usable reads per barcode after filtering duplicates and mitochondrial reads.
Barcodes are retained only if their read count meets or exceeds the specified --minimum-reads threshold.

Barcodes failing the threshold are excluded from:

Sparse matrix generation.
Clustering and downstream analysis.

Input Files:

BAM File: Position-sorted BAM file from Cell Ranger ATAC (*.bam).
Chromosome Sizes File: Genome chromosome size information (e.g., hg38.chrom.sizes).
Blacklist File: BED file of blacklisted regions to exclude (e.g., hg38-blacklist.v3.bed).
Promoter File: BED file defining promoter regions for QC analysis.

Output Files:

Processed BAM Files:
- SAMPLE.compiled.filt.bam: Filtered BAM file with updated read names.
- SAMPLE.filt.rmdup.bam: BAM file after duplicate removal.
QC Metrics:
- SAMPLE.qc_metrics.txt: QC metrics for each barcode.
Sparse Matrix:
- SAMPLE.mtx.gz: Sparse chromatin accessibility matrix.
- SAMPLE.barcodes: List of barcodes included in the matrix.
- SAMPLE.regions: List of genomic regions (features).
Genomic Windows:
- SAMPLE.5kb_windows.bed: Non-overlapping 5 kb genomic windows filtered by blacklisted regions.

II. Window Based Clustering

This step involves window based clustering of snATAC-seq data using Seurat and Signac.

Purpose: Window-based clustering in Step II acts as an intermediate step to prepare the data for peak-based clustering. It groups cells based on broader chromatin accessibility patterns rather than specific peaks. Window-based clustering establishes the initial structure of cell populations, enabling robust peak calling and finer peak-based clustering in the final step.

Note: Why we can't directly use the peak calls from the 10X pipeline (step I-1)?

The 10X pipeline combines all cells to identify peaks, treating the dataset as bulk ATAC-seq data rather than single-cell data. Peaks identified globally may dilute or miss rare signals that are specific to subpopulations of cells, hence small cell types are missed.

Rscript 01_Seurat_snATAC_windows_Harmony_reducePCs.r

Steps 01_Seurat_snATAC_windows_Harmony_reducePCs.r

Workflow:

Use few samples initially to generate hvw.txt (hvw.txt is created by selecting the top 50,000 most variable windows using a subset of initial samples.) for dimensionality reduction.
Apply all samples with hvw.txt to refine final clustering.
Perform batch correction and clustering based on chromatin accessibility.

Key Steps in Script II:

Input Processing
- Reads matrices and metadata for each sample, appending barcodes and preparing data for Seurat compatibility.
- Note: Initially, this step is done only for a few samples to generate highly variable windows.
Window-Based Clustering:
- Uses windows_modified.txt for initial clustering based on global accessibility patterns.
Highly Variable Window (HVW) Selection:
- Identifies highly variable windows across initial samples, saving them to hvw.txt. (it is created by selecting the top 50,000 most variable windows)
Sparse Matrix Generation:
- Generates sparse matrices for all samples using only highly variable windows for dimensionality reduction.
Dimensionality Reduction:
- Normalizes data using TF-IDF, performs LSI, applies Harmony for batch correction, and generates UMAP projections.
Clustering:
- Finds clusters using graph-based clustering and visualizes them with UMAP.
Gene Activity Analysis:
- Computes gene activity scores and integrates RNA expression for downstream analysis.
QC and Metadata Integration:
- Adds QC metrics, such as reads in peaks and promoters, from the qc_metrics.txt file generated in Step 2.
- Annotates clusters with cell types based on known marker genes.
- Computes nucleosome signal and TSS enrichment scores, adding them to metadata for evaluation.

Inputs:

Processed sparse chromatin accessibility matrices from Step I-2.
windows_modified.txt contains non-overlapping 5 kb genomic windows filtered by blacklisted regions.
QC metrics (qc_metrics.txt) from Step 2.

Outputs:

Clustered and annotated Seurat object saved as HVG_all_samples.rds.
Highly variable windows saved as hvw.txt. hvw.txt is created by selecting the top 50,000 most variable windows using a subset of initial samples.
Gene activity scores integrated into metadata.
QC metrics (reads in peaks, promoters, TSS enrichment) added to metadata.
Final visualization files (UMAP projections and cluster annotations).

III. Multiplet Detection with AMULET

Purpose: This step identifies multiplets (multiple cells encapsulated in single droplet) using AMULET. AMULET (ATAC-seq MULtiplet Estimation Tool) is designed to detect multiplets in snATAC-seq data by identifying cells with higher-than-expected read overlaps, which may arise due to droplet encapsulation errors during sequencing.

AMULET uses read overlap statistics to determine whether a barcode corresponds to a multiplet. (https://github.com/UcarLab/AMULET)

#!/bin/bash

N=5
j=0
BASE_DIR="/nfs/lab"
AMULET_DIR="$BASE_DIR/amulet_zip"
HPAP_DIR="$BASE_DIR/parulk/HPAP_scATAC/amulet"
OUT_DIR="$BASE_DIR/parulk/HPAP_scATAC/amulet"

SAMPLES=(HPAP-036 HPAP-039 ... HPAP-109)

for sample in "${SAMPLES[@]}"; do
    ((j=j%N)); ((j++==0)) &amp;&amp; wait
    
    BAM_PATH="$HPAP_DIR/$sample/Upenn_scATACseq/cellranger_RME/$sample/outs/possorted_bam.bam"
    CSV_PATH="$HPAP_DIR/$sample/Upenn_scATACseq/cellranger_RME/$sample/outs/singlecell.csv"
    
    $AMULET_DIR/AMULET.sh --forcesorted --bambc CB --bcidx 0 --cellidx 8 --iscellidx 9 \
        $BAM_PATH \
        $CSV_PATH \
        $AMULET_DIR/human_autosomes.txt \
        $AMULET_DIR/RepeatFilterFiles/blacklist_repeats_segdups_rmsk_hg38.bed \
        $OUT_DIR/$sample \
        $AMULET_DIR &amp;
done

Inputs:

BAM files and single-cell metadata CSV.
Blacklist files and repeat region annotations.

Outputs:

Filtered barcode lists identifying multiplets.

MultipletBarcodes_01.txt

ACTAACGAGAGCCACA-1
AGCCTTCTCACGATTG-1
TAGGAGGTCAGGTTTG-1
AAATGAGCATTTGGCA-1
….

IV. Removing Multiplets and Reclustering

Purpose: This step removes multiplets identified by AMULET and re-clusters the data for further analysis.

Rscript 02_Removing_multiplets.r

Inputs:

Processed Seurat object with initial window based clustering Step II.
Multiplet barcode lists from Step III

Outputs:

Final Seurat object with multiplets removed and clusters refined.
Updated QC metrics in metadata, including TSS enrichment and nucleosome signal.

Key Steps:

Load Multiplets: Reads multiplet barcodes and labels cells as singlets or multiplets.
Remove Multiplets: Filters out multiplet cells and recalculates clusters.
QC Analysis: Adds updated QC metrics for TSS enrichment, nucleosome signal, and fraction of reads in peaks. Remove low quality cells based on QC metrics and recluster
Save Final Object: Stores the refined Seurat object for downstream analysis (peak based clustering).

V. Peak Calling on window based seurat leiden cluster

Purpose: Perform peak calling by generating and merging TagAlign files, splitting by cell type, and subsampling where needed.

Key Step:

1. Make TagAligns for each sample

In step I.2 TagAlign files are created by filtering a deduplicated BAM file to retain only properly paired reads, adjusting their start and end positions based on specified shift and extension sizes. Each read's chromosome, coordinates, barcode, mapping quality, and strand orientation are extracted and written in a tab-delimited format. The resulting compressed TagAlign file is used for downstream analysis such as peak calling, quality control, and chromatin accessibility analysis.

2. Merge the TagAligns into one file

merged_tagAlign.sh

#!/bin/bash

samples=("HPAP-035" "HPAP-036" "HPAP-039" "HPAP-040" "HPAP-044" "HPAP-045"
"HPAP-047" "HPAP-049" "HPAP-050" "HPAP-051" "HPAP-052" "HPAP-053"
"HPAP-054" "HPAP-055" "HPAP-056" "HPAP-059" "HPAP-061" "HPAP-062"
"HPAP-063" "HPAP-064" "HPAP-067" "HPAP-069" "HPAP-072" "HPAP-075"
"HPAP-077" "HPAP-079" "HPAP-080" "HPAP-081" "HPAP-083" "HPAP-084"
"HPAP-085" "HPAP-088" "HPAP-092" "HPAP-099" "HPAP-100" "HPAP-101"
"HPAP-103" "HPAP-104" "HPAP-105" "HPAP-106" "HPAP-109")

for SAMPLE in "${samples[@]}"; do
    zcat $SAMPLE.filt.rmdup.tagAlign.gz | awk -v SAMPLE=$SAMPLE \
    'BEGIN{FS=OFS="\t"} {print $1,$2,$3,$4"-1",$5,$6}'
done | sort -T /nfs/lab/parulk/HPAP_scATAC/tmp -k1,1 -k2,2n -S 64G | bgzip \
-c -@ 16 &gt; /nfs/lab/parulk/HPAP_scATAC/41merged.tagAlign2.gz

3. Make barcode files for each celltype/leiden (needs to be a \n delimited .txt file)

4. Split merged tagAlign file by cell type

splitTagAlign_parallelized.sh -c cells.txt -t 41merged.tagAlign2.gz -b \
barcodes.txt -o /nfs/lab/parulk/HPAP_scATAC/peak_calls_leiden/

5. Subsample cell types with more than 100,000,000 reads in the split tagAlign file

Check the number of reads with wc -l *tagAlign > tagAlign_counts.txt

179947374 split.10.tagAlign
191518770 split.11.tagAlign
152204378 split.12.tagAlign
182024692 split.13.tagAlign
107727312 split.14.tagAlign
76308524 split.15.tagAlign
120399652 split.16.tagAlign
29849565 split.17.tagAlign
54304928 split.18.tagAlign
44476814 split.19.tagAlign
202435758 split.1.tagAlign
53143650 split.20.tagAlign
10674640 split.21.tagAlign
19420652 split.22.tagAlign
1819296 split.23.tagAlign
789413 split.24.tagAlign
401137 split.25.tagAlign
796720 split.26.tagAlign
443876652 split.2.tagAlign
452888720 split.3.tagAlign
293626898 split.4.tagAlign
380076598 split.5.tagAlign
328138582 split.6.tagAlign
287800546 split.7.tagAlign
204336826 split.8.tagAlign
231128946 split.9.tagAlign

a. Classify Leiden Clusters Based on Read Counts

> 100 million reads: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13.

< 100 million reads: 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26.

b. Subsample Large Files (> 100M Reads)

for ct in {1..13}
do
    orig_tagAlign="split."$ct".tagAlign"
    new_tagAlign="split.subsample."$ct".tagAlign"
    subsample --reservoir -n 100000000 $orig_tagAlign &gt; $new_tagAlign
    rm $orig_tagAlign
done

c. Rename Smaller Files (< 100M Reads)

for ct in {14..26}
do
    orig_tagAlign="split."$ct".tagAlign"
    new_tagAlign="split.subsample."$ct".tagAlign"
    mv $orig_tagAlign $new_tagAlign
done

d. Compress All Processed Files

for ct in {1..26}
do
    new_tagAlign="split.subsample."$ct".tagAlign"
    gzip -cvf $new_tagAlign &gt; $new_tagAlign".gz"
done

Notes:

Replace subsample with the correct path or module if needed.
Ensure you are in the directory containing the split.*.tagAlign files before running the script.
Test the script with one or two cell types before executing it for all.

6. Run the call peaks script!

Inputs:

-c (file with all cell types in order)
-t (un-gzipped split tagAligns from step 5, same order as -c)
-b (paths to all cell type specific barcodes files, same order as -c)
-o output dir
-g genome sizes file

nohup bash call_peaks_unparallel.sh -c cells.txt -t \
split.subsample.tagAligns_byCT.txt -b barcodes.txt -o peaks/ &amp;

7. Merge Peaks

awk -v OFS='\t' ' { print $1, $2, $3} ' *merged* &gt; allPeaksAnno.bed
sort -k1,1 -k2,2n allPeaksAnno.bed &gt; allPeaksAnno_sorted.bed
bedtools merge -i allPeaksAnno_sorted.bed &gt; mergedPeak.txt
awk -F '\t' '{print $1,":",$2,"-",$3}' mergedPeak.txt &gt; \
mergedPeak_formatted.txt

Outputs:

TagAlign Files: Compressed files for each sample containing filtered reads.
Merged Peaks: Unified peaks across window based leiden cluster.

VI. Create long format matrix by intersecting reads with peaks from step V

Key Steps

Filter and process single-cell ATAC-seq data.
Generate chromatin accessibility matrices for downstream analysis.
Filter barcodes based on specified thresholds.
Focus on regions of interest using predefined peaks or windows.

nohup /usr/bin/python3 Josh_10XPipeline_withPeaks_justLFM.py \
    -o /nfs/lab/parulk/HPAP_scATAC/peak_calls_leiden/lfm \
    -k /nfs/lab/parulk/HPAP_scATAC/peak_calls_leiden/lfm/filtered_sample_barcodes/HPAP-106.barcodes \
    -a lfm1/HPAP-106.filt.rmdup.tagAlign.gz \
    -n HPAP-106 \
    -t 24 \
    -m 2 &amp;

Inputs

TagAlign file (-a): Compressed file containing mapped reads.
Barcodes file (-k): List of barcodes to retain.
Output directory (-o): Directory to store results.
Additional files:
- Predefined peaks file (mergedPeak.txt) specifying peaks from Step V.
- Chromosome size and blacklist files for region filtering.

Outputs

Long-format matrix (.long_fmt_mtx.txt.gz): Accessibility counts per barcode-region pair.
Barcodes file (.barcodes): List of filtered barcodes.

VII. Final Cluster (peak based)

Purpose: This script is designed for processing, integrating, and analyzing single-nucleus ATAC-seq (snATAC-seq) data using Seurat and Signac packages in R. It implements data preprocessing, quality control, dimensionality reduction, clustering, visualization, and sub-clustering for detailed characterization of cell types and conditions. It integrates multiple samples and processes metadata to enable downstream analyses. The script specifically uses peaks to create the Seurat object for chromatin accessibility analysis.

Rscript 03_Seurat_snATAC_peaks_Harmony_reducePCs_all_samples_final.r

Key Steps:

Package Installation and Loading:
- Installs required libraries (Seurat, Signac, ggplot2, EnsDb.Hsapiens.v86) and dependencies.
- Sets up parallelization for faster processing.
Data Import:
- Reads ATAC fragment matrices for all samples and processes them into long-format matrices (from step VI).
Quality Control:
- Filters cells based on metrics like TSS enrichment, nucleosome signal, and mitochondrial read fraction.
- Adds metadata for sex, condition, and autoantibody (AAB) status.
Matrix Construction:
- Merges sparse matrices of chromatin accessibility for merged peaks (step V).
- Converts matrices into Seurat-compatible ChromatinAssay objects.
Dimensionality Reduction and Clustering:
- Performs TF-IDF normalization and SVD.
- Applies Harmony for batch effect correction.
- Runs UMAP for visualization and clustering at multiple resolutions.
Sub-clustering:
- Identifies sub-populations within specific clusters for finer cell-type classification.
- Generates sub-cluster-specific UMAP and marker expression plots.
Cell Type Annotation:
- Assigns identities based on marker gene expression and metadata.
- Labels clusters with predefined cell types (e.g., Beta, Alpha, Ductal).
Gene Activity Scores:
- Calculates gene activity from ATAC data and normalizes it for RNA-based downstream analyses.
Visualization:
- Generates UMAP plots colored by sample, condition, and sex.
- Produces dot plots for marker gene expression.
- Bar plots display cell type distribution across conditions.
Export and Save Outputs:
- Saves intermediate and final Seurat objects for reproducibility.
- Exports figures in PDF and SVG formats.

Inputs:

ATAC Fragment Matrices: Long-format matrix files for each sample containing counts for accessible chromatin regions from step VI
Annotations: Gene annotations (EnsDb.Hsapiens.v86) for genome alignment.
Peak Windows File: List of highly variable peaks derived from Step V.
Metadata Files: Sample-specific metadata for sex, condition, and autoantibody (AAB) status.
Marker Gene List: Predefined cell-type-specific marker genes for downstream analysis.
Quality Control Metrics: QC metrics files to filter cells based on signal-to-noise properties.

Output:

Seurat Objects: Processed Seurat objects saved at intermediate and final stages.
- atac_obj_peak_based_final.rds: Final annotated Seurat object.
Figures/Plots:
- UMAP embeddings visualized by library, condition, and cell type.
- Dot plots showing marker expression.
- Quality control plots for TSS enrichment, nucleosome signal, and mitochondrial reads.
- Sub-cluster plots for detailed visualization of smaller cell populations.
Saved Metadata:
- Metadata files embedded within Seurat objects for sample attributes, conditions, and AAB status.

FIN