snATAC-seq data standards

This document describes data generation, processing, and QC standards in PanKbase for single-nucleus ATAC-sequencing studies, based on standards adapted from snATAC HPAP protocol and Gaulton Lab snATAC data processing pipeline.


I. Data Generation

Based on standard practices for Chromium Single Cell ATAC Sequencing (10x Genomics) and snATAC HPAP protocol:

1. Sequencing Depth

Recommended depth: ~25,000–50,000 read pairs per nucleus.
For a targeted recovery of 5,000 nuclei (as stated in the protocol), the total sequencing depth would typically range between 125 million to 250 million read pairs.

2. Read Length

Chromium Single Cell ATAC uses paired-end sequencing.
Standard read length:
    Read 1: 50 bp (transposase cut site)
    Read 2: 50 bp (barcode + UMI information)
    Index 1 (i7): 8 bp (sample index)
    Index 2 (i5): 16 bp (cell barcode)

3. Quality Control Metrics

Cell viability threshold: ≥85%.
Nuclei concentration after isolation: ~500–5,000 nuclei/μl, with a final concentration of ~30 nuclei/μl during transposition.
Nuclei recovery rate: ~25–50% of input cells after processing.
Low debris and ambient DNA: DNase treatment can help reduce contamination.

4. Data Standards

Fragment size distribution: Enrichment around 200 bp and multiples (nucleosome phasing).
TSS enrichment score: ≥10 (for high-quality libraries).
Fraction of reads in peaks (FRiP): ≥15–20%.
Duplicate rate: ≤10%.

For exact sequencing parameters, the final experimental design and requirements should follow recommendations in the Chromium Single Cell ATAC Solution User Guide provided by 10x Genomics.

References

10x Genomics Chromium Single Cell ATAC Solution User Guide

snATAC HPAP protocol


II. Data Processing

Overview

This pipeline is designed for reproducibility and transparency:

  • All scripts are well-documented and available on GitHub Repository.
  • Intermediate and final results are saved, along with metadata in the Data Library.
  • Outputs include both raw and processed data to ensure traceability.
  • QC metrics are generated at each step for quality control.
  • For inquiries or feedback, please open a GitHub Issue.
  • The project adheres to the MIT License, enabling reuse and modification with attribution.

1. Cell Ranger Single Cell ATAC

Purpose: Generate position-sorted BAM files using Cell Ranger ATAC.

Command:

for SAMPLE in HPAP-109; do
    ~/scripts/cellranger-atac-2.0.0/cellranger-atac count \
    --id ${SAMPLE} \
    --fastqs ~/hpap/atac/${SAMPLE}/Upenn_scATACseq/fastq/ \
    --sample ${SAMPLE} \
    --reference ~/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/ \
    --localcores 24 \
    --disable-ui;
done

Inputs: Raw FASTQ files
Outputs: QC HTML files, position-sorted BAM files with barcode annotations


2. Generate QC Metrics and Chromatin Accessibility Matrices

Purpose: Process BAM files to filter reads, calculate QC metrics, and generate chromatin accessibility matrices.

Command:

python3 snATAC_pipeline_hg38_10X.py \
-b SAMPLE/possorted_bam.bam \
-o SAMPLE \
-n SAMPLE \
-t 24 \
-m 2 \
--minimum-reads 1000

Options:

  • -b: Input BAM file
  • -o: Output directory
  • -n: Prefix for output names
  • -t: Number of threads (default: 8)
  • -m: Memory per thread in GB (default: 4)
  • --minimum-reads: Minimum reads per barcode (default: 500)

Outputs:

  • Processed BAM files
  • QC metrics
  • Sparse chromatin accessibility matrix
  • Genomic windows filtered by blacklisted regions

3. Window-Based Clustering

Purpose: Prepare data for peak-based clustering by performing window-based clustering.

Command:

Rscript 01_Seurat_snATAC_windows_Harmony_reducePCs.r

Workflow:

  • Process input matrices and metadata.
  • Perform batch correction and clustering.
  • Generate UMAP projections and annotate clusters.

Outputs:

  • Clustered Seurat object (HVG_all_samples.rds)
  • Highly variable windows (hvw.txt)
  • Gene activity scores integrated with metadata

4. Multiplet Detection with AMULET

Purpose: Detect multiplets using AMULET.

Command:

AMULET.sh --forcesorted --bambc CB --bcidx 0 --cellidx 8 --iscellidx 9 \
  BAM_PATH CSV_PATH autosomes.txt blacklist.bed OUTPUT_DIR

Outputs:

  • Filtered barcode lists identifying multiplets

5. Removing Multiplets and Reclustering

Purpose: Remove multiplets and refine clusters.

Command:

Rscript 02_Removing_multiplets.r

Outputs:

  • Final Seurat object with multiplets removed
  • Updated QC metrics and refined clusters

6. Peak Calling on Clusters

Purpose: Perform peak calling by generating and merging TagAlign files, splitting by cell type, and subsampling.

Key Steps:

  • Generate and merge TagAlign files.
  • Split merged TagAlign files by cell type.
  • Subsample clusters exceeding 100M reads.
  • Compress processed files.
  • Call peaks.

Command Example:

bash call_peaks_unparallel.sh -c cells.txt -t tagAligns.txt -b barcodes.txt -o peaks/

Outputs:

  • Compressed TagAlign files
  • Merged peaks (mergedPeak.txt)

7. Generate Long-Format Matrices

Purpose: Create long-format matrices by intersecting reads with peaks.

Command:

python3 Josh_10XPipeline_withPeaks_justLFM.py \
-o output_dir \
-k barcodes.txt \
-a tagAlign.gz \
-n SAMPLE \
-t 24 \
-m 2

Outputs:

  • Long-format matrix (.long_fmt_mtx.txt.gz)
  • Filtered barcodes

8. Final Cluster (Peak-Based Analysis)

Purpose: Process and analyze snATAC-seq data using peaks for clustering.

Command:

Rscript 03_Seurat_snATAC_peaks_Harmony_reducePCs_all_samples_final.r

Key Steps:

  • Install dependencies.
  • Process ATAC fragment matrices.
  • Apply QC and metadata filtering.
  • Perform dimensionality reduction and clustering.
  • Sub-cluster and annotate cell types.
  • Calculate gene activity scores.
  • Visualize and export results.

Outputs:

  • Final Seurat object (atac_obj_peak_based_final.rds)
  • Visualization files (UMAP plots, dot plots)
  • Saved metadata

Final Notes

This pipeline provides a structured workflow for processing single-nucleus ATAC-seq data, ensuring high reproducibility and quality control at each step. It integrates tools like Cell Ranger, Seurat, Signac, and AMULET for comprehensive data preprocessing, clustering, and analysis. For support, visit the GitHub Repository.