I. Data Processing

Overview

This pipeline is designed for reproducibility and transparency:

All scripts are well-documented and available on GitHub Repository.
Intermediate and final results are saved, along with metadata in the Data Library.
Outputs include both raw and processed data to ensure traceability.
QC metrics are generated at each step for quality control.
For inquiries or feedback, please open a GitHub Issue.
The project adheres to the MIT License, enabling reuse and modification with attribution.

1. Cell Ranger Single Cell ATAC

Purpose: Generate position-sorted BAM files using Cell Ranger ATAC.

Command:

for SAMPLE in HPAP-109; do
    ~/scripts/cellranger-atac-2.0.0/cellranger-atac count \
    --id ${SAMPLE} \
    --fastqs ~/hpap/atac/${SAMPLE}/Upenn_scATACseq/fastq/ \
    --sample ${SAMPLE} \
    --reference ~/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/ \
    --localcores 24 \
    --disable-ui;
done

Inputs: Raw FASTQ files
Outputs: QC HTML files, position-sorted BAM files with barcode annotations

2. Generate QC Metrics and Chromatin Accessibility Matrices

Purpose: Process BAM files to filter reads, calculate QC metrics, and generate chromatin accessibility matrices.

Command:

python3 snATAC_pipeline_hg38_10X.py \
-b SAMPLE/possorted_bam.bam \
-o SAMPLE \
-n SAMPLE \
-t 24 \
-m 2 \
--minimum-reads 1000

Options:

-b: Input BAM file
-o: Output directory
-n: Prefix for output names
-t: Number of threads (default: 8)
-m: Memory per thread in GB (default: 4)
--minimum-reads: Minimum reads per barcode (default: 500)

Outputs:

Processed BAM files
QC metrics
Sparse chromatin accessibility matrix
Genomic windows filtered by blacklisted regions

3. Window-Based Clustering

Purpose: Prepare data for peak-based clustering by performing window-based clustering.

Command:

Rscript 01_Seurat_snATAC_windows_Harmony_reducePCs.r

Workflow:

Process input matrices and metadata.
Perform batch correction and clustering.
Generate UMAP projections and annotate clusters.

Outputs:

Clustered Seurat object (HVG_all_samples.rds)
Highly variable windows (hvw.txt)
Gene activity scores integrated with metadata

4. Multiplet Detection with AMULET

Purpose: Detect multiplets using AMULET.

Command:

AMULET.sh --forcesorted --bambc CB --bcidx 0 --cellidx 8 --iscellidx 9 \
  BAM_PATH CSV_PATH autosomes.txt blacklist.bed OUTPUT_DIR

Outputs:

Filtered barcode lists identifying multiplets

5. Removing Multiplets and Reclustering

Purpose: Remove multiplets and refine clusters.

Command:

Rscript 02_Removing_multiplets.r

Outputs:

Final Seurat object with multiplets removed
Updated QC metrics and refined clusters

6. Peak Calling on Clusters

Purpose: Perform peak calling by generating and merging TagAlign files, splitting by cell type, and subsampling.

Key Steps:

Generate and merge TagAlign files.
Split merged TagAlign files by cell type.
Subsample clusters exceeding 100M reads.
Compress processed files.
Call peaks.

Command Example:

bash call_peaks_unparallel.sh -c cells.txt -t tagAligns.txt -b barcodes.txt -o peaks/

Outputs:

Compressed TagAlign files
Merged peaks (mergedPeak.txt)

7. Generate Long-Format Matrices

Purpose: Create long-format matrices by intersecting reads with peaks.

Command:

python3 Josh_10XPipeline_withPeaks_justLFM.py \
-o output_dir \
-k barcodes.txt \
-a tagAlign.gz \
-n SAMPLE \
-t 24 \
-m 2

Outputs:

Long-format matrix (.long_fmt_mtx.txt.gz)
Filtered barcodes

8. Final Cluster (Peak-Based Analysis)

Purpose: Process and analyze snATAC-seq data using peaks for clustering.

Command:

Rscript 03_Seurat_snATAC_peaks_Harmony_reducePCs_all_samples_final.r

Key Steps:

Install dependencies.
Process ATAC fragment matrices.
Apply QC and metadata filtering.
Perform dimensionality reduction and clustering.
Sub-cluster and annotate cell types.
Calculate gene activity scores.
Visualize and export results.

Outputs:

Final Seurat object (atac_obj_peak_based_final.rds)
Visualization files (UMAP plots, dot plots)
Saved metadata

Final Notes

This pipeline provides a structured workflow for processing single-nucleus ATAC-seq data, ensuring high reproducibility and quality control at each step. It integrates tools like Cell Ranger, Seurat, Signac, and AMULET for comprehensive data preprocessing, clustering, and analysis. For support, visit the GitHub Repository.

snATAC-seq data standards

I. Data Processing

Overview

1. Cell Ranger Single Cell ATAC

2. Generate QC Metrics and Chromatin Accessibility Matrices

3. Window-Based Clustering

4. Multiplet Detection with AMULET

5. Removing Multiplets and Reclustering

6. Peak Calling on Clusters

7. Generate Long-Format Matrices

8. Final Cluster (Peak-Based Analysis)

Final Notes