Gaulton Lab
Parul Kudtarkar, Emily Griffin, Kyle Gaulton
This pipeline is designed for reproducibility and transparency:
Purpose: Run Cell Ranger ATAC to generate position-sorted BAM files
Run the following script to process snATAC-seq FASTQ files using Cell Ranger ATAC. Replace HPAP-109 with sample names.
for SAMPLE in HPAP-109; do
~/scripts/cellranger-atac-2.0.0/cellranger-atac count \
--id ${SAMPLE} \
--fastqs ~/hpap/atac/${SAMPLE}/Upenn_scATACseq/fastq/ \
--sample ${SAMPLE} \
--reference ~/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/ \
--localcores 24 \
--disable-ui;
doneInput: Raw Fastq files
Output: QC in HTML format, possorted_bam.bam: Position-sorted BAM file with barcode annotations
Purpose: The script snATAC_pipeline_hg38_10X.py is used to process the BAM file and generate QC metrics and chromatin accessibility matrices. The script processes snATAC-seq BAM files to:
python3 snATAC_pipeline_hg38_10X.py \
-b /hpap/atac/HPAP-043/Upenn_scATACseq/HPAP-043/outs/possorted_bam.bam \
-o HPAP-043 \
-n HPAP-043 \
-t 24 \
-m 2 \
--minimum-reads 1000Command Options:
-b: Path to the input BAM file from Cell Ranger ATAC.-o: Output directory for processed files.-n: Prefix for output file names.-t: Number of threads to use (default: 8).-m: Maximum memory per thread in GB (default: 4).--minimum-reads: Minimum unique usable reads per barcode (default: 500).Note: To determine --minimum-reads we plotted distribution of Reads per Barcode for each sample. Low-read barcodes form the left peak (likely background or noise). High-read barcodes form the right peak (true cells). We selected a cutoff that separates the low-quality and high-quality barcodes, based on this we determined 1000 to be cutoff for our dataset. During processing in Step 2:
Barcodes failing the threshold are excluded from:
Input Files:
*.bam).hg38.chrom.sizes).hg38-blacklist.v3.bed).Output Files:
SAMPLE.compiled.filt.bam: Filtered BAM file with updated read names.SAMPLE.filt.rmdup.bam: BAM file after duplicate removal.SAMPLE.qc_metrics.txt: QC metrics for each barcode.SAMPLE.mtx.gz: Sparse chromatin accessibility matrix.SAMPLE.barcodes: List of barcodes included in the matrix.SAMPLE.regions: List of genomic regions (features).SAMPLE.5kb_windows.bed: Non-overlapping 5 kb genomic windows filtered by blacklisted regions.This step involves window based clustering of snATAC-seq data using Seurat and Signac.
Purpose: Window-based clustering in Step II acts as an intermediate step to prepare the data for peak-based clustering. It groups cells based on broader chromatin accessibility patterns rather than specific peaks. Window-based clustering establishes the initial structure of cell populations, enabling robust peak calling and finer peak-based clustering in the final step.
Note: Why we can't directly use the peak calls from the 10X pipeline (step I-1)?
The 10X pipeline combines all cells to identify peaks, treating the dataset as bulk ATAC-seq data rather than single-cell data. Peaks identified globally may dilute or miss rare signals that are specific to subpopulations of cells, hence small cell types are missed.
Rscript 01_Seurat_snATAC_windows_Harmony_reducePCs.rWorkflow:
hvw.txt (hvw.txt is created by selecting the top 50,000 most variable windows using a subset of initial samples.) for dimensionality reduction.hvw.txt to refine final clustering.Key Steps in Script II:
windows_modified.txt for initial clustering based on global accessibility patterns.hvw.txt. (it is created by selecting the top 50,000 most variable windows)qc_metrics.txt file generated in Step 2.Inputs:
windows_modified.txt contains non-overlapping 5 kb genomic windows filtered by blacklisted regions.qc_metrics.txt) from Step 2.Outputs:
HVG_all_samples.rds.hvw.txt. hvw.txt is created by selecting the top 50,000 most variable windows using a subset of initial samples.Purpose: This step identifies multiplets (multiple cells encapsulated in single droplet) using AMULET. AMULET (ATAC-seq MULtiplet Estimation Tool) is designed to detect multiplets in snATAC-seq data by identifying cells with higher-than-expected read overlaps, which may arise due to droplet encapsulation errors during sequencing.
AMULET uses read overlap statistics to determine whether a barcode corresponds to a multiplet. (https://github.com/UcarLab/AMULET)
#!/bin/bash
N=5
j=0
BASE_DIR="/nfs/lab"
AMULET_DIR="$BASE_DIR/amulet_zip"
HPAP_DIR="$BASE_DIR/parulk/HPAP_scATAC/amulet"
OUT_DIR="$BASE_DIR/parulk/HPAP_scATAC/amulet"
SAMPLES=(HPAP-036 HPAP-039 ... HPAP-109)
for sample in "${SAMPLES[@]}"; do
((j=j%N)); ((j++==0)) && wait
BAM_PATH="$HPAP_DIR/$sample/Upenn_scATACseq/cellranger_RME/$sample/outs/possorted_bam.bam"
CSV_PATH="$HPAP_DIR/$sample/Upenn_scATACseq/cellranger_RME/$sample/outs/singlecell.csv"
$AMULET_DIR/AMULET.sh --forcesorted --bambc CB --bcidx 0 --cellidx 8 --iscellidx 9 \
$BAM_PATH \
$CSV_PATH \
$AMULET_DIR/human_autosomes.txt \
$AMULET_DIR/RepeatFilterFiles/blacklist_repeats_segdups_rmsk_hg38.bed \
$OUT_DIR/$sample \
$AMULET_DIR &
doneInputs:
Outputs:
MultipletBarcodes_01.txt
ACTAACGAGAGCCACA-1
AGCCTTCTCACGATTG-1
TAGGAGGTCAGGTTTG-1
AAATGAGCATTTGGCA-1
….Purpose: This step removes multiplets identified by AMULET and re-clusters the data for further analysis.
Rscript 02_Removing_multiplets.rInputs:
Outputs:
Key Steps:
Purpose: Perform peak calling by generating and merging TagAlign files, splitting by cell type, and subsampling where needed.
Key Step:
In step I.2 TagAlign files are created by filtering a deduplicated BAM file to retain only properly paired reads, adjusting their start and end positions based on specified shift and extension sizes. Each read's chromosome, coordinates, barcode, mapping quality, and strand orientation are extracted and written in a tab-delimited format. The resulting compressed TagAlign file is used for downstream analysis such as peak calling, quality control, and chromatin accessibility analysis.
merged_tagAlign.sh
#!/bin/bash
samples=("HPAP-035" "HPAP-036" "HPAP-039" "HPAP-040" "HPAP-044" "HPAP-045"
"HPAP-047" "HPAP-049" "HPAP-050" "HPAP-051" "HPAP-052" "HPAP-053"
"HPAP-054" "HPAP-055" "HPAP-056" "HPAP-059" "HPAP-061" "HPAP-062"
"HPAP-063" "HPAP-064" "HPAP-067" "HPAP-069" "HPAP-072" "HPAP-075"
"HPAP-077" "HPAP-079" "HPAP-080" "HPAP-081" "HPAP-083" "HPAP-084"
"HPAP-085" "HPAP-088" "HPAP-092" "HPAP-099" "HPAP-100" "HPAP-101"
"HPAP-103" "HPAP-104" "HPAP-105" "HPAP-106" "HPAP-109")
for SAMPLE in "${samples[@]}"; do
zcat $SAMPLE.filt.rmdup.tagAlign.gz | awk -v SAMPLE=$SAMPLE \
'BEGIN{FS=OFS="\t"} {print $1,$2,$3,$4"-1",$5,$6}'
done | sort -T /nfs/lab/parulk/HPAP_scATAC/tmp -k1,1 -k2,2n -S 64G | bgzip \
-c -@ 16 > /nfs/lab/parulk/HPAP_scATAC/41merged.tagAlign2.gzsplitTagAlign_parallelized.sh -c cells.txt -t 41merged.tagAlign2.gz -b \
barcodes.txt -o /nfs/lab/parulk/HPAP_scATAC/peak_calls_leiden/Check the number of reads with wc -l *tagAlign > tagAlign_counts.txt
179947374 split.10.tagAlign
191518770 split.11.tagAlign
152204378 split.12.tagAlign
182024692 split.13.tagAlign
107727312 split.14.tagAlign
76308524 split.15.tagAlign
120399652 split.16.tagAlign
29849565 split.17.tagAlign
54304928 split.18.tagAlign
44476814 split.19.tagAlign
202435758 split.1.tagAlign
53143650 split.20.tagAlign
10674640 split.21.tagAlign
19420652 split.22.tagAlign
1819296 split.23.tagAlign
789413 split.24.tagAlign
401137 split.25.tagAlign
796720 split.26.tagAlign
443876652 split.2.tagAlign
452888720 split.3.tagAlign
293626898 split.4.tagAlign
380076598 split.5.tagAlign
328138582 split.6.tagAlign
287800546 split.7.tagAlign
204336826 split.8.tagAlign
231128946 split.9.tagAlign> 100 million reads: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13.
< 100 million reads: 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26.
for ct in {1..13}
do
orig_tagAlign="split."$ct".tagAlign"
new_tagAlign="split.subsample."$ct".tagAlign"
subsample --reservoir -n 100000000 $orig_tagAlign > $new_tagAlign
rm $orig_tagAlign
donefor ct in {14..26}
do
orig_tagAlign="split."$ct".tagAlign"
new_tagAlign="split.subsample."$ct".tagAlign"
mv $orig_tagAlign $new_tagAlign
donefor ct in {1..26}
do
new_tagAlign="split.subsample."$ct".tagAlign"
gzip -cvf $new_tagAlign > $new_tagAlign".gz"
doneNotes:
subsample with the correct path or module if needed.Inputs:
nohup bash call_peaks_unparallel.sh -c cells.txt -t \
split.subsample.tagAligns_byCT.txt -b barcodes.txt -o peaks/ &awk -v OFS='\t' ' { print $1, $2, $3} ' *merged* > allPeaksAnno.bed
sort -k1,1 -k2,2n allPeaksAnno.bed > allPeaksAnno_sorted.bed
bedtools merge -i allPeaksAnno_sorted.bed > mergedPeak.txt
awk -F '\t' '{print $1,":",$2,"-",$3}' mergedPeak.txt > \
mergedPeak_formatted.txtOutputs:
Key Steps
nohup /usr/bin/python3 Josh_10XPipeline_withPeaks_justLFM.py \
-o /nfs/lab/parulk/HPAP_scATAC/peak_calls_leiden/lfm \
-k /nfs/lab/parulk/HPAP_scATAC/peak_calls_leiden/lfm/filtered_sample_barcodes/HPAP-106.barcodes \
-a lfm1/HPAP-106.filt.rmdup.tagAlign.gz \
-n HPAP-106 \
-t 24 \
-m 2 &Inputs
Outputs
Purpose: This script is designed for processing, integrating, and analyzing single-nucleus ATAC-seq (snATAC-seq) data using Seurat and Signac packages in R. It implements data preprocessing, quality control, dimensionality reduction, clustering, visualization, and sub-clustering for detailed characterization of cell types and conditions. It integrates multiple samples and processes metadata to enable downstream analyses. The script specifically uses peaks to create the Seurat object for chromatin accessibility analysis.
Rscript 03_Seurat_snATAC_peaks_Harmony_reducePCs_all_samples_final.rKey Steps:
Inputs:
Output:
FIN