Bulk RNA-seq Data Standards – PanKbase
This document describes data generation, processing, and QC standards in PanKbase for bulk RNA-sequencing studies, largely adopted from standards developed by the GTEx consortium.
Overview
Bulk RNA-sequencing (RNA-seq) produces genomic sequencing data describing the abundance of RNA transcripts in a biosample from RNA molecules such as protein-coding and long non-coding RNA transcripts.
I. Data Generation
Data generation standards (from GTEx):
- Read length: At least 50 bp
- Insert size: Average library insert size should be 200 base pairs
- Sequencing depth: Minimum of 15 million reads per RNA-seq library
- Replicates:
- Two or more biological replicates
- Anisogenic: Biosamples from different donors
- Isogenic: Independent biosamples from the same donor
- Technical replicates (optional): Replicate experiments or replicate sequencing of the same biosample
> Although not required, RNA-seq experiments could include quantitative standards such as spike-ins of RNA of known length and quantity to help calibrate quantification, sensitivity, and coverage.
II. Data Processing
The recommended pipeline is the GTEx RNA-seq pipeline.
GTEx v10 Pipeline:
- Software versions:
- STAR v2.7.10a
- RSEM v1.3.3
- RNA-SeQC v2.4.2
- Resources:
- Human genome: GRCh38
- Annotation: GENCODE v39
Key pipeline steps:
- Building indexes:
- STAR and RSEM indexes must be built specifically for each read length
- Alignment:
- STAR aligner is used to map sequence reads to transcripts
- Mark duplicates:
- PICARD tools are used to mark duplicate reads in alignments
- Quality control:
- RNA-SeQC is used for quality control of alignments
- Quantification:
- RSEM is used to quantify transcripts
Pipeline Outputs
- Genome alignment:
.bam
- Transcript alignment:
.bam
- Normalized signal:
.bigWig
- Gene quantifications:
.tsv
- Transcript quantifications:
.tsv
- Quality control metrics:
.txt
III. Data Quality Control
High-quality samples are defined based on GTEx criteria:
- Aligned reads:
- >10 million aligned reads (or mate-pairs if paired-end)
- Mapping rate criteria:
- Read mapping rate > 0.2
- Intergenic mapping rate < 0.3
- Base mismatch rate < 0.01 for mate 1 or < 0.02 for mate 2
- Ribosomal RNA (rRNA) mapping rate < 0.3
Recommended correlations:
- Spearman correlation > 0.9 between isogenic replicates (same donor)
- Spearman correlation > 0.8 between anisogenic replicates (different donors within the same study)