Bulk RNA-sequencing
This document describes data generation, processing and QC standards in PanKbase for bulk RNA-sequencing studies, largely adopted from standards developed by the GTEx consortium
Overview
Bulk RNA-sequencing (RNA-seq) produces genomic sequencing data describing the abundance of RNA transcripts in a biosample from RNA molecules such as protein-coding and long non-coding RNA transcripts.
I. Data generation
Data generation standards (from GTEx)
Read length: The read length of sequencing data should be at least 50 bp.
Insert size: Average library insert size should be 200 base pairs
Sequencing depth: RNA-seq library should be sequenced to a minimum depth of 15 million reads
Replicates: Studies should be performed with two or more biological replicates. Biological replicates could be anisogenic (biosamples from different donors) or isogenic (independent biosamples from the same donor). In addition, although not required, technical replicates of samples could be performed including replicate experiments or replicate sequencing of the same biosample.
Although not required, RNA-seq experiments could include quantitative standards such as spike-ins of RNA of known length and quantity to help calibrate quantification, sensitivity and coverage of an experiment.
II. Data processing
The recommended pipeline to perform data processing is the GTEx RNA-seq pipeline.
https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq
The V10 release of GTEx pipeline consists of the following
Software versions: STAR v2.7.10a, RSEM v1.3.3, RNA-SeQC v2.4.2
Resources: Human genome GRCh38, GENCODEv39
The key steps of the pipeline include:
Building indexes: ****The indexes for STAR and RSEM need to be made specifically for each read length
Alignment: ****The STAR aligner is used to map sequence reads to transcripts
Mark duplicates: ****The PICARD tools is used to mark duplicate reads in alignments
Quality control: ****The RNA-SeQC tool is used to perform quality control of alignments
Quantification: ****The RSEM tool is used to perform quantification of transcripts
Pipeline outputs
Genome alignment - .bam format
Transcript alignment - .bam format
Normalized signal - .bigWig format
Gene quantifications - .tsv format
Transcript quantifications - .tsv format
Quality control metrics - .txt format
III. Data quality control
Samples are considered ‘high-quality’ based on these criteria (from GTEx):
Aligned reads: RNA-seq library should have:
Aligned reads, or mate-pairs if paired-end sequencing, >10 million
Mapping rate: Alignments satisfy the following criteria:
Read mapping rate >0.2
Intergenic mapping rate <0.3
Base mismatch rate <0.01 for mate 1 or <0.02 for mate 2
ribosomal RNA (rRNA) mapping rate <0.3
In addition, it is recommended that experiments ideally have:
Spearman correlation of gene quantifications >0.9 between isogenic replicates (biosample from the same donor)
Spearman correlation of gene quantifications >0.8 between anisogenic replicates (biosamples from different donors within the same study)