PanKbase — Bulk RNA-seq data standards

Bulk RNA-sequencing

This document describes data generation, processing and QC standards in PanKbase for bulk RNA-sequencing studies, largely adopted from standards developed by the GTEx consortium

Overview

Bulk RNA-sequencing (RNA-seq) produces genomic sequencing data describing the abundance of RNA transcripts in a biosample from RNA molecules such as protein-coding and long non-coding RNA transcripts.

I. Data generation

Data generation standards (from GTEx)

Read length: The read length of sequencing data should be at least 50 bp.

Insert size: Average library insert size should be 200 base pairs

Sequencing depth:  RNA-seq library should be sequenced to a minimum depth of 15 million reads

Replicates: Studies should be performed with two or more biological replicates.  Biological replicates could be anisogenic (biosamples from different donors) or isogenic (independent biosamples from the same donor).  In addition, although not required, technical replicates of samples could be performed including replicate experiments or replicate sequencing of the same biosample.

Although not required, RNA-seq experiments could include quantitative standards such as spike-ins of RNA of known length and quantity to help calibrate quantification, sensitivity and coverage of an experiment.

II. Data processing

The recommended pipeline to perform data processing is the GTEx RNA-seq pipeline.

https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq

The V10 release of GTEx pipeline consists of the following

Software versions: STAR v2.7.10a, RSEM v1.3.3, RNA-SeQC v2.4.2

Resources: Human genome GRCh38, GENCODEv39

The key steps of the pipeline include:

Building indexes:  ****The indexes for STAR and RSEM need to be made specifically for each read length

Alignment:  ****The STAR aligner is used to map sequence reads to transcripts

Mark duplicates:  ****The PICARD tools is used to mark duplicate reads in alignments

Quality control:  ****The RNA-SeQC tool is used to perform quality control of alignments

Quantification:  ****The RSEM tool is used to perform quantification of transcripts

Pipeline outputs

Genome alignment - .bam format

Transcript alignment - .bam format

Normalized signal - .bigWig format

Gene quantifications - .tsv format

Transcript quantifications - .tsv format

Quality control metrics - .txt format

III. Data quality control

Samples are considered ‘high-quality’ based on these criteria (from GTEx):

Aligned reads: RNA-seq library should have:

Aligned reads, or mate-pairs if paired-end sequencing, >10 million

Mapping rate:  Alignments satisfy the following criteria:

Read mapping rate >0.2

Intergenic mapping rate <0.3

Base mismatch rate <0.01 for mate 1 or <0.02 for mate 2

ribosomal RNA (rRNA) mapping rate <0.3

In addition, it is recommended that experiments ideally have:

Spearman correlation of gene quantifications >0.9 between isogenic replicates (biosample from the same donor)

Spearman correlation of gene quantifications >0.8 between anisogenic replicates (biosamples from different donors within the same study)