Bulk RNA-seq data standards

Bulk RNA-seq Data Standards – PanKbase

This document describes data generation, processing, and QC standards in PanKbase for bulk RNA-sequencing studies, largely adopted from standards developed by the GTEx consortium.


Overview

Bulk RNA-sequencing (RNA-seq) produces genomic sequencing data describing the abundance of RNA transcripts in a biosample from RNA molecules such as protein-coding and long non-coding RNA transcripts.


I. Data Generation

Data generation standards (from GTEx):

  • Read length: At least 50 bp
  • Insert size: Average library insert size should be 200 base pairs
  • Sequencing depth: Minimum of 15 million reads per RNA-seq library
  • Replicates:
    • Two or more biological replicates
      • Anisogenic: Biosamples from different donors
      • Isogenic: Independent biosamples from the same donor
    • Technical replicates (optional): Replicate experiments or replicate sequencing of the same biosample

> Although not required, RNA-seq experiments could include quantitative standards such as spike-ins of RNA of known length and quantity to help calibrate quantification, sensitivity, and coverage.


II. Data Processing

The recommended pipeline is the GTEx RNA-seq pipeline.

GTEx v10 Pipeline:

  • Software versions:
    • STAR v2.7.10a
    • RSEM v1.3.3
    • RNA-SeQC v2.4.2
  • Resources:
    • Human genome: GRCh38
    • Annotation: GENCODE v39

Key pipeline steps:

  1. Building indexes:
    • STAR and RSEM indexes must be built specifically for each read length
  2. Alignment:
    • STAR aligner is used to map sequence reads to transcripts
  3. Mark duplicates:
    • PICARD tools are used to mark duplicate reads in alignments
  4. Quality control:
    • RNA-SeQC is used for quality control of alignments
  5. Quantification:
    • RSEM is used to quantify transcripts

Pipeline Outputs

  • Genome alignment: .bam
  • Transcript alignment: .bam
  • Normalized signal: .bigWig
  • Gene quantifications: .tsv
  • Transcript quantifications: .tsv
  • Quality control metrics: .txt

III. Data Quality Control

High-quality samples are defined based on GTEx criteria:

  • Aligned reads:
    • >10 million aligned reads (or mate-pairs if paired-end)
  • Mapping rate criteria:
    • Read mapping rate > 0.2
    • Intergenic mapping rate < 0.3
    • Base mismatch rate < 0.01 for mate 1 or < 0.02 for mate 2
    • Ribosomal RNA (rRNA) mapping rate < 0.3

Recommended correlations:

  • Spearman correlation > 0.9 between isogenic replicates (same donor)
  • Spearman correlation > 0.8 between anisogenic replicates (different donors within the same study)