Bulk RNA-seq Data Standards – PanKbase

This document describes data generation, processing, and QC standards in PanKbase for bulk RNA-sequencing studies, largely adopted from standards developed by the GTEx consortium.

Overview

Bulk RNA-sequencing (RNA-seq) produces genomic sequencing data describing the abundance of RNA transcripts in a biosample from RNA molecules such as protein-coding and long non-coding RNA transcripts.

I. Data Generation

Data generation standards (from GTEx):

Read length: At least 50 bp
Insert size: Average library insert size should be 200 base pairs
Sequencing depth: Minimum of 15 million reads per RNA-seq library
Replicates:
- Two or more biological replicates
  - Anisogenic: Biosamples from different donors
  - Isogenic: Independent biosamples from the same donor
- Technical replicates (optional): Replicate experiments or replicate sequencing of the same biosample

> Although not required, RNA-seq experiments could include quantitative standards such as spike-ins of RNA of known length and quantity to help calibrate quantification, sensitivity, and coverage.

II. Data Processing

The recommended pipeline is the GTEx RNA-seq pipeline.

GTEx v10 Pipeline:

Software versions:
- STAR v2.7.10a
- RSEM v1.3.3
- RNA-SeQC v2.4.2
Resources:
- Human genome: GRCh38
- Annotation: GENCODE v39

Key pipeline steps:

Building indexes:
- STAR and RSEM indexes must be built specifically for each read length
Alignment:
- STAR aligner is used to map sequence reads to transcripts
Mark duplicates:
- PICARD tools are used to mark duplicate reads in alignments
Quality control:
- RNA-SeQC is used for quality control of alignments
Quantification:
- RSEM is used to quantify transcripts

Pipeline Outputs

Genome alignment: .bam
Transcript alignment: .bam
Normalized signal: .bigWig
Gene quantifications: .tsv
Transcript quantifications: .tsv
Quality control metrics: .txt

III. Data Quality Control

High-quality samples are defined based on GTEx criteria:

Aligned reads:
- >10 million aligned reads (or mate-pairs if paired-end)
Mapping rate criteria:
- Read mapping rate > 0.2
- Intergenic mapping rate < 0.3
- Base mismatch rate < 0.01 for mate 1 or < 0.02 for mate 2
- Ribosomal RNA (rRNA) mapping rate < 0.3

Recommended correlations:

Spearman correlation > 0.9 between isogenic replicates (same donor)
Spearman correlation > 0.8 between anisogenic replicates (different donors within the same study)