File format standards

This document provides an overview of required formats for the different file types hosted by PanKbase.

Sequencing data

Stored in .fastq format. The specification of .fastq format is an accepted standard and is provided here:

https://maq.sourceforge.net/fastq.shtml

Sequence alignments

Stored in .bam format. The specification of .bam format is an accepted standard and is provided here:

https://samtools.github.io/hts-specs/SAMv1.pdf

Normalized genomic signal

Stored in .bigWig format. The specification of .bigWig format is a standard and is provided here:

https://genome.ucsc.edu/goldenPath/help/bigWig.html

Gene quantifications

Quantifications of counts and normalized expression levels for genes or transcripts, such as those produced by the output of processing a bulk RNA-seq experiment, in tab-delimited format

The specification of the gene quantifications file format, which was adopted from the GTEx project, is here:

https://docs.google.com/spreadsheets/d/1t873TAPyX1FUo6YEfLhXDfrXoosMg02rnmoCuccotJI/edit?gid=0#gid=0

Gene count matrix

Quantitative trait locus (QTL) summary statistics

Summary statistic results from a QTL mapping analysis, for example of gene expression levels (eQTL) or chromatin accessibility levels (caQTL). Stored in a text file in tab-delimited format.

The specification of the QTL summary statistic file format is here:

Genetic association summary statistics

Summary statistic results from a genetic association analysis such as GWAS. Stored in a text file in tab-delimited format.

The specification of the genetic association summary statistic file format is here:

Gene sets

Sets of genes in defined units, such as a biological pathway or genes associated with disease. Stored in a Gene Matrix Transposed (.gmt) file.

The specification of the .gmt file format is here:

https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29