This document provides an overview of required formats for the different file types hosted by PanKbase.
Sequencing data
Stored in .fastq format. The specification of .fastq format is an accepted standard and is provided here:
https://maq.sourceforge.net/fastq.shtml
Sequence alignments
Stored in .bam format. The specification of .bam format is an accepted standard and is provided here:
https://samtools.github.io/hts-specs/SAMv1.pdf
Normalized genomic signal
Stored in .bigWig format. The specification of .bigWig format is a standard and is provided here:
https://genome.ucsc.edu/goldenPath/help/bigWig.html
Gene quantifications
Quantifications of counts and normalized expression levels for genes or transcripts, such as those produced by the output of processing a bulk RNA-seq experiment, in tab-delimited format
The specification of the gene quantifications file format, which was adopted from the GTEx project, is here:
https://docs.google.com/spreadsheets/d/1t873TAPyX1FUo6YEfLhXDfrXoosMg02rnmoCuccotJI/edit?gid=0#gid=0
Gene count matrix
Quantitative trait locus (QTL) summary statistics
Summary statistic results from a QTL mapping analysis, for example of gene expression levels (eQTL) or chromatin accessibility levels (caQTL). Stored in a text file in tab-delimited format.
The specification of the QTL summary statistic file format is here:
Genetic association summary statistics
Summary statistic results from a genetic association analysis such as GWAS. Stored in a text file in tab-delimited format.
The specification of the genetic association summary statistic file format is here:
Gene sets
Sets of genes in defined units, such as a biological pathway or genes associated with disease. Stored in a Gene Matrix Transposed (.gmt) file.
The specification of the .gmt file format is here: