How to create a pigeon‐compatible annotation GTF

Pigeon is designed to work for Gencode annotation GTF file formats. Other GTF formats will need to be modified to work with pigeon classify.

The pigeon GTF format requirements are:

A tab-delimited 9-column file GFF/GTF File Format

  • Column 1 must be the chromosome
  • Column 2 is ignored
  • Column 3 will only be processed if it is gene, transcript, or exon. All other types (e.g. CDS) are ignored.
  • Column 4 & 5 are 1-based start/end
  • Column 6 & 8 are ignored
  • Column 7 is the strand which must be + or -
  • Column 9 is attribute, a semicolon-separated list of tag-value pairs. To be processed properly, the following tags must have values: gene_id , transcript_id and gene_name. Ex: gene_id “ENSG0001”; transcript_id “ENST000A”; gene_name “TP53”;
  • No extra blank lines at the beginning or end of the file
  • Annotations must be organized with a “gene” record, followed by one or more associated “transcript” records, and each “transcript” record is followed by one or more associated “exon” records. Example:
    gene
    transcript_1
      exon_1_1
      exon_1_2
    transcript_2
      exon_2_1
      exon_2_2
    

Example 1: Gencode annotation

Below is a snippet of a Gencode annotation as a reference:

chr1    ENSEMBL gene    17369   17436   .       -       .       gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR68
59-1"; level 3;
chr1    ENSEMBL transcript      17369   17436   .       -       .       gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "mi
RNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "
basic"; transcript_support_level "NA";
chr1    ENSEMBL exon    17369   17436   .       -       .       gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; ge
ne_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; exon_number 1; exon_id
 "ENSE00003746039.1"; level 3; tag "basic"; transcript_support_level "NA";
chr1    HAVANA  gene    29554   31109   .       +       .       gene_id "ENSG00000243485.3"; gene_type "lincRNA"; gene_status "KNOWN"; gene_name "RP1
1-34P13.3"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
chr1    HAVANA  transcript      29554   31097   .       +       .       gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "li
ncRNA"; gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; leve
l 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT0000
0002840.1";
chr1    HAVANA  exon    29554   30039   .       +       .       gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA";
gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number
1; exon_id "ENSE00001947070.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
9.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    30564   30667   .       +       .       gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA";
gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number
2; exon_id "ENSE00001922571.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
9.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    30976   31097   .       +       .       gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA";
gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number
3; exon_id "ENSE00001827679.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
9.2"; havana_transcript "OTTHUMT00000002840.1";

Example 2: modified non-model organism annotation for Pigeon

Here is an example of a pigeon-compatible annotation after it’s been manually modified.

Pf3D7_13_v3     VEuPathDB       gene    21364   28787   .       +       .       gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       transcript      21364   28787   .       +       .       gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gen
e_name "PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       exon    21364   26538   .       +       .       gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       exon    27474   28787   .       +       .       gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       CDS     21364   26538   .       +       0       Parent=PF3D7_1300100.1
Pf3D7_13_v3     VEuPathDB       CDS     27474   28787   .       +       0       Parent=PF3D7_1300100.1
Pf3D7_13_v3     VEuPathDB       gene    30605   31881   .       -       .       gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       transcript      30605   31881   .       -       .       gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gen
e_name "PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       exon    30605   31597   .       -       .       gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       exon    31828   31881   .       -       .       gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       CDS     30605   31597   .       -       0       Parent=PF3D7_1300200.1
Pf3D7_13_v3     VEuPathDB       CDS     31828   31881   .       -       0       Parent=PF3D7_1300200.1

Example 3: SIRV control annotation

Here is an example of an SIRV control annotation compatible with pigeon.

SIRV1	LexogenSIRVData	gene	1001	11643	.	-	0	gene_name "SIRV1"; gene_id "SIRV1";
SIRV1	LexogenSIRVData	transcript	1001	10786	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101";
SIRV1	LexogenSIRVData	exon	1001	1484	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_0";
SIRV1	LexogenSIRVData	exon	6338	6473	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_1";
SIRV1	LexogenSIRVData	exon	6561	6813	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_2";
SIRV1	LexogenSIRVData	exon	7553	7814	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_3";
SIRV1	LexogenSIRVData	exon	10283	10366	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_4";
SIRV1	LexogenSIRVData	exon	10445	10786	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_5";
SIRV1	LexogenSIRVData	transcript	1007	10366	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102";
SIRV1	LexogenSIRVData	exon	1007	1484	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102"; exon_assignment "SIRV102_0";
SIRV1	LexogenSIRVData	exon	6338	6813	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102"; exon_assignment "SIRV102_1";
SIRV1	LexogenSIRVData	exon	7553	7814	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102"; exon_assignment "SIRV102_2";
SIRV1	LexogenSIRVData	exon	10283	10366	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102"; exon_assignment "SIRV102_3";
SIRV1	LexogenSIRVData	transcript	1001	10791	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103";
SIRV1	LexogenSIRVData	exon	1001	1484	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_0";
SIRV1	LexogenSIRVData	exon	6338	6473	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_1";
SIRV1	LexogenSIRVData	exon	6561	6813	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_2";
SIRV1	LexogenSIRVData	exon	7553	7814	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_3";
SIRV1	LexogenSIRVData	exon	10283	10366	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_4";
SIRV1	LexogenSIRVData	exon	10648	10791	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_5";

Modification example

Modifications can be made using a variety of commmand line tools, python, or gff specific tools such as gffread.

Example of unmodified sorghum gff:

ChrM	cshl_gene	gene	444449	447142	.	-	.	ID=SbiRTX436.MG005100;Name=SbiRTX436.MG005100;biotype=protein_coding
ChrM	cshl_gene	mRNA	444449	447142	.	-	.	ID=SbiRTX436.MG005100.1;Parent=SbiRTX436.MG005100;Name=SbiRTX436.MG005100.1;biotype=protein_coding
ChrM	cshl_gene	five_prime_UTR	445616	447142	.	-	.	Parent=SbiRTX436.MG005100.1;Name=5UTR.1
ChrM	cshl_gene	exon	444449	447142	.	-	.	Parent=SbiRTX436.MG005100.1;Name=exon.1
ChrM	cshl_gene	CDS	444449	445615	.	-	.	Parent=SbiRTX436.MG005100.1;Name=CDS.1

Update fields to adhere to above guidelines using python.

cat sorghum.gff | python -c '
import sys
for line in sys.stdin:
  fields = line.split("\t")
  if len(fields) > 8:
    if fields[2] == "gene":
      fields[8] = fields[8].replace("Name=", "gene_name=")
  sys.stdout.write("\t".join(fields))
' > sorghum.modified.gff

Example of gff after python modifications:

ChrM	cshl_gene	gene	444449	447142	.	-	.	ID=SbiRTX436.MG005100;gene_name=SbiRTX436.MG005100;biotype=protein_coding
ChrM	cshl_gene	mRNA	444449	447142	.	-	.	ID=SbiRTX436.MG005100.1;Parent=SbiRTX436.MG005100;Name=SbiRTX436.MG005100.1;biotype=protein_coding
ChrM	cshl_gene	five_prime_UTR	445616	447142	.	-	.	Parent=SbiRTX436.MG005100.1;Name=5UTR.1
ChrM	cshl_gene	exon	444449	447142	.	-	.	Parent=SbiRTX436.MG005100.1;Name=exon.1
ChrM	cshl_gene	CDS	444449	445615	.	-	.	Parent=SbiRTX436.MG005100.1;Name=CDS.1

Next, convert to a simplified gtf using gffread.

gffread sorghum.modified.gff -T --keep-genes -o sorghum.modified.gtf

Final result:

ChrM	cshl_gene	transcript	444449	447142	.	-	.	transcript_id "SbiRTX436.MG005100.1"; gene_id "SbiRTX436.MG005100"; gene_name "SbiRTX436.MG005100"
ChrM	cshl_gene	exon	444449	447142	.	-	.	transcript_id "SbiRTX436.MG005100.1"; gene_id "SbiRTX436.MG005100"; gene_name "SbiRTX436.MG005100";
ChrM	cshl_gene	CDS	444449	445615	.	-	0	transcript_id "SbiRTX436.MG005100.1"; gene_id "SbiRTX436.MG005100"; gene_name "SbiRTX436.MG005100";

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.