Skip to main content

Table 2 GENCODE annotation biotypes (2017)

From: Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Biotype

Description

Protein coding

Contains an ORF that has strong coding potential

 Known coding

100% identical to known RefSeq protein or Swiss-Prot entry

 Novel coding

Shares >60% length with known coding sequence from RefSeq, or Swiss-Prot, or has cross-species/family support or domain evidence

 Putative coding

Shares <60% length with known coding sequence from RefSeq, or Swiss-Prot, or has an alternative first or last coding exon

 Nonsense-mediated decay

If the coding sequence (following the appropriate reference) of a transcript finishes >50 bp from a downstream splice site, then it is tagged as NMD. If the variant does not cover the full reference coding sequence, then it is annotated as NMD if NMD is unavoidable—i.e. no matter what the exon structure of the missing portion is, the transcript will be subject to NMD

 Non-stop decay

Transcripts that have poly(A) features (including signal) without a prior stop codon in the CDS—i.e. a non-genomic poly(A) tail attached directly to the CDS without a 3′ UTR; these transcripts are subject to degradation

 Retained intron

Alternatively spliced transcript believed to contain intronic sequence relative to other, coding, variants

 Processed transcript

Cannot assign an ORF, but is part of a coding locus

lncRNA

Long non-coding RNA—lacks protein-coding potential and is of length >200 bp

 Bidirectional promoter

Transcription start sites of the lncRNA model and the protein-coding model are on opposite strands and within 200 bp of one another, or are found in the same CpG island

 3-Prime overlapping

Transcription start site and/or published experimental data support independent transcription from the 3′ UTR of a coding gene

 Antisense

At least one variant overlaps a protein-coding locus on the opposite strand, or evidence of antisense regulation of a coding gene has been published

 lincRNA

Long intergenic ncRNA: does not overlap (neither sense nor antisense) a coding gene

 Sense intronic

In an intron of a coding gene; no exonic overlap

 Sense overlapping

Contains a coding gene in an intron; no exonic overlap.

Pseudogene

Matches to protein, but ORF disrupted by frameshifts and/or premature stop codons

 Processed

Lacks introns and arose from retrotransposition of parent gene mRNA

 Unprocessed

Can contain introns and is produced by genomic duplication

 Transcribed

Locus-specific transcripts indicate transcription; these can be classified into ‘processed’ and ‘unprocessed’

 Translated

Locus-specific protein mass spectroscopy data suggest translation; the connection is maintained with the pseudogene biotype until the experimental community validates it as a coding gene

 Polymorphic

Pseudogene owing to a single-nucleotide variant (SNV), or insertion-deletion variant (indel); but the same gene is translated in other individuals/haplotypes/strains

 Unitary

Species-specific unprocessed pseudogene, without a parent gene, that has an active orthologue in another species

  1. Data sourced from GENCODE project [196]
  2. ncRNA noncoding RNA, ORF open reading frame, UTR untranslated region