Skip to main content

Table 4 Important areas to consider for genome annotation

From: Genome annotation for clinical genomic diagnostics: strengths and weaknesses

Genome assembly is not complete

Human assembly is still not complete and still being refined

The current assembly is GRCh38, which still contains fragmented genes, and gene duplications are incorrectly represented, yet most analysis is still performed on GRCh37

Transcriptome is still incomplete

Some exons are still not represented in the human genome owing to low expression or temporal expression in tissue that has not yet been interrogated

WES kits will not contain all exons

WGS-negative cases should be iteratively re-analysed as new transcriptional features are revealed

Reference annotation datasets can be missing key features

Automatic annotation is fast but not as accurate as manual annotation

CCDS—missing UTRs

LRG—single, usually canonical, transcript—potential for missing exons; choice of transcript is arbitrary

RefSeq—based on transcriptome, potential for missing exons and problems with inconsistent mapping to reference assembly

Annotation does not necessarily determine which transcripts are the most likely to be functional, and the longest one might not be the major one

Non-coding genome

Long-range gene interactions are poorly understood; methods such as Capture Hi-C will provide insights into such epigenetics

Previously ignored transcript biotypes such as NMD and retained intron are now known to have important regulatory roles in disease

Non-coding RNAs have an important role in disease, yet they are hard to predict and their function remains largely unknown.

Biotype associations

A biotype conflict in annotation datasets will cause incorrect variant calls—for example, lncRNA variant compared with coding gene, coding gene compared with pseudogene

Transcript expression profile

Is transcript expressed in correct tissue for disease phenotype?

Is transcript expressed at the right developmental time for disease phenotype?

  1. CCDS Collaborative Consensus Coding Sequence project, lncRNA long non-coding RNA, LRG Locus Reference Genomic project, NMD nonsense-mediated decay, WES whole-exome sequencing, WGS whole-genome sequencing