About RetrogeneDB:

RetrogeneDB is a database of retrocopy annotations in sequenced eukaryotic genomes. It is the first database containing retrocopy data on such a big scale (currently 62 different genomes from Ensembl release 73), as they tend to be very poorly annotated in public genomic repositiories (NCBI, Ensembl) and the main retrocopy databases (RCPedia, Pseudogene.org) are limited to few model organisms. The retrocopies in RetrogeneDB were detected by our custom pipeline which was designed with low false positives level in mind. RetrogeneDB allows users to easily search for retrocopies and their parental genes using various criteria, from the similarity to the parental gene to expression levels.

What should you generally know about RetrogeneDB:

We use 0-based coordinates (i.e. the first position in the chromosome has the coordinate 0). This is the same as the convention used in UCSC and unlike the one used in Ensembl (which is 1-based).
Data regarding parental genes and their homology were imported from Ensembl. As a result, some data may be missing (for example many genes don't have the gene symbol assigned).
In some cases two or more genes can give equlally good alignments to the retrocopy. In such situations, parental gene is assigned randomly.
Although Yeast genome is included in the Ensembl 73 genomes, it was not included in the RetrogeneDB because there were no retrocopies detected.

What protocols were used for the data analysis within RetrogeneDB:

Retrocopies were identified using LAST program (Kiełbasa et al. 2011), by the translated protein sequence alignment to the hard masked reference genome sequence. All sequences, of all species, downloaded from Ensembl 73 (Flicek et al. 2013) and Ensembl Plants 30[[REF for correct ensemble version]]. Species names with the genome assembly numbers were listed in the supplementary table S1. Genes, that contain reverse transcriptase domain, were excluded from the set. We used the following LAST parameters:

  • Substitution matrix: BLOSUM62
  • Gap existence penalty: 11
  • Gap extension penalty: 2
  • Frameshift penalty: 15
  • Drop-off value: 20

Multiple alignment hits to the same genomic locus were clustered using BedTools (Quinlan and Hall 2010). We required at least 150 bp overlap between alignments on the same strands, to join them into cluster. If particular cluster overlapped a known protein-coding gene, it was excluded from the further analysis. All remaining clusters were now considered as potential retrocopies. For each of these clusters we have selected an optimal alignment, with the highest score, as well as the suboptimal alignments, if their score was at least 98% of the optimal alignment. We considered specific cluster as a retrocopy, if the following criteria were fulfilled:

  • Optimal or suboptimal alignment involves a loss of at least 2 introns longer then 75 nt in comparison to the parental gene (based on Marques et al. 2005).
  • Alignment length is at least 150 bp long at the nucleotide level.
  • Protein sequence based identity and coverage is at least at the level of 50% of the protein level.

The final step included annotation of the retrocopy genomic coordinates and parental gene identity and coverage, which was based on the best alignment from the cluster, that was showin signs of retroposition. In case in more than one alignment was equally good, the final alignment was chosen randomly. If a newly annotated retrocopy was in at least 50% overlapping with known pseudogenes from Ensembl annotations, its status was considered as “KNOWN_PSEUDOGENE”. Otherwise, retrocopy status was considered as “NOVEL”.

To identify retrocopies, that are known protein coding genes, a modified approach was applied. For a given species, based on Ensembl annotations, the collection of all products of “protein-coding” genes, was self-aligned using LAST. Alignments of the alternative products of the same gene were removed. As potential retrocopies, we have then considered all genes, with the entire coding sequence contained always within one exon. In case if a gene encoded more than one protein, the longest transcript was taken under consideration. For each of these potential retrocopies, we were trying to find protein sequence alignments, produced of genes, which does not show reverse transcriptase activity (based on Ensembl protein descriptions), which coding sequence is at least 150 bp long, and protein sequence alignments coverage and identity are at least 50%. In case of the parental gene protein sequence included into alignment, we additionally required it to consist a sequence of at least 3 exons. All retrocopies, that meet these requirements, received a “KNOWN_PROTEIN_CODING” status.

All of the retrocopies went through the manual curation, before the final release of the database. Particularly, we have manually screened less than a 100 parental genes, that gave a large number of retrocopies. We have exluded genes, that originated from transposons, and which evolutionary conservation patterns seemed untrustworthy. In the comparison to the first release of the RetrogeneDB database (Kabza et al. 2014), we have also excluded X human retrocopies (accession numbers here), which were located on chromosome patches. It was dictated by the fact, that retrocopies located on genomic patches are usually identical or extremely similar to their counterparts located on actual chromosomes or scaffolds. As a result it is impossible to obtain unique mappings of reads (RNA-Seq, ChIP-Seq etc.) to those regions.

References:

  • ...
  • ...
A retrocopy is considered to have conserved ORF if it contains no frameshifts and stop codons.