Retrocopies were identified using the LAST program  by the translated protein sequence alignment to the hard-masked reference genome sequence. All sequences of all species were downloaded from Ensembl 73  and Ensembl Plants 30 . Species names with the genome assembly numbers are listed in supplementary table ST1. Genes that contain a reverse transcriptase domain were excluded from the set. The following parameters of the LAST program were applied:
Multiple alignment hits to the same genomic locus were clustered using BEDTools . A minimum overlap of 150 bp between alignments on the same strands was required to join them into a cluster. If a particular cluster overlapped a known protein-coding gene, it was excluded from further analysis. All remaining clusters were considered to be potential retrocopies. For each of these clusters, the alignment with the highest score was selected as optimal. Suboptimal alignments were selected if their score was at least 98% of the optimal alignment. A cluster was considered to be a retrocopy if the following criteria were fulfilled:
The final step included annotation of the parental gene identity and genomic coordinates of the retrocopy based on the best alignment from the selected cluster. In the case of two or more alignments that were equally good, the final alignment was chosen randomly. If a newly annotated retrocopy had at least a 50% overlap with known pseudogenes from Ensembl annotations, its status was considered to be “KNOWN_PSEUDOGENE”. Otherwise, retrocopy status was considered as “NOVEL”.
To identify annotated, protein-coding genes that originated via retroposition, all protein sequences originating from protein-coding genes of a given organism were aligned to themselves using LAST . Proteome sequences were limited to the products of genes annotated in Ensembl as ‘protein-coding’. Alignments were subsequently filtered, and those between isoforms of the same gene were removed. Next, genes whose coding sequence consisted of one exon in every transcript were selected. If the gene encoded more than one protein, the longest isoform was chosen. Each gene was then searched for the best-scoring alignment and examined for signs of retroposition. To call the gene a retrogene, the best alignment must have met following criteria: neither tested gene nor its potential parental gene should show reverse transcriptase activity (detected from the Ensembl descriptions of protein families), the coding sequence of the tested gene is at least 150 bp long, the alignment identity and coverage of both proteins is greater than 50% and the aligned fragment with potential parental gene (excluding the first and last 10 amino acids) must span at least two introns. Every retrogene found this way, was classified as 'KNOWN_PROTEIN_CODING'.
All of the retrocopies went through manual curation before the final release of the database. Genes that originated from transposons, and whose evolutionary conservation patterns seemed untrustworthy, were excluded. In addition, genes that provided a large number of retrocopies were carefully examined. In the comparison to the first release of the RetrogeneDB database , 316 human and 9 mouse retrocopies, which were located on chromosome patches, were also removed. This removal was dictated by the fact that retrocopies located on genomic patches are usually identical or extremely similar to their counterparts located on actual chromosomes or scaffolds. As a result, it is impossible to obtain unique mappings of reads (e.g., RNA-Seq, ChIP-Seq) to those regions.
To estimate the expression of both retrogenes and their parental genes, we utilized a unique hit approach, similar to the one used in RCPedia . We downloaded short read libraries for various tissues from selected records of the NCBI SRA database and trimmed them using Trimmomatic . Short reads matching known rRNA sequences were removed using Bowtie 2 . Filtered reads were mapped to their reference genomes using STAR . Reads that mapped uniquely to retrocopies and their parental genes were counted using the featureCounts program from the Subread package . Expression estimates were normalized using the overall number of mapped reads, resulting in RPM (reads per million mapped reads) values.
EST sequences of 43 species, listed in supplementary table S5, were downloaded from the UCSC database . For a given species, all nucleotide sequences of retrocopies were aligned to the EST sequences using BLAST . EST sequences, which had a sequence alignment less than 100 bp long and a coverage of less than 90%, were excluded from further analysis. All remaining EST sequences were then aligned to the reference genomes using BLAT . An EST sequence was considered to be a retrocopy transcript if the EST – retrocopy alignment had an identity equal to or higher than 95% and EST coverage of at least 85%. Finally, EST alignment with a retrocopy had to have a higher score and identity than the alignment of the same EST with the parental gene or any other sequence.
The FANTOM5 project has provided enormous insight into the alternative transcription start sites (TSS) landscape in human and mouse genomes, where the TSS positions were determined using the CAGE method and applied in a unified protocol to 1829 human and 1072 mouse libraries. This vast collection of libraries, covering inter alia tissues, cell lines and time-courses, was used for the annotation of retrocopies of TSSs. The detailed description of all utilized libraries is provided here . FANTOM5 TSS annotations were downloaded from the official project website and converted into bed format. Mouse annotations were then converted from the mm9 to the mm10 genome version using the liftOver tool . TSS was considered as a possible retrocopy alternative transcription start site if it was located near the 5’ ends of retrocopies, that is, in the region up to 500 bp upstream of the annotated retrocopy 5’ end, down to 30% of the annotated retrocopy body:
The distance of 500 bp was adopted from the automated FANTOM5 protocols . The transcription start sites potentially associated with retrocopies were identified using the intersect program from the BEDTools package , which requires an overlap of at least 1 bp between the annotated TSS positions and the described region near the 5’ ends of retrocopies. Each TSS, based on FANTOM project data, has specified expression level normalized to tags per million (TPM).
ChIP-Seq experiments provide valuable information of the exact DNA regions with which DNA-binding proteins interact. Taking advantage of the efforts made within the Encode Project , precomputed RNA Polymerase II Subunit A activity peaks from 53 human and 4 mouse ChIP-Seq libraries were used to annotate polymerase activity. All libraries are listed here. Mouse datasets were converted from the mm9 to the mm10 genome version using the liftOver tool from UCSC . A polymerase peak was considered to be possibly linked to the retrocopy if its presence was detected within the annotated retrocopy body, down to 30% downstream of the 5’ end and no further than 1000 bp upstream of the 5’ end (figure below). Identification of such peaks was done using the intersect program from the BEDTools package . At least 1 bp of overlap between the RNA Polymerase II peaks and the described region surrounding the 5’ end of the retrocopy was required.
For data visualization in the web browser, peaks from various libraries were merged using the merge program from the BEDTools package with the default settings. Additionally, for the visualization of the peak signal strength, the raw signal/coverage values were downloaded for the corresponding ChIP-Seq libraries in bigWig format. Their signal was then merged into one bigWig file per species, using the bigWigMerge program from UCSC