Recognition of 3′-end L1, Alu, processed pseudogenes, and mRNA stem-loops in the human genome using sequence-based and structure-based machine-learning models
The role of 3’-end stem-loops in transposition was experimentally demonstrated for transposons of various species, where LINE-SINE transposons share the same 3’-end sequences, containing a stem-loop. We have discovered that 62-68% of processed pseduogenes and mRNAs also have 3’-end stem-loops. We investigated the properties of 3’-end stem-loops of human L1s, Alus, processed pseudogenes and mRNAs that do not share the same sequences, but all have 3’-end stem-loops. We have built sequence-based and structure-based machine-learning models that are able to recognize 3’-end L1, Alu, processed pseudogene and mRNA stem-loops with high performance. The sequence-based models use only sequence information and capture compositional bias in 3’-ends. The structure-based models consider physical, chemical and geometrical properties of dinucleotides composing a stem and position-specific nucleotide content of a loop and a bulge. The most important parameters include shift, tilt, rise, and hydrophilicity. The obtained results clearly point to the existence of structural constrains for 3’-end stem-loops of L1 and Alu, which are probably important for transposition, and reveal the potential of mRNAs to be recognized by the L1 machinery. The constructed models are freely available at github (https://github.com/AlexShein/transposons/) and can be used for de novo discovery of transposon-related stem-loops.