Recognition of 3’ UTR stem-loop in LINE transposons across the tree of life by machine learning methods
We found earlier that L1-Alu transposons in human genome contain a conservative stem-loop structure at their 3’UTR . We built a machine- learning model that could distinguish L1 3’-UTR stem-loop structures from stem-loops from different genomic locations. Later we found that all LINE transposons contain stem-loops at their 3’-end. Since 3’-end stem-loop structure was experimentally shown to play an important role in recognition of transpos- on RNA by the LINE encoded reverse transcriptase in several species [2-4], we hypothesize that this structure could be preserved for that purpose in other spe- cies. Here we built machine learning model using random forest algorithm to study structural properties of 3’-end transposon stem-loops. The constructed model is based on physical, chemical and structural RNA characteristics such as entalphy, enthropy, Gibbs free energy, hydrophilicity, and helical structural pa- rameters of dinucleotides - Shift, Roll, Slide, Rise, Tilt, Bend . Each stem- loop structure was split into 30 positions and each position was characterized by 23 characteristics so that the final property vector contained 602 position- specific characteristics for each stem-loop. 2200 sequences of all available LINE transposons from different species across the tree of life were extracted from RepBase database . We constructed machine-learning model using ran- dom forest that was able to distinguish 3’-end LINE stem-loops from random stem-loops with 78% of accuracy. Analysis of predictor importance revealed that enthalpy and entropy in loop positions and hydrophilicity and stacking en- ergy in stem positions were the major influential factors for model prediction power. The obtained results support the idea that 3’-end transposon stem-loops share similar structural properties, which are probably required for transposi- tion.