Построение модели машинного обучения, распознающей G-квадруплексы на основе физико-химических свойств ДНК
Sentiment analysis has become a powerful tool in processing and analysing expressed opinions on a large scale. While the application of sentiment analysis on English-language content has been widely examined, the applications on the Russian language remains not as well-studied. In this survey, we comprehensively reviewed the applications of sentiment analysis of Russian-language content and identified current challenges and future research directions. In contrast with previous surveys, we targeted the applications of sentiment analysis rather than existing sentiment analysis approaches and their classification quality. We synthesised and systematically characterised existing applied sentiment analysis studies by their source of analysed data, purpose, employed sentiment analysis approach, and primary outcomes and limitations. We presented a research agenda to improve the quality of the applied sentiment analysis studies and to expand the existing research base to new directions. Additionally, to help scholars selecting an appropriate training dataset, we performed an additional literature review and identified publicly available sentiment datasets of Russian-language texts.
There is ample evidence that morphological and social cues in a human face provide signals of human personality and behaviour. Previous studies have discovered associations between the features of artificial composite facial images and attributions of personality traits by human experts. We present new findings demonstrating the statistically significant prediction of a wider set of personality features (all the Big Five personality traits) for both men and women using real-life static facial images. Volunteer participants (N = 12,447) provided their face photographs (31,367 images) and completed a self-report measure of the Big Five traits. We trained a cascade of artificial neural networks (ANNs) on a large labelled dataset to predict self-reported Big Five scores. The highest correlations between observed and predicted personality scores were found for conscientiousness (0.360 for men and 0.335 for women) and the mean effect size was 0.243, exceeding the results obtained in prior studies using ‘selfies’. The findings strongly support the possibility of predicting multidimensional personality profiles from static facial images using ANNs trained on large labelled datasets. Future research could investigate the relative contribution of morphological features of the face and other characteristics of facial images to predicting personality.
This book constitutes the proceedings of the 8th International Conference on Analysis of Images, Social Networks and Texts, AIST 2019, held in Kazan, Russia, in July 2019.
The 24 full papers and 10 short papers were carefully reviewed and selected from 134 submissions (of which 21 papers were rejected without being reviewed). The papers are organized in topical sections on general topics of data analysis; natural language processing; social network analysis; analysis of images and video; optimization problems on graphs and network structures; analysis of dynamic behaviour through event data.
With the advances in the sequencing technology the International Cancer Genome Consortium (ICGC)  and The Cancer Genome Atlas (TCGA)  collected data on more than 16 000 genome-wide pairs tumor-normal tissue providing a valuable resource to study cancer mutations. In this research we focus on pre- evaluation of the relationship between cancer breakpoint hotspots and DNA regions potentially forming secondary structures such as stem-loops (cruciforms) and quadru- plexes. We performed analysis of 2 234 samples covering 10 cancer types and built machine-learning models predicting cancer breakpoint distribution over chromosome based on the density distribution of stem-loops and quadruplexes. We developed pro- cedure for machine learning models building and evaluation as the considered data are extremely imbalanced and it is needed to get reliable estimate of prediction power. We conducted a set of experiments to select the best appropriate resampling scheme, class balancing technique and parameters of machine learning algorithms. The best final models were applied to cancer breakpoints data. From the performed analysis it could be concluded that the relationship between cancer breakpoints hotspots and studied DNA secondary structures exists, however, generally, this relationship is weak for stem-loops, but higher for quadruplexes. We also found differences in model predictive power depending on cancer types. Thus, stem-loop-based model performs better for pancreatic, prostate, ovary, uterus, brain and liver cancer, and quadruplex- based model works better for blood, bone, skin and breast cancer.
One of the most challenging data analysis tasks of modern High Energy Physics experiments is the identification of particles. In this proceedings we review the new approaches used for particle identification at the LHCb experiment. Machine-Learning based techniques are used to identify the species of charged and neutral particles using several observables obtained by the LHCb sub-detectors. We show the performances of various solutions based on Neural Network and Boosted Decision Tree models.
Non-B DNA structures have a great potential to form and influence various genomic processes including transcription. One of the mechanisms of transcription regulation is nucleosome positioning. Even though only B-DNA can be wrapped around a nucleosome, non-B DNA structures can compete with a nucleosome for a genomic location. Here we used permanganate/S1 nuclease footprinting data on non-B DNA structures, such as Z-DNA, H-DNA, G-quadruplexes and stress-induced duplex destabilization (SIDD) sites, together with MNase-seq data on nucleosome positioning in the mouse genome. We found three types of patterns of nucleosome positioning around non-B DNA structures: a structure is surrounded by nucleosomes from both sides, from one side, or nucleosome free region. Machine learning models based on random forest and XGBoost algorithms were constructed to recognize DNA regions of 1kB length containing a particular pattern of nucleosome positioning for four types of DNA structures (Z-DNA, H-DNA, G-quadruplexes and SIDD sites) based on statistics of di- and tri-nucleotides. The best performance (94% of accuracy) was reached for Gquadruplexes while for other types of structures the accuracy was under 70%. We conclude that 1kB regions containing Gquadruplexes have distinct compositional properties, and this fact points to preferential locations of such pattern in the genome and requires further investigation. Gene ontology analysis revealed that the genes intersecting with the discovered patterns are enriched in channel and transmembrane activity, transcription factor and receptor binding. The direction for further research is to study the distribution of the discovered patterns in different tissues to identify well-positioned and dynamic nucleosomes and reveal genes, regulated via DNA structures and nucleosome positioning.
This book constitutes the refereed proceedings of the 11th International Conference on Intelligent Data Processing, IDP 2016, held in Barcelona, Spain, in October 2016.
The 11 revised full papers were carefully reviewed and selected from 52 submissions. The papers of this volume are organized in topical sections on machine learning theory with applications; intelligent data processing in life and social sciences; morphological and technological approaches to image analysis.
Background: Chromosomal rearrangements are the typical phenomena in cancer genomes causing gene disruptions and fusions, corruption of regulatory elements, damage to chromosome integrity. Among the factors contributing to genomic instability are non-B DNA structures with stem-loops and quadruplexes being the most prevalent. We aimed at investigating the impact of specifically these two classes of non-B DNA structures on cancer breakpoint hotspots using machine learning approach.
Methods: We developed procedure for machine learning model building and evaluation as the considered data are extremely imbalanced and it was required to get a reliable estimate of the prediction power. We built logistic regression models predicting cancer breakpoint hotspots based on the densities of stem-loops and quadruplexes, jointly and separately. We also tested Random Forest models varying different resampling schemes (leave-one-out cross validation, train-test split, 3-fold cross-validation) and class balancing techniques (oversampling, stratification, synthetic minority oversampling).
Results: We performed analysis of 487,425 breakpoints from 2234 samples covering 10 cancer types available from the International Cancer Genome Consortium. We showed that distribution of breakpoint hotspots in different types of cancer are not correlated, confirming the heterogeneous nature of cancer. It appeared that stem-loop- based model best explains the blood, brain, liver, and prostate cancer breakpoint hotspot profiles while quadruplex- based model has higher performance for the bone, breast, ovary, pancreatic, and skin cancer. For the overall cancer profile and uterus cancer the joint model shows the highest performance. For particular datasets the constructed models reach high predictive power using just one predictor, and in the majority of the cases, the model built on both predictors does not increase the model performance.
Conclusion: Despite the heterogeneity in breakpoint hotspots’ distribution across different cancer types, our results demonstrate an association between cancer breakpoint hotspots and stem-loops and quadruplexes. Approximately for half of the cancer types stem-loops are the most influential factors while for the others these are quadruplexes. This fact reflects the differences in regulatory potential of stem-loops and quadruplexes at the tissue-specific level, which yet to be discovered at the genome-wide scale. The performed analysis demonstrates that influence of stem- loops and quadruplexes on breakpoint hotspots formation is tissue-specific.