JMLR Workshop and Conference Proceedings
Recently proposed distance dependent Chinese Restaurant Process (ddCRP) generalizes extensively used Chinese Restaurant Process (CRP) by accounting for dependencies between data points. Its posterior is intractable and so far only MCMC methods were used for inference. Because of very different nature of ddCRP no prior developments in variational methods for Bayesian nonparametrics are appliable. In this paper we propose novel variational inference for important sequential case of ddCRP (seqddCRP) by revealing its connection with Laplacian of random graph constructed by the process. We develop efficient algorithm for optimizing variational lower bound and demonstrate its efficiency comparing to Gibbs sampler. We also apply our variational approximation to CRP-equivalent seqddCRP-mixture model, where it could be considered as alternative to one based on truncated stick-breaking representation. This allowed us to achieve significantly better variational lower bound than variational approximation based on truncated stick breaking for Dirichlet process.
In the paper we present a new framework for dealing with probabilistic graphical models. Our approach relies on the recently proposed Tensor Train format (TT-format) of a tensor that while being compact allows for efficient application of linear algebra operations. We present a way to convert the energy of a Markov random field to the TT-format and show how one can exploit the properties of the TT-format to attack the tasks of the partition function estimation and the MAP-inference. We provide theoretical guarantees on the accuracy of the proposed algorithm for estimating the partition function and compare our methods against several state-of-the-art algorithms.
Despite the fact that user-generated data are widely used in medical informatics in general and for revealing side-effects of various pharmaceuticals in particular, recent studies have focused merely on methods of extracting information on side effects from unstructured or semi-structured reviews of specific medications without linking side effects to any outcomes.
In this study we demonstrate how user-generated online content on side effects experienced by patients while taking a pharmaceutical product can be used to do research after the drug has been introduced to the market, thus allowing to complement the results of official clinical studies and market research. In particular, we concentrate on revealing the contribution of various side effects to reported customer satisfaction with Tamiflu, a popular antiviral drug.
Publicly available data from an online platform with reviews from patients are used as an input to the analysis that applies statistical and machine learning methods (multivariate logit models and classification trees) to investigate the relationships of side effects to demographic characteristics and to the overall satisfaction with the medication.
We prioritized side effects of Tamiflu based on the significance of their association with patient’s ratings published at one of the well-known drug discussion forums. Among all types of side effects used in our study, the neuropsychiatric symptoms and body pain are the most influential, followed by skin problems. Specific combinations of side-effects that are associated with low satisfaction were detected.
The proposed analytical approach can help pharmaceutical companies to improve their products and/or medical guidelines associated with their products and figure out fighting which adverse effects should be given a priority from the customer satisfaction perspective.
This book constitutes the proceedings of the 23rd International Symposium on Foundations of Intelligent Systems, ISMIS 2017, held in Warsaw, Poland, in June 2017. The 56 regular and 15 short papers presented in this volume were carefully reviewed and selected from 118 submissions. The papers include both theoretical and practical aspects of machine learning, data mining methods, deep learning, bioinformatics and health informatics, intelligent information systems, knowledge-based systems, mining temporal, spatial and spatio-temporal data, text and Web mining. In addition, four special sessions were organized; namely, Special Session on Big Data Analytics and Stream Data Mining, Special Session on Granular and Soft Clustering for Data Science, Special Session on Knowledge Discovery with Formal Concept Analysis and Related Formalisms, and Special Session devoted to ISMIS 2017 Data Mining Competition on Trading Based on Recommendations, which was launched as a part of the conference.
This volume is the supplementary volume of the 14th International Conference on Formal Concept Analysis (ICFCA 2017), held from June 13th to 16th 2017, at IRISA, Rennes. The ICFCA conference series is one of the major venues for researches from the field of Formal Concept Analysis and related areas to present and discuss their recent work with colleagues from all over the world. Since it has been started in 2003 in Darmstadt, the ICFCA conference series had been held in Europe, Australia, America, and Africa.
The field of Formal Concept Analysis (FCA) originated in the 1980s in Darmstadt as a subfield of mathematical order theory, with prior developments in other research groups. Its original motivation was to consider complete lattices as lattices of concepts, drawing motivation from philosophy and mathematics alike. FCA has since then devel- oped into a wide research area with applications much beyond its original motivation, for example in logic, data mining, learning, and psychology.
The FCA community is mourning the passing of Rudolf Wille on January 22nd 2017 in Bickenbach, Germany. As one of the leading researchers throughout the history of FCA, he was responsible for inventing and shaping many of the fundamental notions of this area. Indeed, the publication of his article ”Restructuring Lattice Theory: An Approach Based on Hierarchies of Concepts” is seen by many as the starting point of Formal Concept Analysis as an independent direction of research. He was head of the FCA research group in Darmstadt from 1983 until his retirement in 2003, and remained an active researcher and contributor thereafter. In 2003, he was among the founding members of the ICFCA conference series.
For this supplementary volume, 13 papers were chosen to be published: four papers judged mature enough to be discussed at the conference and nine papers presented in the demonstration and poster session.
Non-B DNA structures have a great potential to form and influence various genomic processes including transcription. One of the mechanisms of transcription regulation is nucleo- some positioning. Even though only B-DNA can be wrapped around a nucleosome, non-B DNA structures can compete with a nucleosome for a genomic location. Here we used perman- ganate/S1 nuclease footprinting data on non-B DNA structures, such as Z-DNA, H-DNA, G- quadruplexes and stress-induced duplex destabilization (SIDD) sites, together with MNase-seq data on nucleosome positioning in the mouse genome. We found three types of patterns of nucleosome positioning around non-B DNA structures: a structure is surrounded by nucleo- somes from both sides, from one side, or nucleosome free region. Machine learning models based on random forest and XGBoost algorithms were constructed to recognize DNA regions of 1kB length containing a particular pattern of nucleosome positioning for four types of DNA structures (Z-DNA, H-DNA, G-quadruplexes and SIDD sites) based on statistics of di- and tri- nucleotides. The best performance (94% of accuracy) was reached for G-quadruplexes while for other types of structures the accuracy was under 70%. We conclude that 1kB regions con- taining G-quadruplexes have distinct compositional properties, and this fact points to preferen- tial locations of such pattern in the genome and requires further investigation. For other DNA structures a region composition is not a sufficient predictive factor and one should take into account other physical and structural DNA properties to improve nucleosome-DNA-structure pattern recognition.
With the advances in the sequencing technology the International Cancer Genome Consortium (ICGC)  and The Cancer Genome Atlas (TCGA)  collected data on more than 16 000 genome-wide pairs tumor-normal tissue providing a valuable resource to study cancer mutations. In this research we focus on pre- evaluation of the relationship between cancer breakpoint hotspots and DNA regions potentially forming secondary structures such as stem-loops (cruciforms) and quadru- plexes. We performed analysis of 2 234 samples covering 10 cancer types and built machine-learning models predicting cancer breakpoint distribution over chromosome based on the density distribution of stem-loops and quadruplexes. We developed pro- cedure for machine learning models building and evaluation as the considered data are extremely imbalanced and it is needed to get reliable estimate of prediction power. We conducted a set of experiments to select the best appropriate resampling scheme, class balancing technique and parameters of machine learning algorithms. The best final models were applied to cancer breakpoints data. From the performed analysis it could be concluded that the relationship between cancer breakpoints hotspots and studied DNA secondary structures exists, however, generally, this relationship is weak for stem-loops, but higher for quadruplexes. We also found differences in model predictive power depending on cancer types. Thus, stem-loop-based model performs better for pancreatic, prostate, ovary, uterus, brain and liver cancer, and quadruplex- based model works better for blood, bone, skin and breast cancer.
There are many different methods for computing relevant patterns in sequential data and interpreting the results. In this paper, we compute emerging patterns (EP) in demographic sequences using sequence-based pattern structures, along with different algorithmic so- lutions. The purpose of this method is to meet the following domain requirement: the obtained patterns must be (closed) frequent contiguous prefixes of the input sequences. This is required in order for demogra- phers to fully understand and interpret the results.
Non-B DNA structures have a great potential to form and influence various genomic processes including transcription. One of the mechanisms of transcription regulation is nucleosome positioning. Even though only B-DNA can be wrapped around a nucleosome, non-B DNA structures can compete with a nucleosome for a genomic location. Here we used permanganate/S1 nuclease footprinting data on non-B DNA structures, such as Z-DNA, H-DNA, G-quadruplexes and stress-induced duplex destabilization (SIDD) sites, together with MNase-seq data on nucleosome positioning in the mouse genome. We found three types of patterns of nucleosome positioning around non-B DNA structures: a structure is surrounded by nucleosomes from both sides, from one side, or nucleosome free region. Machine learning models based on random forest and XGBoost algorithms were constructed to recognize DNA regions of 1kB length containing a particular pattern of nucleosome positioning for four types of DNA structures (Z-DNA, H-DNA, G-quadruplexes and SIDD sites) based on statistics of di- and tri-nucleotides. The best performance (94% of accuracy) was reached for Gquadruplexes while for other types of structures the accuracy was under 70%. We conclude that 1kB regions containing Gquadruplexes have distinct compositional properties, and this fact points to preferential locations of such pattern in the genome and requires further investigation. Gene ontology analysis revealed that the genes intersecting with the discovered patterns are enriched in channel and transmembrane activity, transcription factor and receptor binding. The direction for further research is to study the distribution of the discovered patterns in different tissues to identify well-positioned and dynamic nucleosomes and reveal genes, regulated via DNA structures and nucleosome positioning.
Proceedings of the international conference "Neural Information Processing Systems 2018." (NIPS 2018)
One of the most challenging data analysis tasks of modern High Energy Physics experiments is the identification of particles. In this proceedings we review the new approaches used for particle identification at the LHCb experiment. Machine-Learning based techniques are used to identify the species of charged and neutral particles using several observables obtained by the LHCb sub-detectors. We show the performances of various solutions based on Neural Network and Boosted Decision Tree models.
A model for organizing cargo transportation between two node stations connected by a railway line which contains a certain number of intermediate stations is considered. The movement of cargo is in one direction. Such a situation may occur, for example, if one of the node stations is located in a region which produce raw material for manufacturing industry located in another region, and there is another node station. The organization of freight traﬃc is performed by means of a number of technologies. These technologies determine the rules for taking on cargo at the initial node station, the rules of interaction between neighboring stations, as well as the rule of distribution of cargo to the ﬁnal node stations. The process of cargo transportation is followed by the set rule of control. For such a model, one must determine possible modes of cargo transportation and describe their properties. This model is described by a ﬁnite-dimensional system of diﬀerential equations with nonlocal linear restrictions. The class of the solution satisfying nonlocal linear restrictions is extremely narrow. It results in the need for the “correct” extension of solutions of a system of diﬀerential equations to a class of quasi-solutions having the distinctive feature of gaps in a countable number of points. It was possible numerically using the Runge–Kutta method of the fourth order to build these quasi-solutions and determine their rate of growth. Let us note that in the technical plan the main complexity consisted in obtaining quasi-solutions satisfying the nonlocal linear restrictions. Furthermore, we investigated the dependence of quasi-solutions and, in particular, sizes of gaps (jumps) of solutions on a number of parameters of the model characterizing a rule of control, technologies for transportation of cargo and intensity of giving of cargo on a node station.
Event logs collected by modern information and technical systems usually contain enough data for automated process models discovery. A variety of algorithms was developed for process models discovery, conformance checking, log to model alignment, comparison of process models, etc., nevertheless a quick analysis of ad-hoc selected parts of a journal still have not get a full-fledged implementation. This paper describes an ROLAP-based method of multidimensional event logs storage for process mining. The result of the analysis of the journal is visualized as directed graph representing the union of all possible event sequences, ranked by their occurrence probability. Our implementation allows the analyst to discover process models for sublogs defined by ad-hoc selection of criteria and value of occurrence probability
The geographic information system (GIS) is based on the first and only Russian Imperial Census of 1897 and the First All-Union Census of the Soviet Union of 1926. The GIS features vector data (shapefiles) of allprovinces of the two states. For the 1897 census, there is information about linguistic, religious, and social estate groups. The part based on the 1926 census features nationality. Both shapefiles include information on gender, rural and urban population. The GIS allows for producing any necessary maps for individual studies of the period which require the administrative boundaries and demographic information.
It is well-known that the class of sets that can be computed by polynomial size circuits is equal to the class of sets that are polynomial time reducible to a sparse set. It is widely believed, but unfortunately up to now unproven, that there are sets in EXPNP, or even in EXP that are not computable by polynomial size circuits and hence are not reducible to a sparse set. In this paper we study this question in a more restricted setting: what is the computational complexity of sparse sets that are selfreducible? It follows from earlier work of Lozano and Torán (in: Mathematical systems theory, 1991) that EXPNP does not have sparse selfreducible hard sets. We define a natural version of selfreduction, tree-selfreducibility, and show that NEXP does not have sparse tree-selfreducible hard sets. We also construct an oracle relative to which all of EXP is reducible to a sparse tree-selfreducible set. These lower bounds are corollaries of more general results about the computational complexity of sparse sets that are selfreducible, and can be interpreted as super-polynomial circuit lower bounds for NEXP.