Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark
The graph anomaly detection problem occurs in many application areas and can be solved by spotting outliers in unstructured collections of multi-dimensional data points, which can be obtained by graph analysis algorithms. We implement the algorithm for the small community analysis and the approximate LOF algorithm based on Locality-Sensitive Hashing, apply the algorithms to a real world graph and evaluate scalability of the algorithms. We use Apache Spark as one of the most popular Big Data frameworks.
Apache Spark is one of the most popular Big Data frameworks. Performance evaluation of Big Data frameworks is a topic of interest due to the increasing number and importance of data analytics applications within the context of HPC and Big Data convergence. In the paper we present early performance evaluation of a typical supervised graph anomaly detection problem implemented using GraphX and MLlib libraries in Apache Spark on a cluster.
The article demonstrates that crimes that come to the attention of the criminal police have varying worth in the eyes of Russian policemen and, consequently, attract unequal efforts. The worth of crimes is closely related to the criteria for evaluation of police performance. The data derived from 12 in-depth interviews with Russian police officers, nine indepth interviews with senior students of Moscow University of Russian Interior Ministry who are undergoing practice within police departments, and online discussions within the police community show that policemen in Russia made their practical decisions while balancing between multiple orders of worth. The theoretical framework of data interpretation is represented by symbiosis theories of valuations and the institutional logics approach. Operationalized as a set of cultural rules and expectations defining legitimate grounds for assessing and determining what rational behavior in a given organizational context really is, the concept of institutional logics stresses the interrelations between value-oriented and material dimensions of social action but allows one to stress the hierarchy and constant competition between various orders of worth in an organization. Four institutional logics — state, clan, quasi-market, and professional — are empirically identified. Each of them brings its own order of worth to the police organizational environment. Crimes in the eyes of the police always have a price — expressed in either “checkmarks,” points of recognition by the boss or colleagues, or money. The data suggest that, despite the hierarchy between the orders of (crimes’) worth within the police system as a whole, in each case, institutional logics and criteria of worth related to them compete with each other. Depending on the characteristics of the criminal case and the situation in the police department at a given moment, the competition between various orders of worth is resolved by policemen in different ways. The results of the study shed light on the functioning of police discretion and help to accentuate the dysfunctional side of police reform in Russia.
The Semantic Evaluation (SemEval) series of workshops focuses on the evaluation and comparison of systems that can analyse diverse semantic phenomena in text with the aim of extending the current state of the art in semantic analysis and creating high quality annotated datasets in a range of increasingly challenging problems in natural language semantics. SemEval provides an exciting forum for researchers to propose challenging research problems in semantics and to build systems/techniques to address such research problems. SemEval-2016 is the tenth workshop in the series of International Workshops on Semantic Evaluation Exercises. The first three workshops, SensEval-1 (1998), SensEval-2 (2001), and SensEval-3 (2004), focused on word sense disambiguation, each time growing in the number of languages offered, in the number of tasks, and also in the number of participating teams. In 2007, the workshop was renamed to SemEval, and the subsequent SemEval workshops evolved to include semantic analysis tasks beyond word sense disambiguation. In 2012, SemEval turned into a yearly event. It currently runs every year, but on a two-year cycle, i.e., the tasks for SemEval-2016 were proposed in 2015. SemEval-2016 was co-located with the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’2016) in San Diego, California. It included the following 14 shared tasks organized in five tracks: • Text Similarity and Question Answering Track – Task 1: Semantic Textual Similarity: A Unified Framework for Semantic Processing and Evaluation – Task 2: Interpretable Semantic Textual Similarity – Task 3: Community Question Answering • Sentiment Analysis Track – Task 4: Sentiment Analysis in Twitter – Task 5: Aspect-Based Sentiment Analysis – Task 6: Detecting Stance in Tweets – Task 7: Determining Sentiment Intensity of English and Arabic Phrases • Semantic Parsing Track – Task 8: Meaning Representation Parsing – Task 9: Chinese Semantic Dependency Parsing • Semantic Analysis Track – Task 10: Detecting Minimal Semantic Units and their Meanings – Task 11: Complex Word Identification – Task 12: Clinical TempEval iii • Semantic Taxonomy Track – Task 13: TExEval-2 – Taxonomy Extraction – Task 14: Semantic Taxonomy Enrichment This volume contains both Task Description papers that describe each of the above tasks and System Description papers that describe the systems that participated in the above tasks. A total of 14 task description papers and 198 system description papers are included in this volume. We are grateful to all task organisers as well as the large number of participants whose enthusiastic participation has made SemEval once again a successful event. We are thankful to the task organisers who also served as area chairs, and to task organisers and participants who reviewed paper submissions. These proceedings have greatly benefited from their detailed and thoughtful feedback. We also thank the NAACL 2016 conference organizers for their support. Finally, we most gratefully acknowledge the support of our sponsor, the ACL Special Interest Group on the Lexicon (SIGLEX). The SemEval-2016 organizers, Steven Bethard, Daniel Cer, Marine Carpuat, David Jurgens, Preslav Nakov and Torsten Zesch
In our conducted research we have built the data processing pipeline for storing railway KPIs data based on Big Data open-source technologies – Apache Hadoop, Kafka, Kafka HDFS Connector, Spark, Airflow and PostgreSQL. Created methodology for data load testing allowed to iteratively perform data load tests with increased data size and evaluate needed cluster software and hardware resources and, finally, detected bottlenecks of solution. As a result of the research we proposed architecture for data processing and storage, gave recommendations on data pipeline optimization. In addition, we calculated approximate cluster machines sizing for current dataset volume for data processing and storage services.
In many areas, such as social science, politics or market research, people need to track sentiment and their changes over time. For sentiment analysis in this field it is more important to correctly estimate proportions of each sentiment expressed in the set of documents (quantification task) than to accurately estimate sentiment of a particular document (classification). Basically, our study was aimed to analyze the effectiveness of two iterative quantification techniques and to compare their effectiveness with baseline methods. All the techniques are evaluated using a set of synthesized data and the SemEval-2016 Task4 dataset. We made the quantification methods from this paper available as a Python open source library. The results of comparison and possible limitations of the quantification techniques are discussed.
Because of the lack of data on cash flows, it is impossible to use traditional measures of return such as IRR and TVPI for evaluation the performance of private equity funds in emerging markets.
In this study, we proposed an approach based on the use of adjusted rates of return for the PE funds, which can be implemented without the use of data on cash flows and net assets of the funds. The proposed indicators can be calculated on the basis of the publicly available data on portfolio transactions of the fund.
The study was presented methodology based on the performance of private equity portfolio transactions as well as the analysis of empirical data on a sample of 1957 deals in BRIC countries from 2000 to 2012.
The results of the empirical analysis largely support a number of fundamental characteristics of the PE funds, previously identified for the developed capital markets such as:
1. Private equity deals in developing countries are more risky assets than traditional instruments.
2. The return on the majority of transactions is below the return of the stock market, however, the most successful are significantly ahead of the market.
3. Coefficient β of buyout funds is less than one, indicating the low exposure to systemic risk.
Some characteristics were confirmed only in part:
1. The investments of venture capital funds have a coefficient β is greater than one for the markets of Brazil and India, and less than one for Russia and China.
2. Return on investment is higher for buyout funds than for venture capital funds in Russia and China. In India and Brazil - the opposite result.
The rest of the characteristics are fundamentally different from the identified in the developed capital markets:
1. The period of ownership for the private equity fund investment in developing countries is less than for developed countries and is an average of 3.3 years.
A model for organizing cargo transportation between two node stations connected by a railway line which contains a certain number of intermediate stations is considered. The movement of cargo is in one direction. Such a situation may occur, for example, if one of the node stations is located in a region which produce raw material for manufacturing industry located in another region, and there is another node station. The organization of freight traﬃc is performed by means of a number of technologies. These technologies determine the rules for taking on cargo at the initial node station, the rules of interaction between neighboring stations, as well as the rule of distribution of cargo to the ﬁnal node stations. The process of cargo transportation is followed by the set rule of control. For such a model, one must determine possible modes of cargo transportation and describe their properties. This model is described by a ﬁnite-dimensional system of diﬀerential equations with nonlocal linear restrictions. The class of the solution satisfying nonlocal linear restrictions is extremely narrow. It results in the need for the “correct” extension of solutions of a system of diﬀerential equations to a class of quasi-solutions having the distinctive feature of gaps in a countable number of points. It was possible numerically using the Runge–Kutta method of the fourth order to build these quasi-solutions and determine their rate of growth. Let us note that in the technical plan the main complexity consisted in obtaining quasi-solutions satisfying the nonlocal linear restrictions. Furthermore, we investigated the dependence of quasi-solutions and, in particular, sizes of gaps (jumps) of solutions on a number of parameters of the model characterizing a rule of control, technologies for transportation of cargo and intensity of giving of cargo on a node station.
Event logs collected by modern information and technical systems usually contain enough data for automated process models discovery. A variety of algorithms was developed for process models discovery, conformance checking, log to model alignment, comparison of process models, etc., nevertheless a quick analysis of ad-hoc selected parts of a journal still have not get a full-fledged implementation. This paper describes an ROLAP-based method of multidimensional event logs storage for process mining. The result of the analysis of the journal is visualized as directed graph representing the union of all possible event sequences, ranked by their occurrence probability. Our implementation allows the analyst to discover process models for sublogs defined by ad-hoc selection of criteria and value of occurrence probability
Existing approaches suggest that IT strategy should be a reflection of business strategy. However, actually organisations do not often follow business strategy even if it is formally declared. In these conditions, IT strategy can be viewed not as a plan, but as an organisational shared view on the role of information systems. This approach generally reflects only a top-down perspective of IT strategy. So, it can be supplemented by a strategic behaviour pattern (i.e., more or less standard response to a changes that is formed as result of previous experience) to implement bottom-up approach. Two components that can help to establish effective reaction regarding new initiatives in IT are proposed here: model of IT-related decision making, and efficiency measurement metric to estimate maturity of business processes and appropriate IT. Usage of proposed tools is demonstrated in practical cases.