Interpretable machine learning for demand modeling with high-dimensional data using Gradient Boosting Machines and Shapley values
Forecasting demand and understanding sales drivers are one of the most important tasks in retail analytics. However, traditionally, linear models and/or models with a small number of predictors have been predominantly used in sales modeling. Taking into account that real-world demand is naturally determined by complex substitution and complementation patterns among a large number of interrelated SKUs, nonlinear effects of prices, promotions, seasonality, as well as many other factors, their lagged values, and interactions, a realistic model has to be able to account for all that. We propose a conceptual model for sales modeling based on standard POS data available to any retailer and generate almost 500 potentially useful predictors of a focal SKU’s sales accordingly. In our comparison of three classes of models, Gradient Boosting Machines outperformed Random Forests and Elastic nets. By using interpretable machine learning methods, we came up with actionable insights related to the importance of various groups of predictors from the conceptual model, as well as demonstrated how helpful it can be for marketing managers to decompose predictions into the effects of individual regressors by using an approximation of Shapley values for feature attribution.
The chief aim of this paper is to analyse dynamics of linear and non-linear methods to predict bankruptcy for Russian private small and medium-sized retail and wholesale trade companies. We use financial and non-financial data prior and subsequent to the economic crisis of 2008—2009. We use the following methods: logistic regression and random forest.
This research will be of vital importance especially to banks and other credit organisations providing loans to small and medium businesses.
Our dataset comprises from 200,000 to 600,000 companies depending on specific year. We use data from the Ruslana database which covers the period from 2004 to 2012.
The definition of default is extended to financial difficulties by adding voluntary liquidated firms to those liquidated as a result of legal bankruptcy. We study active companies and two types of liquidated ones.
Heterogeneity of Russian companies is taken into account in several ways. In addition to financial ratios derived from financial statements we include non-financial variables such as regional distribution, age, size and legal form into statistical models.
Evaluation of the prediction performance is done with the help of out-of-sample forecasts. We obtain models with quite high predictive power, area under ROC curve reaches 0.75. Random forest outperformed logit-model. Adding non-financial information such as age and federal region leads to the improved forecasts while legal form and size do not have a great impact on the outcome. Among financial measures liquidity, profitability and leverage ratios turned out to be essential. Moreover, our models captured a structural change which was likely to be caused by the crisis of 2008—2009.
Nowadays, most of the people are suffering from the attack of chronic diseases because of their lifestyle, food habits, and reduction in physical activities. Diabetes is one of the most common chronic diseases being suffered by the people of all ages. As a result, the healthcare sector is generating extensive data containing huge volume, enormous velocity, and a vast variety of heterogeneous sources. In such scenario, scientific solutions offer to harness these massive, heterogeneous and complex datasets to obtain more meaningful information. Moreover, machine learning algorithms can play a tremendous part in creating a statistical prediction-based model. The aim of this paper is to identify the prevalence of diabetes related to long-term complications among patients with type-2 diabetes mellitus. The processing and statistical analysis require machine learning environment known as Scikit-Learn, Pandas for Python, and R-Studio for R. In this work, machine learning approaches such as decision tree, random forest for developing classification system-based prediction model to assess type-2 diabetes mellitus chronic diseases have been studied. Additionally, we have proposed an algorithm which is solely based on random forest and tried to detect the complicated areas of type-2 diabetes patients.
Measuring indirect importance of various attributes is a very common task in marketing analysis for which researchers use correlation and regression techniques. We have listed and illustrated some common problems with widely used latent importance measures. A more theoretically sound approach – the Shapley Value decomposition – was applied to a rich data set of US internet stores. The use of store-level data instead of respondent-level data allowed us to reveal the factors, which are powerful in explaining, why some stores have higher rates of willingness to make repeat purchases than the others. By confronting the indirect importance and performance measures for three different internet stores, we have revealed strengths, weaknesses, attributes that the company should bring customers’ attention to and attributes improvement of which is not of a high priority.
This paper is devoted to modern approaches to the estimation of external conflict in the theory of evidence based on axioms. The conflict measure is defined on the set of beliefs obtained from several sources of information. It is shown that the conflict measure should be a monotone set function with respect to sets of beliefs. Some robust procedures for evaluation of conflict measure that are stable to small changes in evidences are introduced and discussed. The analysis of conflict among forecasts about the value of shares of Russian companies of investment banks is presented. In this analysis the conflict measure estimates inconsistency of recommendations of investment banks, while the Shapley values of this measure on the set of evidences characterize the contribution of each investment bank to the overall conflict. The relationship between conflict and precision of forecasts is also investigated.
To the best knowledge of authors, the use of Random forest as a potential technique for residential estate mass appraisal has been attempted for the first time. In the empirical study using data on residential apartments the method performed better than such techniques as CHAID, CART, KNN, multiple regression analysis, Artificial Neural Networks (MLP and RBF) and Boosted Trees. An approach for automatic detection of segments where a model significantly underperforms and for detecting segments with systematically under- or overestimated prediction is introduced. This segmentational approach is applicable to various expert systems including, but not limited to, those used for the mass appraisal.
In this paper we consider choice problems under the assumption that the preferences of the decision maker are expressed in the form of a parametric partial weak order without assuming the existence of any value function. We investigate both the sensitivity (stability) of each non-dominated solution with respect to the changes of parameters of this order, and the sensitivity of the set of non-dominated solutions as a whole to similar changes. We show that this type of sensitivity analysis can be performed by employing techniques of linear programming.
The paper examines the structure, governance, and balance sheets of state-controlled banks in Russia, which accounted for over 55 percent of the total assets in the country's banking system in early 2012. The author offers a credible estimate of the size of the country's state banking sector by including banks that are indirectly owned by public organizations. Contrary to some predictions based on the theoretical literature on economic transition, he explains the relatively high profitability and efficiency of Russian state-controlled banks by pointing to their competitive position in such functions as acquisition and disposal of assets on behalf of the government. Also suggested in the paper is a different way of looking at market concentration in Russia (by consolidating the market shares of core state-controlled banks), which produces a picture of a more concentrated market than officially reported. Lastly, one of the author's interesting conclusions is that China provides a better benchmark than the formerly centrally planned economies of Central and Eastern Europe by which to assess the viability of state ownership of banks in Russia and to evaluate the country's banking sector.