Die nachfolgenden Inhalte sind englischsprachig.
relaimpo: relative importance of regressors
Relative importance is an old topic in regression applications: Many scientists want to quantify the relative contributions of the regressors to the model's total explanatory value. The R-package relaimpo offers six different metrics for relative importance in linear models.
In the beginning, the intention of developing relaimpo simply was to provide a reasonably fast version of the (relatively) well-known method of averaging sequential sums of squares over orderings of regressors. This method is called lmg in package relaimpo because of the first known mention in Lindeman, Merenda and Gold (1980, p.119ff); Kruskall (1987) is a more well-known source for this method, and it has been re-invented by various researchers from different fields, e.g. under the names of "Shapley value regression" or "dominance analysis" . When working on this method and its properties, also in comparison to other methods, particularly also the newly-proposed method pmvd by Feldman (2005), the package grew to include the total of six metrics. After the latest additions of a less-well-known metric from Genizi (1993) and a new one from Zuber and Strimmer (2010), there are now eight different metrics, five of which yield "natural" decompositions of the model R2 in linear regression models. Besides calculation of metrics, it is also possible to obtain bootstrap confidence intervals (exploratory in nature) for the metrics themselves, their ranks and their pairwise differences. Basic printing and plotting facilities are also provided.
The package downloadable on this website includes among its metrics the metric pmvd by Feldman. Since there might be an issue with US patents 6,640,204 or 6,961,678 regarding this metric, the package including this metric is offered on this website under GPL version 2 with the following explicit geographical restriction: Distribution is restricted to non-US countries.
Status May 6 2016: As both the above patents have lapsed (i.e. are currently not valid due to non-payment of fees by the patent owner), I currently have no issues with US users using the non-US version of the software. Should the patents be revived, permission for US users will again be withdrawn. It is the user's responsibility not to violate the patents (patent search link).
The GPL paragraph on geographical restriction is for convenience included here:
8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License.
By downloading the package from this website, you confirm that you are a non-US user in the sense of GPL paragraph 8 (cf. Legalese), or have checked that the patents mentioned in Legalese have lapsed / expired:
If you are not entitled to download this package due to the geographical restriction, you can download a reduced version of relaimpo (without pmvd, otherwise unchanged) on CRAN at http://cran.r-project.org/web/packages/relaimpo/index.html
If you install the package for the first time, it is recommended to first install package relaimpo from CRAN. This way, all packages on which relaimpo depends are automatically installed. Afterwards, you can install the non-US version and start working.
|Package binaries (please select appropriate version of R;|
I have stopped updating Windows builds for old R releases; version 2.2 contains two additional metrics (genizi and car) over and above versions 2.1 and 2.1-4); version 2.2-2 adapts to R 3.0.0 and higher (almost no functional changes)
|Version||Windows binaries||Mac OS binaries|
|R 2.2.1 (built under 2.2.1)||relaimpo_2.1.zip|
|R 2.3.0 and R 2.3.1 (built under 2.3.1, |
should also work under 2.3.0)
|R 2.4.0 up to 2.9.x (built under 2.9.1)||relaimpo_2.1-4.zip|
|R 2.10.0 to 2.15.x (built under 2.11.0)||relaimpo_2.2.zip|
|R 3.0.0 and higher (built under 3.1.0 Devel)||relaimpo_2.2-2.zip|
|R 3.5.0 and higher (built under 3.4.1)||relaimpo_2.2-3.zip|
|R 4.0.x and higher (built under 4.0.3 or higher)||relaimpo_2.2-5.zip||relaimpo_2.2-6.tgz|
Package sources: relaimpo_2.2-6.tar.gz, relaimpo_2.2-5.tar.gz, relaimpo_2.2-3.tar.gz, relaimpo_2.2-2.tar.gz; relaimpo_2.2.tar.gz (older R versions, in case the newer file does not work);
Package manual: relaimpo.pdf (manual from CRAN, please disregard licensing information, this is the non-US-Version with pmvd)
- JSS paper on relaimpo (Grömping, 2006) that explains the mathematics behind the metrics and the features of the package:
Relative Importance for Linear Regression in R: The Package relaimpo
- American Statistician paper (Grömping, 2007) regarding the statistical properties of the two metrics in relaimpo that decompose R2 into non-negative contributions (LMG and PMVD):
Estimators of Relative Importance in Linear Regression Based on Variance Decomposition
Users from the US should download the global version of the R-package relaimpo that is offered on the R-project's official download server, CRAN. This version is identical to the non-US version on this homepage, apart from the fact that it does not contain the metric pmvd by Feldman (2005) because of Legalese.
May 6, 2016: As the relevant patents are currently lapsed, US users may decide to use the non-US version - but see Legalese.
relaimpo was written before the infrastructure for parallelization was created in R. There are no arguments to handle parallelization. Large problems would benefit from parallelization in evaluating the different subset regressions, even without the bootstrap - this has not been implemented (and most likely won't be). The bootstrap, however, can be parallelized through features of package boot:
boot.relimp calls the
boot function in the boot package, which takes its default behavior on parallelization (arguments
parallel for parallelization method and
ncpus for the number of parallel processes to use) from the options
boot.ncpus. Thus, you can activate parallelization of the bootstrap by setting those options to desired values (e.g.
options(boot.parallel="snow", boot.ncpus=2L)) before calling
This bibliography has not been updated for a long time. There is a recent review paper that includes many newer references:
Grömping, U. (2015). Variable importance in regression models. WIREs Comput Stat 7, 137-152. DOI: 10.1002/wics.1346. (Corrigendum, regarding the Fabbris metric and numeric results on the Green metric)
The most important references for the R-package relaimpo are marked with an asterisk.
Azen, R. and Budescu, D.V. (2003). The dominance analysis approach for comparing predictors in multiple regression. Psychological Methods 8, 129-148.
Azen, R. (2003). Dominance Analysis SAS Macros. URL: www.uwm.edu/~azen/damacro.html.
Bring, J. (1996). A geometric approach to compare variables in a regression model. The American Statistician 50, 57-62.
Budescu, D.V. (1993). Dominance Analysis: A new approach to the problem of relative importance in multiple regression. Psychological Bulletin 114, 542-551.
Budescu, D.V. and Azen, R. (2004). Beyond Global Measures of Relative Importance: Some Insights from Dominance Analysis. Organizational Research Methods 7, 341 - 350.
Chevan, A. and Sutherland, M. (1991). Hierarchical Partitioning. The American Statistician 45, 90-96.
Conklin, M., Powaga, K. and Lipovetsky, S. (2004). Customer satisfaction analysis: Identification of key drivers. European Journal of Operational Research 154, 819–827.
*Darlington, R.B. (1968). Multiple regression in psychological research and practice. Psychological Bulletin 69, 161-182. (last, first, betasq, pratt)
Feldman, B. (1999). The proportional value of a cooperative game. Manuscript for a contributed paper at the Econometric Society World Congress 2000. Downloadable at http://fmwww.bc.edu/RePEc/es2000/1140.pdf.
Feldman, B. (2002). A Dual Model of Cooperative Value. Manuscript, downloadable from http://ssrn.com/abstract=317284.
*Feldman, B. (2005). Relative Importance and Value. Manuscript (latest version), downloadable at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2255827. (pmvd)
Feldman, B. (2007). A theory of attribution. MPRA Paper No. 3349. Downloadable at http://mpra.ub.uni-muenchen.de/3349/01/MPRA_paper_3349.pdf.
Fickel, N. (2001). Sequenzialregression: Eine neodeskriptive Lösung des Multikollinearitätsproblems mittels stufenweise bereinigter und synchronisierter Variablen. Habilitationsschrift, University of Erlangen-Nuremberg. VWF, Berlin.
Fickel, N. (2003). Measuring Supplementary Influence by Using Sequential Linear Regression. Downloadable from Mathematics Preprint Server.
Firth, D. (1998). Relative importance of explanatory variables. Conference ”Statistical Issues in the Social Sciences”, Stockholm, October 1998. URL: http://www.nuff.ox.ac.uk/sociology/alcd/relimp.pdf.
Fox, J. (2002). Bootstrapping regression models. In: An R and S-PLUS Companion to Applied Regression: A web appendix to the book. Sage, Thousand Oaks, CA. URL: http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-bootstrapping.pdf. (appropriate bootstrapping in regression models)
*Genizi, A. (1993). Decomposition of R2 in multiple regression with correlated regressors. Statistica Sinica 3, 407-420.
*Grömping, U. (2006). Relative Importance for Linear Regression in R: The Package relaimpo. Journal of Statistical Software 17, Issue 1.
*Grömping, U. (2007). Estimators of Relative Importance in Linear Regression Based on Variance Decomposition. The American Statistician 61, 139-147.
Grömping, U. (2007). Response to comment by Scott Menard, re: Estimators of Relative Importance in Linear Regression Based on Variance Decomposition. In: Letters to the Editor, The American Statistician 61, 280-284.
Grömping, U. (2009). Variable Importance Assessment in Regression: Linear Regression Versus Random Forest. The American Statistician 63, 308-319.
Grömping, U. and Landau, S. (2009). Do not adjust coefficients in Shapley value regression. To appear in Applied Stochastic Models in Business and Industry. Online early view: DOI: 10.1002/asmb.773.
Hart, S. and Mas-Colell, A. (1989). Potential, value and consistency. Econometrica 57, 589-614. (game-theoretic background for lmg)
Healy, M.J.R. (1990). Measuring importance. Statistics in Medicine 9, 633-637.
Hoffman, P.J. (1960). The paramorphic representation of clinical judgment. Psychological Bulletin 57, 116-131.
Hoffman, P.J. (1962). Assessment of the independent contributions of predictors. Psychological Bulletin 59, 77-80.
Johnson, J.W. (2000). A heuristic method for estimating the relative weight of predictor variables in multiple regression. Multivariate behavioral research 35, 1-19.
Johnson, J.W. (2004). Factors affecting relative weights: the influence of sampling and measurement error. Organizational Research Methods 7, 283-299.
Johnson, J.W. and Lebreton, J.M. (2004). History and Use of Relative Importance Indices in Organizational Research. Organizational Research Methods 7, 238 - 257.
Kruskal, W. (1987). Relative importance by averaging over orderings. The American Statistician 41: 6-10.
Kruskal, W. (1987b): Correction to ”Relative importance by averaging over orderings”. The American Statistician 41: 341.
Kruskal, W. and Majors, R. (1989). Concepts of relative importance in recent scientific literature. The American Statistician 43: 2-6.
Lebreton, J.M., Ployhart, R.E. and Ladd, R.T. (2004). A Monte Carlo Comparison of Relative Importance Methodologies. Organizational Research Methods 7, 258 - 282.
*Lindeman, R.H., Merenda, P.F. and Gold, R.Z. (1980). Introduction to Bivariate and Multivariate Analysis, Scott, Foresman, Glenview IL. (lmg, p.119ff)
Lipovetsky, S. and Conklin, M. (2001). Analysis of Regression in Game Theory Approach. Applied Stochastic Models in Business and Industry 17, 319-330.
MacNally, R. (2000) Regression and model building in conservation biology, biogeography and ecology: the distinction between and reconciliation of 'predictive' and 'explanatory' models. Biodiversity and Conservation 9: 655-671.
MacNally, R. (2002) Multiple regression and inference in conservation biology and ecology: further comments on identifying important predictor variables. Biodiversity and Conservation 11: 1397-1401.
MacNally, R. & Walsh, C. (2004). Hierarchical partitioning public-domain software. Biodiversity and Conservation 13, 659-660.
Nimon K. and Roberts, J.K. (2009). yhat: Interpreting Regression Effects. R package version 1.0-2. http://cran.r-project.org/web/packages/yhat/yhat.pdf
Ortmann, K.M. (2000). The proportional value of a positive cooperative game. Mathematical Methods of Operations Research 51, 235-248. (game-theoretic background for pmvd)
Pedhazur, E.J. (1982, 2nd ed.). Multiple regression in behavioral research: explanation and prediction. Holt, Rinehart and Winston, New York.
*Pratt, J.W. (1987). Dividing the indivisible: Using simple symmetry to partition variance explained. In: Pukkila, T. and Puntanen, S. (Eds.): Proceedings of second Tampere conference in statistics, University of Tampere, Finland, 245-260. (pratt)
Shapley, L. (1953). A value for n-person games. Reprinted in: Roth, A. (1988, ed.): The Shapley Value: Essays in Honor of Lloyd S. Shapley. Cambridge University Press, Cambridge. (game-theoretic background for lmg)
Soofi, E.S., Retzer, J.J. and Yasai-Ardekani, M. (2000). A Framework for Measuring the Importance of Variables with Applications to Management Research and Decision Models. Decision Sciences 31, 1-31.
Theil, H. (1971). Principles of Econometrics. Wiley, New York..
Theil, H. (1987). How many bits of information does an independent variable yield in a multiple regression? Statistics and Probability Letters 6, 107-108.
Theil, H. and Chung, C.-F. (1988a). Relations between two sets of variates: the bits of information provided by each variate in each set. Statistics and Probability Letters 6, 137-139.
Theil, H. and Chung, C.-F. (1988). Information-theoretic measures of fit for univariate and multivariate linear regressions. The American Statistician 42, 249-252.
Thomas, D.R., Hughes, E. and Zumbo, B.D. (1998). On variable importance in linear regression. Social Indicators Research 45, 253-275.
Thomas, D.R., Zhu, P.C. and Decady, Y.J. (2007). Point estimates and confidence intervals for variable importance in multiple linear regression. J. Educational and Behavioral Statistics 32, 61-91.
Ward, J.H. (1962). Comments on ”The paramorphic representation of clinical judgment”. Psychological Bulletin 59, 74-76.
Walsh C. & Mac Nally, R. (2003). The hier.part Package: Hierarchical Partitioning. (Part of: Documentation for R: A language and environment for statistical computing.) R Foundation for Statistical Computing, Vienna, Austria. URL: http://cran.r-project.org/web/packages/hier.part/hier.part.pdf.
Whittaker, T.A.; Fouladi, R.T.; Williams, N.J. (2002). Determining Predictor Importance in Multiple Regression Under Varied Correlational And Distributional Conditions. J. Modern Applied Statistical Methods 1, 354-366.
*Zuber, V. and Strimmer, K. (2011). Variable importance and model selection by decorrelation. Statistical Applications in Genetics and Molecular Biology 10.1 (2011): 1-27. Preprint at http://arxiv.org/abs/1007.5516