Frank E. Harrell, Jr. Regression Modeling. Strategies. With Applications to. Linear Models,. Logistic Regression, and Survival Analysis. With Figures. Download as PDF, TXT or read online from Scribd. Flag for Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic. With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Regression Modeling Strategies (eBook, PDF) - Harrell, Jr., Frank E.
|Language:||English, Spanish, Japanese|
|ePub File Size:||29.61 MB|
|PDF File Size:||18.29 MB|
|Distribution:||Free* [*Regsitration Required]|
Frank E Harrell Jr General questions: soundofheaven.info, tag regression- strategies. Course notes: Supplemental material: soundofheaven.info With Applications to Linear Models, Logistic Regression, and Survival Analysis PDF · General Aspects of Fitting Regression Models. Frank E. Harrell Jr. With Applications to Linear Models, Logistic and Ordinal Regression, and Survival PDF · General Aspects of Fitting Regression Models. Frank E. Harrell Jr.
Once the analyst is familiar with a model. But as described further in Section If the goal of the analysis is to make a series of hypothesis tests adjusting P -values for multiple comparisons instead of to predict future responses. Subjects may be more willing to check a box corresponding to a wide interval containing their income. Vanessa Kuentz.
Logistic Model Case Study 2: Survival of Titanic Passengers. Ordinal Logistic Regression. Introduction to Survival Analysis. Parametric Survival Models. Cox Proportional Hazards Regression Model. Case Study in Cox Regression. Back Matter Pages About this book Introduction Many texts are excellent sources of knowledge about individual statistical tools, but the art of data analysis is about choosing and using multiple tools.
Instead of presenting isolated techniques, this text emphasizes problem solving strategies that address the many issues arising when developing multivariable models using real data and not standard textbook examples. The additional information used to predict the missing values can contain any variables that are potentially pre- dictive.
Parameter estimates are averaged over these multiple im- putations to obtain better estimates than those from single imputation. The variance—covariance matrix of the averaged parameter estimates. White et al.
Methods for esti- mating residuals were listed in Section 3. To properly account for variability due to unknown values. The default method used by aregImpute is weighted PMM so that no residuals or distributional assumptions are required.. There is an option to allow target variables to be optimally transformed. An it- erative process cycles through all target variables to impute all missing val- ues The function implements regression imputation based on adding random residuals to predicted means.
This model is used to predict all of the original missing and non-missing values for the target variable for the current imputation. This approach is used in the MICE algorithm multiple imputation using chained equations implemented in R and other systems. The chained equa- tion method does not attempt to use the full Bayesian multivariate model for all target variables.
Among other things. The aregImpute algorithm takes all aspects of uncertainty into account using the bootstrap while using the same estimation procedures as transcan section 4.
With a chained equations approach. When a predictor of the target variable is missing.
Yucel and Zaslavsky developed a diagnostic that is useful for checking the imputa- tions themselves. Here is an example using the R Hmisc and rms packages. Duplicate the entire dataset. Suppose we were interested in the reasonableness of imputed values for a sometimes- missing predictor Xj. In solving a problem related to imputing binary variables using continuous data models. Develop imputed values for the missing values of Xj. For continuous variables imputing missings with the median non-missing value is adequate.
Fewer imputations may be possible with very large sample sizes. Use multiple imputation with number of imputations equal to max 5. It is important to note that the reasons for missing data are more important determinants of how missing values should be handled than is the quantity of missing values.
They also developed a method combining imputation of missing values with propensity score modeling of the probability of missingness. Multiple predictors frequently missing: More imputations may be required. Here f refers to the proportion of observations having any variables missing. Complete case analysis is also an option here.
Marshall et al. Type 1 predictive mean matching is usually preferred. But it is not appropriate to use the dummy variable or extra category method. Little and An also have an excellent review of imputation methods and developed several approxi- mate formulas for understanding properties of various estimators.
The missingness indicator variables will be collinear. Twist et al. Barnard and Rubin41 derived an estimate of the d. Barnes et al. Wood et al. They found that multiple imputation of the response re- sulted in much improved estimates. They mentioned that ex- tra categories may be added to allow for missing data in propensity models and that adding indicator variables describing patterns of missingness will also allow the analyst to match on missingness patterns when comparing non-randomly assigned treatments.
A clear example is in where covariates X1. A good introductory article on missing data and imputation is by Donders et al. Joseph et al. Functions in the Hmisc package may be useful.
Print univariable summaries of all variables. Horton and Kleinman compare several software packages for handling missing data and have comparisons of results with that of aregImpute. See Little. We analyze 35 variables and a random sample of patients from the study. Make a plot showing all variables on one page that describes especially the continuous variables.
Since X usually contains a strange mixture of binary. Vach is an excellent text describing properties of various methods of dealing with missing data in binary logistic regression see also [ Moons et al.
Make a plot showing the extent of missing data and tendencies for some variables to be missing on the same patients. When d. These references show how to use maximum likelihood to explicitly model the missing data process. Patients were followed for in-hospital outcomes and for long-term survival. See Rubin for a comprehensive source on mul- tiple imputation. Little and Rubin show how imputation can be avoided if the analyst is willing to assume a multivariate distribution for the joint dis- tribution of X and Y.
Use predictive mean matching to multiply impute cost 10 times per missing observation. The is. The cost estimates are not available on patients.
Relate these two variables to each other with an eye toward using charges to predict totcst when totcst is missing. State in a g You can use the R command subset support.
You may want to use a statement like the following in R: Total hospital charges bills are available on all but 25 patients. Remove the observation having zero totcst. For this characterization use the follow- ing patient descriptors: Use transcan to develop single imputations for total cost. Make graphs that will tell whether lin- ear regression or linear regression after taking logs of both variables is better. Prepare for later development of a model to predict costs by developing reliable imputations for missing costs.
If you used a log transformation. Characterize what kind of pa- tients have missing totcst. The model should use the predictors in Problem 1 and should not assume linearity in any predictor but should assume additivity.
Using the multiple imputed values. Chapter 3 dealt with missing data, focusing on utilization of in- complete predictor information. All of these areas are important in the overall scheme of model development, and they cannot be separated from what is to follow. In this chapter we concern ourselves with issues related to the whole model, with emphasis on deciding on the amount of complexity to allow in the model and on dealing with large numbers of predictors.
The chapter con- cludes with three default modeling strategies depending on whether the goal is prediction, estimation, or hypothesis testing. This chapter addresses some of these issues. One general theme of what fol- lows is the idea that in statistical inference when a method is capable of worsening performance of an estimator or inferential quantity i.
There are rare occasions in which one actually expects a relationship to be linear. For example, one might predict mean arterial blood pressure at two months after beginning drug administration using as baseline variables the pretreatment mean blood pressure and other variables. In this case one ex- pects the pretreatment blood pressure to linearly relate to follow-up blood pressure, and modeling is simplea.
In the vast majority of studies, however, there is every reason to suppose that all relationships involving nonbinary predictors are nonlinear. The amount of complexity e. For example, errors in estimating the curvature of a regression function are consequential in predicting Y only when the regression is somewhere steep.
Once the analyst decides to include a predictor in every model, it is fair to a Even then, the two blood pressures may need to be transformed to meet distribu- tional assumptions. Commands in the rms package can be used to plot only what is needed. Here is an example for a logistic model. This approach, and the rank correlation approach about to be discussed, do not require the analyst to really prespecify predictor complexity, so how are they not biased in our favor?
There are two reasons: Like- wise, a low association measure between a categorical variable and Y might lead the analyst to collapse some of the categories based on their frequencies. This often helps, but sometimes the categories that are so combined are the. When collinearities or confounding are not problematic, a quicker approach based on pairwise measures of association can be useful.
This approach will not have numerical problems e.
This is the ordinary R2 from predicting the rank of Y based on the rank of 2 X and the square of the rank of X. For categorical predictors, ranks are not squared but instead the predictor is represented by a series of dummy variables. See p. Note that bivariable correlations can be misleading if marginal relationships 3 vary greatly from ones obtained after adjusting for other predictors.
From the above discussion a general principle emerges. Examples of strategies that are improper without special adjustments e. It is also valuable to consider the reverse situation; that is, one posits a simple model and then additional analysis or outside subject matter information makes the analyst want to generalize the model.
Once the model is generalized e. So another general principle is that when one makes the model more complex, the d. This can be useful in demonstrating to the reader that some complexity was actually needed. Thus moving from simple to more complex models presents no problems other than conservatism if the new complex components are truly unnecessary.
Also, failure to adjust for an important factor can frequently alter the nature of the distribution of Y. Occasionally, however, it is unwieldy to deal simultaneously with all predictors at each stage in the analysis, and instead the regression function shapes are assessed separately for each continuous predictor.
Stepwise variable selection has been a very popular technique for many years, but if this procedure had just been proposed as a statistical method, it would most likely be rejected because it violates every principle of statistical estimation and hypothesis testing. Here is a summary of the problems with this method. It yields R2 values that are biased high. It yields P -values that are too small i. In observational studies, variable selection to determine confounders for adjustment results in residual confounding Rather than solving problems caused by collinearity, variable selection is made arbitrary by collinearity.
It allows us to not think about the problem. If stepwise selection must be used, a global test of no regression should be made before proceeding, simultaneously testing all candidate predictors and having degrees of freedom equal to the number of candidate variables plus any nonlinear or interaction terms. It must be remem- bered that no currently available stopping rule was developed for data-driven variable selection.
AIC can also work when the model that is best by AIC is much better than the runner-up so that if the process were bootstrapped the same model would almost always be found. When used for one variable at a time variable selection. Burnham and Anderson84 rec- ommend selection based on AIC for a limited number of theoretically well-founded models.
Variable selection does not compete well with shrinkage methods that simultaneously model all potential predictors. Even though forward stepwise variable selection is the most commonly 6 used method, the step-down method is preferred for the following reasons.
It usually performs better than forward stepwise methods, especially when collinearity is present. For a given dataset, bootstrapping Efron et al. Bootstrapping can be done on the whole model and compared with bootstrapped estimates of predictive accuracy based on stepwise variable selection for each resample. However, there are a number of drawbacks to this approach Selection from among a set of correlated predictors is arbitrary, and all highly correlated predictors may have a low bootstrap selection frequency.
This may be computationally prohibitive. The bootstrap did not improve upon traditional backward stepdown vari- able selection. For some applications the list of variables selected may be stabilized by grouping variables according to subject matter considerations or empirical correlations and testing each related group with a multiple degree of freedom test.
Then the entire group may be kept or deleted and, if desired, groups that are retained can be summarized into a single variable or the most accurately measured variable within the group can replace the group. See Section 4. Kass and Raftery showed that Bayes factors have several advantages in variable selection, including the selection of less complex models that may agree better with subject matter knowledge. Univariable screening is thus even worse than stepwise modeling as it can miss important variables that are only important after adjusting for other variables.
The online course notes contain a simple simulation study of stepwise selection using R. Here we concern ourselves with the reliability or calibration of a model, meaning the ability of the model to predict future observations as well as it appeared to predict the responses at hand. Similar validation experiments have considered the margin of error in estimating an absolute quantity such as event probability.
For example, Smith et al. The number of non- intercept parameters in the model p is usually greater than the number of predictors. Narrowly distributed predictor variables e. Note that the number of candidate variables must include all variables screened for association with the response, including nonlinear terms and interactions.
Instead of relying on the rules of thumb in the table, the shrinkage factor estimate presented in the next section can be used to guide the analyst in determining how many d. Rules of thumb such as the For the case of ordinary linear regression, estimation of the residual variance is central.
Here n1 and n2 are the marginal frequencies of the two response levels. The linear model case is useful for examining n: As 2 discussed in the next section. Radj is a nearly unbiased estimate of R2. In ordinary linear regression. Consider a clinical trial with 10 randomly assigned treatments such that the patient responses for each treatment are normally distributed.
Predictions near the mean predicted value will usually be quite accurate. But if we plotted the predicted mean response for patients against the observed responses from new data. The treatment group having the lowest sample mean response will usually have a higher mean in the future. The sample mean of the group having the highest sample mean is not an unbiased estimate of its population mean. Figure 4. When we want to highlight a treatment that is not chosen at random or a priori. The reference line at 0.
Just as clothing is sometimes preshrunk so that it will not shrink further once it is purchased. Ridge regression A ridge parameter must be chosen to control the amount of shrinkage.
Penalized maximum likelihood estima- tion. See Section 5. Now turn to the second usage of the term shrinkage. For ordinary linear models.
See Section 9. Collinearity problems are then more likely to result from partially redundant subsets of predictors as in the cholesterol example above. It is very unlikely that this will result in any problems. Consider as two predictors the total and LDL cholesterols that are highly correlated. If predictions are made at the same combinations of total and LDL cholesterol that occurred in the training data.
Thus it is not possible for a combination of. This is accomplished by ignoring Y during data reduction. Note that indexes such 16 as VIF are not very informative as some variables are algebraically connected to each other.
Eliminate variables whose distributions are too narrow. Eliminate candidate predictors that are missing in a large number of sub- jects.
Use of data reduction methods before model development is strongly recommended if the conditions in Table 4. Manipulations of X in unsupervised learning may result in a loss of information for predicting Y. Use the literature to eliminate unimportant variables. Note that some authors compute VIF from the correlation matrix form of the design matrix. Data reduction is aimed at reducing the number of parameters to estimate in the model.
Summarizing collinear variables using a summary score is more powerful and stable than arbitrary selection of one variable in a group of collinear variables see the next section.
Use a statistical data reduction method such as incomplete principal com- ponent regression. Some available data reduction methods are given below. If two cities had the same rainfall. One way to consider a categorical variable redundant is if a linear combination of dummy variables representing it can be predicted from a linear combination of other variables. See Chapters 8 and 14 for detailed case studies in data reduction.
The Hmisc redun function implements the following redundancy checking algorithm. The redun function implements both approaches. A second. When the predictor is expanded into multiple terms. Special consideration must be given to categorical predictors. Expand categorical predictors into dummy variables. One rigorous approach involves removing predictors that are easily predicted from other predictors. It may be advisable to cluster vari- ables before scaling so that patterns are derived only from variables that are related.
The D statistic will detect a wide variety of dependencies between two variables. For the special case of repre- senting a series of variables with one PC. Pairwise deletion of missing values is also advisable for this procedure—casewise deletion can result in a small biased sample. For either approach. For mixtures of categorical and continuous predictors.
Another approach.. Often one can use these techniques to scale multiple dummy variables into a few dimensions. Pearson or Spearman squared correlations can miss important associations and thus are not always good similarity measures. See pp. H are marginal cu- mulative distribution functions and F is the joint CDF. Once each di- mension is scored see below.
For purely categorical predictors. Repeat steps 2 to 4 until the proportion of variation explained by P C1 reaches a plateau. This typically requires three to four iterations. MGV does not use PCs so one need not precede the analysis by variable clustering. The process is repeated until the transformations converge. Use ordinary linear regression to predict P C1 on the basis of functions of the Xs.
ACE handles monotonically restricted transformations and categorical variables. The expansion of each Xj is regressed separately on P C1. ACE does not handle missing values. MGV involves predicting each variable from the current transformations of all the other variables. Compute P C1. It automatically transforms all variables. Xq using the correlation matrix of Xs. When predicting variable i. See Chapter 16 for more about ACE.
The goal of MGV is to transform each variable so that it is most similar to predictions from the other transformed variables. If the sample size is low. See Chapter 8 for a detailed example of these scaling techniques. It does not implement monotonicity constraints. This approach of check- ing that transformations are optimal with respect to Y uses the response data. For continuous variables. Imputed values are initialized to medians of contin- uous variables and the most frequent category of categorical variables.
This problem is more likely when multiple variables are missing on the same subjects. Transformed variables are normalized to have mean 0 and standard deviation 1. In that way.
Then when using canonical variates to transform each variable in turn. For categorical ones. It defaults to imputing categorical variables using the category whose predicted canonical score is closest to the predicted score. These constants are ignored during the transformation-estimation phasen. As an example of non-monotonic transformation and imputation. This technique has proved to be help- ful when. Tick marks indicate the two imputed values for blood pressure.
For the ordinal count of the number of positive factors. For the more powerful predictor of the two summary measures. The adequacy of either type of scoring can be checked using tests of linearity in a regression modelq. A fair way to validate such two-stage models is to use a resampling method Section 5. Either a shrunken estimator or data reduction is needed. If this falls below 0. If one constituent variable has a very high R2 in predicting the original cluster score. A method called battery reduction can be used to delete variables from clusters by determining if a subset of the variables can explain most of the variance explained by P C1 see [ This approach does not require examination of associations with Y.
For clusters that are retained after limited step-down modeling. A simple method. Then a new cluster score is created and the response model is rerun with the new score in the place of the original one. Let p denote the number of parameters in this model.
All variables contained in clusters that were not selected initially are ignored. A reduced model may have acceptable calibration if associations with Y are not used to reduce the predictors. Chapter 12] and The full model with 15 d. The With these assumptions. This is unlikely in practice.
The other 10 variables would have to be reduced to a single variable using principal components or another scaling technique. The AIC-based calculation yields a maximum of 2. The analyst wishes to analyze age.
In this case the If the goal of the analysis is to make a series of hypothesis tests adjusting P -values for multiple comparisons instead of to predict future responses. The analyst may be forced to assume that age is linear. It is not known whether interaction between age and sex exists. The other 10 variables are assumed to be linear and to not interact with themselves or age and sex. A summary of the various data reduction methods is given in Figure 4. There is a total of 15 d. If the information explained by the omitted variables is less than one would expect by chance e.
When principal component analysis or related methods are used for data reduction. In one dataset of patients and deaths. Data reduction approaches covered in the last section can yield very inter- pretable. Remedies for this have been discussed in Sections 4. Sometimes the analyst may deem a subject so atypical of other subjects in the study that deletion of the case is warranted.
Extreme values of the predictor variables can have a great impact. These new approaches. When data reduction is not required. In some cases. Newer single stage approaches are evolving. It can be disheartening. In one example a single extreme predictor value in a sample of size that was not on a straight line relationship with.
On other occasions. Predictions were found to be more stable when WBC was truncated at Such disagreements should not lead to discard- ing the observations unless the predictor or response values are erroneous as in Reason 3. On rare occasions. The most common measures that apply to a variety of regression models are leverage. Most im- portant. To compute leverage in ordinary least squares. Some believe that the distribution of hii should be examined for values that are higher than typical.
In both cases. Statistical measures can also be helpful. Various statistical indexes can quantify dis- crimination ability e. ROC area only measure how well predicted values can rank-order responses.. If the purpose of the models is only to rank-order subjects. Items 3 through 7 require subjective judgment. The methods that follow assume that the performance of the models is evaluated on a sample not used to develop either one.
Some of the criteria for choosing one model over the other are 1. In this case. Given that the two models have similar calibration. Rank measures Dxy. For the relatively small subset of patients with extremely low white blood counts or serum albumin.
Suppose that the predicted value is the probability that a subject dies. As- suming corrections have been made for complexity. Again given that both models are equally well calibrated. This is especially true when the models are strong.
The worth of a model can be judged by how far it goes out on a limb while still maintaining good calibration. If one model assigns 0. Then high-resolution histograms of the predicted risk distributions for the two models can be very revealing. For example: Insist on validation of predictive models and discoveries. There are several things that a good analyst can do to improve the situation.
Show that alternative explanations are easy to posit. As stated in the Preface. Characterize observations that had to be discarded. For survival time data. Impute missing Xs if the fraction of observations with any missing Xs is not tiny. These strategies are far from failsafe. At the least these default strategies are concrete enough to be criticized so that statisticians can devise better ones. Depending on the model used. In what follows some default strategies are elaborated.
If there are missing Y values on a small fraction of the subjects but Y can be reliably substituted by a surrogate response. These models can simultaneously impute missing values while determining transformations. Characterize tendencies for Y to be missing using. Special im- putation models may be needed if a continuous X needs a non-monotonic transformation p.
In most cases. Assemble as much accurate pertinent data as possible. For each predictor specify the complexity or degree of nonlinearity that should be allowed see Section 4. Use the entire sample in the model development as data are too precious to waste. The d. Transformations determined from the previous step may be used to reduce each predictor into 1 d.
When you can test for model complexity in a very structured way. When missing values were imputed. Logistic Model Case Study 2: Survival of Titanic Passengers. Ordinal Logistic Regression. Transform-Both-Sides Regression. Introduction to Survival Analysis. Parametric Survival Models. Cox Proportional Hazards Regression Model. Case Study in Cox Regression. Back Matter Pages About this book Introduction This highly anticipated second edition features new chapters and sections, new references, and comprehensive R software.
As in the first edition, this text is intended for Masters' or Ph. The book will also serve as a reference for data analysts and statistical methodologists, as it contains an up-to-date survey and bibliography of modern statistical modeling techniques.
Examples used in the text mostly come from biomedical research, but the methods are applicable anywhere predictive models "analytics" are useful, including economics, epidemiology, sociology, psychology, engineering, and marketing. Generalized least squares Linear models Logistic regression Predictive modeling R statistical software Regression analysis Survival analysis knitr reproducible documents.