Binary logistic regression using RStudio
Introduction
We will be using an example from Chapter 11 of Pituch and Stevens (2016) text, Applied Multivariate Statistics for the Social Sciences (6th ed). In their chapter, all output was generated using SPSS and SAS. Data for our example is contained in an SPSS file and can be downloaded here: https://drive.google.com/file/d/1hQO0fUfuy19btI02HLCHBUip0cAzh4ia/view After downloading the data, import it into RStudio to follow along with the presentation.
The example in the Pituch and Stevens (2016) chapter was based around an intervention program designed to decrease the likelihood that pre-diabetic adults would be diagnosed with full-blown Type 2 diabetes later on. In their example, n=200 adults were randomly assigned to either a ‘normal’ treatment condition or an ‘enhanced’ treatment condition. The normal treatment condition group is coded 0 in the dataset, whereas the enhanced condition is coded 1. The treatment period for the participants in the study lasted 3 months. At the conclusion of the treatment period, participants were assessed for the presence or absence of diabetes. The binary outcome variable (Health) in the datafile is coded 0=diabetic, 0=NOT diabetic. Prior to treatment onset, participants responded to a (continuous) measure of motivation to change one’s lifestyle to improve health.
library(haven)
DIABETES1 <- read_sav("DIABETES1.sav")
A few preliminaries…
We can generate frequency tables for our binary predictor, Treat (coded 0=normal treatment, 1=enhanced treatment), and our binary outcome, Health (coded 0=not diabetic, 1=diabetic) using the frq() function associated with the sjmisc package.
library(sjmisc)
frq(DIABETES1$Health)
## x <numeric>
## # total N=200 valid N=200 mean=0.42 sd=0.49
##
## Value | N | Raw % | Valid % | Cum. %
## --------------------------------------
## 0 | 116 | 58.00 | 58.00 | 58.00
## 1 | 84 | 42.00 | 42.00 | 100.00
## <NA> | 0 | 0.00 | <NA> | <NA>
frq(DIABETES1$Treat)
## x <numeric>
## # total N=200 valid N=200 mean=0.50 sd=0.50
##
## Value | Label | N | Raw % | Valid % | Cum. %
## -----------------------------------------------------------
## 0 | normal treatment | 100 | 50 | 50 | 50
## 1 | enhanced treatment | 100 | 50 | 50 | 100
## <NA> | <NA> | 0 | 0 | <NA> | <NA>
We can also use the describe() function from the psych package to obtain various descriptives for the variables in our dataframe as well.
library(psych)
describe(DIABETES1)
## vars n mean sd median trimmed mad min max range skew
## Case 1 200 100.50 57.88 100.50 100.50 74.13 1.00 200.00 199.00 0.00
## Motiv 2 200 49.46 9.86 48.59 49.35 11.57 27.07 72.35 45.28 0.12
## Treat 3 200 0.50 0.50 0.50 0.50 0.74 0.00 1.00 1.00 0.00
## Health 4 200 0.42 0.49 0.00 0.40 0.00 0.00 1.00 1.00 0.32
## ZMotiv 5 200 0.00 1.00 -0.09 -0.01 1.17 -2.27 2.32 4.59 0.12
## transf 6 200 193.96 48.33 188.71 192.92 56.78 89.29 309.77 220.48 0.20
## kurtosis se
## Case -1.22 4.09
## Motiv -0.76 0.70
## Treat -2.01 0.04
## Health -1.91 0.03
## ZMotiv -0.76 0.07
## transf -0.75 3.42
Logistic regression model
We will run our logistic regression model using the glm() function. Inside the parenthesis, we begin with the formula, specifying Health as a function of Treat + Motiv. The second argument specifies the name of the dataframe, which is DIABETES1. Finally, we set the ‘family’ argument = binomial and indicate we are using the logit link. [If you are using probit regression, you would set the link = “probit”.] The assignment operator (<-) is pointing to an object I have (arbitrarily) named “fit1”. This object contains the results of our analysis and we draw from it later to generate various outputs for evaluating the fit of our model.
fit1 <- glm(Health ~ Treat + Motiv, data = DIABETES1, family = binomial (link="logit"))
summary(fit1)
##
## Call:
## glm(formula = Health ~ Treat + Motiv, family = binomial(link = "logit"),
## data = DIABETES1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.85504 0.81249 -3.514 0.000442 ***
## Treat 1.01385 0.30211 3.356 0.000791 ***
## Motiv 0.04029 0.01547 2.604 0.009218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 272.12 on 199 degrees of freedom
## Residual deviance: 253.14 on 197 degrees of freedom
## AIC: 259.14
##
## Number of Fisher Scoring iterations: 4
The regression slopes in this output represent the predicted change in the logit(Y=1) per unit increase on the predictor. [Recall, Y=1 indicates no diabetes and Y=0 indicates diabetes on the Health variable] Logits are a transformation of the probability(Y=1), which allows the relationship between the probability(Y=1) and the predictors to be estimated using a linear function. When it comes to interpretation of the regression slope, a positive value indicates that with increasing values on X, the probability(Y=1) is also increasing; a negative value indicates that with increasing values on X, the probability(Y=1) is decreasing; a slope of 0 indicates no change in the probability(Y=1) as values on a predictor increase.
In our output, we see that Treat was a positive and statistically significant (b=1.014, s.e.=.302, p<.001) predictor in the model. Formally speaking, this regression slope indicates adults in the enhanced treatment condition were expected to be 1.014 logits higher than those in the normal treatment condition. This simply means that adults in the enhanced condition had a higher probability of being in the non-diabetic group (Y=1) following treatment than those in the normal condition.
We also see Motiv was a positive and statistically significant (b=.040, s.e.=.015, p=.009) predictor in the model. Formally, speaking, the slope indicates that for every unit increase on motivation, the predicted logit(Y=1) increases by .04. More practically, the positive slope indicates that individuals with a higher level of motivation at the start of their treatments exhibited a greater probability of not being diagnosed with diabetes at the conclusion of the treatment period.
The following code shows you attributes of the fit1 object containing our results. We will be extracting some of the elements contained in this object later on.
attributes(fit1)
## $names
## [1] "coefficients" "residuals" "fitted.values"
## [4] "effects" "R" "rank"
## [7] "qr" "family" "linear.predictors"
## [10] "deviance" "aic" "null.deviance"
## [13] "iter" "weights" "prior.weights"
## [16] "df.residual" "df.null" "y"
## [19] "converged" "boundary" "model"
## [22] "call" "formula" "terms"
## [25] "data" "offset" "control"
## [28] "method" "contrasts" "xlevels"
##
## $class
## [1] "glm" "lm"
The next code chunk shows you how you can extract the regression coefficients from the fit1 object and exponentiate them to produce odds ratios (OR’s) for your predictors.
exp(fit1$coefficients)
## (Intercept) Treat Motiv
## 0.05755371 2.75618478 1.04111224
Odds ratios (OR’s) represent the multiplicative change in odds(Y=1) per unit increase on a predictor. The odds ratio for treatment condition was 2.756. This means that the odds for those in the enhanced treatment condition being diagnosed as diabetic following the treatment period was 2.76 times that of the odds for those in the normal treatment condition. From this, we can say that the odds of a person in the enhanced treatment condition being non-diabetic is (2.76-1)*100% = 176% greater than the odds of a person in the normal treatment condition. If we compute the inverse of the odds ratio - i.e., 1/OR = 1/2.76 = .362 - this tells us the odds of a person in the normal treatment condition being healthy (i.e., Y=1, or not diagnosed with diabetes) is .36 times that of the odds for those in the enhanced treatment condition. That is, the odds of a person in the normal condition being healthy (non-diabetic) is 63.8% less [computed as (.362-1)*100% = -63.8%] than the odds for those in the enhanced group.
The odds ratio for motivation was 1.041, meaning that for every one unit increase on motivation, the odds of a person being ‘healthy’ (non-diabetic) following the treatment period is multiplied by 1.041. In other words, the odds of being healthy increase by (1.041-1)*100% = 4.1% for every unit increase on pre-intervention motivation.
A more efficient way of obtaining the odds ratios, while also getting other results of value - such as the likelihood ratio (LR) chi-square test and pseudo-R-squares - would be to use the summ() function from the jtools package. The exp=T argument below specifies that we want exponentiated coefficients (odds ratios). Leaving this argument out or setting exp=F will produce the original regression coefficients.
library(jtools) #for odds ratios and LR chi-square test
summ(fit1, exp=T, vif=T, digits=3)
## MODEL INFO:
## Observations: 200
## Dependent Variable: Health
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(2) = 18.972, p = 0.000
## Pseudo-R² (Cragg-Uhler) = 0.122
## Pseudo-R² (McFadden) = 0.070
## AIC = 259.145, BIC = 269.040
##
## Standard errors: MLE
## ----------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ----------------- ----------- ------- ------- -------- ------- -------
## (Intercept) 0.058 0.012 0.283 -3.514 0.000
## Treat 2.756 1.525 4.983 3.356 0.001 1.002
## Motiv 1.041 1.010 1.073 2.604 0.009 1.002
## ----------------------------------------------------------------------
The LR chi-square test can be considered an omnibus test of the null hypothesis that all regression slopes for a model are equal to 0. Rejecting this null hypothesis indicates a logistic regression model fits the data significantly better than an intercept-only (null) model. For our analysis, the LR chi-square test is statistically significant, = 18.972, p < .001, indicating our model fits the data better than a null model.
Our output using the summ() function also includes McFadden’s pseudo . Values between .2 and .4 may be regarded as indicating a ‘strong improvement in fit’ (Pituch & Stevens, 2016, p. 449), relative to an intercept-only (null) model. For our analysis, McFadden’s = .07, indicating an approximate improvement in fit relative to an intercept-only model of 7%.You can obtain additional pseudo R-squares if you wish by using the PseudoR2() function associated with the DescTools package. These and others are described at https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/.
library(DescTools)
PseudoR2(fit1, which = c("McFadden", "McFaddenAdj", "CoxSnell", "Nagelkerke"))
## McFadden McFaddenAdj CoxSnell Nagelkerke
## 0.06972021 0.04767086 0.09049993 0.12172328
The C-statistic can be obtained using the Cstat() function associated with the DescTools package. For a description of this statistic, see https://rdrr.io/cran/DescTools/man/Cstat.html. According to the authors of this site, C-statistic values range from .5 to 1.0, with higher values indicating better prediction. ‘Reasonable’ models are those with a C-statistic > .70; and ‘strong’ models are those where the C-statistic > .8.
Cstat(fit1)
## [1] 0.674569
Another indicator of goodness of fit is the Hosmer-Lemshow test. Citing Allison (2012), Pituch and Stevens (2016), indicate this test ‘can be interpreted as a test of the null hypothesis that no additional interaction or nonlinear terms are needed in the model’ (p. 455). That said, it does not test whether other predictors that have been excluded would, upon inclusion in the model, result in an improvement in model fit.
To carry out the test, we can use the performance_hosmer() function associated with the performance package.
library(performance)
performance_hosmer(fit1, n_bins=10)
## # Hosmer-Lemeshow Goodness-of-Fit Test
##
## Chi-squared: 6.876
## df: 8
## p-value: 0.550
Another way of evaluating our model involves generating and studying a confusion matrix to examine the correspondence between the actual group memberships on the dependent variable in the data and those predicted by the model. Here is a break down of our steps:
- Recode predicted probabilities(Y=1), into a vector consisting of 0’s and 1’s representing predicted group membership on the outcome variable; i.e., 0=predicted diabetic, 1=predicted non-diabetic.
- Add this vector to the our original dataframe.
- Use the CrossTable() function from the gmodels package to generate confusion matrix.
library(gmodels)
# Step 1:
prg.m1 <- ifelse(fit1$fitted.values <= .5, 0 , 1)
# Step 2:
DIABETES1 <- cbind(DIABETES1,prg.m1)
# Step 3:
CrossTable(x=DIABETES1$Health, y=DIABETES1$prg.m1, chisq=F, prop.r=T, prop.c = F, prop.t=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 200
##
##
## | DIABETES1$prg.m1
## DIABETES1$Health | 0 | 1 | Row Total |
## -----------------|-----------|-----------|-----------|
## 0 | 92 | 24 | 116 |
## | 1.606 | 3.660 | |
## | 0.793 | 0.207 | 0.580 |
## -----------------|-----------|-----------|-----------|
## 1 | 47 | 37 | 84 |
## | 2.218 | 5.055 | |
## | 0.560 | 0.440 | 0.420 |
## -----------------|-----------|-----------|-----------|
## Column Total | 139 | 61 | 200 |
## -----------------|-----------|-----------|-----------|
##
##
The confusion matrix can be used to talk about how well the model functions in a predictive sense. Sensitivity refers to the proportion of true positives that were correctly predicted by the model. Specificity refers to the proportion of true negatives that were correctly predicted by the model. The overall classification accuracy for the model refers to the proportion of observed outcomes correctly predicted by the model. Here are the calculated values for sensitivity, specificity, and classification accuracy from our confusion matrix:
- Specificity = ratio of number predicted Y=0 to number observed Y=0: 92/116=.793
- Sensitivity = ratio of number predicted Y=1 to the number observed Y=1: 37/84=.440
- Classification accuracy: (92+37)/200 = .645
As you can see, 44% of individuals who were observed to be healthy (non-diabetic) at the end of the treatment period were correctly predicted to fall into this category by the model. This is the sensitivity of the model since the target outcome of Y=1 refers to the non-diabetic category. 79.3% of individuals who were observed to be diabetic at the end of the treatment period were correctly predicted to fall into this category. This is the specificity of the model since again the target outcome of Y=1 refers to the non-diabetic category.
One might also be interested in describing how well the model performs in terms of the proportion of cases predicted to fall into a category that were (perhaps later) observed to actually fall into that category. Positive predictive value (PPV) refers to the proportion of cases predicted to fall into the target category (Y=1) that ultimately were observed to fall into that category. Negative predictive value (NPV) refers to the proportion of cases predicted to fall into the non-target category (Y=0) that ultimately were observed to fall into that category. We can calculate the PPV and NPV from our confusion matrix as follows:
- PPV = 37 individuals were observed Y=1 out of 61 predicted Y=1, for PPV = 37/61=.607.
- NPV = 92 individuals were observed Y=0 out of 139 predicted Y=0, for NPV = 92/139=.662
Of the 61 individuals predicted to be healthy at the end of the treatment period, 37 (or 60.7%) were observed to be healthy. Of the 139 individuals at the end of treatment predicted to be diabetic, 92 (or 66.2%) were observed to be diabetic.
Testing linearity in the logit using Box-Tidwell transformation: Xln(X)
One of the assumptions of logistic regression is linearity in the logit. This can be assessed by including non-linear transformations of your continuous predictors adn examining these ‘transformed predictors’ for statistical significance. [Keep in mind, it makes no sense to use the Box-Tidwell procedure with categorical predictors.] Significant regression slopes for these predictors suggest the possible need to add non-linear terms into your regression model. In the code below, I added the transformed predictor by specifying Motiv:log(Motiv) in the model formula. This caused RStudio to internally carry out the transformation. I could have just as easily computed the transformed variable, added it into the dataframe, and then included it as a predictor in the model formula below instead.
#Re-specify model containing the product of Motiv and the natural log of Motiv
fit1 <-glm(Health ~ Treat + Motiv + Motiv:log(Motiv),
family=binomial (link="logit"), data=DIABETES1)
summary(fit1)
##
## Call:
## glm(formula = Health ~ Treat + Motiv + Motiv:log(Motiv), family = binomial(link = "logit"),
## data = DIABETES1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.10181 6.66091 -0.316 0.752349
## Treat 1.01356 0.30216 3.354 0.000795 ***
## Motiv -0.03566 0.66708 -0.053 0.957367
## Motiv:log(Motiv) 0.01549 0.13601 0.114 0.909336
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 272.12 on 199 degrees of freedom
## Residual deviance: 253.13 on 196 degrees of freedom
## AIC: 261.13
##
## Number of Fisher Scoring iterations: 4
We see in our table, the transformed predictor was not statistically significant (p=.9093), indicating we have evidence the assumption of linearity in the logit is fulfilled.
References
Pituch, K. A., Stevens, J. P. (2016). Applied multivariate statistics for the social sciences (6th ed.). Routledge.
Comments
Post a Comment