Sequential (adjacent levels or repeated categories) recoding of an ordered multicategorical predictor in multiple regression
Overview
It is often assumed that the only way to analyze the relationship between a multicategorical independent variable and a continuous dependent variable is to use analysis of variance (ANOVA). Nevertheless, it is possible to perform a linear regression analysis using a multicategorical variable as the independent variable, so long as the original categorical variable is appropriately recoded into a new set of variables prior to being included in the analysis. These variables carry with them all the grouping information of the original independent variable. Darlington and Hayes (2017) refer to these variables as “compound variables”. The regression slope associated with each of the new variables in the model represents a contrast involving group means. When other focal predictors or covariates are included in the model, these slopes represent contrasts of adjusted group means (Darlington & Hayes, 2017). There are a variety of coding systems an analyst could use for modeling the functional relationship between the categorical independent variable and the dependent variable. How the original categorical variable has been recoded, however, has implications for interpretation of the regression slopes and intercept in one’s model.
Perhaps the most frequently utilized approach is dummy or indicator variable recoding. This system involves recoding the original categorical variable into J-1 new variables that are coded with a particular configuration of 0’s and 1’s [where J=the number of groups associated with the original categorical variable]. With indicator variable recoding, the analyst must decide on one group serving as a baseline or reference category whose mean is contrasted against the means of the other groups. The reference category is coded 0 across all indicator variables which uniquely identifies that category. The remaining categories are easily identifiable based on the remaining pattern of 0’s and 1 on the indicators.
Let’s say as researchers we ask the question, “Does education level among college students predict the degree to which they engage in deep processing while learning?” We collect sample data from a set of n=96 college students here we ask them their level of education (i.e., ‘freshman’, ‘sophomore’, ‘junior’, or ‘senior’) and have them respond to a measure of ‘deep processing’. Our research hypothesis is that education level predicts students reports of deep processing. After collecting our data, we decide to run our analysis using linear regression instead of ANOVA. [Perhaps this is just preference or perhaps we decide to include other variables in the model as focal predictors as well.] We decide to recode education level [‘edlevel’] into a set of indicator variables where our reference category is ‘freshman’. The table below contains the codes for the indicator variables we create.
Indicator coding of categorical predictor | |||
edlevel (original categorical variable) | IND1 | IND2 | IND3 |
freshman | 0 | 0 | 0 |
sophomore | 1 | 0 | 0 |
junior | 0 | 1 | 0 |
senior | 0 | 0 | 1 |
The prediction equation for our linear regression is:
Y’ = b0 + b1*Ind1 + b2*Ind2 + b3*Ind3
…where b0 = intercept, b1 = slope for first indicator (IND1), b2 = slope for second indicator (IND2), b3 = slope for third indicator (IND3).
When we estimate our model using a statistical package, b0 is equal to the mean for freshmen (since freshmen are coded 0 across all indicators; b1 is the difference in means between sophomores (coded 1 on IND1 and 0 on IND2 and IND3) and freshmen; b2 is the difference in means between juniors (coded 1 on IND2 and 0 on IND1 and IND3) and freshmen; b3 is the difference in means between seniors (coded 1 on IND3 and 0 on IND1 and IND2). If other predictors are included in the model, the intercept is an adjusted mean and the aforementioned slopes are differences in adjusted means (which should sound familiar in the ANCOVA sense). [It is worth pointing out that the designation of a group as a reference category is up to the analyst and there was no mathematical basis for us using ‘freshman’ as the reference group in this example. We could have just as easily picked ‘sophomore’ as the reference category]
In those cases where you have an independent variable that is ordered categorical – i.e., the levels of the variable represent ordered categories – the researcher may have questions about whether groups falling at adjacent levels of the independent variable exhibit mean differences on the dependent variable. In the context of the example provided above, we might ask, “Do freshmen differ from sophomores; do sophomores differ from juniors; and do juniors differ from seniors in their mean levels of deep processing?” To address this question, we cannot use the indicator variable approach I described earlier. Instead, we can use sequential coding. Darlington and Hayes (2017) provide a nice discussion of sequential coding in their book, Regression analysis and linear models: Concepts, applications, and implementation, and I pivot off the strategy they lay out here. There is another example I ran across on the web here on repeated effect coding that (a) seems a bit more convoluted and (b) applies the strategy to a nominal variable where its applicability seems more limited.
Below is a table containing the recoding of the original ‘edlevel’ variable into three new sequentially coded dummy variables (SC1, SC2, SC3). Based on this recoding, the regression slopes in our equation, Y’ = b0 + b1*SC1 + b2*SC2 + b3*SC3, represent differences in means between adjacent categories for the predictor variable. Thus, the slope (b1) for SC1 is equal to the mean for sophomores minus the mean for freshman; the slope (b2) is equal to the mean for juniors minus the mean for sophomores; the slope (b3) for SC3 is equal to the mean for seniors minus the mean for sophomores. Note too that the intercept (b0) remains the mean of the freshman group since it is coded 0 across the sequentially coded dummy variables.
Sequential coding of categorical predictor | |||
edlevel (original categorical variable) | SC1 | SC2 | SC33 |
freshman | 0 | 0 | 0 |
sophomore | 1 | 0 | 0 |
junior | 1 | 1 | 0 |
senior | 1 | 1 | 1 |
Example using IBM SPSS
For this example, we will
use syntax to generate the recoded variables. If you prefer to use the SPSS
drop-downs to perform the recoding, then see this Youtube video.
Here is a screenshot
of a subset of the raw data. The raw data can be downloaded by clicking this
link.
Open up the SPSS
syntax editor and type in the syntax for recoding the original variable into
the new set of variables. Inside each parenthesis is an equals (=) sign. To the
left is the value of the original variable. To the right is the recode on the
new variable. You will notice on line 3, we are recoding the ‘edlevel’ variable
into ‘SC1’, with value codes of 0, 1, 1, 1. This corresponds with the first
column in the previous table given above. On line 4, we are recoding ‘edlevel’
into ‘SC2’, with value codes of 0, 0, 1, 1. This corresponds with the second
column in the previous table shown above. On line 5, we are recoding ‘edlevel’
into ‘SC3’, with value codes of 0, 0, 0, 1. This corresponds to the third
column in the previous table. On line 6, we end the recoding with the Execute
command followed by a period.
Highlight the code and
click the Green Arrow to generate the new variables for your dataset.
A subset of the new dataset (that includes the sequentially coded variables)…
Running the analysis
Interpreting the results
The independent
variable, ‘edlevel’ accounted for statistically significant variation in deep
processing, R² = .614, F(3,95) = 48.843, p <.001. The regression slope for
SC1 was positive and statistically significant (b = 4.000, s.e. = .824, p
<.001), suggesting sophomores in college score higher on deep processing
that freshmen. The regression slope for SC2 was positive and statistically
significant (b = 2.000, s.e. = .857, p = .022), suggesting juniors in college
score higher on deep processing than sophomores. The regression slope for SC3
was positive and statistically significant (b = 2.000, s.e. = .731, p = .007),
suggesting juniors in college score higher on deep processing than sophomores.
The mean for freshmen
in the sample was 20 (this is the intercept).
The mean for
sophomores in the sample was 20 + 4 = 24 (intercept + slope for SC1).
The for juniors in the
sample was 20 + 4 + 2 = 26 (intercept + slope for SC1 + slope for SC2).
The mean for seniors
in the sample was 20 + 4 + 2 + 2 = 28 (intercept + slope for SC1 + slope for
SC2 + slope for SC3).
References
Darlington, R. B.,
& Hayes, A. F. (2017). Regression analysis and linear models: Concepts,
applications, and implementation. The Guilford Press: New York.
Comments
Post a Comment