Factor analysis of EBI items: Tutorial with RStudio and EFA.dimensions package
For this posting, I will be walking you through an example of exploratory factor analysis (EFA) using RStudio with the EFA.dimensions package. In this tutorial, I will address various steps one typically considers when performing EFA and how to make decisions at various stages in the process.
We will be factor analyzing items from the Epistemic Belief Inventory (EBI; Schraw et al., 2002) based on data collected from Chilean high school students in this study. Download a copy of the Excel file containing the data here. The items for our analyses all have the root name 'ce' (see screenshot below). Although the original authors of the study analyzed all 28 items, to keep our output simpler, we will only perform the analysis on the 17 items contained in Table 2.
These items correspond to the following variables in the Excel data file: ce1, ce2, ce3, ce4, ce5, ce8, ce9, ce11, ce14, ce15, ce17, ce20, ce22, ce24-ce27. Below is a screenshot of a subset of the Excel file.
Preliminaries
The analyses
Step 1: Assess factorability of correlation matrix
[For this step, we will assume you have already examined the distributions of your variables and considered factors that may have impacted the quality of the correlations being submitted to factor analysis.]
The first question you need to ask yourself during EFA is whether you have the kind of data that would support factor analytic investigation and yield reliable results. Some issues can produce a factor solution that is not particularly trustworthy or reliable (Cerny & Kaiser, 1977; Dziuban & Shirkey, 1974), whereas others can result in algorithmic problems that may prevent a solution from even being generated (Watkins, 2018). The way to address this question is to evaluate the correlation matrix based on the set of measured variables being submitted to factor analysis. [In the current example, the measured variables are the items from the EBI. Hereafter, I will be referring to "items" throughout the discussion.] Answering the question of whether it is appropriate to conduct factor analysis in the first place can assist you in correcting problems before they emerge during analysis or (at the very least) serve as valuable information that can assist you in diagnosing problems that occur when running your EFA.
A common preliminary step is to examine the correlation matrix based on your items for the presence of correlations that exceed .30 (in absolute value; Watkins, 2022). During factor analysis, the idea is to extract latent variables (factors) to account for (or explain) the correlational structure among a set of measured variables. If the items in your correlation matrix are generally trivial (i.e., very small) or near zero, this logically precludes investigation of common factors since there is little common variation among the variables that will be shared. Additionally, when screening your correlation matrix, you should also look for very high correlations (e.g., r's > .90) among items that might suggest the presence of linear (or near-linear) dependencies among your variables. The presence of multicollinearity among your variables can produce unstable results, whereas singularity in your correlation matrix will generally result in the termination of the analysis (with a message indicating your matrix is nonpositive definite). In the presence of singularity, certain matrix operations cannot be performed (Watkins, 2021). [Note: A common reason why singularity occurs is when an analyst attempts to factor analyze correlations involving a set of items along with a full scale score that is based on the sum or average of those items. An example of the warning message in R/RStudio can be seen here]
To examine our correlation matrix, we will rely on the corr.test function associated with the psych package. Line 9 (below) in the Script Editor below contains the corr.test function and associated arguments for our analysis. The 'itemdata' reference is in the data argument portion of the command. This points RStudio to the data frame containing the variables we are analyzing. Setting use="complete.obs" invokes listwise deletion of cases during this analysis that have missing data on any of the variables included in the correlation matrix. To generate the correlation matrix, simply highlight line 9 and click Run.
The matrix is symmetric with 1's on the principal diagonal. If you wish to count the number of unique correlations greater than .30 (in absolute value), simply count the number in either the lower OR upper triangle. Counting the number of correlations in the lower triangle of this matrix, there look to be about 10 correlations [out of the p(p-1)/2 = 17(16)/2 = 176 unique correlations; remember, the matrix is symmetric about the principal diagonal containing 1’s] that exceed .30. None of the correlations exceed .90, which indicates to me low likelihood of any linear dependencies.
Bartlett’s test is used to test whether your correlation matrix differs significantly from an identity matrix [a matrix containing 1’s on the diagonal and 0’s in the off-diagonal elements]. In a correlation matrix, if all off-diagonal elements are zero, this signifies your variables are completely orthogonal. Failure to reject the null when performing this test is an indication that your correlation matrix is factorable. In our output, Bartlett’s test is statistically significant (p<.001). [Keep in mind that Bartlett's test is often significant, even with matrices that have more trivial correlations, as the power of the test is influenced by sample size. Factor analysis is generally performed using large samples.]
The determinant of a correlation matrix can be thought of as a generalized variance. If this value is 0, your matrix is singular (which would preclude matrix inversion). If the determinant is greater than .00001, then you have a matrix that should be factorable. In our results, the determinant is .113, which is greater than .00001.
The KMO MSA is another index to assess whether it makes sense to conduct EFA. Values closer to 1 on the overall MSA indicate greater possibility of the presence of common factors (making it more reasonable to perform EFA).
Here, we see the overall MSA is .78, which falls into Kaiser and Rice’s (1974) “middling” (but factorable) range [see scale below]. The remaining MSA’s are item-level measures. Values with KMO values < .50 are candidates for removal prior to running your EFA. The item (variable) level MSA’s in our data range from .67 to .85, suggesting it is reasonable to submit them all to EFA.
Step 2: Determination of number of factors
Once you have concluded (from Step 1) that it is appropriate to conduct EFA, the next step is to estimate the number of factors that may account for the interrelationships among your items. Historically, analysts have leaned on the Kaiser criterion (i.e., eigenvalue cutoff rule) to the exclusion of better options. This largely seems to be the result of this criterion being programmed into many statistical packages as a default. Nevertheless, this rule often leads to over-factoring and is strongly discouraged as a sole basis for determining number of factors.
Ideally, you will try to answer the 'number of factors question' using multiple investigative approaches. This view is endorsed by Fabrigar and Wegener (2012) who stated, “determining the appropriate number of common factors is a decision that is best addressed in a holistic fashion by considering the configuration of evidence presented by several of the better performing procedures” (p. 55). It is important to recognize, however, that different analytic approaches can suggest different possible models (i.e., models that vary in the number of factors) that may be explain your data. Resist the temptation to run from this ambiguity by selecting one and relying solely on it (unless perhaps it is a method that has a strong empirical track record for correctly identifying factors; a good example is parallel analysis). Instead, consider the different factor models[or at least a reasonable subset of them] as 'candidates' for further investigation. From there, perform EFA on each model, rotate the factors, and interpret them. The degree of factor saturation and interpretability of those factors can then be used as a basis for deciding which of the candidate models may best explain your data.
Method 1: Parallel analysis
Parallel analysis is a method that has strong empirical support as a basis for determining the number of factors. This method involves comparing eigenvalues from your data against eigenvalues that have been generated at random. To perform parallel analysis, use the 'RAWPAR' function associated with the EFA.dimensions package. The factormodel="PCA" argument instructs the function to compute the real and random eigenvalues using an unreduced correlation matrix (essentially extracting eigenvalues using the principal components method) The Ndatasets=1000 argument instructs the function to generate the random eigenvalues using 1000 simulations. The percentile=95 argument instructs the function to return the 95th percentiles for the randomly generated eigenvalues. See line 13 below.
One you have generated the output using the 'RAWPAR' function, simply compare the eigenvalues from your data [see Real Data column below] against randomly generated eigenvalues [either Mean or 95th percentile]. Using this method, the number of factors equals the number of Real Data eigenvalues that exceed those that were randomly generated. Looking at the table below, we see that the eigenvalues from our data exceed the 95th percentile of randomly generated eigenvalues. This suggests retention of 3 factors.
This method involves a comparison of PCA models that vary in number of extracted components. The starting point for the method is to extract 0 components and simply compute the average of the squared zero-order correlations among a set of variables. Next, a single component (i.e., 1 component model) is extracted and the average of the squared partial correlations (i.e., after partialling the first component) is computed. Following, two components are extracted (i.e., 2 component model) and the average of the squared partial correlations (i.e., after partialling the first two components) is computed; and so on. The number of recommended factors based on this procedure is associated with the component model that has the smallest average (partial) correlations. To generate the MAP (i.e., Minimum average partial) Test, use the 'MAP' function with EFA.dimensions. See line 17 below.
Method 3: Scree test
The scree test is one of the older methods for identifying number of factors. This test involves plotting the eigenvalues (Y axis) against factor number (X axis). In general, the plot of the eigenvalues look like the side of a mountain with the base of the mountain containing 'rubble'. To obtain the scree plot using EFA.dimensions, use the 'SCREE_PLOT' function. Here, I have typed the function into line 19 in the Script Editor.
A limitation of the scree plot is its subjectivity. Moreover, there are times when there may be other 'break-points' that can cloud the picture of how many factors to retain.
Method 4: Sequential chi-square tests
Another option for identifying the possible number of factors is to use sequential chi-square tests. Personally, I am not of a big fan of this approach given that it is well known to lead to over-factoring, and this tendency will be amplified in larger samples. The approach involves testing the fit of a set of common factor models, each containing a different number of factors. The significance test associated with each factor model is a test of whether the model deviates significantly from an exact fitting model. In effect, the significance tests that are performed are essentially a series of 'badness of fit' tests. The preferred factor model (in terms of number of factors) is the one that has the fewest factors and is not statistically significant. To perform this test using EFA.dimensions simply use the 'SMT' function. See line 21.
Method 5: Empirical Kaiser criterion.
This method was proposed by Braeken and van Assen (2017) as an alternative to the more deeply flawed Kaiser (1960) criterion (i.e., eigenvalue cutoff rule of 1). The basic approach is to compare the eigenvalues from your data (column 2) against a set of reference values that "can be seen as a sample-variant of the original Kaiser criterion (which is only effective at the population level), yet with a built-in empirical correction factor that is a function of the variables-to-sample-size ratio and the prior observed eigenvalues in the series" (p. 463). According to Braeken and van Asson (2017), the EKC appears to function as well or better than parallel analysis, with the EKC outperforming parallel analysis in cases where factors are moderately to highly correlated and there are few measured variables per factor. In a manner analogous to what we did before with parallel analysis, we retain those factors that have eigenvalues from the data that exceed the reference values. The EFA.dimensions function for performing this test is 'EMPKC'. See line 23 below.
We see below, that the method suggests maintaining 3 factors. [It turns out that if we’d used the traditional Kaiser criterion of maintaining factors with eigenvalues > 1, it would have also suggested 3 factors.]
For a closer look at the table, click here.
Step 3: Final extraction of factors, rotation, and interpretation
At this point, I have determined the most likely candidate is a model containing three factors [The one-factor model suggested by the MAP test is unlikely, especially given evidence epistemic beliefs are multidimensional]. For this reason, the remainder of this discussion will be based on a three-factor model. To perform the analysis, we will rely on Principal Axis Factoring (PAF) for factor extraction. Watkins (2018) notes factor extraction involves transforming correlations to factor space in a manner that is computationally, but not necessarily conceptually, convenient. As such, extracted factors are generally rotated to "achieve a simpler and theoretically more meaningful solution" (p. 231). For our example, we will consider two factor rotation approaches (Promax and Varimax) to facilitate interpretation of the final set of factors that are extracted. Promax rotation is one type of oblique rotation that relaxes the constraint of orthogonality of factors, thereby allowing them to correlate. Varimax rotation is a type of orthogonal rotation. When Varimax rotation is used, the factors are not permitted to correlate following rotation of the factor axes. Below is the Script Editor with the code for performing PAF, forcing 3 factors, and then rotating for interpretation. See lines 25 and 27 below.
PAF with Promax rotation
The syntax below is for performing Principal Axis factoring (PAF) with Promax rotation. The 'itemdata' is listed first to refer to the data frame where are data are located. The argument, corkind="pearson", specifies we are analyzing Pearson correlations. The Nfactors=3 argument specifies we are forcing the extraction of 3 factors (recall, this was based on our assessment of the number of factors during Step 2). The iterpaf=50 argument specifies a maximum number of iterations for estimation of the parameters in the model. [If the model does not converge within this maximum, you can re-set this to a higher number.] The rotate="PROMAX" argument specifies (you guessed it!) Promax rotation. Prior to rotation, the extracted factors are orthogonal and account for additive variation in the set of measured variables (items). Promax rotation is an oblique rotation that allows factors to be intercorrelated - this in contrast to Varimax rotation (an orthogonal rotation) which we use later. Finally, the ppower=3 argument is recommended in the package. The aim with this parameter is to give the simplest structure with the lowest inter-factor correlations.
The table below contains initial and extracted communalities. During principal axis factoring (PAF) the values in the first column (Initial) are placed into the principal diagonal of the correlation matrix (instead of 1's). It is this reduced correlation matrix that is factored. The Initial communalities are computed as the squared multiple correlation between each measured variable (item) and the remaining variables (items). [You can easily generate these by performing a series of regressions where you regress each variable onto the remaining variables and obtaining the R-squares.] The Extraction communalities are final estimates of the proportion of variation in each item accounted for jointly by the set of common factors that have been extracted.
The next table in the output are actually not results from the PAF. Rather they are results from an initial Principal components analysis. Personally, I am not entirely sure why this is presented. The eigenvalues in the first column of values indicate the amount of variation in the set of items accounted for by each component that is extracted. Confusingly, the first column refers to 'factor' number (instead of component the component number). I have to assume this is a holdover from the long tradition in EFA to use principal components as estimates of factors and the tendency to use the term 'factors' instead of 'components'. When discussing this table, I will stick with tradition (with the proviso that these are technically components).
The eigenvalue associated with a given component summarizes how much variation is accounted for in the original set of variables (items) by a given component. The first component accounts for as much variation as 3.06 of the original measured variables (items). We can compute the % of variance by dividing the eigenvalue by the total number of items (i.e., 17): 3.06/17 = .18 (or 18%). The second component accounts for as much variation as 2.04 of the original items. This translates into 2.04/17 = .12 (or about 12% of the variation). Cumulatively, the first and second components account for 30% of the variation in the items. Component 3 accounts for as much variation as 1.43 items, which is approximately (1.43/17)*100% = 8% of the variation. Cumulatively, Components 1-3 account for approximately 38% of the variation in the items. Notice that the fourth component explains less variation than a single measured variable. And so on.
*It is worth noting the original Kaiser criterion (eigenvalue cutoff rule) was developed as a basis for determining number of factors based on a PCA solution (such as the one above). The reasoning was that for a factor to be useful, it needs to explain as much variation as at least one measured variable. If we applied the eigenvalue cutoff rule to this table (to determine number of factors), it would suggest retention of 3 factors. [The eigenvalues in this table are the same as those we saw earlier in the output when we used the EKC approach to determining number of factors.]
For a closer look at the table, click here.
For a closer look at the table, click here.
For a closer look at the table, click here.
For a closer look at the table, click here.
Cerny, B. A., & Kaiser, H. F. (1977). A study of a measure of sampling adequacy for factor-analytic correlation matrices. Multivariate Behavioral Research, 12(1), 43–47.
Dziuban, C. D., & Shirkey, E. C. (1974). When is a correlation matrix appropriate for factor analysis? Psychological Bulletin, 31, 358-361.
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141–151.
Leal-Soto, F., Ferrer-Urbina, R. (2017). Three-factor structure for Epistemic Belief Inventory: A cross-validation study. PLoS ONE, 12 (3): e0173295. doi:10.1371/journal.pone.0173295
Pituch, K.A., & Stevens, J.P. (2016). Applied multivariate statistics for the social sciences (6th ed.). New York: Routledge.
Schraw, G., Bendixen, L. D., & Dunkle, M. E. (2002). Development and validation of the Epistemic Belief Inventory (EBI). In B. K. Hofer & P. R. Pintrich (Eds.), Personal epistemology: The psychology of beliefs about knowledge and knowing (pp. 261–275). Lawrence Erlbaum Associates Publishers.
Watkins, M. W. (2018). Exploratory factor analysis: A guide to best practice. Journal of Black Psychology, 44, 219-246.
Watkins, M. W. (2021). A step-by-step guide to exploratory factor analysis with R and RStudio. New York: Routledge.
Comments
Post a Comment