diff --git a/02-Global_estimation.Rmd b/02-Global_estimation.Rmd index 2af3749..04f2c91 100644 --- a/02-Global_estimation.Rmd +++ b/02-Global_estimation.Rmd @@ -70,7 +70,7 @@ Replacing the values of x and y with the standardized versions in our calculatio sum(zx * zy)/(length(zx) - 1) == cor(x, y) ``` -In our example, the two variables are prefectly correlated, so $r = 1$. +In our example, the two variables are perfectly correlated, so $r = 1$. Incidentally, this is the same as dividing the covariance of $x$ and $y$ by the product of their standard deviations, which omits the need for the Z-transformation step but achieves the same outcome: @@ -82,7 +82,7 @@ Now that we have reviewed these basic concepts, we can begin to consider them wi ## Regression Coefficients -The inferential heart of structural equation modeling are the regression (or path) coefficients. These values mathematically quantify the (mostly linear) dependence of one variable on another. This verbage should sound familiar because that is what we have already established as the goal of covariance/correlation. +The inferential heart of structural equation modeling are the regression (or path) coefficients. These values mathematically quantify the (mostly linear) dependence of one variable on another. This verbiage should sound familiar because that is what we have already established as the goal of covariance/correlation. In this section, we will demonstrate how path coefficients can be derived from correlation coefficients and explore Grace's "8 rules of path coefficients." @@ -92,7 +92,7 @@ In a simple linear regression, one variable $y$ is the response and another $x$ $$\hat{y} = bx + a$$ -where $b$ is the regression coefficient and $a$ is the intercept. It's important to note that $b$ implies a linear relationship, i.e., the relatoinship between $x$ and $y$ can be captured by a straight line (for now). +where $b$ is the regression coefficient and $a$ is the intercept. It's important to note that $b$ implies a linear relationship, i.e., the relationship between $x$ and $y$ can be captured by a straight line (for now). The regression coefficient between $x$ and $y$ can be related to the correlation coefficient through the following equation: @@ -359,7 +359,7 @@ cov(data) returns the variance-covariance matrix for the three variables $x1$, $y1$, and $y2$. We would call this the *observed global* variance-covariance matrix. -The entire machinary behind covariance-based SEM is to reproduce that global variance-covariance matrix. In fact, all of covariance-based SEM can be boiled down into a simple equation: +The entire machinery behind covariance-based SEM is to reproduce that global variance-covariance matrix. In fact, all of covariance-based SEM can be boiled down into a simple equation: $$\Sigma = \Sigma(\Phi)$$ @@ -377,7 +377,7 @@ The maximum-likelihood fitting function can be expressed as: where $\Sigma$ is the modeled covariance matrix, $S$ is the observed covariance matrix, $p$ is the number of endogenous variables, and $q$ is the number of exogenous variables. $tr$ is the trace of the matrix (sum of the diagonal) and the $^{-1}$ is the inverse of the matrix. -Maximum-likelihood estimators have a few desireable properties, principally that they are invariant to the scales of the variables and provide unbiased estimates based on a few assumptions: +Maximum-likelihood estimators have a few desirable properties, principally that they are invariant to the scales of the variables and provide unbiased estimates based on a few assumptions: - variables must exhibit multivariate normality. Oftentimes this is the not case: dummy variables, interactions and other product terms have non-normal distributions. However, $F_{ML}$ is fairly robust to violations of multinormality, especially as the sample size grows large. - the observed matrix $S$ must be positive-definite. This means there are no negative variances, an implied correlation > 1.0, or redundant variables (one row is a linear function of another). @@ -416,7 +416,7 @@ Finally, consider a third equation: $$2a - 4 = 4b$$ -We now have more pieces of known information than unknowns, since we have already arrived at a solution for both $a$ and $b$ based on the previous two equations. In this case, we call the system of the equations *overidentified* because we have more information than is necessary to arrive at unique solutions for our unknown variables. This is the desireable state, because that extra information can be used to provide additional insight. +We now have more pieces of known information than unknowns, since we have already arrived at a solution for both $a$ and $b$ based on the previous two equations. In this case, we call the system of the equations *overidentified* because we have more information than is necessary to arrive at unique solutions for our unknown variables. This is the desirable state, because that extra information can be used to provide additional insight. You may alternately hear models referred to as *saturated*. Such a model would be *just identified*; an *unsaturated* model would be *overidentified* and and an *oversaturated* model would be *underidentified*. @@ -458,7 +458,7 @@ The order condition can be evaluated using the following equation: $$G \leq H$$ -where $G$ = the number of incoming paths, and $H$ = the number of exogenous variables + the number of indirectly-connected endogenous variabls. In the previous example, $G = 2$ while $H = 1$, so the model fails the order condition, as noted. +where $G$ = the number of incoming paths, and $H$ = the number of exogenous variables + the number of indirectly-connected endogenous variables. In the previous example, $G = 2$ while $H = 1$, so the model fails the order condition, as noted. Model identification is only the first step in determining whether a model can provide unique solutions: sample size can also restrict model fitting by not providing enough replication for the $F_{ML}$ function to arrive at a stable set of estimates for the path coefficients. @@ -494,7 +494,7 @@ We can then formally compare the $\chi^2$ statistic to the $\chi^2$-distribution Failing to reject the null hypothesis that the $\chi^2$ statistic is different from 0 (perfect fit) implies a generally good representation of the data (*P* > 0.05). Alternately, rejecting the null implies that the $\chi^2$ statistic is large, as is the discrepancy between the observed and modeled variance-covariance matrices, thus implying a poor fit to the data (*P* < 0.05). Interpreting the outcome of the significance test is often tricky, as a significant *P*-value indicates *poor* fit, so be careful. -The $\chi^2$ index also provides a way to gauge the relative fit of two models, one of which is nested within the other. The *$\chi^2$ difference test* is simply the difference in $\chi^2$ values between the two models, with the degrees of freedom being the difference in the degrees of freedom between the two models. The resulting statistic can then be compared to a $\chi^2$ table to yield a significance value. Again, this test is for *nested* models. For non-nested models, other statistics allow for model comparisons, including AIC and BIC. An AIC or BIC score $\geq2$ is generally considered to indicate significant differneces among models, with smaller values indicating equivalency between the two models. +The $\chi^2$ index also provides a way to gauge the relative fit of two models, one of which is nested within the other. The *$\chi^2$ difference test* is simply the difference in $\chi^2$ values between the two models, with the degrees of freedom being the difference in the degrees of freedom between the two models. The resulting statistic can then be compared to a $\chi^2$ table to yield a significance value. Again, this test is for *nested* models. For non-nested models, other statistics allow for model comparisons, including AIC and BIC. An AIC or BIC score $\geq2$ is generally considered to indicate significant differences among models, with smaller values indicating equivalency between the two models. $\chi^2$ tests tend to be affected by sample size, with larger samples more likely to generate poor fit due to small absolute deviations (note the scaling of $F_{ML}$ by $n-1$ in the above equation). As a result, there are several other fit indices for covariance-based SEM that attempt to correct for this problem: @@ -509,7 +509,7 @@ What happens if the model doesn't fit? Depending on the goals of your analysis ( - examination correlation of model residuals: parameters with large residual correlations (difference between observed and expected) could suggest missing information or linkages. - *modification indices*, or the expected decrease in the $\chi^2$ if a missing path were to be included in the model. A high value of a modification index would suggest the missing path should be included. (Tests of directed separation, which we cover in the chapter on Local Estimation, provide similar insight and are returned automatically by *piecewiseSEM*.) -Users should take caution when exploring these techniques as to avoid dredging the model. SEM is a technique that relies heavily on informed model specification: adding paths in that are suggested by the data but not anticipated by the user to achieve adequate fit, or comparing all sub-models using AIC, for example, might be appropriate in other applications, but ignore the basic philosophy behind SEM that relationships are tested bas don *a priori* knowledge. +Users should take caution when exploring these techniques as to avoid dredging the model. SEM is a technique that relies heavily on informed model specification: adding paths in that are suggested by the data but not anticipated by the user to achieve adequate fit, or comparing all sub-models using AIC, for example, might be appropriate in other applications, but ignore the basic philosophy behind SEM that relationships are tested based on *a priori* knowledge. ## Model Fitting Using *lavaan* @@ -559,7 +559,7 @@ summary(keeley_sem1) The output is organized into a few sections. First is the likelihood optimization method, number of parameters, the total sample size for the model, the estimator ($F_{ML}$ is the default) and the fit statistic. The model has $\chi^2 = 0$ with 0 degrees of freedom: this is because we have as many knowns as unknowns, and thus the model is just identified or saturated. To show this, we can apply the t-rule: we must estimate the two variances of the variables plus their path coefficient ($t = 3$) and know the values of the two variables ($n = 2$). Recall the equation for the t-rule $t \leq n(n + 1)/2$, so $3 = 2(2+1)/2 = 6/2 = 3$, and therefore the model is saturated. -Next up are the actual parameter estimates: the relationship between fire severity and stand age is $\beta = 0.06$ with $P < 0.001$. The model has reports the estimated error variance on the endogenous variable. +Next up are the actual parameter estimates: the relationship between fire severity and stand age is $\beta = 0.06$ with $P < 0.001$. The model reports the estimated error variance on the endogenous variable. We can dissect this output a little more. First, let's fit the corresponding linear model using `lm`: @@ -603,7 +603,7 @@ To return the standardized coefficients using *lavaan* requires a separate funct standardizedsolution(keeley_sem1) ``` -This output does not return the raw coefficients, however, or any other information about the model that is useful in interpretation. The obtain a single output, you can pass the argument `standardize = T` to `summary`: +This output does not return the raw coefficients, however, or any other information about the model that is useful in interpretation. To obtain a single output, you can pass the argument `standardize = T` to `summary`: ```{r} summary(keeley_sem1, standardize = T) @@ -627,7 +627,7 @@ Now that we have covered the basics of *lavaan*, let's fit a slightly more compl Here, we test the hypotheses that total cover of plants is a function of fire severity, which in turn is informed by how old the plants are in a particular plot (which we have already investigated using linear regression). This test is known as *full mediation*, in other words that the effect of age is fully mediated by fire severity (we will test another scenario shortly). -Again, we must provide the formulae as a character string. This model can be broken down into two equation representing the two endogenous variables: +Again, we must provide the formulae as a character string. This model can be broken down into two equations representing the two endogenous variables: ```{r} keeley_formula2 <- ' @@ -658,9 +658,9 @@ fitMeasures(keeley_sem2) Woah! We're certainly not lacking in statistics. -Returning to the summary output, we see the same coefficient for $firesev ~ age$ and a new estimate for $cover ~ firesev$. In this case, more severe fires reduce cover (not unexpectedly). +Returning to the summary output, we see the same coefficient for $firesev \sim age$ and a new estimate for $cover \sim firesev$. In this case, more severe fires reduce cover (not unexpectedly). -Now that we have multiple linkages, we can also compute the indirect effect of age on cover. Recall from Rule 3 of path coefficients that the indirect effects along a compound path are the product of the individual path coefficients: $0.454 * -0.437 = -0.198$. +Now that we have multiple linkages, we can also compute the indirect effect of age on cover. Recall from Rule 3 of path coefficients that the indirect effects along a compound path are the product of the individual path coefficients: $0.454 \times -0.437 = -0.198$. We can obtain this value by modifying the model formula to include these calculations directly. This involves giving a name to the coefficients in the model strings, then adding a new line indicating their product using the operator `:=`: diff --git a/03-Local_estimation.Rmd b/03-Local_estimation.Rmd index 452b295..057fbc8 100644 --- a/03-Local_estimation.Rmd +++ b/03-Local_estimation.Rmd @@ -35,7 +35,7 @@ Another way of thinking about model fit is to ask: are we missing anything? Reca The *tests of directed separation* evaluate this hypothesis: that we are justified in excluding relationships. This question is actually implicit in the $\chi^2$ statistic: a substantial deviation from the observed correlations suggests that we're missing information in our model that could bring in our estimates more in line with our observations. The tests of directed separation take this one step further by explicitly identifying and testing whether each piece of missing information (i.e., each missing path) could indeed change our interpretation of the overall model. -Two variables are *d-separated* if they are statistically independent conditional on their joint influences. Let's unpack this statement. First, 'two variables.' The two variables are *unrelated* in the hypothesized causal model: in other words, there is not a directed path already connecting them. Second, 'statistically independent.' We test for statistical dependence in our model all the time: the *P*-values associated with the path coefficients, for example, test whether the effect is significantly different than zero. Statistical *independence* then asks whether the two variables are significantly *unrelated*, or that that their relationship is in fact no diffeent than zero. Finally, 'conditional on their joint influences' means that the test for statistical independence must account for contributions from any other influences. In other words, the test must consider the *partial* effect of one variable on the other if either or both are already influenced by other variables in the model. +Two variables are *d-separated* if they are statistically independent conditional on their joint influences. Let's unpack this statement. First, 'two variables.' The two variables are *unrelated* in the hypothesized causal model: in other words, there is not a directed path already connecting them. Second, 'statistically independent.' We test for statistical dependence in our model all the time: the *P*-values associated with the path coefficients, for example, test whether the effect is significantly different than zero. Statistical *independence* then asks whether the two variables are significantly *unrelated*, or that that their relationship is in fact no different than zero. Finally, 'conditional on their joint influences' means that the test for statistical independence must account for contributions from any other influences. In other words, the test must consider the *partial* effect of one variable on the other if either or both are already influenced by other variables in the model. Procedurally, this evaluation is quite easy: identify missing paths, test whether the effect is not significantly different from zero (*P* > 0.05), and combine those inferences to gauge the overall trustworthiness of the model. But there is some background to cover first. @@ -47,7 +47,7 @@ In this case, we have specified two sets of directed relationships: $x1 -> y1$ a If we apply the t-rule from the chapter on global estimation, we have $3(3+1)/2$ or 6 pieces of known information (the variances on the 3 variables + the 3 sets of covariances). We want to estimate the 2 parameters $\gamma_{x1y1}$ and $\beta_{y1y2}$ and the variances on the 3 variables (we can get their covariances from that). Thus we have 6 known values to estimate 5 unknown values, and the model is *underidentified*. We noted in the chapter on global estimation that the number of leftover known values can be used as degrees of freedom in the $\chi^2$ goodness-of-fit test. In this case, there is 1 degree of freedom, so likewise, we can go on to test model fit. -This 1 degree of freedom actually corresponds to the missing relationship between $x1 -> y2$. This is the *independence claim* we wish to test: that there is indeed no relationship between $x1$ and $y2$. However, the effect of $x1$ on $y2$ must be independent (or the partial effect) or the known influence of $y1$. Thus, we are testing the partial effect of $x1$ on $y2$ given $y1$. You may see this claim written in the following notation: $x1 | y2 (y1)$ where the bar separates the two variables in the claim, and any conditioning variables follow in parantheses. (Shipley actually puts the two variables in parentheses followed by the conditioning variables in brackets: $(x1, y2) | {y1}$, for the record.) +This 1 degree of freedom actually corresponds to the missing relationship between $x1 -> y2$. This is the *independence claim* we wish to test: that there is indeed no relationship between $x1$ and $y2$. However, the effect of $x1$ on $y2$ must be independent (or the partial effect) or the known influence of $y1$. Thus, we are testing the partial effect of $x1$ on $y2$ given $y1$. You may see this claim written in the following notation: $x1 | y2 (y1)$ where the bar separates the two variables in the claim, and any conditioning variables follow in parentheses. (Shipley actually puts the two variables in parentheses followed by the conditioning variables in brackets: $(x1, y2) | {y1}$, for the record.) In this simple example, there is one conditioning variable for the single independence claim. This one independence claim constitutes what is called the *basis set* which is the minimum number of independence claims derived from a path diagram. The key word is *minimum*. @@ -63,7 +63,7 @@ There are several missing paths: $x1 -> y2$, $x1 -> y3$ and $y1 -> y3$. Let's consider the independence claim $x1 -> y3$. Based on our last example, $y2$ must be included as a conditioning variable due to its direct influence on $y3$, but what about $y1$? It has an indirect influence on $y3$ through $y2$. However, by having included $y2$ in the independence claim, we have already (theoretically) incorporated the indirect influence of $y1$ through it. In other words, any effect of $y1$ would change $y2$ before $y3$, and the variance in $y2$ is already considered in the independence claim. So the full claim would be: $x1 | y3 (y2)$. -_Our second rule is: conditioning variables consist of only those variables *immediately ancestral* to the two variables whose independence is being evaluated._ In other words, we assume that the effects of any other downstream variables are captured in the variance contributed by the immediate ancestors, and we can therefore ignore them. Upstream variables (those occuring later in the path diagram) are never considered as conditioning variables, for the obvious reason that they have no effect on the preceding variables. +_Our second rule is: conditioning variables consist of only those variables *immediately ancestral* to the two variables whose independence is being evaluated._ In other words, we assume that the effects of any other downstream variables are captured in the variance contributed by the immediate ancestors, and we can therefore ignore them. Upstream variables (those occurring later in the path diagram) are never considered as conditioning variables, for the obvious reason that they have no effect on the preceding variables. For the claim $y1 -> y3$ above, there are now two conditioning variables: $y2$ (on $y3$) and also $x1$ (on $y1$). So the final independence claim would be: $y1 | y2 (x1, y1)$. @@ -79,11 +79,11 @@ Deriving the basis set can be difficult but mercifully is automated in the `piec The basis set includes the unspecified paths from $x1 | y2 (y1)$ and $x2 | y2 (y1)$. But what about $x1 | x2$? -Shipley would include this claim in the basis set. However, several argument could be made against it along several fronts. +Shipley would include this claim in the basis set. However, several arguments could be made against it along several fronts. First, unlike $y2$ which very clearly is an effect (i.e., has a directed path flowing into it), there is no expectation of a cause-effect relationship between the two exogenous variables $x1$ and $x2$. In fact, such a relationship may yield nonsensical claims (e.g., between butterfly color and number of train stations) or where directionality is confounded in one direction (e.g., latitude and species richness). If the purpose of the test is to evaluate linkages that were originally deemed irrelevant, is it really that useful to test non-mechanistic or impossible links? If we did indeed recover a significant correlation between butterfly color and train stations, is that mechanistically interesting or (more likely) totally spurious? And should we therefore reject a model due to a totally spurious relationship? These are tough questions with no clear answer. From a personal perspective, I believe the tests of directed separation should be diagnostic: should I have included this path? Did it provide useful information? Including non-informative claims because they can be evaluated simply inflates the test statistic with no real benefit to the identifying underlying causal processes. -Second, and more practically, there is no easy way for the user to specify the distributional and other assumptions associated with exogenous variables in the same way they can for endogenous variables. By virtue of modeling $y2$ in a directed path (from $y1$), it is clear how that response should be treateed by the way the model is oded. However, no where in the regression models is there information on how $x1$ should be treated: is it binomial? Hierarchical? Polynomial? Asking the user to code this information would vastly inflate the amount of code necessary to run tests, and combined with the above, would yield little insight for a potentially very large investment. +Second, and more practically, there is no easy way for the user to specify the distributional and other assumptions associated with exogenous variables in the same way they can for endogenous variables. By virtue of modeling $y2$ in a directed path (from $y1$), it is clear how that response should be treated by the way the model is coded. However, no where in the regression models is there information on how $x1$ should be treated: is it binomial? Hierarchical? Polynomial? Asking the user to code this information would vastly inflate the amount of code necessary to run tests, and combined with the above, would yield little insight for a potentially very large investment. Nevertheless, independence claims could be added back into the basis set if the user decides they disagree with this perspective. @@ -91,7 +91,7 @@ Now that we are comfortable identifying missing paths and constructing the basis Once the model is fit, statistical independence is assessed with a t-, F-, or other test. If the resulting *P*-value is >0.05, then we fail to reject the null hypothesis that the two variables are conditionally independent. In this case, a high *P*-value is a *good* thing: it indicates that we were justified in excluding that relationship from our path diagram in the first place, because the data don't support a strong linkage between those variables within some tolerable threshold of error. -Shipley's most important contribution was to shose *P*-values can be used to construct a fit index analogous to the $\chi^2$ statistic from global estimation: Fisher's *C* statistic. +Shipley's most important contribution was to chose *P*-values that can be used to construct a fit index analogous to the $\chi^2$ statistic from global estimation: Fisher's *C* statistic. Fisher's *C* is calculated as: @@ -111,9 +111,9 @@ where $K$ is the likelihood degrees of freedom (not $k$, the number of claims in $$AIC_c = C + 2K\frac{n}{(n - K - 1)}$$ -It's important to point out that, like the $\chi^2$ statistic for global estimation, the *C* statistic can be affected by sample size, but not in as a direct way. As sample size increases, the probability of recovering a "significant" *P*-vaue increases, reducing the potential for a good-fitting model. Similarly, more complex models may lead to a kind of "overfitting" where significant d-sep tests are obscured by many more non-significant values leading to strong support for the (potentially incorrect) model structure. Paradoxically, poor sample size can also lead to a good-fitting model because the tests lack the power to detect an actual effect (high Type II error). Such biases should be considered when reporting the results of the test. +It's important to point out that, like the $\chi^2$ statistic for global estimation, the *C* statistic can be affected by sample size, but not in a direct way. As sample size increases, the probability of recovering a "significant" *P*-value increases, reducing the potential for a good-fitting model. Similarly, more complex models may lead to a kind of "overfitting" where significant d-sep tests are obscured by many more non-significant values leading to strong support for the (potentially incorrect) model structure. Paradoxically, poor sample size can also lead to a good-fitting model because the tests lack the power to detect an actual effect (high Type II error). Such biases should be considered when reporting the results of the test. -In this way, the tests of directed separation may be usefully diagnostic by drawing attention to the specific relationships that could be re-inserted into the model, which would have the added benefit of improving model fit (by removing those significant *P*-values from the basis set). Whether this is advisable depends on the goal of the exercise: in an "exploratory" mode, for example, adding paths might be useful *if* they are theoretically justifiable. Blindly selecting all the significant tests and re-inserting those paths, however, is irresponsible and in fact antithetical to the philsophy of SEM (where paths are carefully specified by the user). +In this way, the tests of directed separation may be usefully diagnostic by drawing attention to the specific relationships that could be re-inserted into the model, which would have the added benefit of improving model fit (by removing those significant *P*-values from the basis set). Whether this is advisable depends on the goal of the exercise: in an "exploratory" mode, for example, adding paths might be useful *if* they are theoretically justifiable. Blindly selecting all the significant tests and re-inserting those paths, however, is irresponsible and in fact antithetical to the philosophy of SEM (where paths are carefully specified by the user). Like $\chi^2$ and other fit indices from global estimation, just identified models have no degrees of freedom (missing paths) left over with which to test model fit. As in those cases, we can turn to other diagnostic tests, such as individual model $R^2$ to give us a sense of confidence in the model structure. If a high proportion of variance is explained in all endogenous variable and there are significant path coefficients, it follows that residual error is low, and it's safe to assume that there are no other variables out there that can further clarify the model structure. @@ -163,7 +163,7 @@ The first step is to derive the basis set using the function `basisSet`: basisSet(keeley_psem) ``` -Here, there is a single independence claim representing the missing path from $age -> cover$ conditional on the influence of $firesev$ on $cover$. +Here, there is a single independence claim representing the missing path from $age \rightarrow cover$ conditional on the influence of $firesev$ on $cover$. Now to evaluate the tests of directed separation using the function `dSep`: @@ -191,7 +191,7 @@ Or, we could just use the function `fisherC`: fisherC(keeley_psem) ``` -It seems in this case, as with the example for global estimation, we would fail to reject the partial mediation model with the directed path from $age -> cover$ and instead refer to the full mediation model. Also as previously, we can perform a Chi-square difference test to validate this conclusion empirically. +It seems in this case, as with the example for global estimation, we would fail to reject the partial mediation model with the directed path from $age \rightarrow cover$ and instead refer to the full mediation model. Also as previously, we can perform a $\chi^2$ difference test to validate this conclusion empirically. Let's create the partial mediation model and compare the two: @@ -205,7 +205,7 @@ keeley_psem2 <- psem( anova(keeley_psem, keeley_psem2) ``` -It seems that, based on the output of the Chi-square difference test, the two models are not significantly different and thus parsimony would tend towards the model with fewer estimatde parameters (the full mediation model). +It seems that, based on the output of the $\chi^2$ difference test, the two models are not significantly different and thus parsimony would tend towards the model with fewer estimated parameters (the full mediation model). Note that the output also computes AIC and BIC scores: these can be obtained with the `AIC` and `BIC` functions as well: @@ -223,7 +223,7 @@ This exercise was a long workaround to reveal that all the above can be executed summary(keeley_psem, .progressBar = FALSE) ``` -The output should look very familar to the output from other summary calls, like `summary.lm`. The d-sep test, Fisher's C and information criteria are all reported. +The output should look very familiar to the output from other summary calls, like `summary.lm`. The d-sep test, Fisher's C and information criteria are all reported. Additionally, model coefficients are returned. Unlike *lavaan*, the standardized estimates are provided by default. Also unlike *lavaan*, the individual model $R^2$ values are also returned by default. Both sets of statistics are key for inference, and thus we have decided to make them available with any further arguments passed to `summary`. @@ -273,7 +273,7 @@ summary(shipley_sem, standardize = T, rsq = T) Firstly, we notice the goodness-of-fit can be estimated, but the model is a poor fit (*P* < 0.001). The paths are all significant but this doesn't do us much good considering the model is not suitable for inference. -Instead of fiddling with modification indices and trying to rejigger the model strcuture, let's analyze the same path diagram using a piecewise approach and recognizing both the hierarchical structure AND non-normality of the data: +Instead of fiddling with modification indices and trying to rejigger the model structure, let's analyze the same path diagram using a piecewise approach and recognizing both the hierarchical structure AND non-normality of the data: ```{r, message = FALSE, warning = FALSE} library(nlme) library(lme4) @@ -297,11 +297,11 @@ shipley_psem <- psem( summary(shipley_psem, .progressBar = FALSE) ``` -(In the new version of *lme4* we get some warnings in the d-sep tests, but the model still return an output.) +(In the new version of *lme4* we get some warnings in the d-sep tests, but the model still returns an output.) -The immediately obvious difference is that the model is no longer a poor fit: we have 12 degrees of freedom corresponding to 6 indepdence claims, all of which have *P* > 0.05. Therefore, the model-wide *P* = 0.484, and we would therefore reject the null that the data do not support the hypothesized model structure. +The immediately obvious difference is that the model is no longer a poor fit: we have 12 degrees of freedom corresponding to 6 independence claims, all of which have *P* > 0.05. Therefore, the model-wide *P* = 0.484, and we would therefore reject the null that the data do not support the hypothesized model structure. -Moreover, while the direction of the parameter estimates remain the same, they vary considerably in their magnitudes (e.g., $\beta_{date, DD} = -0.652$ for *lavaan* and $\beta_{date, DD} = -0.497$ in *piecewiseSEM*). +Moreover, while the direction of the parameter estimates remain the same, they vary considerably in their magnitudes (e.g., $\beta_{date, DD} = -0.652$ for *lavaan* and $\beta_{date, DD} = -0.498$ in *piecewiseSEM*). The model $R^2$s are all higher as well, for fixed-effects only (marginal) and especially for fixed- and random-effects together (conditional). @@ -309,7 +309,7 @@ Thus, by addressing the non-independence of the data, we have converged on suppo ## A Special Case: Where Graph Theory Fails -In the majority of cases, as we have established, the direction of the independence claim doesn't matter because, while the coefficients will differ, their *P*-values will be identical. Thus it doesn't matter if you test $y | x$ or $x | y$ because the claim will yield the same significance test. EXCEPT when intermediate endogenous variables are non-normally distributed. +In the majority of cases, as we have established, the direction of the independence claim doesn't matter because, while the coefficients will differ, their *P*-values will be identical. Thus, it doesn't matter if you test $y | x$ or $x | y$ because the claim will yield the same significance test. EXCEPT when intermediate endogenous variables are non-normally distributed. Consider the following SEM: @@ -320,14 +320,19 @@ In this SEM, there are two independence claims: * $y3 | x1 (y1, y2)$ * $y2 | y1 (x1)$ -In the second independence claim, if both variables were normally distributed, the significance value is the same whether the test is conducted as $y2 | y1 (x1)$ or $y1 | y2 (x1)$. This is NOT true, however, when one or both of the responses are fit to a non-normal distribution. This is because the response is now transformed via a *link function* $g(\mu)$ (see chapter on coefficients), and the parameter estimates--and their standard errors--are now expressed on the link scale. This transformation means the *P*-value obtained by regressing $y1 ~ y2$ is NOT the same as the one obtained by regressing $y2 ~ y1$. +In the second independence claim, if both variables were normally distributed, the significance value is the same whether the test is conducted as $y2 | y1 (x1)$ or $y1 | y2 (x1)$. This is NOT true, however, when one or both of the responses are fit to a non-normal distribution. This is because the response is now transformed via a *link function* $g(\mu)$ (see chapter on coefficients), and the parameter estimates--and their standard errors--are now expressed on the link scale. This transformation means the *P*-value obtained by regressing $y1 \sim y2$ is NOT the same as the one obtained by regressing $y2 \sim y1$. To show this is true, let's generate some Poisson-distributed data and model using both LM and GLM with a log-link: ```{r} set.seed(87) -glmdat <- data.frame(x1 = runif(50), y1 = rpois(50, 10), y2 = rpois(50, 50), y3 = runif(50)) +glmdat <- data.frame( + x1 = runif(50), + y1 = rpois(50, 10), + y2 = rpois(50, 50), + y3 = runif(50) +) # LM summary(lm(y1 ~ y2 + x1, glmdat))$coefficients[2, 4] @@ -339,7 +344,7 @@ summary(glm(y1 ~ y2 + x1, "poisson", glmdat))$coefficients[2, 4] summary(glm(y2 ~ y1 + x1, "poisson", glmdat))$coefficients[2, 4] ``` -In the case of `lm` the *P*-value is identical regardless of the direction, and moreover is < 0.05, thus--depending on the outcome of the other claim--we might reject the model. +In the case of `lm`, the *P*-value is identical regardless of the direction, and moreover is < 0.05, thus--depending on the outcome of the other claim--we might reject the model. In contrast, when $y1$ and $y2$ are modeled as Poisson-distributed, the *P*-value is alternatingly < and >= 0.05. Thus, depending on how the claim is specified, we might or might not reject the model. A big difference! @@ -349,7 +354,7 @@ In contrast, when $y1$ and $y2$ are modeled as Poisson-distributed, the *P*-valu (2) We can remove that path from the basis set and instead specify it as a correlated error using `%~~%`. This circumvents the issue altogether but it may not make sense to assume both variables are generated by some underlying process; or -(3) We can conduct *both* tests and choose the most conservative (i.e., lowest) P**-value. +(3) We can conduct *both* tests and choose the most conservative (i.e., lowest) *P*-value. These options are returned by `summary` in the event the above scenario is identified in the SEM: ```{r, error = TRUE} @@ -391,4 +396,4 @@ Shipley, Bill. "Confirmatory path analysis in a generalized multilevel context." Shipley, Bill. "The AIC model selection method applied to path analytic models compared using ad‐separation test." Ecology 94.3 (2013): 560-564. -Lefcheck, Jonathan S. "piecewiseSEM: Piecewise structural equation modelling in r for ecology, evolution, and systematics." Methods in Ecology and Evolution 7.5 (2016): 573-579. \ No newline at end of file +Lefcheck, Jonathan S. "piecewiseSEM: Piecewise structural equation modeling in r for ecology, evolution, and systematics." Methods in Ecology and Evolution 7.5 (2016): 573-579. \ No newline at end of file diff --git a/04-Coefficients.Rmd b/04-Coefficients.Rmd index a1f812f..f7903eb 100644 --- a/04-Coefficients.Rmd +++ b/04-Coefficients.Rmd @@ -11,19 +11,19 @@ output: html_document Path (or regression) coefficients are the inferential engine behind structural equation modeling, and by extension all of linear regression. They relate changes in the dependent variable $y$ to changes in the independent variable $x$, and thus act as a measure of association. In fact, you may recall from the chapter on global estimation that, under specific circumstances, path coefficients can be expressed as (partial) correlations, a unitless measure of association that makes them excellent for comparisons. They also allow us to generate predictions for new values of $x$ and are therefore useful in testing and extrapolating model results. -We will consider two kinds of regression coefficients: unstandardized (or raw) coefficients, and standardized coefficients. +We will consider two kinds of regression coefficients: unstandardized (or raw) coefficients and standardized coefficients. Unstandardized coefficients are the default values returned by all statistical programs. In short, they reflect the expected (linear) change in the response with each unit change in the predictor. For a coefficient value $\beta = 0.5$, for example, a 1 unit change in $x$ there is, on average, an 0.5 unit change in $y$. In models with more than one independent variable (e.g., $x1$, $x2$, etc), the coefficient reflects the expected change in $y$ *given* the other variables in the model. This implies that the effect of one particular variable controls for the presence of other variables, generally by holding them constant at their mean. This is why such coefficients are referred to as *partial* regression coefficients, because they reflect the independent (or partial) contributions of any particular variable. -As an aside: one tricky aspect to interpretation involves transformations. When the log-transformation is applied, for example, the relationships between the variable are no longer linear. This means that we have to change our interpretation slightly. When $y$ is log-transformed, the coefficient $\beta$ is interpreted as a 1 unit change in $x$ leads to a $(exp(\beta) - 1) \times 100%$ change in $y$. Oppositely, when the independent variable $x$ is log-transformed, $\beta$ is interpreted as a 1% change in $x$ leads to a $\beta$ increase in $y$. Finally, when both are transformed, both are expressed in percentages: a 1% change in $x$ leads to a $(exp(\beta) - 1) \times 100%$ change in $y$. Transformations often confound intrepretation, so it is worth mentioning. +As an aside: one tricky aspect to interpretation involves transformations. When the log-transformation is applied, for example, the relationships between the variable are no longer linear. This means that we have to slightly change our interpretation. When $y$ is log-transformed, the coefficient $\beta$ is interpreted as a 1 unit change in $x$ leads to a $(exp(\beta) - 1) \times 100%$ change in $y$. Oppositely, when the independent variable $x$ is log-transformed, $\beta$ is interpreted as a 1% change in $x$ leads to a $\beta$ increase in $y$. Finally, when both are transformed, both are expressed in percentages: a 1% change in $x$ leads to a $(exp(\beta) - 1) \times 100%$ change in $y$. Transformations often confound interpretation, so it is worth mentioning. In contrast to raw coefficients, standardized coefficients are expressed in equivalent units, regardless of the original measurements. Often these are in units of standard deviations of the mean (scale standardization) but, as we shall see shortly, there are other possibilities. The goal of standardization is to increase _comparability_. In other words, the magnitude of standardized coefficients can be directly compared to make inferences about the relative strength of relationships. In SEM, it is often advised to report both unstandardized and standardized coefficients, because they present different and mutually exclusive information. Unstandardized coefficients contain information about both the variance *and* the mean, and thus are essential for prediction. Along these lines, they are also useful for comparing across models fit to the same variables, but using different sets of data. Because the most common form of standardization concerns scaling by the sample standard deviations, data derived from different sources (i.e., different datasets) have different sample variances and their standardized coefficients are not immediately comparable. -Unstandardized coefficients also reflect the phenomenon of interest in straightforward language. Imagine telling someone that 1 standard deviation change in nutrient input levels would result in a 6 standard deviation change in water quality. That might seem impressive until it becomes clear that the size of the dataset has reduced the sample variance, and the absoluty relationship reveals only a very tiny change in water quality with each unit change in nutrient levels. Not so impressive anymore. +Unstandardized coefficients also reflect the phenomenon of interest in straightforward language. Imagine telling someone that 1 standard deviation change in nutrient input levels would result in a 6 standard deviation change in water quality. That might seem impressive until it becomes clear that the size of the dataset has reduced the sample variance, and the absolute relationship reveals only a very tiny change in water quality with each unit change in nutrient levels. Not so impressive anymore. Standardized effects, on the other hand, are useful for comparing the relative magnitude of change associated with different paths in the same model (i.e., using the same dataset). Care should be taken *not* to interpret these relationships as the 'proportion of variance explained'--for example, a larger standardized coefficient does not explain more variance in the response than a smaller standardized coefficient--but rather in terms of relative influence on the mean of the response. @@ -37,7 +37,7 @@ Thus, both standardized and unstandardized coefficients have their place in stru The most typical implementation of standardization is placing the coefficients in units of standard deviations of the mean. This is accomplished by scaling the coefficient $\beta$ by the ratio of the standard deviation of $x$ over the standard deviation of $y$: - $$b = \beta*\left( \frac{sd_x}{sd_y} \right)$$ + $$b = \beta* \times \left( \frac{sd_x}{sd_y} \right)$$ Thus lending this standardization its name. This coefficient has the interpretation that, for a 1 standard deviation change in $x$, we expect a $b$ unit standard deviation change in $y$. @@ -77,21 +77,21 @@ standardizedsolution(xy_sem) summary(xy_sem, standardize = T) ``` -In all 3 cases, we have achieved a scale-standardized coefficient of $b = 0.095$. Thus, a 1 SD change in $x$ would result in a 0.095 SD change in $y$. +In all three cases, we have achieved a scale-standardized coefficient of $b = 0.095$. Thus, a 1 SD change in $x$ would result in a 0.095 SD change in $y$. ## Range Standardization An alternative to scale standardization is *relevant range* standardization. This approach scales the coefficients over some relevant range. Typically this is the full range of the data, in which case $\beta$ can be standardized as follows: - $$b = \beta * \frac{max(x) - min(x)}{max(y) - min(y)}$$ + $$b = \beta \times \frac{\max(x) - \min(x)}{\max(y) - \min(y)}$$ The interpretation for the coefficient would then be the expected proportional shift in $y$ along its range given a full shift along the range of $x$. At first, this might seem like a strange form of standardization, but it has some powerful applications. For example, consider a binary predictor: 0 or 1. In such a case, the relevant range-standardized coefficient is the expected shift in $y$ given the transition from one state (0) to another (1). Or consider a management target such as decreasing nutrient runoff by 10%. Would reducing fertilizer application by 10% of its range yield a 10% reduction in runoff? Such expressions are necessarily the currency of applied questions. -Perhaps the best application of relevant ranges is in comparing coefficients within a model: rather than dealing in somewhat esoteric quantities of standard deviations, relevant range standardization simply asks which variable causes a greater shift in $y$ along its range. This is a much more digestable concept to most scientists. It may even provide a more fair comparison across the same paths fit to different datasets, if the ranges are roughly similar and/or encompassed in the others. Restricting the range may be a useful solution for comparing coefficients across models fit to different data, as long as the range doesn't extend beyond that observed in any particular dataset. +Perhaps the best application of relevant ranges is in comparing coefficients within a model: rather than dealing in somewhat esoteric quantities of standard deviations, relevant range standardization simply asks which variable causes a greater shift in $y$ along its range. This is a much more digestible concept to most scientists. It may even provide a more fair comparison across the same paths fit to different datasets if the ranges are roughly similar and/or encompassed in the others. Restricting the range may be a useful solution for comparing coefficients across models fit to different data, as long as the range doesn't extend beyond that observed in any particular dataset. -For a worked example, we have now entered fully into the realm of *piecewiseSEM*--it does not appear as if *lavaan* has integrated this functionality a of yet. Let's attempt to scale the results by hand, then compare to the output from `coefs` with the argument `standardize = "range"`: +For a worked example, we have now entered fully into the realm of *piecewiseSEM*--it does not appear as if *lavaan* has integrated this functionality as of yet. Let's attempt to scale the results by hand, then compare to the output from `coefs` with the argument `standardize = "range"`: ```{r} #by hand @@ -117,7 +117,7 @@ x <- x[order(x)] y <- c(rbinom(10, 1, 0.8), rbinom(10, 1, 0.2)) -glm_model <- glm(y ~ x, data = data.frame(x = x, y = y), "binomial") +glm_model <- glm(y ~ x, data = data.frame(x = x, y = y), family = "binomial") xpred <- seq(min(x), max(x), 0.01) @@ -128,7 +128,7 @@ plot(x, y) lines(xpred, ypred) ``` -Clearly these data are not linear, and modeling them as such would ignore the underlying process. Instead, as you can see, we fit them to a binomial distribution using a generalized linear model (GLM). +Clearly, these data are not linear, and modeling them as such would ignore the underlying process. Instead, as you can see, we fit them to a binomial distribution using a generalized linear model (GLM). GLMs consist of three parts: (1) the random component, or the expected values of the response based on their underlying distribution, (2) the systematic component that represents the linear combination of predictors, and (3) the link function, which links the expected values of the response (random component) to the linear combination of predictors (systematic component). @@ -144,11 +144,11 @@ Note how the line is no longer sigmoidal, but straight! For binomial responses, there are two kinds of link functions: logit and probit. We'll focus on the logit link for now because it's more common. With this link, the coefficients are in units of logits or the *log odds ratio*, which reflect the log of the probability of observing an outcome (1) relative to the probability of not observing it (0). -Often these coefficients are reverted to just the odds ratio by taking the exponent, which yields the proportional change in the probablity observing one outcome (1) with a unit change change in the predictor. +Often these coefficients are reverted to just the odds ratio by taking the exponent, which yields the proportional change in the probability observing one outcome (1) with a unit change change in the predictor. Say, for example, we have a coefficient $\beta = -0.12$. A 1 unit change in $x$ would result in $exp(-0.12) = 0.88 \times 100%$ or 88% reduction in the odds of observing the outcome. -The problem is that (log) odds ratios themselves are not comparable across models, and it's unclear how they might be standardized, since the coefficient is on the link (linear) scale, while the only variance we can compute is from the raw data, which is on the non-linear scale. Thus, we need to find some sway to obtain estimates of variance on the same scale as the coefficient. +The problem is that (log) odds ratios themselves are not comparable across models, and it's unclear how they might be standardized, since the coefficient is on the link (linear) scale, while the only variance we can compute is from the raw data, which is on the non-linear scale. Thus, we need to find some way to obtain estimates of variance on the same scale as the coefficient. One approach is to consider that for every value of $x$, there is an underlying probability distribution of observing a 0 or a 1 for $y$. The mean of these distributions is where a particular outcome is *most* likely. Let's say at low values of $x$ we observe $y = 0$, at at high values of $x$ we observe $y = 1$. If we order $x$, the mean probabilities give rise to a linear increase in observing $y = 1$ with increasing $x$. Here is an illustration of this phenomenon (from Long 1997): @@ -170,13 +170,13 @@ The variance in $y^*$ is the sum of the variance of the predictions (on the line The square-root of this quantity gives the standard deviation of $SD_{y^*}$ on the linear scale for use in scale standardization, or alternately, the range of $y^*$ to use in relevant range standardization. -There is an alternate method to the 'latent theoretic approach', which relies on the proportion of variance explained, or $R^2$. Here, we can express the $R^2$ as the variance of the predicted values (on the non-linear scale) over the variance of the observed values (also on the non-linear scale): +There is an alternative method to the 'latent theoretic approach', which relies on the proportion of variance explained, or $R^2$. Here, we can express the $R^2$ as the variance of the predicted values (on the non-linear scale) over the variance of the observed values (also on the non-linear scale): $$R^2 = \frac{\sigma_{\hat{y}}^2}{\sigma_{y}^2}$$ We can obtain the variance of the observed values as the variance of the predicted values (on the linear scale) over the total explained variance of $R^2$. The standard deviation, of course, is the square-root of this variance. -This method, called the *observation-empirical approach*, does not require the acknowledgement of any latent variables or theoretical error variances, but does require an acceptance of this is a valid measurement of $R^2$ (which some consider it not, as GLM estimation is based on deviance, not variance, and thus this statistic is not equivalent). It also does not provide a measure of the range of $y$ although we can assume, based on sampling theory, that is $6 * \sigma_{y}$. +This method, called the *observation-empirical approach*, does not require the acknowledgment of any latent variables or theoretical error variances, but does require an acceptance of this as a valid measurement of $R^2$ (some consider it not, as GLM estimation is based on deviance, not variance, and thus this statistic is not equivalent). It also does not provide a measure of the range of $y$ although we can assume, based on sampling theory, that is $6 * \sigma_{y}$. Let's revisit our earlier GLM example and construct standardized coefficients: ```{r} @@ -206,11 +206,11 @@ coefs(glm_model, standardize.type = "Menard.OE"); beta_oe We see that both approaches produce coefficients, and they are the same as returned by the `coefs` function in *piecewiseSEM* (with the appropriate argument). -You'll note that the observation-empirical approach yields a smaller coefficient than the latent-theoretic. This is because the former approach is influenced by the fact that it is based on the relationship between a linear approximation (predictions) of a non-linear variable (raw values), introducing a loss of information. The *latent theoretic approach* also suffers from a loss of information from use of a distribution-specific but theoretically-derived error variance, which may or may not approach the true error variance (which is unknowable). Either way, both kinds of standardization are not without their drawbacks, but both provide potentially useful information in being able to compare linear and now *linearized* standardized coefficients. +You'll note that the observation-empirical approach yields a smaller coefficient than the latent-theoretic. This is because the former approach is influenced by the fact that it is based on the relationship between a linear approximation (predictions) of a non-linear variable (raw values), introducing a loss of information. The *latent theoretic approach* also suffers from a loss of information by using a distribution-specific but theoretically-derived error variance, which may or may not approach the true error variance (which is unknowable). Either way, both kinds of standardization are not without their drawbacks, but both provide potentially useful information in being able to compare linear and, now, *linearized* standardized coefficients. ## Scaling to Other Non-Normal Distributions -As it turns out, the latent-theoretic approach has one further benefit: we can extend it to other distributions (as they all have their own described theoretical error variances), and mixed-effects models that introduce another source of random variation. +As it turns out, the latent-theoretic approach has one further benefit: we can extend it to other distributions (as they all have their own described theoretical error variances) and mixed-effects models that introduce another source of random variation. [content to come]