How to calculate an equivalence test, and power for an equivalence test, for a fixed effect in a mixed effects model using R and PANGEA.

## Introduction to a two-part problem

Calculating equivalence test (TOST) power for fixed effects in a mixed model can be compartmentalized into two independent problems. First, we need a method for calculating TOST power from the non-centrality parameters (ncp) and degrees of freedom (df) of a test. TOSTER does not currently take this approach when calculating power, but it is straight-forward to recast the calculations in terms of ncps and dfs based on formulas provided in the literature.

Once a method for calculating power based on ncp and df has been identified, the problem of calculating power reduces to a problem of identifying an appropriate ncp and df for the expected mixed model effect. This is by no means trivial, as calculations of ncp and df can become quite complicated with multiple random factors in the model . Fortunately, the free and open source power calculator PANGEA can be used to extract the ncp and df for a wide array of mixed model designs . We can use PANGEA to extract the ncp and df for both tests of a TOST procedure, given some equivalence bounds. From there, we can simply plug these values into our formula for calculating power since, at this point, power for mixed and fixed designs are calculated in the same way.

## Part 1: Calculating power for a (one-sample) equivalence test using noncentrality parameters (ncp)

Say we want to calculate the power of an equivalence test for a one-sample t-test with the following parameters:

es <- 0            # Assumed effect size (alternative hypothesis). Set to 0 to assume that there is no true effect.
s <- 1             # Standard deviation. When set to 1, "ës" contain values on Cohen's d scale.
n <- 100           # Number of subjects.
se <- (s/sqrt(n))  # Standard error.
df <- n-1          # degrees of freedom.
alpha <- 0.05      # Type I error rate.

If we wanted to calculate power for a traditional two-sided null hypothesis test using noncentrality parameters and degrees of freedom, we could set up the analysis in the following way (from https://www.cyclismo.org/tutorial/R/power.html):

ncp <- es/se                    # notice that the ncp is just a t value - the t value on which we would expect the t-distribution to be centered, given some difference between the groups. If the npc is 0, we assume a central t-distribution.
crit_t <- qt(1-alpha/2, df=df)  # alpha/2 denotes a two-sided t test, and is the same as specifying alpha=0.025.

typeII <- pt(crit_t, df=df, ncp=ncp) - pt(-crit_t, df=df, ncp=ncp)   # The type II error of the test.
power <- 1-(pt(crit_t,df=df,ncp=ncp) - pt(-crit_t, df=df, ncp=ncp))  # The power of the test, which equals 1-typeII
paste("Power =", power)
##  "Power = 0.05"

We can verify that this calculation is correct by comparing it against the results of the power.t.test() function in base R and the pwr.t.test function in the “pwr” package. When es = 0, power will reduce to the type I error rate (in reality we have no power, since there is no true effect to detect). You can change the value of es to see that the calculations are identical regardless of effect size.

power.t.test(n = n, delta = es, sd = s, sig.level = alpha, type = "one", alternative = "two.sided", strict = TRUE)  # The strict argument must be set to TRUE for this test to calculate power correctly for the two-sample case.
##
##      One-sample t test power calculation
##
##               n = 100
##           delta = 0
##              sd = 1
##       sig.level = 0.05
##           power = 0.05
##     alternative = two.sided
pwr::pwr.t.test(n = n, d = es/s, sig.level = alpha, type = "one", alternative = "two.sided")
##
##      One-sample t test power calculation
##
##               n = 100
##               d = 0
##       sig.level = 0.05
##           power = 0.05
##     alternative = two.sided

Now let us extend the use of noncentrality parameters to the case of the two-one-sided-tests (TOST) used for equivalence testing. First we must decide on a smallest effect size of interest (SESOI) to use as bounds for the equivalence test. Let us assume that we consider a group difference of 0.3, in either direction, to be the smallest effect we care about. Because the population standard deviation s is 1, we can choose whether to think of the group difference as a raw difference or a difference in Cohen’s d. The two will the same for this example since d = es/s, and s is 1, so d = es/1 = es.

bound_l <- -0.3  # Smallest negative effect size of interest (TOST null hypothesis)
bound_u <-  0.3  # Smallest positive effect size of interest (TOST null hypothesis)

Having defined our bounds, we can calculate the ncp for the tests against the upper and lower bounds (the assumed effect size is the difference between the assumed (null) effect and the bound). We then use these npc values to calculate the power for the equivalence test (Chow, Shao, and Wang 2008, p55, 3rd equation):

ncp_l <- (es + bound_l) / se
ncp_u <- (es + bound_u) / se
crit_t <- qt(alpha, df, lower.tail = FALSE)

power <- ifelse(1 - pt(-crit_t, df, ncp_l, lower.tail = FALSE) - pt(crit_t, df, ncp_u, lower.tail = TRUE) > 0,
1 - pt(-crit_t, df, ncp_l, lower.tail = FALSE) - pt(crit_t, df, ncp_u, lower.tail = TRUE),
0)  # Derived from Chow, Shao & Wang (2008), page 55, third formula, slightly tweaked to allow different SESOI for upper and lower bound.

paste("Power =", power)
##  "Power = 0.817975007468883"

We can verify our calculations against the results from the power_t_TOST() function in TOSTER. The results should be identical.

TOSTER::power_t_TOST(alpha = alpha, n = n, low_eqbound = bound_l, high_eqbound = bound_u, type = "one-sided")
##
##       TOST power calculation
##
##           power = 0.817975
##            beta = 0.182025
##           alpha = 0.05
##               n = 100
##           delta = 0
##              sd = 1
##          bounds = -0.3, 0.3

## Part 2: Extending the ncp approach to mixed designs using PANGEA

Having resolved the issue of how to calculate equivalence test power from the degrees of freedom and noncentrality parameter, we must now turn to the issue of how to calculate noncentrality parameters in the first place. The ncp equals the effect size divided by the effect standard error, or:

$$ncp=\frac{\theta}{\sigma/\sqrt{n}}$$

Where $$\theta$$ is the assumed effect size in the population, $$\sigma$$ is the assumed standard deviation in the population, and $$n$$ is the number of observations.

For simple designs, this formula is easily calculated (just calculate the t-value of the assumed effect). However, for mixed model desigs, the details of the calculation becomes more complicated due to the partitioning of variance among the random variables and their interactions. In practice, the details of the calculation will depend on the specifics of the design and the number of random variables involved . For example, the ncp for a two-group fully crossed design, where the effect is assumed to vary randomly across participants and stimuli, is defined as:

$$npc=\frac{d}{2\sqrt{\frac{V_P\times_C}{p}+\frac{V_S\times_C}{q}+\frac{V_E}{2pq}}}$$

(see Table A1 in Westfall, Kenny, and Judd 2014 for a definition of formula parameters). Fortunately, the Shiny application PANGEA can take care of the ncp calculation for us. All we need to do is to specify the planned design, the random variables, and make some assumptions about the variance partitioning coefficients (“VCP,” Westfall, Kenny, and Judd 2014) involved. PANGEA will then return the ncp and Welch-Satterthwaite approximated degrees of freedom. It will also return the power for a selected fixed effect given the specified design. The power is based on a t-test for the effect. This approach can in principle be extended to tests of any fixed linear contrast .

In order to plug the information from PANGEA into the equivalence test power formula, we need to perform two tests in PANGEA; one for the test against each of the bounds we have defined. For example, if we define d=-0.5 and d=0.5 as our lower and upper bound, respectively, then we need to compute the ncp in PANGEA for one test where “Effect size (d)” is set to theta + bound_l, and for another test where “Effect size (d)” is set to theta + bound_u. We can then simply plug the ncp and df reported by PANGEA into the power calculation defined in part 1 to get equivalence test power for a fixed effect in our mixed model design.

As an example of how to excecute the whole procedure, suppose we would like to calculate power for the fully crossed design mentioned above. First, let us assume the following input parameters for the tests (see the appendix of Westfall, Kenny, and Judd 2014 for detailes on how Cohen’s d is defined in this design):

d <- 0    # assumed effect size in units of Cohen's d, using a joint standard deviation over all variance components.
bound_l <- -0.3  # Smallest negative effect size of interest in same units as d (null hypothesis)
bound_u <-  0.3  # Smallest positive effect size of interest in same units as d (null hypothesis)
p <- 100  # number of subjects
q <- 100  # number of stimuli
alpha <- 0.05  # significance threshold
## Assumed variance component ratios, following default partition procedure described in Westfall (2014, 2015):
Ve <- 0.3
Vpxc <- 0.1
Vsxc <- 0.1

We could obtain the ncp and df via the following formulas :

se <- 2 * sqrt( Vpxc/p + Vsxc/q + Ve/(2*p*q) )  # standard error of the estimate (information about standard deviation contained within d)
ncp <- d / se  # calculate t value/non-centrality parameter for the assumed effect size
df <- (q*Vpxc + p*Vsxc + Ve)^2 / ( (q*Vpxc+Ve)^2/(p-1) + (p*Vsxc+Ve)^2/(q-1) + Ve^2/((p-1)*(q-1)))  # Welch-Satterthwaite approximation

ncp_l <- (d + bound_l) / se  # calculate t value/non-centrality parameter corresponding to the smallest negative effect size of interest
ncp_u <- (d + bound_u) / se  # calculate t value/non-centrality parameter corresponding to the smallest positive effect size of interest

We could also extract these values from PANGEA. In either case, we then take the df and ncp and calculate power in exactly the same way as we would for the simple t-test*.

crit_t <- qt(alpha, df, lower.tail = FALSE)
power <- ifelse(1 - pt(-crit_t, df, ncp_l, lower.tail = FALSE) - pt(crit_t, df, ncp_u, lower.tail = TRUE) > 0,
1 - pt(-crit_t, df, ncp_l, lower.tail = FALSE) - pt(crit_t, df, ncp_u, lower.tail = TRUE),
0)  # Derived from Chow, Wang $Chao (2008), page 55, third formula, slightly tweaked to allow different SESOI for upper and lower bound. power ##  0.9080016 *OBS! Note that the ncp calculated using the formula derived from does not perfectly correspond to the ncp calculated by PANGEA! The numbers should be identical. Neither I nor Jake Westfall is currently sure what the source of this divergence is. I have done some informal testing which suggests that the two methods seem to produce very similar numbers, which means that either methold will probably be quite accurate for practical purposes. However, I cannot guarantee that there are not some conditions where the two methods might produce very different numbers. # Calculating an equivalence test for an estimated fixed effect in mixed model Having derived a solution for how to calculate the power of a mixed model fixed effect equivalence test, we may reasonably ask how we would calculate the actual equivalence test once the data are in. A relatively straight-forward approach is to fit the model using the lme4 package in R. From this model we can extract the effect size estimate and Welch-Satterthwaite degrees of freedom for the fixed effect of interest. We then calculate TOST in the same way as we would for a simple t-test with no random effects assumed. Suppose we have the following dataset (counter-balanced design, no true effect), and a SESOI of raw effect size = 0.5: set.seed(3) # Dataset data <- data.frame("participant" = as.factor(rep(1:10, each = 10)), "condition" = as.factor(rep(c(0,1,1,0), 5, each = 5)), "stimulus" = as.factor(rep(1:10, 10)), "response" = rnorm(100)) summary(data) ## participant condition stimulus response ## 1 :10 0:50 1 :10 Min. :-2.26540 ## 2 :10 1:50 2 :10 1st Qu.:-0.72254 ## 3 :10 3 :10 Median : 0.03419 ## 4 :10 4 :10 Mean : 0.01104 ## 5 :10 5 :10 3rd Qu.: 0.73927 ## 6 :10 6 :10 Max. : 1.73554 ## (Other):40 (Other):40 # Equivalence bounds bound_u <- 0.5 # Upper equivalence bound bound_l <- -0.5 # Lower equivalence bound Now we fit a linear mixed model to the data, assuming a fixed effect of condition, and random effects of participants and stimuli: library(lme4) library(lmerTest) fm <- lmer(response ~ condition + (1 + condition | participant) + (1 + condition | stimulus), data) ## boundary (singular) fit: see help('isSingular') summary(fm) ## Linear mixed model fit by REML. t-tests use Satterthwaite's method [ ## lmerModLmerTest] ## Formula: ## response ~ condition + (1 + condition | participant) + (1 + condition | ## stimulus) ## Data: data ## ## REML criterion at convergence: 252.8 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -2.38943 -0.77473 0.01916 0.77860 1.77988 ## ## Random effects: ## Groups Name Variance Std.Dev. Corr ## participant (Intercept) 0.09554 0.3091 ## condition1 0.17918 0.4233 -0.90 ## stimulus (Intercept) 0.00134 0.0366 ## condition1 0.02356 0.1535 1.00 ## Residual 0.64767 0.8048 ## Number of obs: 100, groups: participant, 10; stimulus, 10 ## ## Fixed effects: ## Estimate Std. Error df t value Pr(>|t|) ## (Intercept) 0.1316 0.1505 8.9493 0.874 0.405 ## condition1 -0.2411 0.2149 8.6014 -1.122 0.292 ## ## Correlation of Fixed Effects: ## (Intr) ## condition1 -0.749 ## optimizer (nloptwrap) convergence code: 0 (OK) ## boundary (singular) fit: see help('isSingular') The lmerTest package adds a t-test with Welch-Satterthwaite approximation to df for each fixed effect in our model (in this case, intercept and condition). This is the same t-test approach that PANGEA calculates power for . It is directly calculated from the estimated effect and std.error provided in the basic lmer model. There are at least three ways to calculate an equivalence test from the data provided in this model. First, we can use the contest1D functions of the lmerTest package to perform tests centered on the lower and upper equivalence bound, using the rhs option: lower <- contest1D(fm, c(0, 1), confint=TRUE, rhs=bound_l) # get t value for test against lower bound upper <- contest1D(fm, c(0, 1), confint=TRUE, rhs=bound_u) # get t value for test against upper bound lower ## Estimate Std. Error df t value lower upper Pr(>|t|) ## 1 -0.2410955 0.214897 8.601367 1.204784 -0.7306825 0.2484914 0.2603875 upper ## Estimate Std. Error df t value lower upper Pr(>|t|) ## 1 -0.2410955 0.214897 8.601367 -3.448608 -0.7306825 0.2484914 0.007803962 The test provided by contest1D is two-sided, but we can easily recalculate the required one-sided tests from the t-values provided (or simply divide by 2 the p-values provided by contest1D): pt(lower$t value, lower$df, lower.tail = FALSE) # test against lower bound ##  0.1301938 lower$Pr(>|t|)/2  # test against lower bound
##  0.1301938
pt(upper$t value, upper$df, lower.tail = TRUE)  # test against upper bound
##  0.003901981
upper$Pr(>|t|)/2 # test against upper bound ##  0.003901981 If both these tests are significant, so is the test for equivalence. In this case, the test against the lower bound is not significant, which means that we cannot reject that the true value is not equal to or more extreme than bound_l, which means that we failed to obtain an equivalent result. Instead of relying on lmerTest, we can also calculate the equivalence test directly from the estimated effects, standard error and degrees of freedom in the basic lme4 model. All we need to do is subtract each bound from the effect size when calculating the t value: # Reproduce test without the use of rhs, by subtracting the bound t values from the estimated effect pt((lower$Estimate-bound_l)/lower$Std. Error, lower$df, lower.tail = FALSE)  # test against lower bound 
##  0.1301938
pt((upper$Estimate-bound_u)/upper$Std. Error, upper\$df, lower.tail = TRUE)  # test against upper bound
##  0.003901981

The result should be identical to the solution provided by lmerTest.

Finally, we can compare our upper and lower equivalence bound to the 90% (or alpha*2 more generally) confidence interval of an effect estimate, again using contest1D:

contest1D(fm, c(0, 1), confint=TRUE, rhs=bound_l, level = 0.9)
##     Estimate Std. Error       df  t value      lower     upper  Pr(>|t|)
## 1 -0.2410955   0.214897 8.601367 1.204784 -0.6371148 0.1549237 0.2603875

If the lower and upper bounds of the 90% confidence interval falls within the boundry specified by our SESOI, the equivalence test is significant.