1 Simulation example Q.6

In this simulation example, the highlighted data analysis pitfall is sampling variability. Specifically, it shows the influence of sampling variability on the results of a data analysis.

#Load packages
library(ggdag)
library(dagitty)
library(ggplot2)
library(data.table)
library(gt)
library(tidyverse)
library(ordinal)
library(EnvStats)

1.1 Data-generating process

The data-generating process is the specific process that generates the observed data. In reality, this process is unknown and understanding it is, in a general sense, the ultimate aim of science. Here, we assume this process is known and describe it via a directed acyclic graph (DAG) in Fig. 1.1. In this DAG,¹ indoor air temperature \((\text{T})\) and biological sex \((\text{S})\) influence thermal sensation vote \((\text{V})\) both directly and indirectly, passing through clothing insulation \((\text{C})\).

Code

dag_coords <-
  data.frame(name = c('T', 'C', 'S', 'V'),
             x    = c(1, 2, 3, 2),
             y    = c(1.2, 1.2, 1.2, 1))

DAG.06 <-
  dagify(C ~ S + T,
       V ~ S + C + T,
       coords = dag_coords)

ggplot(data = DAG.06, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_text(colour='black', size = 8, parse = TRUE,
                family = c('mono'),
                label = c(expression(bold(C)), expression(bold(S)),  expression(bold(T)), expression(bold(V)))) +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(8, 'pt'), type = 'open'),
                 family = c('mono'), 
                 fontface = c('bold')) + 
  coord_cartesian(xlim = c(1, 3), ylim = c(1, 1.2))  +
  theme_dag()

Fig. 1.1. Graphical representation via DAG of the data-generating process. Indoor air temperature \((\text{T})\) and biological sex \((\text{S})\) influence thermal sensation vote \((\text{V})\) both directly and indirectly, passing through clothing insulation \((\text{C})\).

The DAG in Fig. 1.1 can be written as:

\(C \sim f_{C}(S, T)\), read as ‘clothing insulation \((\text{C})\) is some function of biological sex \((\text{S})\) and indoor air temperature \((\text{T})\)’.
\(V \sim f_{V}(S, T, C)\), read as ‘thermal sensation vote \((\text{V})\) is some function of biological sex \((\text{S})\), indoor air temperature \((\text{T})\) and clothing insulation \((\text{C})\)’.

1.2 Synthetic data set

In a DAG, an omitted arrow between two variables signifies the absence of direct influence of one variable on another. In contrast, a present arrow does not provide any information about the magnitude of the effect or its functional form. As such, all relationships are qualitative. However, to generate synthetic data, we need to quantitatively define the relationship between the variables. These relationships are defined within the custom function data.sim_Q06.ord(). This function takes as input the sample size N and generates synthetic data according to the DAG in Fig. 1.1.

data.sim_Q06.ord <- function(N) {
#Simulate biological sex
  S <- factor(sample(1:2, size = N, replace = TRUE)) #S = 1 'male'; S = 2 'female'
#Simulate indoor air temperature
  T <- rnorm(N, mean = 23 , sd = 2)
#Simulate clothing insulation
  C <- rlnormTrunc(N, meanlog = ifelse(S=='1', 0.25 + 0.00 -0.04*T, 0.25 + 0.03 -0.04*T), sdlog =  0.15, min = 0.3 , max = 2)
#Simulate thermal sensation vote
  V <- sim_TSV.ord(n = N, sex = S, temp = T, clo = C, b_int = 0, b_sex = c(0,-0.16), b_temp = 0.37, b_clo = 0.88)
  return(data.table(S, T, C, V)) #return a data.table with the simulated values
}

Specifically:

Biological sex \((\text{S})\) is randomly sampled with replacement from a vector of factors 1 (i.e., male) and 2 (i.e., female).
Indoor air temperature \((\text{T})\) is randomly sampled from a normal distribution. The mean is set to 23°C (i.e., mean = 23) and the standard deviation to 2°C (i.e., sd = 2).
Clothing insulation \((\text{C})\) is randomly sampled from a truncated lognormal distribution. The mean is on the log scale (i.e., meanlog) and is a linear of function of biological sex \((\text{S})\) and indoor air temperature \((\text{T})\): for male (i.e., S = 1) is meanlog = 0.25 - 0.04*T while for female (i.e., S = 2) is meanlog = 0.25 + 0.03 - 0.04*T. The standard deviation is on the log scale and is a constant, fixed at 0.15 (i.e., sdlog = 0.15). The parameters min and max are the assumed lower and upper bounds for clothing, fixed to 0.3 clo (i.e., min = 0.3) and 2 clo (i.e., max = 2), respectively.
Thermal sensation votes \((\text{V})\) are generated from the custom function sim_TSV.ord(), which is a function of biological sex (sex), indoor air temperature (temp), and clothing insulation (clo). The term b_int is the coefficient for the intercept (set to 0), b_sex is the coefficient for biological sex (set to 0 for males and -0.16 for females), b_temp is the coefficient for indoor air temperature (set to 0.37), and b_clo is the coefficient for clothing insulation (set to 0.88). More information on the custom function sim_TSV.ord() can be found in Appendix A.1.

From this data generation mechanism, we simulated the target population, which consists of one million observations.

set.seed(2023) #Set random number for reproducibility
#Simulate the population
Q06_population <- data.sim_Q06.ord(N = 1e6)
#Transform V to an ordered factor
Q06_population[, V := factor(V, levels = c('1','2','3','4','5','6','7'), ordered = TRUE)]
#View the data frame
Q06_population

              S        T         C     V
         <fctr>    <num>     <num> <ord>
      1:      1 19.35834 0.5173312     3
      2:      2 20.17534 0.6351932     3
      3:      1 17.25539 0.6350222     4
      4:      1 22.32519 0.5410619     4
      5:      2 23.51410 0.5319473     6
     ---                                
 999996:      1 23.16175 0.5811199     7
 999997:      1 20.21810 0.4694916     4
 999998:      2 21.43980 0.5467828     3
 999999:      1 21.73996 0.4982545     4
1000000:      1 20.66438 0.6083867     3

From this population, we obtained three thousand data sets with three different sample sizes using simple random sampling. Specifically, a thousand data sets with sample sizes of 30, 300 and 900.

n_sims <- 1e3 #Number of data sets to simulate

#Set random number for reproducibility
set.seed(2023)  
#Generate a vector of random numbers for reproducibility
random.seed <- sample(1:1e5, size = n_sims, replace = FALSE)

#Sample one data set of 30 observations
set.seed(random.seed[1])
sample.random_3e1 <- 
  Q06_population |> 
  slice_sample(n = 3e1) #take a simple random sample of size 30
#View the data frame
sample.random_3e1

         S        T         C     V
    <fctr>    <num>     <num> <ord>
 1:      2 21.09664 0.5414387     2
 2:      1 21.71421 0.5381658     4
 3:      2 23.74820 0.5389922     6
 4:      2 20.05361 0.5974996     4
 5:      1 20.72877 0.5577154     3
 6:      2 24.28257 0.6013693     5
 7:      1 24.92904 0.4196810     4
 8:      1 21.80931 0.6892414     4
 9:      1 21.82005 0.4485176     3
10:      1 26.24905 0.4085024     6
11:      2 27.07587 0.5773195     5
12:      2 23.37962 0.4209909     4
13:      2 24.04006 0.5938105     4
14:      2 24.71866 0.5924425     4
15:      1 26.38131 0.3865448     2
16:      2 20.23951 0.5525658     4
17:      1 25.08493 0.4105008     4
18:      2 21.63454 0.4554378     4
19:      2 22.27104 0.5191736     4
20:      2 23.71114 0.6124570     2
21:      2 28.60947 0.4721329     5
22:      2 22.36287 0.5316676     4
23:      1 20.99654 0.6277547     4
24:      1 23.20989 0.5144766     3
25:      1 21.83795 0.5868481     4
26:      1 22.51528 0.6097533     4
27:      1 23.08223 0.4722961     4
28:      2 23.65405 0.4397346     3
29:      1 26.82511 0.4086332     4
30:      1 21.13123 0.5078074     4
         S        T         C     V

#Sample one data set of 300 observations
set.seed(random.seed[1])
sample.random_3e2 <- 
  Q06_population |> 
  slice_sample(n = 3e2) #take a simple random sample of size 300
#View the data frame
sample.random_3e2

          S        T         C     V
     <fctr>    <num>     <num> <ord>
  1:      2 21.09664 0.5414387     2
  2:      1 21.71421 0.5381658     4
  3:      2 23.74820 0.5389922     6
  4:      2 20.05361 0.5974996     4
  5:      1 20.72877 0.5577154     3
 ---                                
296:      2 22.08967 0.7875628     4
297:      1 23.89738 0.3811996     4
298:      2 21.42748 0.5020544     3
299:      1 21.06568 0.4899180     4
300:      2 19.44767 0.5651442     1

#Sample one data set of 900 observations
set.seed(random.seed[1])
sample.random_9e2 <- 
  Q06_population |> 
  slice_sample(n = 9e2) #take a simple random sample of size 900
#View the data frame
sample.random_9e2

          S        T         C     V
     <fctr>    <num>     <num> <ord>
  1:      2 21.09664 0.5414387     2
  2:      1 21.71421 0.5381658     4
  3:      2 23.74820 0.5389922     6
  4:      2 20.05361 0.5974996     4
  5:      1 20.72877 0.5577154     3
 ---                                
896:      1 22.01716 0.6665051     4
897:      2 25.58033 0.4767277     4
898:      1 21.13034 0.6622662     4
899:      1 20.12777 0.6227197     5
900:      1 21.80400 0.4828775     4

A summary of descriptive statistics for all variables in the data sets is given below; of the three thousand data sets, only one for each of the three sample sizes (i.e., 30, 300 and 900) is presented. Detailed information can be found in Q.6 Summary.

summary(sample.random_3e1)

 S            T               C          V     
 1:15   Min.   :20.05   Min.   :0.3865   1: 0  
 2:15   1st Qu.:21.74   1st Qu.:0.4502   2: 3  
        Median :23.15   Median :0.5349   3: 4  
        Mean   :23.31   Mean   :0.5211   4:18  
        3rd Qu.:24.61   3rd Qu.:0.5910   5: 3  
        Max.   :28.61   Max.   :0.6892   6: 2  
                                         7: 0

summary(sample.random_3e2)

 S             T               C          V      
 1:156   Min.   :17.45   Min.   :0.3393   1:  7  
 2:144   1st Qu.:21.74   1st Qu.:0.4571   2: 15  
         Median :23.07   Median :0.5126   3: 39  
         Mean   :23.03   Mean   :0.5195   4:157  
         3rd Qu.:24.31   3rd Qu.:0.5792   5: 54  
         Max.   :29.57   Max.   :0.7876   6: 18  
                                          7: 10

summary(sample.random_9e2)

 S             T               C          V      
 1:447   Min.   :15.49   Min.   :0.3091   1: 14  
 2:453   1st Qu.:21.70   1st Qu.:0.4613   2: 29  
         Median :23.05   Median :0.5175   3:160  
         Mean   :23.02   Mean   :0.5242   4:437  
         3rd Qu.:24.36   3rd Qu.:0.5826   5:185  
         Max.   :29.57   Max.   :0.8901   6: 49  
                                          7: 26

1.3 Data analysis

In a scientific context, data analysis should be performed to model the phenomenon under scrutiny and not fitting curves to data. The data are a means of obtaining information about that phenomenon to answer the specific question of interest. In this example, we assumed that the objective of the data analysis is to answer the following inferential question: ‘Is there a direct association between biological sex and thermal sensation vote?’.

1.3.1 Scientific model

The first step to address a scientific question is to define the scientific model. A scientific model is essentially a representation of a specific phenomenon that is being studied. To achieve this, we defined a DAG. A DAG generally consists of letter-and-arrow pictures summarising the researcher’s scientific knowledge about the phenomenon under study. Since we know the data-generating process (we defined it), the scientific model equates to the data-generating process outlined in Fig. 1.1. In reality, however, it would not.²

1.3.2 Statistical model

A statistical model is a mathematical representation of the scientific model —which in turn represents the data-generating process— that aims to answer the particular question of interest. To address the question, the synthetic data sets are analysed using ordinal regression within a frequentist framework.³

To estimate the effect of interest, we need to select the variables to adjust for: the adjustment set. Assuming that the structure of the scientific model is correct, we can use the adjustmentSets() function to identify the adjustment set that (asymptotically) allow the unbiased estimation of the effect of interest.

adjustmentSets(DAG.06,
               exposure = 'S', 
               outcome = 'V', 
               type = 'minimal', 
               effect = 'direct', 
               max.results = Inf)

{ C, T }

The resulting adjustment set include both clothing \((\text{C})\) and indoor air temperature \((\text{T})\). The adjustment set is only one for this DAG, but there can be several adjustment sets for a more complex DAG.

The statistical model can then be written as:

\[ \begin{aligned} \mathrm{V}_{i} &\sim \operatorname{Multinomial}(1, \boldsymbol\pi_{i})\\ \operatorname{logit}(\gamma_{ik}) &= \tau_{k} -\eta_{i}\\ \eta_{i} &= \alpha_{0} + \beta_{S} (\mathrm{S}_i) + \beta_{T} \mathrm{T}_{i} + \beta_{C} \mathrm{C}_{i}\\ \end{aligned} \tag{1.1}\]

In Eq. 1.1, the conditional distribution of the response variable \(\mathrm{Y}_{i}\) for the \(i^{th}\) observation is assumed to follow an ordered multinomial distribution with probability vector \(\boldsymbol\pi_{i} = \{\pi_{i1},\ ...,\ \pi_{ik}\}\), where \(\pi_{ik} = \mathrm{Pr}(Y_{i}=k)\). The cumulative probability for \(\pi_{ik}\) is \(\gamma_{ik} = \mathrm{Pr}(Y_{i} \le k)\), hence \(\gamma_{ik} = \pi_{i1}\ + ... +\ \pi_{ik}.\) The term \(\{\tau_{k}\}\) are the threshold parameters, with \(k \in \{1,\ ...,\ K\}\) and \(\eta_{i}\) is the linear predictor, where \(\alpha_{0}\) is the intercept term⁴ and \(\beta_{S}\), \(\beta_{T}\), \(\beta_{C}\) are the regression coefficients for biological sex \((\text{S})\), indoor air temperature \((\text{T})\), and clothing insulation \((\text{C})\), respectively.

A summary description of the simulation example is provided in Table 1.1.

Table 1.1. Summary description of the simulation example

Pitfall	Forgetting the influence of sampling variability on the results and interpretation of a data analysis.
Type of analysis	Inferential.
Research question	Is there a direct association between biological sex and TSV?
Framework	Frequentist.
Assumptions	Random sample (simple random sampling): everyone in the population has an equal chance of being selected into the sample.
	Independence of observations: each observation represents independent bits of information.
	No confounding: the DAG includes all shared causes among the variables.
	No model error: perfect functional form specification.
	No measurement error: all variables are measured perfectly.
Variables	Indoor air temperature (T): continuous variable [unit: °C]
	Thermal resistance of clothing (C): continuous variable [unit: clo]
	Sex (S): categorical variable [‘1’ male; ‘2’ female]
	TSV (V): ordinal variable [‘1’ cold; ‘2’ cool; ‘3’ slightly cool; ‘4’ neutral; ‘5’ slightly warm; ‘6’ warm; ‘7’ hot]

1.3.3 Results

The results of the fitted statistical model (defined above) are presented here. The for-loop shown in the code below performs the following operations. First samples (using simple random sample) a data set of 30 observations from the target population. Subsequently, perform ordinal regression and store the estimated coefficient for sex and its 95% confidence interval in the data frame coefs_3e1. This operation is repeated a thousand times, resulting in the data frame coefs_3e1 containing the estimates (point estimate and confidence interval) of a thousand random samples of size 30.

#Create an empty data frame
empty.df <- data.frame(matrix(NA, nrow = n_sims, ncol = 6))
#Rename the data frame columns
colnames(empty.df) <- c('sim.id', 'estimate', 'CI_2.5', 'CI_97.5',
                        'sample.size', 'coverage')

#Sample a thousand data sets of 30 observations and perform ordinal regression 
coefs_3e1 <- empty.df #assign the empty data frame
for (i in 1:n_sims){
  set.seed(random.seed[i]) #set unique seed for each simulation 
#Sample data set from population   
  sample.random <- 
    Q06_population |> 
    slice_sample(n = 3e1) #take a simple random sample of size 30
#Fit model  
  fit <- clm(formula = V ~ 1 + S + T + C,
             data = sample.random,
             link = 'logit', 
             threshold = 'flexible')
#Compile matrix  
  coefs_3e1[i, 1] <- i #simulation ID
  coefs_3e1[i, 2] <- coef(fit)['S2'] #point estimate
  coefs_3e1[i, 3:4] <- confint(fit, level = 0.95, type = 'Wald')['S2', ] #confidence interval (Wald)
  coefs_3e1[i, 5] <- '3e1' #sample size
}
#View the first six rows of the data frame
head(coefs_3e1, 6)

  sim.id   estimate     CI_2.5    CI_97.5 sample.size coverage
1      1  0.4111986 -1.0894662  1.9118634         3e1       NA
2      2  0.6224043 -0.9730538  2.2178624         3e1       NA
3      3 -0.4799724 -1.9708800  1.0109352         3e1       NA
4      4  0.4042922 -1.1084313  1.9170156         3e1       NA
5      5 -1.9331813 -3.6423774 -0.2239852         3e1       NA
6      6  0.4240608 -1.0861795  1.9343011         3e1       NA

The same procedure is repeated for the data sets of sample sizes of 300 and 900, and the results are stored in coefs_3e2 and coefs_9e2, respectively.

Code

#Sample a thousand data sets of 300 observations and perform ordinal regression 
coefs_3e2 <- empty.df #assign the empty data frame
for (i in 1:n_sims){
  set.seed(random.seed[i]) #set unique seed for each simulation 
#Sample data set from population   
  sample.random <- 
    Q06_population |> 
    slice_sample(n = 3e2) #take a simple random sample of size 300
#Fit model
  fit <- clm(formula = V ~ 1 + S + T + C,
             data = sample.random,
             link = 'logit', 
             threshold = 'flexible')
#Compile matrix  
  coefs_3e2[i, 1] <- i #simulation ID
  coefs_3e2[i, 2] <- coef(fit)['S2'] #point estimate
  coefs_3e2[i, 3:4] <- confint(fit, level = 0.95, type = 'Wald')['S2', ] #confidence interval (Wald)
  coefs_3e2[i, 5]  <- '3e2' #sample size
}


#Sample a thousand data sets of 900 observations and perform ordinal regression 
coefs_9e2 <- empty.df #assign the empty data frame
for (i in 1:n_sims){
  set.seed(random.seed[i]) #set unique seed for each simulation 
#Sample data set from population   
  sample.random <- 
    Q06_population |> 
    slice_sample(n = 9e2) #take a simple random sample of size 900
#Fit model  
  fit <- clm(formula = V ~ 1 + S + T + C,
             data = sample.random,
             link = 'logit', 
             threshold = 'flexible')
#Compile matrix    
  coefs_9e2[i, 1] <- i #simulation ID
  coefs_9e2[i, 2] <- coef(fit)['S2'] #point estimate
  coefs_9e2[i, 3:4] <- confint(fit, level = 0.95, type = 'Wald')['S2', ] #confidence interval (Wald)
  coefs_9e2[i, 5]  <- '9e2' #sample size
}

The three data frames (coefs_3e1, coefs_3e2, and coefs_9e2) are merged (i.e., coefs), and the coverage is calculated. The coverage is set to 1 if the confidence interval overlaps the data-generating parameter for sex (i.e., b_sex = -0.16) and 0 otherwise.

b_sex <- c(0,-0.16) # Male, Female
#Combine the three data frames
coefs <- data.table(rbind(coefs_3e1, coefs_3e2, coefs_9e2))
#Calculate coverage 
coefs[, coverage := as.character(coverage)][, coverage := '1'][(b_sex[2] < CI_2.5) | (b_sex[2] > CI_97.5), coverage := '0']
#View the data frame
coefs

      sim.id   estimate     CI_2.5     CI_97.5 sample.size coverage
       <int>      <num>      <num>       <num>      <char>   <char>
   1:      1  0.4111986 -1.0894662  1.91186336         3e1        1
   2:      2  0.6224043 -0.9730538  2.21786242         3e1        1
   3:      3 -0.4799724 -1.9708800  1.01093518         3e1        1
   4:      4  0.4042922 -1.1084313  1.91701563         3e1        1
   5:      5 -1.9331813 -3.6423774 -0.22398515         3e1        0
  ---                                                              
2996:    996 -0.1039069 -0.3491340  0.14132018         9e2        1
2997:    997 -0.3088756 -0.5543428 -0.06340832         9e2        1
2998:    998 -0.5290883 -0.7817956 -0.27638097         9e2        0
2999:    999  0.0849292 -0.1634437  0.33330208         9e2        1
3000:   1000 -0.1757759 -0.4226830  0.07113115         9e2        1

The results are then plotted in Fig. 1.2.

Code

ggplot(data = subset(coefs, sim.id <= 1e2), aes(x = sim.id, y = estimate, ymin = CI_2.5, ymax = CI_97.5, colour = coverage)) + 
  geom_hline(yintercept = b_sex[2], alpha = 0.8, linetype = 'dashed', colour = 'blue') + 
  geom_linerange() +
  geom_point(size = 1) +
  scale_colour_manual('', values = c('black', '#FF4040'),
                      breaks=c('1','0')) +
  scale_y_continuous('Estimate (logit scale)') +
  scale_x_continuous('Simulation ID') +
  facet_wrap(.~factor(sample.size, 
                      levels=c('3e1', '3e2', '9e2'), 
                      labels = c('Sample size = 30', 'Sample size = 300', 'Sample size = 900')),
             nrow = 3,
             scales = 'fixed') +
  theme(legend.position = 'none', 
        axis.title.x = element_text(margin = unit(c(2, 0, 0, 0), "mm")), 
        axis.title.y = element_text(margin = unit(c(0, 2, 0, 0), "mm"))
        )

Fig. 1.2. Estimates of the coefficient of sex (only the first hundreds of the thousand simulations are shown). The black dots represent the point estimate, and the black lines represent the 95% confidence intervals. In red are highlighted the confidence intervals that do not overlap the parameter used to generate the data (dashed blue line).

Fig. 1.2 shows the first 100 estimates (point estimate and confidence interval) of the coefficient for sex for the three different sample sizes. The inferential uncertainty shrinks (i.e., narrow confidence intervals) when the sample size increases from 30 to 900 observations. However, it can be seen that, independently of the sample size, some confidence intervals (the one in red) do not overlap the data-generating parameter (the dashed blue line). If all the thousand simulations are considered, the frequency of the coverage of the calculated confidence intervals (i.e., how many times the confidence intervals overlap the data-generating parameter) is 92.4%, 95.7% and 95.8% for the 30, 300 and 900 observations, respectively. This result is expected because it is a property that a 95% confidence interval has. However, once a specific confidence interval has been calculated (in our case, the result of one simulation), it is impossible to make any conclusions about the probability of it containing the data-generating parameter. In the frequentist approach probability refers to frequency. The data-generating parameter is a fixed constant, and a calculated confidence interval is also fixed. Therefore, a given confidence interval either includes the parameter or does not, with no frequency involved. Confidence intervals only quantify the uncertainty due to random error (i.e., sample variability), not systematic error (i.e., bias). According to Gelman et al. (pages 56 of Ref. (Gelman et al., 2020)), the ways to account for sources of errors that are not in the statistical model are: (i) improving data collection, (ii) expanding the model, and (iii) increasing the stated uncertainty. Concerning the last point, some authors (see Ref. (Lash et al., 2009) and chapter 19 of Ref. (Rothman et al., 2008)) recommend the application of quantitative bias analysis to produce intervals around the effect estimate, taking into account random and systematic sources of uncertainty.

When conducting a study, it is important to recognise that (generally) only uncertain conclusions can be drawn from data, even if statistical significance is achieved. In thermal comfort, claims on data tend to be stated deterministically by placing the results into the ‘significant’ or ‘nonsignificant’ bins. However, this practice, called dichotomania (Greenland, 2017), is harmful because it overlooks the inherent variation in the human response (but more generally in the topic under study) as well as the uncertainty involved in statistical inference. Instead, focusing on uncertainty quantification is likely to result in a reduction of overly confident assertions that, upon further examination, may lack support from the data. Practical examples of this suggestion can be found in Vasishth and Gelman (Vasishth and Gelman, 2021).

Session info

Version information about R, the OS and attached or loaded packages.

sessionInfo()

R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8 
[2] LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] EnvStats_2.8.1    ordinal_2023.12-4 lubridate_1.9.3   forcats_1.0.0    
 [5] stringr_1.5.1     dplyr_1.1.4       purrr_1.0.2       readr_2.1.5      
 [9] tidyr_1.3.1       tibble_3.2.1      tidyverse_2.0.0   gt_0.11.0        
[13] data.table_1.15.4 ggplot2_3.5.1     dagitty_0.3-4     ggdag_0.2.12     

loaded via a namespace (and not attached):
 [1] ggrepel_0.9.5       Rcpp_1.0.12         lattice_0.20-45    
 [4] digest_0.6.36       utf8_1.2.4          V8_4.4.2           
 [7] ggforce_0.4.2       R6_2.5.1            evaluate_0.24.0    
[10] pillar_1.9.0        rlang_1.1.4         curl_5.2.1         
[13] rstudioapi_0.16.0   Matrix_1.6-5        rmarkdown_2.27     
[16] labeling_0.4.3      htmlwidgets_1.6.4   igraph_2.0.3       
[19] polyclip_1.10-6     munsell_0.5.1       compiler_4.2.3     
[22] numDeriv_2016.8-1.1 xfun_0.46           pkgconfig_2.0.3    
[25] htmltools_0.5.8.1   tidyselect_1.2.1    gridExtra_2.3      
[28] graphlayouts_1.1.1  viridisLite_0.4.2   fansi_1.0.6        
[31] ucminf_1.2.2        tzdb_0.4.0          withr_3.0.0        
[34] MASS_7.3-58.2       grid_4.2.3          nlme_3.1-162       
[37] jsonlite_1.8.8      gtable_0.3.5        lifecycle_1.0.4    
[40] magrittr_2.0.3      scales_1.3.0        cachem_1.1.0       
[43] cli_3.6.2           stringi_1.8.4       farver_2.1.2       
[46] viridis_0.6.5       xml2_1.3.6          generics_0.1.3     
[49] vctrs_0.6.5         boot_1.3-28.1       tools_4.2.3        
[52] glue_1.7.0          tweenr_2.0.3        hms_1.1.3          
[55] ggraph_2.2.1        fastmap_1.2.0       yaml_2.3.9         
[58] timechange_0.3.0    colorspace_2.1-0    tidygraph_1.3.1    
[61] memoise_2.0.1       knitr_1.48          sass_0.4.9

We have selected only a few variables of interest to limit the complexity of the DAG and shift the focus to statistical data analysis.↩︎
The scientific model is generally an approximation (a best guess) of the unknown data-generating process. As such, it would have some inaccuracies/errors, which will inevitably affect the data analysis.↩︎
In all simulation examples, we relied on the regression framework (ordinal or beta regression) within the frequentist or Bayesian approach. However, other approaches are available and could be used depending on the question and the data. It is the researcher’s responsibility to justify the choice of the statistical framework (and any decisions made during the modelling process) to answer the specific question of interest.↩︎
Here, the intercept is omitted (i.e., set to zero). Omitting the intercept term allows the full set of thresholds \(\tau_{1},\ ...,\ \tau_{k}\) to be identified.↩︎