4  Simulation example Q.9

In this simulation example, the highlighted data analysis pitfall is clustered/nested observations. Specifically, it shows the implication of having clustered/nested observations on the result of the data analysis.

4.1 Data-generating process

The data-generating process is the specific process that generates the observed data. In reality, this process is unknown and understanding it is, in a general sense, the ultimate aim of science. Here, we assume this process is known and describe it via a directed acyclic graph (DAG) in Fig. 4.1. In this DAG,1 indoor air temperature \((\text{T})\) and biological sex \((\text{S})\) influence thermal sensation vote \((\text{V})\) both directly and indirectly, passing through clothing insulation \((\text{C})\). In addition, thermal sensation vote \((\text{V})\) is influenced directly by some unmeasured characteristics of the specific building \((\text{B})\).

Code
dag_coords <-
  data.table(name = c('T', 'C', 'S', 'V', 'B'),
             x    = c(1, 2, 3, 2, 1),
             y    = c(1.2, 1.2, 1.2, 1, 1))

DAG.09 <- dagify(C ~ S,
                 C ~ T,
                 V ~ S + C + T + B,
                 coords = dag_coords)

ggplot(data = DAG.09, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_text(colour='black', size = 8, parse = TRUE,
                family = c('mono'),
                label = c(expression(bold(B)), expression(bold(C)),  expression(bold(S)), expression(bold(T)), expression(bold(V)))) +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(8, 'pt'), type = 'open'),
                 family = c('mono'), 
                 fontface = c('bold')) + 
  coord_cartesian(xlim = c(1, 3), ylim = c(1, 1.2))  +
  theme_dag()
Fig. 4.1. Graphical representation via DAG of the data-generating process. Indoor air temperature \((\text{T})\) and biological sex \((\text{S})\) influence thermal sensation vote \((\text{V})\) both directly and indirectly, passing through clothing insulation \((\text{C})\). In addition, thermal sensation vote \((\text{V})\) is influenced directly by some unmeasured characteristics of the specific building \((\text{B})\).

The DAG in Fig. 4.1 can be written as:

  • \(C \sim f_{C}(S, T)\), read as ‘clothing insulation \((\text{C})\) is some function of biological sex \((\text{S})\) and indoor air temperature \((\text{T})\)’.
  • \(V \sim f_{V}(S, T, C, B)\), read as ‘thermal sensation vote \((\text{V})\) is some function of biological sex \((\text{S})\), indoor air temperature \((\text{T})\), clothing insulation \((\text{C})\) and some unmeasured characteristics of the building \((\text{B})\)’.

4.2 Synthetic data set

In a DAG, an omitted arrow between two variables signifies the absence of direct influence of one variable on another. In contrast, a present arrow does not provide any information about the magnitude of the effect or its functional form. As such, all relationships are qualitative. However, to generate synthetic data, we need to quantitatively define the relationship between the variables. These relationships are defined within the custom function data.sim_Q09.beta(). This function takes as input the sample size N and the number of buildings n_group, and generates synthetic data according to the DAG in Fig. 4.1.

data.sim_Q09.beta <- function(N, n_group) {
#Simulate the number of people in each of the ngroup buildings so that the total number of people is N.
  B_size <- rmultinom(n = 1, size=N, prob = sample(c(0.1), size=n_group, replace = TRUE))[,1]
#Simulate the building random effects 
  B_B <- rnorm(n_group, mean=0 , sd=0.75)
#Simulate indoor air temperature for each building
  T_B <- rnorm(n_group, mean=23 , sd=2)
#Create empty list  
  listID <- list()
  listB <- list()
  listT <- list()
#Replicate group-level variables (e.g., indoor air temperature) for each observations   
  for (i in 1:n_group){
    listB[[i]] <- rep(B_B[i], times=B_size[i])  #replicate B_B[i] for B_size[i] times
    listID[[i]] <- rep((1:n_group)[i], times=B_size[i]) #replicate (1:ngroup)[i] for B_size[i] times
    listT[[i]] <- rep(T_B[i], times=B_size[i]) #replicate T_B[i] for B_size[i] times
  }
#Save the lists as vectors   
  B_id <- unlist(listID) #a vector of building identifiers of size N
  B <- unlist(listB) #a vector of building random effects of size N
  T <- unlist(listT) #a vector of indoor air temperature of size N
#Simulate biological sex
  S <- factor(sample(1:2, size=N, replace=TRUE)) #S = 1 'male'; S = 2 'female'
#Simulate clothing insulation  
  C <- rlnormTrunc(N, meanlog = ifelse(S=='1', 0.25 + 0.00 -0.04*T, 0.25 + 0.03 -0.04*T), sdlog =  0.15, min = 0.3 , max = 2) 
#Simulate thermal sensation vote
  V <- sim_TSV.beta(n=N, sex=S, temp=T, clo=C, b_int=-8.65, b_sex=c(0,-0.16), b_temp=0.37, b_clo=0.88, build=B, phi=17.9)
return(data.table(S, T, C, V, B, B_id)) #return a data.table with the simulated values
}

Specifically:

  • The buildings random effect \((\text{B})\) are randomly sampled from a normal distribution with 0 mean (i.e., mean = 0) and 0.75 standard deviation (i.e., sd = 0.75).

  • Indoor air temperature \((\text{T})\) is randomly sampled from a normal distribution. The mean is set to 23°C (i.e., mean = 23) and the standard deviation to 2°C (i.e., sd = 2). Indoor air temperature is simulated as a between-group variable, assuming that each building has one indoor air temperature.

  • Biological sex \((\text{S})\) is randomly sampled with replacement from a vector of factors 1 (i.e., male) and 2 (i.e., female). Biological sex is simulated as a within-group variable, assuming each observation has a value independent from the specific building.

  • Clothing insulation \((\text{C})\) is randomly sampled from a truncated lognormal distribution. The mean is on the log scale (i.e., meanlog) and is a linear of function of biological sex \((\text{S})\) and indoor air temperature \((\text{T})\): for male (i.e., S = 1) is meanlog = 0.25 + 0.00 -0.04*T while for female (i.e., S = 2) is meanlog = 0.25 + 0.03 -0.04*T. The standard deviation is on the log scale and is a constant, fixed at 0.15 (i.e., sdlog = 0.15). The parameters min and max are the assumed lower and upper bounds for clothing, fixed to 0.3 clo (i.e., min = 0.3) and 2 clo (i.e., max = 2), respectively. Clothing insulation is simulated as a within-group variable, assuming each observation has a value independent from the specific building.

  • Thermal sensation votes \((\text{V})\) are generated from the custom function sim_TSV.beta(), which is a function of biological sex (sex), indoor air temperature (temp), and clothing insulation (clo). The term b_int is the coefficient for the intercept (set to -8.65), b_sex is the coefficient for biological sex (set to 0 for males and -0.16 for females), b_temp is the coefficient for indoor air temperature (set to 0.37), b_clo is the coefficient for clothing insulation (set to 0.88), and phi is the precision parameter (set to 17.9). More information on the custom function sim_TSV.beta() can be found in Appendix A.5.

From this data generation mechanism, we simulated the target population, which consists of one million observations from twenty thousand buildings.

set.seed(2023) #set random number for reproducibility
#Simulate the population
Q09_population <- data.sim_Q09.beta(N = 1e6, n_group = 20000)
#View the data frame
Q09_population
              S        T         C         V         B  B_id
         <fctr>    <num>     <num>     <num>     <num> <int>
      1:      2 22.81022 0.4900057 0.8272881 1.2676635     1
      2:      2 22.81022 0.3467289 0.9028027 1.2676635     1
      3:      1 22.81022 0.4034637 0.5864774 1.2676635     1
      4:      2 22.81022 0.5417691 0.6345086 1.2676635     1
      5:      1 22.81022 0.5495862 0.7164846 1.2676635     1
     ---                                                    
 999996:      2 27.34685 0.4707264 0.8875221 0.6904279 20000
 999997:      1 27.34685 0.4239501 0.9774967 0.6904279 20000
 999998:      1 27.34685 0.4790684 0.9720889 0.6904279 20000
 999999:      1 27.34685 0.3614063 0.9445903 0.6904279 20000
1000000:      2 27.34685 0.3819760 0.7213163 0.6904279 20000

From this population, we obtain a data set of ten thousand observations using multistage sampling. We first randomly sampled 500 buildings; among the sampled buildings, we randomly sampled ten thousand observations.

set.seed(2023) #Set random number for reproducibility
#Sample a vector of 500 unique buildings ID
sample.building <- 
  Q09_population$B_id |> #select the building ID
  unique() |> #remove duplicate values
  sample(size = 5e2) #take a simple random sample of size 500

set.seed(2023) #set random number for reproducibility
#Sample one data set of 10,000 observations
sample.random <- 
  Q09_population |>
  filter(B_id %in% sample.building) |> #filter by the 500 building ID
  slice_sample(n = 1e4) #take a simple random sample of size 10,000 from the 500 buildings

#View the data frame
sample.random
            S        T         C         V          B  B_id
       <fctr>    <num>     <num>     <num>      <num> <int>
    1:      1 26.22338 0.4057381 0.4840411 -1.2220967 16952
    2:      1 25.19498 0.4276205 0.7277264 -0.2434178  8566
    3:      1 19.41889 0.5108458 0.2018259  0.4214767  1467
    4:      2 22.85782 0.3410838 0.5978592  0.6718118  6646
    5:      2 23.16726 0.4505068 0.5852178 -0.1235028  6518
   ---                                                     
 9996:      2 19.15630 0.7439215 0.3920396  0.4903466 10025
 9997:      2 19.85480 0.5591411 0.1787886 -0.5343069  7528
 9998:      2 24.89526 0.5089781 0.7077378  0.5052873 11047
 9999:      1 22.17644 0.4353788 0.6363981  0.2581817 15913
10000:      1 23.58039 0.4962462 0.9294576  0.4319900  9757

A summary of descriptive statistics for all variables in the data sets is given below. Detailed information can be found in Q.9 Summary.

summary(sample.random)
 S              T               C                V           
 1:4971   Min.   :18.13   Min.   :0.3015   Min.   :0.005116  
 2:5029   1st Qu.:21.85   1st Qu.:0.4613   1st Qu.:0.377474  
          Median :23.23   Median :0.5167   Median :0.569528  
          Mean   :23.18   Mean   :0.5246   Mean   :0.559469  
          3rd Qu.:24.46   3rd Qu.:0.5803   3rd Qu.:0.749002  
          Max.   :28.83   Max.   :1.0351   Max.   :0.999927  
       B                 B_id      
 Min.   :-2.21596   Min.   :    2  
 1st Qu.:-0.54383   1st Qu.: 5063  
 Median : 0.01985   Median : 9920  
 Mean   :-0.01413   Mean   : 9951  
 3rd Qu.: 0.51081   3rd Qu.:14998  
 Max.   : 2.10181   Max.   :19887  

4.3 Data analysis

In a scientific context, data analysis should be performed to model the phenomenon under scrutiny and not fitting curves to data. The data are a means of obtaining information about that phenomenon to answer the specific question of interest. In this example, we assumed that the objective of the data analysis is to answer the following predictive question: ‘What is the probability that the thermal sensation vote for females at 25°C with a clothing insulation value of 0.6 clo in a typical (i.e., average) building is lower than 0.5 (‘neutral’)?’.

4.3.1 Scientific model

The first step to address a scientific question is to define the scientific model. A scientific model is essentially a representation of a specific phenomenon that is being studied. To achieve this, we defined a DAG. A DAG generally consists of letter-and-arrow pictures summarising the researcher’s scientific knowledge about the phenomenon under study. Since we know the data-generating process (we defined it), the scientific model equates to the data-generating process outlined in Fig. 4.1. In reality, however, it would not.2

4.3.2 Statistical model

A statistical model is a mathematical representation of the scientific model —which in turn represents the data-generating process— that aims to answer the particular question of interest. To address the question, the synthetic data sets are analysed using beta regression within a Bayesian framework.3

To estimate the effect of interest, two beta regression models are fitted: one that ignores the cluster present in the data (Model 1) and one that considers it (Model 2).

The statistical model for ‘Model 1’ can then be written as:

\[ \begin{aligned} \mathrm{V}_{i} &\sim \operatorname{Beta}(\mu_{i}, \phi)\\ \operatorname{logit}(\mu_{i}) &= \eta_{i}\\ \eta_{i} &= \alpha_{0} + \beta_{S} (\mathrm{S}_i) + \beta_{T} \mathrm{T}_{i} + \beta_{C} \mathrm{C}_{i} \\ \alpha_{0} &\sim \operatorname{Normal}(0, 5)\\ \beta_{S} &\sim \operatorname{Normal}(0, 1)\\ \beta_{T} &\sim \operatorname{Normal}(0, 1)\\ \beta_{C} &\sim \operatorname{Normal}(0, 1)\\ \phi &\sim \operatorname{Exponential}(1)\\ \end{aligned} \tag{4.1}\]

In Eq. 5.1, the conditional distribution of the response variable \(\mathrm{Y}_{i}\) for the \(i^{th}\) observation is assumed to follow a beta distribution with mean \(\mu_{i}\) and precision parameter \(\phi\). The parameter \(\eta_{i}\) is the linear predictor, where \(\alpha_{0}\) is the intercept term and \(\beta_{S}\), \(\beta_{T}\), \(\beta_{C}\) are the regression coefficients for biological sex \((\text{S})\), indoor air temperature \((\text{T})\) and clothing insulation \((\text{C})\), respectively. The intercept \(\alpha_{0}\) has a \(\operatorname{Normal}(0, 5)\) prior, the regression coefficients \(\beta_{S}\), \(\beta_{T}\), \(\beta_{C}\) have a \(\operatorname{Normal}(0, 1)\) prior, and the precision parameter \(\phi\) has an \(\operatorname{Exponential}(1)\) prior.

The statistical model for ‘Model 2’ can then be written as:

\[ \begin{aligned} \mathrm{V}_{i} &\sim \operatorname{Beta}(\mu_{i}, \phi)\\ \operatorname{logit}(\mu_{i}) &= \eta_{i}\\ \eta_{i} &= \alpha_{0} + \beta_{S} (\mathrm{S}_i) + \beta_{T} \mathrm{T}_{i} + \beta_{C} \mathrm{C}_{i} + \alpha_{\mathrm{BUILD}[i]} \\ \alpha_{0} &\sim \operatorname{Normal}(0, 5)\\ \beta_{S} &\sim \operatorname{Normal}(0, 1)\\ \beta_{T} &\sim \operatorname{Normal}(0, 1)\\ \beta_{C} &\sim \operatorname{Normal}(0, 1)\\ \alpha_{j} &\sim \operatorname{Normal}(0, 1) ~\text{for building}~j=1 .. 500\\ \phi &\sim \operatorname{Exponential}(1)\\ \end{aligned} \tag{4.2}\]

In Eq. 5.2, the only difference from Eq. 5.1 is the presence of \(\alpha_{\mathrm{BUILD}[i]}\) in the linear predictor \(\eta_{i}\). The parameter \(\alpha_{j}\) is the random effect (specifically a random intercept) that accounts for the specific effect of the \(j^{th}\) building \(\mathrm{B}_{j}\), assuming a \(\operatorname{Normal}(0, 1)\) prior.

A summary description of the simulation example is provided in Table 4.1.

Table 4.1. Summary description of the simulation example
Pitfall Ignore the implications of clustered/nested observations (i.e., dependent observations) on the results and interpretation of a data analysis.
Type of analysis Predictive.
Research question What is the probability that the TSV vote for females at 25°C with a clothing insulation value of 0.6 clo in a typical (i.e., average) building is lower than 0.5 (‘neutral’)?
Framework Bayesian.
Assumptions Random sample (multistage sampling): everyone in the population has an equal chance of being selected into the sample.
Limited random variability: large sample size.
No confounding: the DAG includes all shared causes among the variables.
No model error: perfect functional form specification.
No measurement error: all variables are measured perfectly.
Variables Indoor air temperature (T): continuous variable [unit: °C]
Thermal resistance of clothing (C): continuous variable [unit: clo]
Sex (S): categorical variable [‘1’ male; ‘2’ female]
Building ID (B_id): index variable
TSV (V): continuous but bounded variable with interval (0, 1) [‘0’ cold; ‘1’ hot]

4.3.3 Results

The results of two fitted statistical models (defined above) are presented here.

In brms, the intercept can be specified by the syntax y ~ x or y ~ 1 + x in the formula argument. When doing so the corresponding prior is applied under the presupposition all the regressors (x) are mean centred. However, if the regressors are not mean centred (like in our case), the syntax ... ~ 0 + Intercept + ... should be used. With this syntax, the prior can be defined on the real intercept, directly. More information can be found in the brms reference manual.

#Fit the beta regression model (Model 1)

Model.1 <-
  brm(data = sample.random,
      family = Beta(link = 'logit'),
      formula = V ~ 0 + Intercept + S + T + C,
      #Define priors for specific parameters or classes of parameters
      prior = c(prior(normal(0, 5), class = b, coef = Intercept),
                prior(normal(0, 1), class = b, coef = S2),
                prior(normal(0, 1), class = b, coef = C),
                prior(normal(0, 1), class = b, coef = T),
                prior(exponential(1), class = phi)),
      iter = 5000, warmup = 1000, 
      chains = 4, cores = 4,
      seed = 2023, #for reproducibility
      file = 'fit/Q9.fit_Model1') #save model fit

#View of the model summary
summary(Model.1)
 Family: beta 
  Links: mu = logit; phi = identity 
Formula: V ~ 0 + Intercept + S + T + C 
   Data: sample.random (Number of observations: 10000) 
  Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
         total post-warmup draws = 16000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -7.24      0.14    -7.52    -6.95 1.00     5160     7398
S2           -0.12      0.02    -0.15    -0.08 1.00     9304     9550
T             0.31      0.00     0.30     0.32 1.00     5961     8738
C             0.88      0.10     0.68     1.08 1.00     6108     8966

Further Distributional Parameters:
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi     5.23      0.07     5.10     5.37 1.00     8574     9239

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
#Fit the beta regression model (Model 2)

Model.2 <-
  brm(data = sample.random,
      family = Beta(link = 'logit'),
      formula = V ~ 0 + Intercept + S + T + C  + (1 | B_id),
      #Define priors for specific parameters or classes of parameters
      prior = c(prior(normal(0, 5), class = b, coef = Intercept),
                prior(normal(0, 1), class = b, coef = S2),
                prior(normal(0, 1), class = b, coef = C),
                prior(normal(0, 1), class = b, coef = T),
                prior(exponential(1), class = phi),
                prior(normal(0, 1), class = sd)),
      iter = 5000, warmup = 1000, 
      chains = 4, cores = 4,
      seed = 2023, #for reproducibility
      file = 'fit/Q9.fit_Model2') #save model fit

#View of the model summary
summary(Model.2)
 Family: beta 
  Links: mu = logit; phi = identity 
Formula: V ~ 0 + Intercept + S + T + C + (1 | B_id) 
   Data: sample.random (Number of observations: 10000) 
  Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
         total post-warmup draws = 16000

Multilevel Hyperparameters:
~B_id (Number of levels: 500) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.78      0.03     0.73     0.83 1.00     1286     3047

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -8.29      0.41    -9.10    -7.50 1.00      665     1575
S2           -0.15      0.01    -0.17    -0.13 1.00    32155    11712
T             0.35      0.02     0.32     0.39 1.00      691     1491
C             0.96      0.06     0.83     1.08 1.00    31834    12409

Further Distributional Parameters:
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi    17.97      0.26    17.46    18.49 1.00    35041    10764

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The two models can be compared by approximate leave-one-out cross-validation (LOOCV) (Vehtari et al., 2017) where smaller values indicate a better fit. In particular, the LOO information criterion (LOOIC) for the two models and their difference is calculated and summarised in Table 4.2.

Code
#Calculate LOO criterion for Model 1
Model.1 <- add_criterion(Model.1, criterion = 'loo')

#Calculate LOO criterion for Model 2
Model.2 <- add_criterion(Model.2, criterion = 'loo')

#Compare the LOOIC for the two models
LOOIC <- 
  rbind(Model.2 = loo(Model.2)$estimates[3, ], #extract LOOIC and SE for Model 2 
        Model.1 = loo(Model.1)$estimates[3, ]) |>  #extract LOOIC and SE for Model 1 
  data.frame() |>
  rename(c(LOOIC = Estimate, se = SE)) |>
  round(digits = 2)

#Compare the difference of the LOOIC for the two models
LOO <- loo_compare(Model.2, Model.1, criterion = 'loo') #compare the elpd_loo

LOOIC.diff <- 
  cbind(LOOIC_diff = LOO[, 1] * (-2), #calculate the LOOIC difference
        se_diff = LOO[, 2] * 2) |>  #calculate the SE difference
  round(digits = 2)
Table 4.2. Models comparison
Model LOOIC SE LOOIC.diff1 SE.diff2
Model 2: GLMM model -17680.26 137.04 0.00 0.00
Model 1: GLM model -6067.95 110.81 11612.31 157.58
1 LOOIC.diff is the difference between the two LOOIC scores.
2 SE.diff is the standard error of the LOOIC.diff.

The table shows that Model.2 has a significantly better fit (smaller LOOIC value) than Model.1 since the difference in LOOIC (i.e., LOOIC.diff) is very large (73.69 times the corresponding standard error, SE.diff).

Using the function predicted_draws(), we can calculate the predicted posterior distribution of the TSVs for Model.1 and Model.2 given the regressors of interest (i.e., females at 25°C with clothing insulation of 0.6 clo).

#Create a data frame with defined regressors
grid.pred <- expand.grid(S = '2', #female
                         T = 25,
                         C = 0.6)

#Calculate prediction Model 1
set.seed(2023) #set random number for reproducibility
Model.1_pred <- 
  Model.1 |> 
  predicted_draws(newdata = grid.pred)

#Calculate prediction Model 2
set.seed(2023) #set random number for reproducibility
Model.2_pred <- 
  Model.2 |> 
  predicted_draws(newdata = grid.pred, 
                  re_formula = NA) #include no group-level effects

The predicted posterior distribution of the TSVs for Model.2 (i.e., Model.2_pred) represent the predicted TSVs for an average building. These have been calculated by holding the group-level residual (i.e., the group random effect) at its mean of zero, that is, by setting re_formula = NA in the function predicted_draws(). On the other hand, the predicted posterior distribution of the TSVs for Model.1 (i.e., Model.1_pred) do not represent the predicted TSVs for an average building because the model does not model the variability among buildings. Consequently, with this model is not possible to answer the question of interest. These predicted probabilities could be thought of, to a certain extent, as that of females at 25°C with clothing insulation of 0.6 clo across buildings on average. However, if these probabilities are of interest, they should be calculated using Model.2. This can be done by adding the term B_id = unique(sample.random$B_id) in grid.pred and by setting re_formula = NULL in the function predicted_draws().

The results of the two models are then plotted in Fig. 4.2.

Code
#Custom function to capitalise first letter of a word
foo = function(x){
  paste(toupper(substring(x, 1, 1)),
        tolower(substring(x, 2, nchar(x))),
        sep = '')
}

#Combine the predictions of the two models in one data frame
plot_preds <- 
  bind_rows(
    'Model.2_pred' = Model.2_pred,
    'Model.1_pred' = Model.1_pred,
    .id = 'draw_type') |> 
  mutate(draw_type = fct_inorder(draw_type))

#Calculate the highest predictive density (HPD) for Model 1 and Model 2 
mean_preds <- 
  plot_preds |>
  group_by(draw_type, S) |> 
  mean_hdi(.prediction) |>
  mutate(.interval = 'hpd')

#Calculate the probability (percentage of area) that the predicted TSVs are lower and higher than 0.5 (i.e., neutral TSV)
perc.area <- 
  rbind(data.frame(lower = mean(Model.1_pred$.prediction < 0.5), 
                   higher = mean(Model.1_pred$.prediction > 0.5), 
                   draw_type = 'Model.1_pred'),
        data.frame(lower = mean(Model.2_pred$.prediction < 0.5), 
                   higher = mean(Model.2_pred$.prediction > 0.5), 
                   draw_type = 'Model.2_pred'))

#Plot
ggplot(plot_preds, aes(x = .prediction, fill = draw_type)) +
  stat_halfeye(point_interval = mean_hdi, 
               .width = .95,
               shape = 16, 
               point_size = 2.3, 
               point_color = 'black', point_fill = 'black',
               color = 'black', 
               normalize = 'all', 
               alpha = 0.75) +
  geom_text(data = mean_preds, mapping = aes(x = .prediction, y = 0.95, label = paste(foo(.point), '=', sprintf('%.2f',round(.prediction, digits = 2)))), 
            size = 9/.pt, hjust = 0.5) +
  geom_text(data = mean_preds, mapping = aes(x = .prediction, y = .25, label = paste(sprintf('%.0f%%',.width*100),  toupper(.interval))), 
            size = 9/.pt, hjust = 0.5) +
  geom_text(data = mean_preds, mapping = aes(x = .lower, y = 0.10, label = paste(sprintf('%.2f', .lower))), 
            size = 9/.pt, hjust = 0.5) +
  geom_text(data = mean_preds, mapping = aes(x = .upper, y = 0.10, label = paste(sprintf('%.2f', .upper))), 
            size = 9/.pt, hjust = 0.5) +
  geom_vline(xintercept = 0.5, linetype = 'dashed', linewidth = 0.25, alpha = 0.5) +
  geom_text(data = perc.area, mapping = aes(x = 0.55, y = 0.60, label = paste(sprintf('%.1f%%', lower*100), '<', 0.50, '<', sprintf('%.1f%%', higher*100))), 
            size = 9/.pt, hjust = 0.5) +
  scale_x_continuous(breaks = seq(0, 1, length.out = 7),
                     labels = c('Cold\n(0)', '', '', 'Neutral\n(0.5)', '', '', 'Hot\n(1)'),
                     limits = c(0, 1)) +
  scale_fill_manual('Models', values = c('#F65314', '#7CBB00'), 
                    breaks = c('Model.1_pred', 'Model.2_pred'),
                    labels = c('Model 1 (GLM)', 'Model 2 (GLMM)')) +
  labs(x = expression('Predicted TSV'),
       y = expression('Probability density')) + 
  facet_grid(. ~ factor(draw_type, levels=c('Model.1_pred','Model.2_pred'), labels = c('Model 1 (GLM)', 'Model 2 (GLMM)'))) +
  theme(legend.position = 'none',
        axis.ticks = element_line(linewidth = 0.25), 
        axis.title.x = element_text(margin = unit(c(2, 0, 0, 0), 'mm')), 
        axis.title.y = element_text(margin = unit(c(0, 2, 0, 0), 'mm'))
        )
Fig. 4.2. Predicted posterior distribution of the TSVs ‘Model 1’ (orange) and ‘Model 2’ (green). The black line and dot at the bottom of each distribution represent the highest predictive density (HPD) and the mean, respectively.

Fig. 4.2 shows the predicted posterior distribution of the TSVs for Model.1 (orange) and Model.2 (green), respectively. The black line and dot at the bottom of each distribution represent the highest predictive density (HPD) and the mean, respectively. These intervals (i.e., the black lines) are defined here to span over 95% of the distribution and represent the 95% HPD. Using Model.1, the probability that the thermal sensation vote for females at 25°C with a clothing insulation value of 0.6 clo is lower than 0.5 (‘neutral’) is 16.0%. However, this is not the probability for a typical (i.e., average) building, leading to the wrong answer to our question of interest. Model.2, on the other hand, can calculate the predicted TSVs for an average building. From this model, the probability that the thermal sensation vote for females at 25°C with a clothing insulation value of 0.6 clo in a typical (i.e., average) building is lower than 0.5 (‘neutral’) is 2.9%.

Session info

Version information about R, the OS and attached or loaded packages.

R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8 
[2] LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] EnvStats_2.8.1    brms_2.21.0       Rcpp_1.0.12       tidybayes_3.0.6  
 [5] lubridate_1.9.3   forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4      
 [9] purrr_1.0.2       readr_2.1.5       tidyr_1.3.1       tibble_3.2.1     
[13] tidyverse_2.0.0   gt_0.11.0         data.table_1.15.4 ggplot2_3.5.1    
[17] dagitty_0.3-4     ggdag_0.2.12     

loaded via a namespace (and not attached):
 [1] colorspace_2.1-0     estimability_1.5.1   QuickJSR_1.3.1      
 [4] rstudioapi_0.16.0    farver_2.1.2         rstan_2.32.6        
 [7] graphlayouts_1.1.1   ggrepel_0.9.5        svUnit_1.0.6        
[10] fansi_1.0.6          mvtnorm_1.2-4        xml2_1.3.6          
[13] bridgesampling_1.1-2 codetools_0.2-19     cachem_1.1.0        
[16] knitr_1.48           polyclip_1.10-6      bayesplot_1.11.1    
[19] jsonlite_1.8.8       ggdist_3.3.2         ggforce_0.4.2       
[22] compiler_4.2.3       emmeans_1.10.3       backports_1.4.1     
[25] Matrix_1.6-5         fastmap_1.2.0        cli_3.6.2           
[28] tweenr_2.0.3         htmltools_0.5.8.1    tools_4.2.3         
[31] igraph_2.0.3         coda_0.19-4.1        gtable_0.3.5        
[34] glue_1.7.0           reshape2_1.4.4       posterior_1.6.0     
[37] V8_4.4.2             vctrs_0.6.5          nlme_3.1-162        
[40] ggraph_2.2.1         tensorA_0.36.2.1     xfun_0.46           
[43] timechange_0.3.0     lifecycle_1.0.4      MASS_7.3-58.2       
[46] scales_1.3.0         tidygraph_1.3.1      hms_1.1.3           
[49] Brobdingnag_1.2-9    parallel_4.2.3       inline_0.3.19       
[52] yaml_2.3.9           curl_5.2.1           memoise_2.0.1       
[55] gridExtra_2.3        loo_2.8.0            StanHeaders_2.32.10 
[58] sass_0.4.9           stringi_1.8.4        checkmate_2.3.1     
[61] boot_1.3-28.1        pkgbuild_1.4.4       rlang_1.1.4         
[64] pkgconfig_2.0.3      matrixStats_1.3.0    distributional_0.4.0
[67] evaluate_0.24.0      lattice_0.20-45      rstantools_2.4.0    
[70] htmlwidgets_1.6.4    labeling_0.4.3       tidyselect_1.2.1    
[73] plyr_1.8.9           magrittr_2.0.3       R6_2.5.1            
[76] generics_0.1.3       pillar_1.9.0         withr_3.0.0         
[79] abind_1.4-5          arrayhelpers_1.1-0   utf8_1.2.4          
[82] tzdb_0.4.0           rmarkdown_2.27       viridis_0.6.5       
[85] grid_4.2.3           digest_0.6.36        xtable_1.8-4        
[88] RcppParallel_5.1.8   stats4_4.2.3         munsell_0.5.1       
[91] viridisLite_0.4.2   

  1. We have selected only a few variables of interest to limit the complexity of the DAG and shift the focus to statistical data analysis.↩︎

  2. The scientific model is generally an approximation (a best guess) of the unknown data-generating process. As such, it would have some inaccuracies/errors, which will inevitably affect the data analysis.↩︎

  3. In all simulation examples, we relied on the regression framework (ordinal or beta regression) within the frequentist or Bayesian approach. However, other approaches are available and could be used depending on the question and the data. It is the researcher’s responsibility to justify the choice of the statistical framework (and any decisions made during the modelling process) to answer the specific question of interest.↩︎