2  Simulation example Q.7

In this simulation example, the highlighted data analysis pitfall is random sampling variability. Specifically, it shows the influence of random sampling variability on the results of a data analysis.

2.1 Data-generating process

The data-generating process is the specific process that generates the observed data. In reality, this process is unknown and understanding it is, in a general sense, the ultimate aim of science. Here, we assume this process is known and describe it via a directed acyclic graph (DAG) in Fig. 2.1. In this DAG,1 indoor air temperature \((\text{T})\) and biological sex \((\text{S})\) influence thermal sensation vote \((\text{V})\) both directly and indirectly, passing through clothing insulation \((\text{C})\). Participation \((\text{P})\) in the survey response is affected by thermal sensation vote \((\text{V})\). The selection node \(\text{P}\) is boxed to explicitly represent that the selection in the sample is conditional on \(\text{P}\).

Code
dag_coords <-
  data.frame(name = c('T', 'C', 'S', 'V', 'P'),
             x    = c(1, 2, 3, 2, 3),
             y    = c(1.2, 1.2, 1.2, 1, 1))

DAG.07 <-
  dagify(C ~ S,
         C ~ T,
         V ~ S + C + T,
         P ~ V,
         coords = dag_coords)

ggplot(data = DAG.07, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_point(x = 3, y = 1, shape = 0, size = 13, stroke = 0.9, color = 'black') +
  geom_dag_text(colour='black', size = 8, parse = TRUE,
                family = c('mono'),
                label = c(expression(bold(C)), expression(bold(P)), expression(bold(S)), expression(bold(T)), expression(bold(V)))
                ) +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(8, 'pt'), type = 'open'),
                 family = c('mono'), 
                 fontface = c('bold')) + 
  coord_cartesian(xlim = c(1, 3), ylim = c(1, 1.2))  +
  theme_dag()
Fig. 2.1. Graphical representation via DAG of the data-generating process. Indoor air temperature \((\text{T})\) and biological sex \((\text{S})\) influence thermal sensation vote \((\text{V})\) both directly and indirectly, passing through clothing insulation \((\text{C})\). Participation \((\text{P})\) in the survey response is affected by thermal sensation vote \((\text{V})\).

The DAG in Fig. 2.1 can be written as:

  • \(C \sim f_{C}(S, T)\), read as ‘clothing insulation \((\text{C})\) is some function of biological sex \((\text{S})\) and indoor air temperature \((\text{T})\)’.
  • \(V \sim f_{V}(S, T, C)\), read as ‘thermal sensation vote \((\text{V})\) is some function of biological sex \((\text{S})\), indoor air temperature \((\text{T})\) and clothing insulation \((\text{C})\)’.
  • \(P \sim f_{P}(V)\), read as ‘participation \((\text{P})\) is some function of thermal sensation vote \((\text{V})\)’.

2.2 Synthetic data set

In a DAG, an omitted arrow between two variables signifies the absence of direct influence of one variable on another. In contrast, a present arrow does not provide any information about the magnitude of the effect or its functional form. As such, all relationships are qualitative. However, to generate synthetic data, we need to quantitatively define the relationship between the variables. These relationships are defined within the custom function data.sim_Q07.ord(). This function takes as input the sample size N and generates synthetic data according to the DAG in Fig. 2.1.

data.sim_Q07.ord <- function(N) {
#Simulate biological sex
  S <- factor(sample(1:2, size = N, replace = TRUE)) #S = 1 'male'; S = 2 'female'
#Simulate indoor air temperature
  T <- rnorm(N, mean = 23 , sd = 2)
#Simulate clothing insulation
  C <- rlnormTrunc(N, meanlog = ifelse(S=='1', 0.25 + 0.00 -0.04*T, 0.25 + 0.03 -0.04*T), sdlog =  0.15, min = 0.3 , max = 2)
#Simulate thermal sensation vote
  V <- sim_TSV.ord(n = N, sex = S, temp = T, clo = C, b_int = 0, b_sex = c(0,-0.16), b_temp = 0.37, b_clo = 0.88)
  return(data.table(S, T, C, V)) #return a data.table with the simulated values
}

Specifically:

  • Biological sex \((\text{S})\) is randomly sampled with replacement from a vector of factors 1 (i.e., male) and 2 (i.e., female).

  • Indoor air temperature \((\text{T})\) is randomly sampled from a normal distribution. The mean is set to 23°C (i.e., mean = 23) and the standard deviation to 2°C (i.e., sd = 2).

  • Clothing insulation \((\text{C})\) is randomly sampled from a truncated lognormal distribution. The mean is on the log scale (i.e., meanlog) and is a linear of function of biological sex \((\text{S})\) and indoor air temperature \((\text{T})\): for male (i.e., S = 1) is meanlog = 0.25 - 0.04*T while for female (i.e., S = 2) is meanlog = 0.25 + 0.03 - 0.04*T. The standard deviation is on the log scale and is a constant, fixed at 0.15 (i.e., sdlog = 0.15). The parameters min and max are the assumed lower and upper bounds for clothing, fixed to 0.3 clo (i.e., min = 0.3) and 2 clo (i.e., max = 2), respectively.

  • Thermal sensation votes \((\text{V})\) are generated from the custom function sim_TSV.ord(), which is a function of biological sex (sex), indoor air temperature (temp), and clothing insulation (clo). The term b_int is the coefficient for the intercept (set to 0), b_sex is the coefficient for biological sex (set to 0 for males and -0.16 for females), b_temp is the coefficient for indoor air temperature (set to 0.37), and b_clo is the coefficient for clothing insulation (set to 0.88). More information on the custom function sim_TSV.ord() can be found in Appendix A.2.

From this data generation mechanism, we simulated the target population, which consists of one million observations.

set.seed(2023) #set random number for reproducibility
#Simulate the population
Q07_population <- data.sim_Q07.ord(N = 1e6)
#Transform V to an ordered factor
Q07_population[, V := factor(V, levels = c('1','2','3','4','5','6','7'), ordered = TRUE)]
#View the data frame
Q07_population
              S        T         C     V
         <fctr>    <num>     <num> <ord>
      1:      1 19.35834 0.5173312     3
      2:      2 20.17534 0.6351932     3
      3:      1 17.25539 0.6350222     4
      4:      1 22.32519 0.5410619     4
      5:      2 23.51410 0.5319473     6
     ---                                
 999996:      1 23.16175 0.5811199     7
 999997:      1 20.21810 0.4694916     4
 999998:      2 21.43980 0.5467828     3
 999999:      1 21.73996 0.4982545     4
1000000:      1 20.66438 0.6083867     3

From this population, we obtained two data sets of ten thousand observations each: one using simple random sampling and the other using non-random sampling. In the non-random sample, participation \((\text{P})\) in the survey is affected by the thermal sensation vote \((\text{V})\), assuming that people with lower and higher thermal sensation votes are more likely to answer the survey.

To do so, we first create a column of weights in the target population Q07_population, which depends on the thermal sensation vote. The more a thermal sensation vote deviates from neutrality (i.e., V = '4'), the larger the weight. For example, a thermal sensation vote of '1' and '7' (i.e., ‘cold’ and ‘hot’) have a weight equal to 0.9, a thermal sensation vote of '2' and '6' (i.e., ‘cool’ and ‘warm’) have a weight equal to 0.8, and so forth.

Q07_population <- 
  Q07_population |> 
  mutate(weights = case_when(V == '1' ~ 0.9,
                             V == '2' ~ 0.8,
                             V == '3' ~ 0.3,
                             V == '4' ~ 0.2,
                             V == '5' ~ 0.3,
                             V == '6' ~ 0.8,
                             V == '7' ~ 0.9,
                             .default = NA))

Subsequently, the two data sets of ten thousand observations each are sampled. The first sample (i.e., sample.random) will use simple random sampling, whereas the second one (i.e., sample.bias) will use non-random sampling based on the weights,2 where observations with larger weights are more likely to be sampled. As a result, the sample obtained is biased.

set.seed(2023) #set random number for reproducibility 
#Sample one data set of 10,000 observations
sample.random <- 
  Q07_population |> 
  slice_sample(n = 1e4) |>  #take a simple random sample of size 10,000
  select(-weights) #drop column 'weights'

set.seed(2023) #set random number for reproducibility
#Sample one data set of 10,000 observations
sample.bias <- 
  Q07_population |> 
  slice_sample(n = 1e4, weight_by = weights) |> #take a biased sample of size 10,000
  select(-weights) #drop column 'weights'

A summary of descriptive statistics for all variables in the data sets is given below. Detailed information can be found in Q.7 Summary.

summary(sample.random)
 S              T               C          V       
 1:4912   Min.   :16.37   Min.   :0.3004   1: 145  
 2:5088   1st Qu.:21.65   1st Qu.:0.4647   2: 336  
          Median :23.03   Median :0.5209   3:1638  
          Mean   :23.00   Mean   :0.5284   4:4933  
          3rd Qu.:24.36   3rd Qu.:0.5840   5:2229  
          Max.   :29.98   Max.   :1.0358   6: 473  
                                           7: 246  
summary(sample.bias)
 S              T               C          V       
 1:4951   Min.   :15.00   Min.   :0.3017   1: 420  
 2:5049   1st Qu.:21.67   1st Qu.:0.4620   2: 808  
          Median :23.09   Median :0.5167   3:1483  
          Mean   :23.07   Mean   :0.5253   4:3066  
          3rd Qu.:24.44   3rd Qu.:0.5798   5:2246  
          Max.   :30.55   Max.   :0.9754   6:1309  
                                           7: 668  

2.3 Data analysis

In a scientific context, data analysis should be performed to model the phenomenon under scrutiny and not fitting curves to data. The data are a means of obtaining information about that phenomenon to answer the specific question of interest. In this example, we assumed that the objective of the data analysis is to answer the following inferential question: ‘What is the (population’s) expected average thermal sensation vote for males exposed to air temperatures between 16°C and 30°C with a clothing insulation value of 0.5 clo?’.

2.3.1 Scientific model

The first step to address a scientific question is to define the scientific model. A scientific model is essentially a representation of a specific phenomenon that is being studied. To achieve this, we defined a DAG. A DAG generally consists of letter-and-arrow pictures summarising the researcher’s scientific knowledge about the phenomenon under study. Since we know the data-generating process (we defined it), the scientific model equates to the data-generating process outlined in Fig. 2.1. In reality, however, it would not.3

2.3.2 Statistical model

A statistical model is a mathematical representation of the scientific model —which in turn represents the data-generating process— that aims to answer the particular question of interest. To address the question, the synthetic data sets are analysed using ordinal regression within a Bayesian framework.4

The statistical model can then be written as:

\[ \begin{aligned} \mathrm{V}_{i} &\sim \operatorname{Multinomial}(1, \boldsymbol\pi_{i})\\ \operatorname{logit}(\gamma_{ik}) &= \tau_{k} -\eta_{i}\\ \eta_{i} &= \alpha_{0} + \beta_{S} (\mathrm{S}_i) + \beta_{T} \mathrm{T}_{i} + \beta_{C} \mathrm{C}_{i}\\ \alpha_{0} &= 0\\ \tau_{k} &\sim \operatorname{Normal}(0, 4)\\ \beta_{S} &\sim \operatorname{Normal}(0, 1)\\ \beta_{T} &\sim \operatorname{Normal}(0, 1)\\ \beta_{C} &\sim \operatorname{Normal}(0, 1)\\ \end{aligned} \tag{2.1}\]

In Eq. 2.1, the conditional distribution of the response variable \(\mathrm{Y}_{i}\) for the \(i^{th}\) observation is assumed to follow an ordered multinomial distribution with probability vector \(\boldsymbol\pi_{i} = \{\pi_{i1},\ ...,\ \pi_{ik}\}\), where \(\pi_{ik} = \mathrm{Pr}(Y_{i}=k)\). The cumulative probability for \(\pi_{ik}\) is \(\gamma_{ik} = \mathrm{Pr}(Y_{i} \le k)\), hence \(\gamma_{ik} = \pi_{i1}\ + ... +\ \pi_{ik}.\) The term \(\{\tau_{k}\}\) are the threshold parameters, with \(k \in \{1,\ ...,\ K\}\) and \(\eta_{i}\) is the linear predictor, where \(\alpha_{0}\) is the intercept term5 and \(\beta_{S}\), \(\beta_{T}\), \(\beta_{C}\) are the regression coefficients for biological sex \((\text{S})\), indoor air temperature \((\text{T})\), and clothing insulation \((\text{C})\), respectively. The threshold parameters \(\{\tau_{k}\}\) have a \(\operatorname{Normal}(0, 4)\) prior, while the regression coefficients \(\beta_{S}\), \(\beta_{T}\) and \(\beta_{C}\) have a \(\operatorname{Normal}(0, 1)\) prior.

A summary description of the simulation example is provided in Table 2.1.

Table 2.1. Summary description of the simulation example
Pitfall Ignore the implications of selection bias (e.g., case-control bias) on the results and interpretation of a data analysis.
Type of analysis Inferential.
Research question What is the (population’s) expected average TSV for males exposed to air temperatures between 16°C and 30°C with a clothing insulation value of 0.5 clo?
Framework Bayesian.
Assumptions Limited random variability: large sample size.
Independence of observations: each observation represents independent bits of information.
No confounding: the DAG includes all shared causes among the variables.
No model error: perfect functional form specification.
No measurement error: all variables are measured perfectly.
Variables Air temperature (T): continuous variable [unit: °C]
Thermal resistance of clothing (C): continuous variable [unit: clo]
Sex (S): categorical variable [‘1’ male; ‘2’ female]
TSV (V): ordinal variable [‘1’ cold; ‘2’ cool; ‘3’ slightly cool; ‘4’ neutral; ‘5’ slightly warm; ‘6’ warm; ‘7’ hot]

2.3.3 Results

The results of the fitted statistical model (defined above) are presented here. The same ordinal model is fitted to the random and biased sample, resulting in the two models fit_sample.random and fit_sample.bias.

#Fit the ordinal regression model (random sample)
fit_sample.random <-
  brm(data = sample.random,
      family = cumulative(logit),
      formula = V ~ 1 + S + T + C,
      #Define priors for specific parameters or classes of parameters
      prior = c(prior(normal(0, 4), class = Intercept),
                prior(normal(0, 1), class = b)),
      iter = 4000, warmup = 1000, 
      chains = 4, cores = 4,
      seed = 2023, #for reproducibility
      file = 'fit/Q7.fit_sample.random') #save model fit

#Fit the ordinal regression model (biased sample)
fit_sample.bias <-
  brm(data = sample.bias,
      family = cumulative(logit),
      formula = V ~ 1 + S + T + C,
      #Define priors for specific parameters or classes of parameters
      prior = c(prior(normal(0, 4), class = Intercept),
                prior(normal(0, 1), class = b)),
      iter = 4000, warmup = 1000, 
      chains = 4, cores = 4,
      seed = 2023, #for reproducibility
      file = 'fit/Q7.fit_sample.bias') #save model fit

To answer the question of interest, we need to define the values of the regressors at which calculate the average thermal sensation vote. Specifically, males with a clothing insulation value of 0.5 clo exposed to air temperatures between 16°C and 30°C.

#Create a data frame with defined regressors
grid.pred <- 
  data.table(expand_grid(T = seq(from = 16, to = 30, length.out = 281),
                         S = '1',
                         C = 0.5))
#View the data frame
grid.pred 
         T      S     C
     <num> <char> <num>
  1: 16.00      1   0.5
  2: 16.05      1   0.5
  3: 16.10      1   0.5
  4: 16.15      1   0.5
  5: 16.20      1   0.5
 ---                   
277: 29.80      1   0.5
278: 29.85      1   0.5
279: 29.90      1   0.5
280: 29.95      1   0.5
281: 30.00      1   0.5

Subsequently, using the fitted() function, we can calculate the model expectations for the given set of regressors (i.e., the expected values of the posterior predictive distribution), which, in our case, are the predicted probabilities for the seven categories of thermal sensation votes.

#Calculate the posterior predictions (random sample) 
fitted_sample.random <-
  fitted(fit_sample.random,
         newdata = grid.pred,
         summary = F)

#Calculate the posterior predictions (biased sample)
fitted_sample.bias <-
  fitted(fit_sample.bias,
         newdata = grid.pred,
         summary = F)

#View the structure of the data frame
str(fitted_sample.bias) 
 num [1:12000, 1:281, 1:7] 0.326 0.36 0.302 0.369 0.338 ...
 - attr(*, "dimnames")=List of 3
  ..$ : chr [1:12000] "1" "2" "3" "4" ...
  ..$ : NULL
  ..$ : chr [1:7] "1" "2" "3" "4" ...

When summary = F, for categorical and ordinal models, fitted() returns a 3-dimensional array. The first dimension contains as many values as there were post-warm-up draws across the Hamiltonian Monte Carlo (HMC) chains (i.e., 12,000 post-warm-up6). The second dimension contains as many values as there were observations in newdata (i.e., 281 observations), and the third dimension contains the number of categories (i.e., 7 categories).

We need to perform some data wrangling to transform this 3-dimensional array into a useful format (e.g., a data frame). The code below transforms the 3-dimensional array into a data frame.

#Data wrangling (random sample)
f_sample.random <-
  rbind(fitted_sample.random[, , 1],
        fitted_sample.random[, , 2],
        fitted_sample.random[, , 3],
        fitted_sample.random[, , 4],
        fitted_sample.random[, , 5],
        fitted_sample.random[, , 6],
        fitted_sample.random[, , 7]) |>
  data.frame() |> #convert the results to a data frame
  set_names(pull(grid.pred, T)) |> #rename the columns
  mutate(Vote = rep(1:7, each = n() / 7), #create a column of votes 
         draw = rep(1:12e3, times = 7)) |> #create a column of numeric index for the MCMC draw
  pivot_longer(-c(draw, Vote), #convert to the long format all the data frame's elements but 'draw' and 'Vote'
               names_to = 'Air.Temperature', #add columns name here
               values_to = 'Probability') |> #add values here
  mutate(Air.Temperature = as.numeric(Air.Temperature)) #convert 'Air.Temperature' from character to numeric format


#Data wrangling (biased sample)
f_sample.bias <-
  rbind(fitted_sample.bias[, , 1],
        fitted_sample.bias[, , 2],
        fitted_sample.bias[, , 3],
        fitted_sample.bias[, , 4],
        fitted_sample.bias[, , 5],
        fitted_sample.bias[, , 6],
        fitted_sample.bias[, , 7]) |>
  data.frame() |> #convert the results to a data frame
  set_names(pull(grid.pred, T)) |> #rename the columns
  mutate(Vote = rep(1:7, each = n() / 7), #create a column of votes 
         draw = rep(1:12e3, times = 7)) |> #create a column of numeric index for the MCMC draw
  pivot_longer(-c(draw, Vote), #convert to the long format all the data frame's elements but 'draw' and 'Vote'
               names_to = 'Air.Temperature', #add columns name here
               values_to = 'Probability') |> #add values here
  mutate(Air.Temperature = as.numeric(Air.Temperature)) #convert 'Air.Temperature' from character to numeric format

#View the structure of the data frame
str(f_sample.bias)
tibble [23,604,000 × 4] (S3: tbl_df/tbl/data.frame)
 $ Vote           : int [1:23604000] 1 1 1 1 1 1 1 1 1 1 ...
 $ draw           : int [1:23604000] 1 1 1 1 1 1 1 1 1 1 ...
 $ Air.Temperature: num [1:23604000] 16 16.1 16.1 16.1 16.2 ...
 $ Probability    : num [1:23604000] 0.326 0.322 0.318 0.313 0.309 ...

This result in f_sample.random and f_sample.bias: two data frames of 23,604,000 rows (i.e., \(12,000 \times 281 \times 7\)) and 4 columns (i.e., Vote, draw, Air.Temperature and Probability).

In this data frames, each of the 281 values of Air.Temperture has 7 possible Vote categories (i.e., the 7 categories of TSV), with each category having 12,000 probabilities. However, we need a summary of the probabilities for each of the 7 possible Vote at each Air.Temperture. To do so, we can group the data by Air.Temperture and Vote and use the quantile() function to summarise the Probability at some specific probabilities: 0.5 (i.e., the median), and 0.025 and 0.975 (i.e., the 95% credible intervals).

data.plot.random.prob <-
  f_sample.random |>
  group_by(Air.Temperature, Vote) |> #group the data by 'Air.Temperature' and 'Vote'
  summarise(Q = list(quantile(Probability, probs = c(0.025, 0.5, 0.975)))) |> #for the grouped data summarise 'Probability' in a list-column
  unnest_wider(Q) #unnest the list-column into columns

data.plot.bias.prob <-
  f_sample.bias |> 
  group_by(Air.Temperature, Vote) |>#group the data by 'Air.Temperature' and 'Vote'
  summarise(Q = list(quantile(Probability, probs = c(0.025, 0.5, 0.975)))) |> #for the grouped data summarise 'Probability' in a list-column
  unnest_wider(Q) #unnest the list-column into columns
#View the data frame
data.plot.bias.prob 
# A tibble: 1,967 × 5
# Groups:   Air.Temperature [281]
   Air.Temperature  Vote  `2.5%`   `50%` `97.5%`
             <dbl> <int>   <dbl>   <dbl>   <dbl>
 1            16       1 0.307   0.343   0.381  
 2            16       2 0.276   0.294   0.313  
 3            16       3 0.184   0.201   0.218  
 4            16       4 0.107   0.120   0.135  
 5            16       5 0.0247  0.0287  0.0332 
 6            16       6 0.00764 0.00903 0.0107 
 7            16       7 0.00280 0.00335 0.00402
 8            16.0     1 0.302   0.339   0.376  
 9            16.0     2 0.276   0.294   0.313  
10            16.0     3 0.186   0.202   0.220  
# ℹ 1,957 more rows

This result in data.plot.random.prob and data.plot.bias.prob: two data frames of 1,967 rows (i.e., \(281 \times 7\)) and 5 columns (i.e., Air.Temperature, Vote, 2.5%, 50% and 97.5%). Subsequently, we can combine the two data frames in one (i.e., data.plot.prob).

#Combine the two data frames
data.plot.prob <-
  rbind(data.plot.random.prob |> mutate(Type  = 'random'), 
        data.plot.bias.prob |> mutate(Type  = 'bias'))

To verify the results and demonstrate that the sample.bias is biased whereas the sample.random is not, we need to compare them with the truth. In a real setting, this would not be possible; however, within a simulation context, it is. To do so, we need to constrain the data-generating process (defined within the custom function data.sim_Q07.ord()) to generate synthetic data with specific characteristics. This is achieved with the custom function truth.data.sim_Q07.ord(), which returns the probabilities for each of the seven categories of thermal sensation for males with a clothing insulation value of 0.5 clo exposed to air temperatures between 16°C and 30°C.

truth.data.sim_Q07.ord <- function(N) {
#Set biological sex
  S <- factor(rep('1', times = N)) #S = 1 'male'; S = 2 'female'
#Set indoor air temperature
  T <- seq(from = 16, to = 30, length.out = N)
#Set clothing insulation
  C <- 0.5
#Simulate thermal sensation vote
  V <- truth.sim_TSV.ord(n = N, sex = S, temp = T, clo = C, b_int = 0, b_sex = c(0,-0.16), b_temp = 0.37, b_clo = 0.88)
  return(data.table(S, T, C, V)) #return a data.table with the simulated values
}

The truth.sim_TSV.ord() function is a modified version of the sim_TSV.ord() function, which returns the probability of each thermal sensation vote. More information can be found in Appendix A.2.

The following code generates the required synthetic data with a data frame format (truth) and transforms it into a long format (data.plot.truth.prob).

set.seed(2023) #set random number for reproducibility
#Generate the data
truth <- 
  truth.data.sim_Q07.ord(N = 281) |>
  rename(c(Sex = S, Clothing = C, Air.Temperature = T)) #rename columns

#Convert data frame to a long format
data.plot.truth.prob <-
  truth |>
  pivot_longer(-c(Sex, Air.Temperature, Clothing, V), #convert to the long format all the data frame's elements but 'Sex', 'Air.Temperature', 'Clothing' and 'V'
               names_to = 'Vote', #add columns name here
               values_to = 'Probability') #add values here

The results are then plotted in Fig. 2.2.

Code
Type_names <- c('bias' = 'Biased sample', 
                'random' = 'Random sample')
  
Vote_names <-  c('1' = 'Cold', 
                 '2' = 'Cool', 
                 '3' = 'Slightly cool', 
                 '4' = 'Neutral', 
                 '5' = 'Slightly warm', 
                 '6' = 'Warm', 
                 '7' = 'Hot')

ggplot(data = data.plot.prob, aes(x = Air.Temperature, y = `50%`, group = Vote)) +
  geom_line(data = data.plot.truth.prob, mapping = aes(x = Air.Temperature, y = Probability, 
                                                       group = Vote, colour = 'black', fill = 'black',
                                                       linetype = 'black'), linewidth = 0.4) + 
  geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`, fill = as.factor(Type)), alpha = 0.3)+
  geom_line(aes(colour = as.factor(Type), linetype = as.factor(Type)), linewidth = 0.4) +
  scale_colour_manual('',
                      values = c('#F65314', '#7CBB00', 'black'), 
                      breaks = c( 'bias', 'random', 'black'),
                      labels = c('Biased sample', 'Random sample', 'Population')) +
  scale_fill_manual('',
                    values = c('#F65314', '#7CBB00', 'transparent'), 
                    breaks = c('bias', 'random', 'black'),
                    labels = c('Biased sample', 'Random sample', 'Population')) +
  scale_linetype_manual('',
                    values = c('solid','solid', 'solid'), 
                    breaks = c('bias', 'random', 'black'),
                    labels = c('Biased sample', 'Random sample', 'Population')) +
  scale_x_continuous(expression('Air temperature'~('°'*C)),
                     breaks = seq(from = 16, to = 30, by = 2),
                     limits = c(16, 30)) +
  scale_y_continuous(expression('Probabilities'~Pr(Y == k)),
                     limits = c(0, 0.6),
                     labels = scales::percent) +
  facet_grid(Vote ~ Type, 
             labeller = labeller(Vote = Vote_names,
                                 Type = Type_names)) +
  theme(legend.position = 'bottom',
        legend.direction = 'horizontal',
        legend.margin = margin(0,0,0,0),
        legend.key = element_blank(),
        axis.title.x = element_text(margin = unit(c(2, 0, 0, 0), "mm")),
        axis.title.y = element_text(margin = unit(c(0, 2, 0, 0), "mm"))) 
Fig. 2.2. Posterior probability (and 95% credible intervals) across varying air temperatures for males with a clothing insulation value of 0.5 clo. The orange line indicates the calculated median probability from a biased sample (left-side), while the green line represents the medain calculated using a random sample (right-side). The black lines display the mean of the probabilities of the population, which is determined using the data-generating mechanism.

Fig. 2.2 displays the posterior probabilities (and 95% credible intervals) across varying air temperatures for males with a clothing insulation value of 0.5 clo. The orange line indicates the calculated mean probability from a biased sample (left-side), while the green line represents the mean calculated using a random sample (right-side). The black line displays the mean of the probabilities of the population, which is determined using the data-generating mechanism. The figure shows that the probability of the biased sample for each category is not in line with the probability of the population. Conversely, the probabilities derived from the random sample closely align with the population. Although the figure provides valuable information, it does not directly apply to our question of interest as we seek the average outcome. The mean of the probabilities can be calculated as:

\[ \begin{aligned} \mathrm{Mean~Pr} &= \sum_{1}^{K}\pi_{k}~k \\ \end{aligned} \tag{2.2}\]

where \(\pi_{k}\) is the probability of a specific category \(k\), \(k \in \{1,\ ...,\ K\}\).

The following code calculate the mean probability for the samples (i.e., data.plot.random and data.plot.bias) and combine them in one data frame (i.e., data.plot). Additionally, it calculate the mean probability for the truth (i.e., data.plot.truth.mean)

#Calculate mean probability (random sample)
data.plot.random <-
  f_sample.random |>
  group_by(draw, Air.Temperature) |> #group the data by 'draw' and 'Air.Temperature'
  summarise(Mean.Vote = sum(Probability * Vote)) |> #for the grouped data calculate the mean probability
  ungroup() |> #ungroup the data
  group_by(Air.Temperature) |> #group the data by 'Air.Temperature'
  summarise(Q = list(quantile(Mean.Vote, probs = c(0.025, 0.5, 0.975)))) |> #for the grouped data summarise 'Probability' in a list-column
  unnest_wider(Q) #unnest the list-column into columns

#Calculate mean probability (biased sample)
data.plot.bias <-
  f_sample.bias |>
  group_by(draw, Air.Temperature) |> #group the data by 'draw' and 'Air.Temperature'
  summarise(Mean.Vote = sum(Probability * Vote)) |> #for the grouped data calculate the mean probability
  ungroup() |> #ungroup the data
  group_by(Air.Temperature) |> #group the data by 'Air.Temperature'
  summarise(Q = list(quantile(Mean.Vote, probs = c(0.025, 0.5, 0.975)))) |> #for the grouped data summarise 'Probability' in a list-column
  unnest_wider(Q) #unnest the list-column into columns 

#Combine the two data frames
data.plot <-
  rbind(data.plot.random |> mutate(Type  = 'random'), 
        data.plot.bias |> mutate(Type  = 'bias'))

#Calculate mean probability (truth)
data.plot.truth.mean <-
  truth |>
  pivot_longer(-c(Sex, Air.Temperature, Clothing, V), #convert to the long format all the data frame's elements but 'Sex', 'Air.Temperature', 'Clothing' and 'V'
               names_to = 'Vote', #add columns name here
               values_to = 'Probability') |>#add values here
  group_by(Air.Temperature) |> #group the data by 'Air.Temperature'
  summarise(Mean.Vote = sum(Probability * as.numeric(Vote))) |> #for the grouped data calculate the mean probability
  ungroup() #ungroup the data

The results are then plotted in Fig. 2.3.

Code
ggplot(data = data.plot, aes(x = Air.Temperature, y = `50%`)) +
  geom_line(data = data.plot.truth.mean, mapping = aes(x = Air.Temperature, y = `Mean.Vote`, colour = 'black', fill = 'black'),linewidth = 0.4) + 
  geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`, fill = Type), alpha = 0.3)+
  geom_line(aes(colour = Type), linewidth = 0.4) +
  scale_colour_manual('',
                      values = c('#F65314', '#7CBB00', 'black'), 
                      breaks = c( 'bias', 'random', 'black'),
                      labels = c('Biased sample', 'Random sample', 'Population')) +
  scale_fill_manual('',
                    values = c('#F65314', '#7CBB00', 'transparent'), 
                    breaks = c( 'bias', 'random', 'black'),
                    labels = c('Biased sample', 'Random sample', 'Population')) +
  scale_x_continuous('Air temperature (°C)',
                     breaks = seq(from = 16, to = 30, by = 2),
                     limits = c(16, 30)) +
  scale_y_continuous('TSV',
                     breaks = 1:7, 
                     limits = c(1, 7)) +
  facet_grid(. ~ Type, 
             labeller = labeller(Type = Type_names)) +
  theme(legend.position = 'bottom',
        legend.direction = 'horizontal',
        legend.margin = margin(0,0,0,0),
        legend.key = element_blank(),
        axis.title.x = element_text(margin = unit(c(2, 0, 0, 0), "mm")),
        axis.title.y = element_text(margin = unit(c(0, 2, 0, 0), "mm"))) 
Fig. 2.3. Average posterior probability (and 95% credible intervals) across varying air temperatures for males with a clothing insulation value of 0.5 clo. The orange line indicates the calculated mean probability from a biased sample (left-side), while the green line represents the mean calculated using a random sample (right-side). The black lines display the mean of the probabilities of the population, which is determined using the data-generating mechanism.

Fig. 2.3 displays the average of posterior probabilities (and 95% credible intervals) across varying air temperatures for males with a clothing insulation value of 0.5 clo. The orange line indicates the calculated mean probability from a biased sample (left-side), while the green line represents the mean calculated using a random sample (right-side). The black line displays the mean of the probabilities of the population, which is determined using the data-generating mechanism. This figure shows that the mean probability from the biased sample does not match the mean probability of the population. This discrepancy arises because the biased sample contains a disproportionate number of people who voted with lower and higher thermal sensations. However, the mean probability calculated from the random sample is almost a ‘perfect’ match with the population. The random sample can accurately recover the population mean. For both the biased and random samples, narrow 95% credible intervals are observed due to the large sample size (i.e., ten thousand observations). Increasing the sample size will only improve precision and not accuracy (see Note 2.1 for more detail). This is evident in the biased sample, where a larger sample size provides a more precise but incorrect answer.

In this note, we show how a larger sample size improves precision,7 not accuracy.8 Specifically, following the steps detailed in Section 2.2, we sampled two data sets of one thousand observations each. These two data sets are then analysed following the steps described in Section 2.3.3. The code below performs all the steps mentioned above.

Code
set.seed(2023) #set random number for reproducibility 
#Sample one data set of 1,000 observations
sample.random1e3 <- 
  Q07_population |> 
  slice_sample(n = 1e3) |>  #take a simple random sample of size 1,000
  select(-weights) #drop column 'weights'

set.seed(2023) #set random number for reproducibility
#Sample one data set of 1,000 observations
sample.bias1e3 <- 
  Q07_population |> 
  slice_sample(n = 1e3, weight_by = weights) |> #take a biased sample of size 1,000
  select(-weights) #drop column 'weights'


#Fit the ordinal regression model (random sample 1e3)
fit_sample.random1e3 <-
  brm(data = sample.random1e3,
      family = cumulative(logit),
      formula = V ~ 1 + S + T + C,
      #Define priors for specific parameters or classes of parameters
      prior = c(prior(normal(0, 4), class = Intercept),
                prior(normal(0, 1), class = b)),
      iter = 4000, warmup = 1000, 
      chains = 4, cores = 4,
      seed = 2023, #for reproducibility
      file = 'fit/Q7.fit_sample.random1e3') #save model fit


#Fit the ordinal regression model (biased sample 1e3)
fit_sample.bias1e3 <-
  brm(data = sample.bias1e3,
      family = cumulative(logit),
      formula = V ~ 1 + S + T + C,
      #Define priors for specific parameters or classes of parameters
      prior = c(prior(normal(0, 4), class = Intercept),
                prior(normal(0, 1), class = b)),
      iter = 4000, warmup = 1000, 
      chains = 4, cores = 4,
      seed = 2023, #for reproducibility
      file = 'fit/Q7.fit_sample.bias1e3') #save model fit


#Calculate the posterior predictions (random sample 1e3) 
fitted_sample.random1e3 <-
  fitted(fit_sample.random1e3,
         newdata = grid.pred,
         summary = F)

#Calculate the posterior predictions (biased sample 1e3)
fitted_sample.bias1e3 <-
  fitted(fit_sample.bias1e3,
         newdata = grid.pred,
         summary = F)


#Data wrangling (random sample 1e3)
f_sample.random1e3 <-
  rbind(fitted_sample.random1e3[, , 1],
        fitted_sample.random1e3[, , 2],
        fitted_sample.random1e3[, , 3],
        fitted_sample.random1e3[, , 4],
        fitted_sample.random1e3[, , 5],
        fitted_sample.random1e3[, , 6],
        fitted_sample.random1e3[, , 7]) |>
  data.frame() |> #convert the results to a data frame
  set_names(pull(grid.pred, T)) |> #rename the columns
  mutate(Vote = rep(1:7, each = n() / 7), #create a column of votes 
         draw = rep(1:12e3, times = 7)) |> #create a column of numeric index for the MCMC draw
  pivot_longer(-c(draw, Vote), #convert to the long format all the data frame's elements but 'draw' and 'Vote'
               names_to = 'Air.Temperature', #add columns name here
               values_to = 'Probability') |> #add values here
  mutate(Air.Temperature = as.numeric(Air.Temperature)) #convert 'Air.Temperature' from character to numeric format


#Data wrangling (biased sample 1e3)
f_sample.bias1e3 <-
  rbind(fitted_sample.bias1e3[, , 1],
        fitted_sample.bias1e3[, , 2],
        fitted_sample.bias1e3[, , 3],
        fitted_sample.bias1e3[, , 4],
        fitted_sample.bias1e3[, , 5],
        fitted_sample.bias1e3[, , 6],
        fitted_sample.bias1e3[, , 7]) |>
  data.frame() |> #convert the results to a data frame
  set_names(pull(grid.pred, T)) |> #rename the columns
  mutate(Vote = rep(1:7, each = n() / 7), #create a column of votes 
         draw = rep(1:12e3, times = 7)) |> #create a column of numeric index for the MCMC draw
  pivot_longer(-c(draw, Vote), #convert to the long format all the data frame's elements but 'draw' and 'Vote'
               names_to = 'Air.Temperature', #add columns name here
               values_to = 'Probability') |> #add values here
  mutate(Air.Temperature = as.numeric(Air.Temperature)) #convert 'Air.Temperature' from character to numeric format


#Calculate mean probability (random sample 1e3)
data.plot.random1e3 <-
  f_sample.random1e3 |>
  group_by(draw, Air.Temperature) |> #group the data by 'draw' and 'Air.Temperature'
  summarise(Mean.Vote = sum(Probability * Vote)) |> #for the grouped data calculate the mean probability
  ungroup() |> #ungroup the data
  group_by(Air.Temperature) |> #group the data by 'Air.Temperature'
  summarise(Q = list(quantile(Mean.Vote, probs = c(0.025, 0.5, 0.975)))) |> #for the grouped data summarise 'Probability' in a list-column
  unnest_wider(Q) #unnest the list-column into columns


#Calculate mean probability (biased sample 1e3)
data.plot.bias1e3 <-
  f_sample.bias1e3 |>
  group_by(draw, Air.Temperature) |> #group the data by 'draw' and 'Air.Temperature'
  summarise(Mean.Vote = sum(Probability * Vote)) |> #for the grouped data calculate the mean probability
  ungroup() |> #ungroup the data
  group_by(Air.Temperature) |> #group the data by 'Air.Temperature'
  summarise(Q = list(quantile(Mean.Vote, probs = c(0.025, 0.5, 0.975)))) |> #for the grouped data summarise 'Probability' in a list-column
  unnest_wider(Q) #unnest the list-column into columns 


#Combine the four data frames
data.plot1 <-
  rbind(data.plot.random |> mutate(Type  = 'random', Size = '1e4'), 
        data.plot.bias |> mutate(Type  = 'bias', Size = '1e4'),
        data.plot.random1e3 |> mutate(Type  = 'random', Size = '1e3'), 
        data.plot.bias1e3 |> mutate(Type  = 'bias', Size = '1e3')) |> 
  mutate(Size = ordered(as.factor(Size), levels = c('1e4', '1e3')))

#Plot
Size_names <- c('1e4' = 'Sample size: 10,000', 
                '1e3' = 'Sample size: 1,000')

ggplot(data = data.plot1, aes(x = Air.Temperature, y = `50%`)) +
  geom_line(data = data.plot.truth.mean, mapping = aes(x = Air.Temperature, y = `Mean.Vote`, colour = 'black', fill = 'black'),linewidth = 0.4) + 
  geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`, fill = Type), alpha = 0.3)+
  geom_line(aes(colour = Type), linewidth = 0.4) +
  scale_colour_manual('',
                      values = c('#F65314', '#7CBB00', 'black'), 
                      breaks = c( 'bias', 'random', 'black'),
                      labels = c('Biased sample', 'Random sample', 'Population')) +
  scale_fill_manual('',
                    values = c('#F65314', '#7CBB00', 'transparent'), 
                    breaks = c( 'bias', 'random', 'black'),
                    labels = c('Biased sample', 'Random sample', 'Population')) +
  scale_x_continuous('Air temperature (°C)',
                     breaks = seq(from = 16, to = 30, by = 2),
                     limits = c(16, 30)) +
  scale_y_continuous('TSV',
                     breaks = 1:7, 
                     limits = c(1, 7)) +
  facet_grid(Size ~ Type, 
             labeller = labeller(Size = Size_names,
                                 Type = Type_names)) +
  theme(legend.position = 'bottom',
        legend.direction = 'horizontal',
        legend.margin = margin(0,0,0,0),
        legend.key = element_blank(),
        axis.title.x = element_text(margin = unit(c(2, 0, 0, 0), "mm")),
        axis.title.y = element_text(margin = unit(c(0, 2, 0, 0), "mm"))) 
Fig. 2.4. Average posterior probability (and 95% credible intervals) across varying air temperatures for males with a clothing insulation value of 0.5 clo for two sample sizes (10,000 observations (top) and 1,000 observations (bottom)). The orange line indicates the calculated mean probability from a biased sample (left-side), while the green line represents the mean calculated using a random sample (right-side). The black lines display the mean of the probabilities of the population, which is determined using the data-generating mechanism.

Fig. 2.4 displays the average of posterior probabilities (and 95% credible intervals) across varying air temperatures for males with a clothing insulation value of 0.5 clo for two sample sizes (10,000 observations (top) and 1,000 observations (bottom)). The orange line indicates the calculated mean probability from a biased sample (left-side), while the green line represents the mean calculated using a random sample (right-side). The black line displays the mean of the probabilities of the population, which is determined using the data-generating mechanism.

This figure shows that the mean probability for both biased and random samples with ten thousand observations (Fig. 2.4 top) have narrower 95% credible intervals compared with ones from the sample of 1,000 observations. This is expected because a larger sample size contains less (random) sampling variation than a smaller sample, offering a more precise estimate. However, a larger sample size does only improve precision not accuracy. This is evident in the biased sample (Fig. 2.4 right-side), where a larger sample size provides a more precise but incorrect answer.

Session info

Version information about R, the OS and attached or loaded packages.

R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8 
[2] LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] EnvStats_2.8.1    brms_2.21.0       Rcpp_1.0.12       tidybayes_3.0.6  
 [5] lubridate_1.9.3   forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4      
 [9] purrr_1.0.2       readr_2.1.5       tidyr_1.3.1       tibble_3.2.1     
[13] tidyverse_2.0.0   gt_0.11.0         data.table_1.15.4 ggplot2_3.5.1    
[17] dagitty_0.3-4     ggdag_0.2.12     

loaded via a namespace (and not attached):
 [1] nlme_3.1-162         matrixStats_1.3.0    rstan_2.32.6        
 [4] tensorA_0.36.2.1     tools_4.2.3          backports_1.4.1     
 [7] utf8_1.2.4           R6_2.5.1             colorspace_2.1-0    
[10] ggdist_3.3.2         withr_3.0.0          tidyselect_1.2.1    
[13] gridExtra_2.3        Brobdingnag_1.2-9    emmeans_1.10.3      
[16] curl_5.2.1           compiler_4.2.3       cli_3.6.2           
[19] arrayhelpers_1.1-0   xml2_1.3.6           sass_0.4.9          
[22] labeling_0.4.3       posterior_1.6.0      scales_1.3.0        
[25] checkmate_2.3.1      mvtnorm_1.2-4        QuickJSR_1.3.1      
[28] digest_0.6.36        StanHeaders_2.32.10  rmarkdown_2.27      
[31] pkgconfig_2.0.3      htmltools_0.5.8.1    fastmap_1.2.0       
[34] htmlwidgets_1.6.4    rlang_1.1.4          rstudioapi_0.16.0   
[37] farver_2.1.2         generics_0.1.3       svUnit_1.0.6        
[40] jsonlite_1.8.8       distributional_0.4.0 inline_0.3.19       
[43] magrittr_2.0.3       loo_2.8.0            bayesplot_1.11.1    
[46] Matrix_1.6-5         munsell_0.5.1        fansi_1.0.6         
[49] viridis_0.6.5        abind_1.4-5          lifecycle_1.0.4     
[52] stringi_1.8.4        yaml_2.3.9           ggraph_2.2.1        
[55] MASS_7.3-58.2        pkgbuild_1.4.4       grid_4.2.3          
[58] ggrepel_0.9.5        parallel_4.2.3       lattice_0.20-45     
[61] graphlayouts_1.1.1   hms_1.1.3            knitr_1.48          
[64] pillar_1.9.0         igraph_2.0.3         boot_1.3-28.1       
[67] estimability_1.5.1   codetools_0.2-19     stats4_4.2.3        
[70] rstantools_2.4.0     glue_1.7.0           evaluate_0.24.0     
[73] V8_4.4.2             RcppParallel_5.1.8   tweenr_2.0.3        
[76] vctrs_0.6.5          tzdb_0.4.0           polyclip_1.10-6     
[79] gtable_0.3.5         cachem_1.1.0         ggforce_0.4.2       
[82] xfun_0.46            xtable_1.8-4         tidygraph_1.3.1     
[85] coda_0.19-4.1        viridisLite_0.4.2    memoise_2.0.1       
[88] timechange_0.3.0     bridgesampling_1.1-2

  1. We have selected only a few variables of interest to limit the complexity of the DAG and shift the focus to statistical data analysis.↩︎

  2. The slice_sample() function automatically standardised the sampling weights (i.e., weights) to sum to 1.↩︎

  3. The scientific model is generally an approximation (a best guess) of the unknown data-generating process. As such, it would have some inaccuracies/errors, which will inevitably affect the data analysis.↩︎

  4. In all simulation examples, we relied on the regression framework (ordinal or beta regression) within the frequentist or Bayesian approach. However, other approaches are available and could be used depending on the question and the data. It is the researcher’s responsibility to justify the choice of the statistical framework (and any decisions made during the modelling process) to answer the specific question of interest.↩︎

  5. Here, the intercept is omitted (i.e., set to zero). Omitting the intercept term allows the full set of thresholds \(\tau_{1},\ ...,\ \tau_{k}\) to be identified.↩︎

  6. The 12,000 post-warmup derived from the iter = 4000 minus the warmup = 1000 multiply by chains = 4, that is, \((4,000-1,000) \times 4\). These values can be found in the code for fit_sample.random and fit_sample.bias.↩︎

  7. Precision refers typically to how close the measurements are to each other.↩︎

  8. Accuracy generally refers to how close measurements are to their true but unknown value.↩︎