This tutorial demonstrates how to simulate a multi-environment trial (MET) dataset which contains appropriate genetic values and non-genetic effects. The genetic values capture genotype by environment (GxE) interaction, while the non-genetic effects capture environmental effects and spatial variation.
The objective of this tutorial is to simulate a basic MET dataset predominately using base R functions. A more complex MET dataset will be simulated in the next practical using FieldSimR, which generates GxE interaction and spatial variation with more realistic structure and complexity.
Summary
The simulation can be summarised by the following steps:
Clean the working environment.
rm(list = ls())
Load required packages.
library(FieldSimR)
library(ggplot2)
library(mbend)
The exact results in this tutorial can be reproduced by initially setting the following seed.
set.seed(123)
Consider a scenario in which 200 wheat genotypes are evaluated for grain yield (t/ha) in field trials across 10 growing environments. The genotypes are assumed to be independent for simplicity, but will be simulated with genetic relatedness in the next tutorial using AlphaSimR. Each trial comprises 20 columns and 20 rows for 400 plots in total, with two complete blocks of all genotypes aligned in the column direction (side-by-side). The overall trait mean is 4 t/ha and the overall trait variance is 0.2 t2/ha2. The average plot-level heritability is 0.3, which reflects the average ratio of genetic variance to total phenotypic variance within environments. The error variance is given by 0.2/0.3-0.2 = 0.47 t2/ha2.
The simulation parameters are:
n_genos <- 200 # no. genotypes
n_envs <- 10 # no. environments
n_reps <- n_blocks <- 2 # no. replicates/blocks per environment
n_cols <- 20 # no. columns per environment
n_rows <- 20 # no. rows per environment
n_plots <- n_cols * n_rows # no. plots per environment == n_genos * n_reps
mu <- 4 # overall trait mean
sigma2 <- 0.2 # overall trait variance
H2 <- 0.3 # average plot-level heritability
sigma2e <- sigma2/H2 - sigma2 # error variance
round(sigma2e, digits = 2)
#> [1] 0.47
The field trial layout for environment 1 is:
The phenotypes are generated based on the linear mixed model:
\(\mathbf{y} = \mathbf{X}\mathbf{b} + \mathbf{Z}\mathbf{gv} + \mathbf{e}\),
where \(\mathbf{b}\) is the vector of non-genetic effects with design matrix \(\mathbf{X}\), \(\mathbf{gv}\) is the vector of genetic values with design matrix \(\mathbf{Z}\) and \(\mathbf{e}\) is the vector of plot errors.
The simulation involves three steps:
The genetic values are simulated to capture genotype by environment (GxE) interaction, which reflects the differential response of genotypes to different environments. GxE interaction can be broadly categorised as either heterogeneity of genetic variance (changes in genotype scale) or lack of genetic correlation (changes in genotype rank). Readers are referred to Cockerham (1963) and Cooper et al. (1994) for further details.
The genetic values can be simulated two different ways:
The first approach reflects a compound symmetry model for GxE interaction, while the second approach reflects an unstructured model. The unstructured model provides the most flexible way to simulate GxE interaction with realistic structure and complexity.
The first way to obtain the genetic values is to simulate genotype main effects and GxE interaction effects, and then sum these effects.
The genetic values are constructed as the sum of two terms:
\(\mathbf{gv} = \mathbf{1} \otimes \mathbf{g} + \mathbf{{ge}}\),
where \(\mathbf{g}\) is the vector of genotype main effects and \(\mathbf{{ge}}\) is the vector of GxE interaction effects. These effects are generated based on a main effect variance of 0.1 and an interaction variance of 0.1. This allocates half of the overall trait variance to the genotype main effects and half to the GxE interaction effects.
The 200 genotype main effects are sampled from a standard normal distribution, and then shifted to a mean of 4 and scaled to a variance of 0.1.
g_main <- rnorm(n_genos, mean = 0, sd = 1) # sample
g_main <- c(scale(g_main, center = TRUE, scale = TRUE) * sqrt(0.5*sigma2)) + mu # shift and scale
mean(g_main) # overall trait mean
#> [1] 4
var(g_main) # main effect variance
#> [1] 0.1
The genotype main effects are then repeated across all environments.
g_main <- rep(g_main, times = n_envs)
A quick look at the main effects for genotypes 1 to 5 in environments 1 and 2.
GxE interaction is not demonstrated by these effects because there is neither a change in scale between environments nor a change in rank.
The 2,000 GxE interaction effects are also sampled from a standard normal distribution, and then centred and scaled to a variance of 0.1 for each environment.
GE_int <- matrix(rnorm(n_genos*n_envs, mean = 0, sd = 1), ncol = n_envs) # sample
GE_int <- scale(GE_int, center = TRUE, scale = TRUE) * sqrt(0.5*sigma2) # centre and scale
round(var(GE_int), digits = 2) # interaction variance matrix
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0.10 -0.01 0.01 0.01 -0.01 0.01 0.00 0.00 0.01 0.00
#> [2,] -0.01 0.10 0.00 0.00 0.00 0.00 0.01 -0.01 0.01 0.00
#> [3,] 0.01 0.00 0.10 -0.01 0.00 0.00 0.00 0.01 0.00 0.00
#> [4,] 0.01 0.00 -0.01 0.10 -0.01 0.01 0.00 -0.01 0.01 0.01
#> [5,] -0.01 0.00 0.00 -0.01 0.10 0.02 0.01 0.01 0.00 0.00
#> [6,] 0.01 0.00 0.00 0.01 0.02 0.10 -0.01 -0.01 0.00 0.00
#> [7,] 0.00 0.01 0.00 0.00 0.01 -0.01 0.10 0.00 0.00 0.00
#> [8,] 0.00 -0.01 0.01 -0.01 0.01 -0.01 0.00 0.10 0.00 0.00
#> [9,] 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.10 0.00
#> [10,] 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.10
ge_int <- c(GE_int) # vectorise
The off-diagonals of the covariance matrix are not strictly zero, which means that the GxE interaction effects are not completely independent between environments. This is a consequence of the sampling process, and will become negligible as the number of genotypes becomes large. Alternatively, a cholesky decomposition can be used to transform the effects, as will be demonstrated in the next section.
A quick look at the interaction effects for genotypes 1 to 5 in environments 1 and 2.
GxE interaction is demonstrated by these effects because there is a change in rank between environments. This type of interaction is commonly referred to as lack of genetic correlation (Cooper et al., 1994). There also appears to be a change in scale between environments, but note that this is not the case when viewing all genotypes because the interaction variance is the same for all environments. Different interaction variances can be set for each environment as required.
The 2,000 genetic values are obtained by summing the genotype main effects and GxE interaction effects.
gv <- g_main + ge_int
A quick look at the genetic values for genotypes 1 to 5 in environments 1 and 2.
GxE interaction is demonstrated by these effects in the same manner as the GxE interaction effects, i.e. lack of genetic correlation.
The between-environment genetic covariance matrix is:
GV <- matrix(gv, ncol = n_envs)
mean(GV) # overall trait mean
#> [1] 4
round(var(GV), digits = 2)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0.19 0.09 0.10 0.10 0.10 0.10 0.09 0.10 0.10 0.10
#> [2,] 0.09 0.19 0.10 0.09 0.11 0.09 0.11 0.10 0.11 0.10
#> [3,] 0.10 0.10 0.19 0.08 0.10 0.09 0.10 0.11 0.09 0.09
#> [4,] 0.10 0.09 0.08 0.19 0.09 0.10 0.10 0.09 0.11 0.11
#> [5,] 0.10 0.11 0.10 0.09 0.21 0.11 0.12 0.12 0.10 0.11
#> [6,] 0.10 0.09 0.09 0.10 0.11 0.19 0.09 0.09 0.09 0.09
#> [7,] 0.09 0.11 0.10 0.10 0.12 0.09 0.20 0.11 0.11 0.11
#> [8,] 0.10 0.10 0.11 0.09 0.12 0.09 0.11 0.21 0.11 0.10
#> [9,] 0.10 0.11 0.09 0.11 0.10 0.09 0.11 0.11 0.20 0.10
#> [10,] 0.10 0.10 0.09 0.11 0.11 0.09 0.11 0.10 0.10 0.20
round(mean(diag(var(GV))), digits = 2) # overall trait variance
#> [1] 0.2
The between-environment genetic covariance matrix is generated based on the same genetic variance for each environment (given by the sum of the main effect and interaction variances) and the same genetic covariance for each pair of environments (given by the main effect variance). This is indicative of a compound symmetry model for GxE interaction.
The expected diagonals for a compound symmetry model are 0.2, while the expected off-diagonals are 0.1. The minor differences here are a result of the sampling process, since the genotype main effects and GxE interaction effects are not strictly independent. The differences will become negligible as the number of genotypes becomes large.
The between-environment genetic correlation matrix is:
round(cov2cor(var(GV)), digits = 2)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 1.00 0.46 0.52 0.53 0.47 0.53 0.47 0.50 0.53 0.51
#> [2,] 0.46 1.00 0.50 0.46 0.52 0.49 0.54 0.48 0.54 0.50
#> [3,] 0.52 0.50 1.00 0.44 0.51 0.49 0.50 0.54 0.48 0.48
#> [4,] 0.53 0.46 0.44 1.00 0.46 0.55 0.49 0.46 0.54 0.54
#> [5,] 0.47 0.52 0.51 0.46 1.00 0.57 0.59 0.58 0.50 0.53
#> [6,] 0.53 0.49 0.49 0.55 0.57 1.00 0.46 0.47 0.49 0.48
#> [7,] 0.47 0.54 0.50 0.49 0.59 0.46 1.00 0.51 0.52 0.52
#> [8,] 0.50 0.48 0.54 0.46 0.58 0.47 0.51 1.00 0.53 0.51
#> [9,] 0.53 0.54 0.48 0.54 0.50 0.49 0.52 0.53 1.00 0.51
#> [10,] 0.51 0.50 0.48 0.54 0.53 0.48 0.52 0.51 0.51 1.00
The between-environment genetic correlation matrix is generated based on the same genetic correlation for each pair of environments (given by the proportion of main effect variance). This is why the off-diagonals are ~0.5.
The second way to obtain the genetic values is to simulate a between-environment genetic covariance matrix, and then simulate effects based on this matrix. The between-environment genetic covariance matrix is simulated with a separate genetic variance for each environment and a separate genetic covariance/correlation for each pair of environments. This is indicative of an unstructured model for GxE interaction.
The between-environment genetic covariance matrix is constructed as:
\(\mathbf{G_e} = \mathbf{D}^{1/2}_\mathbf{e}\mathbf{C_e}\mathbf{D}^{1/2}_\mathbf{e}\)
where \(\mathbf{D_e}\) is a diagonal genetic variance matrix and \(\mathbf{C_e}\) is a between-environment genetic correlation matrix.
The genetic values are then simulated as:
\(\mathbf{gv} \sim \mbox{N}(\mu\mathbf{1}, \mathbf{G_e} \otimes \mathbf{I})\),
where \(\mu\) is the overall trait mean.
The 10 genetic variances are sampled from an inverse gamma distribution, with shape parameter of 5 and scale parameter of 1. This can be achieved using the inverse of a gamma distribution, with shape parameter of 5 and rate parameter of 1.
de <- 1/rgamma(n_envs, shape = 5, rate = 1) # sample
round(de, digits = 2)
#> [1] 0.09 0.28 0.21 0.18 0.17 0.29 0.30 0.40 0.21 0.19
The genetic variances are then shifted to give an overall trait variance of 0.2.
de <- c(scale(de, center = TRUE, scale = FALSE) + sigma2) # shift
round(de, digits = 2)
#> [1] 0.06 0.25 0.18 0.15 0.14 0.25 0.26 0.37 0.18 0.16
mean(de) # overall trait variance
#> [1] 0.2
The genetic variance matrix is obtained by pasting the variances into a diagonal matrix.
De <- diag(de)
round(De, digits = 2)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
#> [2,] 0.00 0.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
#> [3,] 0.00 0.00 0.18 0.00 0.00 0.00 0.00 0.00 0.00 0.00
#> [4,] 0.00 0.00 0.00 0.15 0.00 0.00 0.00 0.00 0.00 0.00
#> [5,] 0.00 0.00 0.00 0.00 0.14 0.00 0.00 0.00 0.00 0.00
#> [6,] 0.00 0.00 0.00 0.00 0.00 0.25 0.00 0.00 0.00 0.00
#> [7,] 0.00 0.00 0.00 0.00 0.00 0.00 0.26 0.00 0.00 0.00
#> [8,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.37 0.00 0.00
#> [9,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.18 0.00
#> [10,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.16
GxE interaction is demonstrated by this matrix because the genetic variances within environments are different. This type of interaction is commonly referred to as heterogeneity of genetic variance, and occurs as a result of changes in scale between environments without changes in rank (Cooper et al., 1994).
The 45 genetic correlations are sampled from on a continuous uniform distribution ranging from 0.3 to 0.7, which produces a mean genetic correlation of 0.5. The correlations will be simulated with more realistic structure and complexity in the next practical.
ce <- runif(n_envs * (n_envs - 1)/2, min = 0.3, max = 0.7) # sample
The between-environment genetic correlation matrix is obtained by pasting the correlations into a symmetric matrix.
Ce <- matrix(1, n_envs, n_envs)
Ce[upper.tri(Ce)] <- ce
Ce[lower.tri(Ce)] <- t(Ce)[lower.tri(Ce)]
The between-environment genetic correlation matrix may require bending to ensure it has desirable properties, e.g. non-singular/invertible. This is a consequence of the (crude) way the correlations were sampled from the uniform distribution.
Ce <- mbend::bend(Ce)$bent
#> Unweighted bending
#> max.iter = 10000
#> small.positive = 1e-04
#> method = hj
#> Found a correlation matrix.
#> Convergence met after 4 iterations.
round(Ce, digits = 2)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 1.00 0.43 0.67 0.33 0.46 0.34 0.59 0.37 0.64 0.37
#> [2,] 0.43 1.00 0.67 0.61 0.32 0.58 0.46 0.30 0.57 0.54
#> [3,] 0.67 0.67 1.00 0.43 0.59 0.48 0.38 0.44 0.44 0.38
#> [4,] 0.33 0.61 0.43 1.00 0.37 0.31 0.61 0.37 0.57 0.32
#> [5,] 0.46 0.32 0.59 0.37 1.00 0.67 0.57 0.52 0.53 0.67
#> [6,] 0.34 0.58 0.48 0.31 0.67 1.00 0.47 0.54 0.46 0.65
#> [7,] 0.59 0.46 0.38 0.61 0.57 0.47 1.00 0.42 0.39 0.62
#> [8,] 0.37 0.30 0.44 0.37 0.52 0.54 0.42 1.00 0.31 0.54
#> [9,] 0.64 0.57 0.44 0.57 0.53 0.46 0.39 0.31 1.00 0.58
#> [10,] 0.37 0.54 0.38 0.32 0.67 0.65 0.62 0.54 0.58 1.00
The between-environment genetic correlation matrix can be displayed using a heatmap. The heatmap is often ordered based on a dendrogram, which is produced via hierarchical clustering. The dendrogram generally places environments that are more similar closer together, but it is important to use the branches to examine similarity. Higher correlations are distinguished with darker colours and lower correlations with lighter colours.
GxE interaction is represented in this matrix because the genetic correlations between environments are less than one. This type of interaction is commonly referred to as lack of genetic correlation, and occurs as a result of changes in genotype rank between environments (Cooper et al., 1994).
The between-environment genetic covariance matrix is obtained by combining the diagonal genetic variance matrix with the between-environment genetic correlation matrix. The diagonals of this matrix are the genetic variances and the off-diagonals are the genetic covariances.
Ge <- sqrt(De) %*% Ce %*% sqrt(De)
The genetic values are simulated based on the between-environment genetic covariance matrix above. This is achieved by first simulating independent genetic values between environments, and then applying a series of transformations.
The 2,000 genetic values are initially sampled from a standard normal distribution, and centred and scaled to unit variance for each environment.
GV <- matrix(rnorm(n_genos*n_envs, mean = 0, sd = 1), ncol = n_envs) # sample
GV <- scale(GV, center = TRUE, scale = TRUE) # centre and scale
round(var(GV), digits = 2)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 1.00 0.17 0.00 0.11 -0.01 -0.06 -0.02 0.03 0.03 -0.06
#> [2,] 0.17 1.00 0.09 0.18 0.09 0.05 0.11 0.02 0.04 -0.02
#> [3,] 0.00 0.09 1.00 -0.01 -0.01 0.01 0.11 0.05 0.00 0.01
#> [4,] 0.11 0.18 -0.01 1.00 -0.11 0.05 -0.03 0.04 0.00 -0.16
#> [5,] -0.01 0.09 -0.01 -0.11 1.00 -0.02 -0.06 0.03 0.05 0.06
#> [6,] -0.06 0.05 0.01 0.05 -0.02 1.00 0.07 0.05 -0.02 0.08
#> [7,] -0.02 0.11 0.11 -0.03 -0.06 0.07 1.00 -0.06 0.10 0.05
#> [8,] 0.03 0.02 0.05 0.04 0.03 0.05 -0.06 1.00 0.06 -0.02
#> [9,] 0.03 0.04 0.00 0.00 0.05 -0.02 0.10 0.06 1.00 0.00
#> [10,] -0.06 -0.02 0.01 -0.16 0.06 0.08 0.05 -0.02 0.00 1.00
The off-diagonals are not strictly zero, which means that the genetic values are not strictly independent between environments. This is a consequence of the sampling process, and can be rectified with a simple transformation.
GV <- GV %*% t(chol(solve(var(GV)))) # transform
round(var(GV), digits = 2)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 1 0 0 0 0 0 0 0 0 0
#> [2,] 0 1 0 0 0 0 0 0 0 0
#> [3,] 0 0 1 0 0 0 0 0 0 0
#> [4,] 0 0 0 1 0 0 0 0 0 0
#> [5,] 0 0 0 0 1 0 0 0 0 0
#> [6,] 0 0 0 0 0 1 0 0 0 0
#> [7,] 0 0 0 0 0 0 1 0 0 0
#> [8,] 0 0 0 0 0 0 0 1 0 0
#> [9,] 0 0 0 0 0 0 0 0 1 0
#> [10,] 0 0 0 0 0 0 0 0 0 1
The genetic values are now independent between environments.
The final transformation applies the between-environment genetic covariance matrix above. The genetic values are then shifted to a mean of 4.
GV <- GV %*% chol(Ge)
round(var(GV), digits = 2) # simulated between-environment genetic covariance matrix
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0.06 0.05 0.07 0.03 0.04 0.04 0.08 0.06 0.07 0.04
#> [2,] 0.05 0.25 0.14 0.12 0.06 0.15 0.12 0.09 0.12 0.11
#> [3,] 0.07 0.14 0.18 0.07 0.09 0.10 0.08 0.11 0.08 0.06
#> [4,] 0.03 0.12 0.07 0.15 0.05 0.06 0.12 0.09 0.09 0.05
#> [5,] 0.04 0.06 0.09 0.05 0.14 0.13 0.11 0.12 0.08 0.10
#> [6,] 0.04 0.15 0.10 0.06 0.13 0.25 0.12 0.16 0.10 0.13
#> [7,] 0.08 0.12 0.08 0.12 0.11 0.12 0.26 0.13 0.09 0.13
#> [8,] 0.06 0.09 0.11 0.09 0.12 0.16 0.13 0.37 0.08 0.13
#> [9,] 0.07 0.12 0.08 0.09 0.08 0.10 0.09 0.08 0.18 0.10
#> [10,] 0.04 0.11 0.06 0.05 0.10 0.13 0.13 0.13 0.10 0.16
round(Ge, digits = 2) # expected between-environment genetic covariance matrix
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0.06 0.05 0.07 0.03 0.04 0.04 0.08 0.06 0.07 0.04
#> [2,] 0.05 0.25 0.14 0.12 0.06 0.15 0.12 0.09 0.12 0.11
#> [3,] 0.07 0.14 0.18 0.07 0.09 0.10 0.08 0.11 0.08 0.06
#> [4,] 0.03 0.12 0.07 0.15 0.05 0.06 0.12 0.09 0.09 0.05
#> [5,] 0.04 0.06 0.09 0.05 0.14 0.13 0.11 0.12 0.08 0.10
#> [6,] 0.04 0.15 0.10 0.06 0.13 0.25 0.12 0.16 0.10 0.13
#> [7,] 0.08 0.12 0.08 0.12 0.11 0.12 0.26 0.13 0.09 0.13
#> [8,] 0.06 0.09 0.11 0.09 0.12 0.16 0.13 0.37 0.08 0.13
#> [9,] 0.07 0.12 0.08 0.09 0.08 0.10 0.09 0.08 0.18 0.10
#> [10,] 0.04 0.11 0.06 0.05 0.10 0.13 0.13 0.13 0.10 0.16
GV <- scale(GV, center = TRUE, scale = FALSE) + mu # shift
gv <- c(GV) # vectorise
round(mean(gv), digits = 2) # overall trait mean
#> [1] 4
The genetic values now have the desired structure.
A quick look at the genetic values for genotypes 1 to 5 in environments 1 and 2.
GxE interaction is demonstrated by these effects because there is a change in scale between environments as well as a change in rank.
These genetic values will be used to construct the MET dataset below.
It is possible to obtain genotype main effects and GxE interaction effects for this approach, although these components were not explicitly simulated. The proportion of main effect and interaction variance can also be formally quantified to aid with interpretation.
The genotype main effects are obtained as simple averages across environments.
g_main <- rowMeans(GV)
round(var(g_main), digits = 2) # simulated main effect variance
#> [1] 0.1
round(mean(Ge), digits = 2) # expected main effect variance
#> [1] 0.1
The GxE interaction effects are then obtained after adjusting for the genotype main effects.
ge_int <- c(GV - g_main)
GE_int <- matrix(ge_int, ncol = n_envs)
round(var(GE_int), digits = 2) # simulated interaction variance matrix
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0.06 -0.02 0.02 0.00 0.00 -0.03 0.00 -0.03 0.02 -0.01
#> [2,] -0.02 0.11 0.03 0.02 -0.05 0.01 -0.02 -0.06 0.00 -0.01
#> [3,] 0.02 0.03 0.08 -0.01 0.00 -0.02 -0.04 -0.02 -0.02 -0.03
#> [4,] 0.00 0.02 -0.01 0.08 -0.02 -0.04 0.02 -0.03 0.01 -0.03
#> [5,] 0.00 -0.05 0.00 -0.02 0.06 0.01 0.00 0.00 0.00 0.01
#> [6,] -0.03 0.01 -0.02 -0.04 0.01 0.11 -0.02 0.01 -0.02 0.01
#> [7,] 0.00 -0.02 -0.04 0.02 0.00 -0.02 0.12 -0.02 -0.03 0.01
#> [8,] -0.03 -0.06 -0.02 -0.03 0.00 0.01 -0.02 0.20 -0.05 0.00
#> [9,] 0.02 0.00 -0.02 0.01 0.00 -0.02 -0.03 -0.05 0.09 0.00
#> [10,] -0.01 -0.01 -0.03 -0.03 0.01 0.01 0.01 0.00 0.00 0.06
round(mean(diag(var(GE_int))), digits = 2) # simulated pooled interaction variance
#> [1] 0.1
round(mean(diag(Ge - mean(Ge))), digits = 2) # expected pooled interaction variance
#> [1] 0.1
The pooled interaction variance can be further split into heterogeneity of genetic variance and lack of genetic correlation (Cooper et al., 1994).
round(sum((sqrt(diag(var(GV))) - mean(sqrt(diag(var(GV)))))^2)/n_envs, digits = 2) # heterogeneity of genetic variance
#> [1] 0.01
round(sum(outer(sqrt(diag(var(GV))), sqrt(diag(var(GV))), "*") * (1 - cov2cor(var(GV))))/n_envs^2, digits = 2) # lack of genetic correlation
#> [1] 0.09
This indicates that 0.1 of the interaction variance corresponds to changes in scale between environments and 0.9 corresponds to changes in rank.
Simulating the non-genetic effects involves three parts:
The 10 environmental main effects are sampled from a standard normal distribution, and then centred and scaled to unit variance. These effects represent large differences in yield between environments.
b_env <- rnorm(n_envs, mean = 0, sd = 1) # sample
b_env <- c(scale(b_env, center = TRUE, scale = TRUE)) # centre and scale
round(b_env, digits = 2)
#> [1] 0.14 0.15 -0.45 0.10 -2.04 1.36 0.45 -0.68 -0.40 1.37
The 20 block effects are sampled from a normal distribution with variance of 0.01, and then centred for each environment. These effects represent minor differences in yield between blocks within environments.
B_block <- matrix(rnorm(n_blocks*n_envs, mean = 0, sd = sqrt(0.01)), ncol = n_envs) # sample
B_block <- scale(B_block, center = TRUE, scale = FALSE) # centre
b_block <- c(B_block) # vectorise
round(b_block, digits = 2)
#> [1] -0.12 0.12 0.09 -0.09 -0.09 0.09 -0.03 0.03 -0.05 0.05 -0.05 0.05
#> [13] 0.03 -0.03 -0.03 0.03 -0.04 0.04 -0.05 0.05
The plot errors are simulated to capture spatial variation, which can be broadly categorised as either global and local trend (smooth spatial trend), random error (noise) or extraneous variation (systematic error). Readers are referred to the previous tutorial for full details on simulating these components.
The plot errors are obtained as the sum of a spatial error term and a random error term. The variance of each term is set to 0.23, which corresponds to half of the error variance in each environment.
The spatial errors are simulated based on a separable first order autoregressive (AR1) process for each environment. The column autocorrelation is set to 0.5 and the row autocorrelation is set to 0.7.
ar1_fn <- function(n, autocor){autocor^abs(outer(1:n, 1:n, "-"))}
ar1_col <- ar1_fn(n = n_cols, autocor = 0.5) # column correlation matrix
ar1_row <- ar1_fn(n = n_rows, autocor = 0.7) # row correlation matrix
S_cor <- kronecker(ar1_col, ar1_row) # error correlation matrix
The autocorrelation parameters are used for simplicity, but will be varied across environments in the next tutorial using FieldSimR.
The spatial errors are sampled from a standard normal distribution, transformed based on the error correlation matrix above, and then centred and scaled to a variance of 0.23 for each environment.
ES <- matrix(rnorm(n_cols*n_rows*n_envs, mean = 0, sd = 1), ncol = n_envs) # sample
ES <- t(chol(S_cor)) %*% ES # transform
ES <- scale(ES, center = TRUE, scale = TRUE) * sqrt(0.5*sigma2e) # centre and scale
round(diag(var(ES)), digits = 2) # spatial error variances
#> [1] 0.23 0.23 0.23 0.23 0.23 0.23 0.23 0.23 0.23 0.23
es <- c(ES) # vectorise
The random errors are sampled from a standard normal distribution, and then centred and scaled to a variance of 0.23 for each environment.
ER <- matrix(rnorm(n_plots*n_envs, mean = 0, sd = 1), ncol = n_envs) # sample
ER <- scale(ER, center = TRUE, scale = TRUE) * sqrt(0.5*sigma2e) # centre and scale
round(diag(var(ER)), digits = 2) # random error variances
#> [1] 0.23 0.23 0.23 0.23 0.23 0.23 0.23 0.23 0.23 0.23
er <- c(ER) # vectorise
The total plot errors are obtained as the sum of the spatial and random errors.
e <- es + er
Generating the phenotypes involves two parts:
The design matrices are used to link the simulated effects to the phenotypes, as determined by the experimental design.
A randomised complete block design is used to allocate genotypes to plots for each environment. A data frame is constructed that contains the plot factors “env”, “block”, “column” and “row”. The data frame is ordered as rows within columns within environments.
plot_df <- data.frame(env = factor(rep(1:n_envs, each = n_plots)),
block = factor(rep(1:n_blocks, each = n_plots/n_blocks)),
col = factor(rep(1:n_cols, each = n_rows)),
row = factor(rep(1:n_rows)))
head(plot_df) # check the top few lines
#> env block col row
#> 1 1 1 1 1
#> 2 1 1 1 2
#> 3 1 1 1 3
#> 4 1 1 1 4
#> 5 1 1 1 5
#> 6 1 1 1 6
A design function is then generated to allocate the genotypes (treatments), such that every genotype occurs exactly once in every block. A different design function is generated for each environment.
plot_df$id <- factor(unlist(lapply(seq_len(n_blocks*n_envs), function(x) sample(1:n_genos))))
head(plot_df)
#> env block col row id
#> 1 1 1 1 1 114
#> 2 1 1 1 2 72
#> 3 1 1 1 3 135
#> 4 1 1 1 4 63
#> 5 1 1 1 5 49
#> 6 1 1 1 6 65
head(with(plot_df, table(id, block, env))[,,1]) # check the resolvability of the design
#> block
#> id 1 2
#> 1 1 1
#> 2 1 1
#> 3 1 1
#> 4 1 1
#> 5 1 1
#> 6 1 1
The field trial layout for environment 1 with genotypes labelled is:
The design matrices are then obtained.
X_env <- model.matrix(~env - 1, data = plot_df)
X_block <- model.matrix(~block:env - 1, data = plot_df)
Z <- model.matrix(~id:env - 1, data = plot_df)
The phenotypes are generated based on the linear mixed model:
y <- X_env %*% b_env + X_block %*% b_block + Z %*% gv + e
The MET dataset is formed by combining the plot and treatment factors with the phenotypes. The true genetic values are also added. This highlights an appealing feature of simulated data, i.e. the true values are known.
met_df <- data.frame(plot_df,
yield = y,
gv = Z %*% gv)
head(met_df)
#> env block col row id yield gv
#> 1 1 1 1 1 114 4.223903 4.213213
#> 2 1 1 1 2 72 4.596717 3.801605
#> 3 1 1 1 3 135 6.218665 4.445786
#> 4 1 1 1 4 63 4.346109 4.269491
#> 5 1 1 1 5 49 5.082635 4.309022
#> 6 1 1 1 6 65 3.590371 3.582597
The field trial layout for environment 1 with phenotypes shaded is:
A quick look at the phenotypes and genetic values for all environments:
Note the differences in phenotypic mean yield (vertical alignment) and variance (vertical spread) between environments.
The simulated plot-level and line-mean heritabilities can be formally quantified to aid with interpretation.
The plot-level heritabilities within environments are:
round(c(by(met_df[,c("yield","gv")], met_df$env, function(x) {cor(x$yield, x$gv)}))^2, digits = 2) # simulated plot-level heritabilities
#> 1 2 3 4 5 6 7 8 9 10
#> 0.19 0.31 0.30 0.23 0.27 0.36 0.36 0.42 0.30 0.25
round(de/(de + sigma2e), digits = 2) # expected plot-level heritabilities
#> [1] 0.12 0.35 0.28 0.24 0.23 0.35 0.36 0.44 0.28 0.25
The minor discrepancies here are a result of the sampling process.
The line-mean heritabilities within environments are:
ave_yield <- with(met_df, tapply(yield, list(id, env), mean))
tmp <- met_df[met_df$block == 1,]
tmp <- tmp[order(tmp$env, tmp$id),]
round(c(by(tmp[,c("yield","gv")], tmp$env, function(x) {cor(x$yield, x$gv)}))^2, digits = 2) # simulated line-mean heritabilities
#> 1 2 3 4 5 6 7 8 9 10
#> 0.24 0.35 0.28 0.17 0.29 0.32 0.42 0.53 0.35 0.20
round(de/(de + sigma2e/n_reps), digits = 2) # expected line-mean heritabilities
#> [1] 0.21 0.52 0.44 0.38 0.38 0.52 0.53 0.61 0.44 0.40
round(c(by(tmp[,c("yield","gv")], tmp$env, function(x) {cor(x$yield, x$gv)})), digits = 2) # simulated prediction accuracies
#> 1 2 3 4 5 6 7 8 9 10
#> 0.49 0.59 0.53 0.41 0.53 0.56 0.65 0.73 0.59 0.45
round(sqrt(de/(de + sigma2e/n_reps)), digits = 2) # expected prediction accuracies
#> [1] 0.46 0.72 0.66 0.62 0.61 0.72 0.73 0.78 0.66 0.63
The line-mean heritability across all environments is:
ave_yield <- with(met_df, tapply(yield, id, mean))
round(cor(ave_yield, g_main)^2, digits = 2) # simulated line-mean heritability
#> [1] 0.83
round(var(g_main)/(var(g_main) + mean(diag(var(GE_int)))/n_envs + sigma2e/(n_envs*n_reps)), digits = 2) # expected line-mean heritability
#> [1] 0.76
round(cor(ave_yield, g_main), digits = 2) # simulated prediction accuracy
#> [1] 0.91
round(sqrt(var(g_main)/(var(g_main) + mean(diag(var(GE_int)))/n_envs + sigma2e/(n_envs*n_reps))), digits = 2) # expected prediction accuracy
#> [1] 0.87