This tutorial demonstrates how to simulate plot data in a plant breeding field trial which contains appropriate genetic and non-genetic effects. The genetic effects capture differences between genotypes, while the non-genetic effects capture environmental effects and spatial variation.

The objective of this tutorial is to simulate basic plot data predominately using base R functions. More complex plot data will be simulated in the next practical using FieldSimR.


Summary

The simulation can be summarised by the following steps:

  1. Simulating plot errors which capture spatial variation
    1. Global and local trend
    2. Random error
    3. Extraneous variation

  1. Simulating appropriate genetic values and non-genetic effects

  2. Generating phenotypes by combining the plot errors with the genetic values and non-genetic effects
    1. Experimental design
    2. Constructing a dataset
    3. Measures of heritability and expected accuracy


Preliminaries

Clean the working environment.

rm(list = ls())

Install FieldSimR from github.

library(devtools)
install_github("crWerner/fieldsimr")

Load required packages.

library(FieldSimR)
library(ggplot2)
library(interp)

The exact results in this tutorial can be reproduced by initially setting the following seed.

set.seed(123)


Simulation details

Consider a scenario in which 200 wheat genotypes are evaluated for grain yield (t/ha) in a single field trial. The genotypes are assumed to be independent for simplicity, but will be simulated with genetic relatedness in the next tutorial using AlphaSimR. The trial comprises 10 columns and 20 rows for 200 plots in total. There are two complete blocks of all genotypes aligned in the column direction (side-by-side). The trial is 80 m long in the column direction and 40 m wide in the row direction, with plots 8m long by 2m wide. The trait mean is 4 t/ha and the trait variance is 0.2 t2/ha2. The plot-level heritability is 0.3, which reflects the ratio of genetic variance to total phenotypic variance. The error variance is given by 0.2/0.3-0.2 = 0.47 t2/ha2.


The simulation parameters are:

n_genos <- 100 # no. genotypes
n_reps <- n_blocks <- 2 # no. replicates/blocks
n_cols <- 10   # no. columns
n_rows <- 20   # no. rows
n_plots <- n_cols * n_rows # no. plots == n_genos * n_reps
mu <- 4        # trait mean
sigma2 <- 0.2  # trait variance 
H2 <- 0.3      # plot-level heritability
sigma2e <- sigma2/H2 - sigma2 # error variance
round(sigma2e, digits = 2)
#> [1] 0.47


The field trial layout is:


The phenotypes are generated based on the linear mixed model:

\(\mathbf{y} = \mathbf{X}\mathbf{b} + \mathbf{Z}\mathbf{gv} + \mathbf{e}\),

where \(\mathbf{b}\) is a vector of non-genetic effects with design matrix \(\mathbf{X}\), \(\mathbf{gv}\) is a vector of genetic values with design matrix \(\mathbf{Z}\) and \(\mathbf{e}\) is a vector of plot errors.

The simulation involves three steps:

  1. Simulating the plot errors
  2. Simulating the genetic values and non-genetic effects
  3. Generating the phenotypes by combining the plot errors with the genetic values and non-genetic effects.


1. Simulating the plot errors

The plot errors are simulated to capture spatial variation, which reflects heterogeneity across the trial area. Spatial variation can be broadly categorised as either global and local trend (smooth spatial trend), random error (noise) or extraneous variation (systematic error). Readers are referred to Gilmour et al. (1997) and Stefanova et al. (2009) for further details.

The plot errors are constructed as the sum of three terms:

\(\mathbf{e} = \mathbf{se} + \mathbf{re} + \mathbf{ee}\),

where \(\mathbf{se}\) is a vector of spatial errors that capture global and local trend, \(\mathbf{re}\) is a vector of random errors that capture random measurement error/noise and \(\mathbf{ee}\) is a vector of extraneous errors that capture extraneous variation. The plot errors are generated based on a spatial error variance of 0.19, random error variance of 0.19 and an extraneous error variance of 0.09. This allocates 0.4 of the total error variance to the spatial errors, 0.4 to the random errors and 0.2 to the extraneous errors.

Simulating the plot errors involves three parts:

  1. Simulating global and local trend
  2. Simulating random error
  3. Simulating extraneous variation.


a. Simulating global and local trend

The spatial errors capture global and local trend, e.g. large scale fertility gradients across the trial area (Green, 1985) and small scale changes in soil composition between neighbouring plots (Gilmour et al., 1997).

The spatial errors can be simulated two different ways:

  • Constructing an error correlation matrix
  • Applying a smoothing function.


Constructing an error correlation matrix

The first way to simulate the spatial errors is to construct a correlation matrix, and then simulate effects based on this matrix. The correlation matrix is structured as a separable first order autoregressive (AR1) process, which comprises a separate autocorrelation parameter for the column and row dimensions.

The spatial errors are simulated in this case as:

\(\mathbf{se} \sim \mbox{N}\big(\mathbf{0}, \sigma_s^2\boldsymbol{\Sigma}_\mathbf{c}(\rho_c) \otimes \boldsymbol{\Sigma}_\mathbf{r}(\rho_r)\big)\),

where \(\sigma_s^2\) is the spatial error variance, \(\rho_c\) is the column autocorrelation parameter with correlation matrix \(\boldsymbol{\Sigma}_\mathbf{c}\) and \(\rho_r\) is the row autocorrelation parameter with correlation matrix \(\boldsymbol{\Sigma}_\mathbf{r}\). The error correlation matrix will be initially constructed with a column autocorrelation of 0.5 and a row autocorrelation of 0.7. Other autocorrelation parameters will also be demonstrated.

A function is used to construct the column and row correlation matrices, which enables the matrix dimensions and autocorrelation parameters to be changed as required.

ar1_fn <- function(n, autocor){autocor^abs(outer(1:n, 1:n, "-"))}
ar1_col <- ar1_fn(n = n_cols, autocor = 0.5) # column correlation matrix
round(ar1_col, digits = 2)
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,] 1.00 0.50 0.25 0.12 0.06 0.03 0.02 0.01 0.00  0.00
#>  [2,] 0.50 1.00 0.50 0.25 0.12 0.06 0.03 0.02 0.01  0.00
#>  [3,] 0.25 0.50 1.00 0.50 0.25 0.12 0.06 0.03 0.02  0.01
#>  [4,] 0.12 0.25 0.50 1.00 0.50 0.25 0.12 0.06 0.03  0.02
#>  [5,] 0.06 0.12 0.25 0.50 1.00 0.50 0.25 0.12 0.06  0.03
#>  [6,] 0.03 0.06 0.12 0.25 0.50 1.00 0.50 0.25 0.12  0.06
#>  [7,] 0.02 0.03 0.06 0.12 0.25 0.50 1.00 0.50 0.25  0.12
#>  [8,] 0.01 0.02 0.03 0.06 0.12 0.25 0.50 1.00 0.50  0.25
#>  [9,] 0.00 0.01 0.02 0.03 0.06 0.12 0.25 0.50 1.00  0.50
#> [10,] 0.00 0.00 0.01 0.02 0.03 0.06 0.12 0.25 0.50  1.00
ar1_row <- ar1_fn(n = n_rows, autocor = 0.7) # row correlation matrix
round(ar1_row, digits = 2)[1:10,1:10] # only print first 10 rows for brevity
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,] 1.00 0.70 0.49 0.34 0.24 0.17 0.12 0.08 0.06  0.04
#>  [2,] 0.70 1.00 0.70 0.49 0.34 0.24 0.17 0.12 0.08  0.06
#>  [3,] 0.49 0.70 1.00 0.70 0.49 0.34 0.24 0.17 0.12  0.08
#>  [4,] 0.34 0.49 0.70 1.00 0.70 0.49 0.34 0.24 0.17  0.12
#>  [5,] 0.24 0.34 0.49 0.70 1.00 0.70 0.49 0.34 0.24  0.17
#>  [6,] 0.17 0.24 0.34 0.49 0.70 1.00 0.70 0.49 0.34  0.24
#>  [7,] 0.12 0.17 0.24 0.34 0.49 0.70 1.00 0.70 0.49  0.34
#>  [8,] 0.08 0.12 0.17 0.24 0.34 0.49 0.70 1.00 0.70  0.49
#>  [9,] 0.06 0.08 0.12 0.17 0.24 0.34 0.49 0.70 1.00  0.70
#> [10,] 0.04 0.06 0.08 0.12 0.17 0.24 0.34 0.49 0.70  1.00

The column and row correlation matrices are then Kroneckered to form the error correlation matrix.

S_cor <- kronecker(ar1_col, ar1_row)


The spatial errors are initially sampled from a standard normal distribution.

se <- rnorm(n_cols*n_rows, mean = 0, sd = 1)

The spatial errors are then transformed based on the error correlation matrix generated above.

se <- c(t(chol(S_cor)) %*% se)
round(cov2cor(var(matrix(se, ncol = n_cols))), digits = 2) # simulated column correlation matrix
#>        [,1] [,2] [,3]  [,4]  [,5] [,6] [,7]  [,8]  [,9] [,10]
#>  [1,]  1.00 0.32 0.29 -0.07 -0.06 0.10 0.18 -0.37 -0.32  0.06
#>  [2,]  0.32 1.00 0.82  0.52  0.68 0.68 0.32  0.39  0.09  0.20
#>  [3,]  0.29 0.82 1.00  0.49  0.51 0.49 0.35  0.20  0.22  0.10
#>  [4,] -0.07 0.52 0.49  1.00  0.68 0.46 0.23  0.40  0.27  0.55
#>  [5,] -0.06 0.68 0.51  0.68  1.00 0.73 0.45  0.48  0.25  0.52
#>  [6,]  0.10 0.68 0.49  0.46  0.73 1.00 0.33  0.40  0.07  0.27
#>  [7,]  0.18 0.32 0.35  0.23  0.45 0.33 1.00  0.49  0.39  0.32
#>  [8,] -0.37 0.39 0.20  0.40  0.48 0.40 0.49  1.00  0.55  0.38
#>  [9,] -0.32 0.09 0.22  0.27  0.25 0.07 0.39  0.55  1.00  0.34
#> [10,]  0.06 0.20 0.10  0.55  0.52 0.27 0.32  0.38  0.34  1.00
round(ar1_col, digits = 2) # expected column correlation matrix
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,] 1.00 0.50 0.25 0.12 0.06 0.03 0.02 0.01 0.00  0.00
#>  [2,] 0.50 1.00 0.50 0.25 0.12 0.06 0.03 0.02 0.01  0.00
#>  [3,] 0.25 0.50 1.00 0.50 0.25 0.12 0.06 0.03 0.02  0.01
#>  [4,] 0.12 0.25 0.50 1.00 0.50 0.25 0.12 0.06 0.03  0.02
#>  [5,] 0.06 0.12 0.25 0.50 1.00 0.50 0.25 0.12 0.06  0.03
#>  [6,] 0.03 0.06 0.12 0.25 0.50 1.00 0.50 0.25 0.12  0.06
#>  [7,] 0.02 0.03 0.06 0.12 0.25 0.50 1.00 0.50 0.25  0.12
#>  [8,] 0.01 0.02 0.03 0.06 0.12 0.25 0.50 1.00 0.50  0.25
#>  [9,] 0.00 0.01 0.02 0.03 0.06 0.12 0.25 0.50 1.00  0.50
#> [10,] 0.00 0.00 0.01 0.02 0.03 0.06 0.12 0.25 0.50  1.00
round(cov2cor(var(t(matrix(se, ncol = n_cols)))), digits = 2)[1:10,1:10] # simulated row correlation matrix
#>        [,1]  [,2]  [,3] [,4]  [,5]  [,6]  [,7] [,8]  [,9] [,10]
#>  [1,]  1.00  0.73 -0.24 0.01 -0.23 -0.09 -0.28 0.01  0.58  0.15
#>  [2,]  0.73  1.00  0.03 0.03 -0.36 -0.06 -0.12 0.17  0.70  0.28
#>  [3,] -0.24  0.03  1.00 0.36  0.24  0.73  0.68 0.39 -0.16 -0.14
#>  [4,]  0.01  0.03  0.36 1.00  0.77  0.82  0.78 0.56  0.23  0.14
#>  [5,] -0.23 -0.36  0.24 0.77  1.00  0.79  0.72 0.49 -0.06 -0.10
#>  [6,] -0.09 -0.06  0.73 0.82  0.79  1.00  0.87 0.57  0.01 -0.09
#>  [7,] -0.28 -0.12  0.68 0.78  0.72  0.87  1.00 0.81  0.08  0.17
#>  [8,]  0.01  0.17  0.39 0.56  0.49  0.57  0.81 1.00  0.50  0.40
#>  [9,]  0.58  0.70 -0.16 0.23 -0.06  0.01  0.08 0.50  1.00  0.48
#> [10,]  0.15  0.28 -0.14 0.14 -0.10 -0.09  0.17 0.40  0.48  1.00
round(ar1_row, digits = 2)[1:10,1:10] # expected row correlation matrix
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,] 1.00 0.70 0.49 0.34 0.24 0.17 0.12 0.08 0.06  0.04
#>  [2,] 0.70 1.00 0.70 0.49 0.34 0.24 0.17 0.12 0.08  0.06
#>  [3,] 0.49 0.70 1.00 0.70 0.49 0.34 0.24 0.17 0.12  0.08
#>  [4,] 0.34 0.49 0.70 1.00 0.70 0.49 0.34 0.24 0.17  0.12
#>  [5,] 0.24 0.34 0.49 0.70 1.00 0.70 0.49 0.34 0.24  0.17
#>  [6,] 0.17 0.24 0.34 0.49 0.70 1.00 0.70 0.49 0.34  0.24
#>  [7,] 0.12 0.17 0.24 0.34 0.49 0.70 1.00 0.70 0.49  0.34
#>  [8,] 0.08 0.12 0.17 0.24 0.34 0.49 0.70 1.00 0.70  0.49
#>  [9,] 0.06 0.08 0.12 0.17 0.24 0.34 0.49 0.70 1.00  0.70
#> [10,] 0.04 0.06 0.08 0.12 0.17 0.24 0.34 0.49 0.70  1.00

The discrepancies here are a result of the sampling process, which does not guarantee the columns and/or rows are independent. Note that the differences will become negligible when the number of columns and rows become large.

Lastly, the spatial errors are centred and scaled to a variance of 0.19.

se <- c(scale(se, center = TRUE, scale = TRUE)) * sqrt(0.4 * sigma2e) # centre and scale
round(mean(se), digits = 2)
#> [1] 0
round(var(se), digits = 2)
#> [1] 0.19


The field layout with the spatial errors shaded is:

The spatial errors are moderately correlated between neighbouring plots.

The theoretical and sample variograms are:


A quick look at what happens when the column and row autocorrelations are set to 0.99.

ar1_col <- ar1_fn(n = n_cols, autocor = 0.99) # create column correlation matrix
ar1_row <- ar1_fn(n = n_rows, autocor = 0.99) # create row correlation matrix
S_cor <- kronecker(ar1_col, ar1_row) # create error correlation matrix
se <- rnorm(n_cols*n_rows, mean = 0, sd = 1) # initially sample spatial errors
se <- c(t(chol(S_cor)) %*% se) # transform spatial errors
se <- c(scale(se, center = TRUE, scale = TRUE)) * sqrt(0.4 * sigma2e) # centre and scale

The spatial errors are now highly correlated between neighbouring plots.

The theoretical and sample variograms are:


A quick look at what happens when the column and row autocorrelations are set to 0.

ar1_col <- ar1_fn(n = n_cols, autocor = 0) # create column correlation matrix
ar1_row <- ar1_fn(n = n_rows, autocor = 0) # create row correlation matrix
S_cor <- kronecker(ar1_col, ar1_row) # create error correlation matrix
se <- rnorm(n_cols*n_rows, mean = 0, sd = 1) # initially sample spatial errors
se <- c(t(chol(S_cor)) %*% se) # transform spatial errors
se <- c(scale(se, center = TRUE, scale = TRUE)) * sqrt(0.4 * sigma2e) # centre and scale

The spatial errors are now independent between neighbouring plots.

The theoretical and sample variograms are:


Applying a smoothing function

The second way to simulate the spatial errors is to generate random values at several points in the field, and then apply a smoothing function between these points. The smoothing function takes the form of bivariate interpolation, which applies piecewise linear interpolation in the column and row dimensions. Readers are referred to Akima (1978) for further details.

The process begins by setting up a blank field array, based on the trial and plot dimensions.


Four interpolation (knot) points are placed just outside the trial boarder. The values at the knot points are sampled from a continuous uniform distribution ranging from -1 to 1.

boarder_df <- data.frame(x = c(-1, 81, -1, 81),
                        y = c(-1, -1, 41, 41),
                        # z = runif(4, min = -1, max = 1))
                        z = c(1, 0.2, -1, -0.2)) # for demonstration
round(boarder_df$z, digits = 2)
#> [1]  1.0  0.2 -1.0 -0.2


The coordinates of the column and row centres are obtained.

max.dim <- max(n_cols, n_rows) # maximum number of columns/rows
col_centre <- 8 * (1:(max.dim + 1) - 0.5)
col_centre
#>  [1]   4  12  20  28  36  44  52  60  68  76  84  92 100 108 116 124 132 140 148
#> [20] 156 164
row_centre <- 2 * (1:(max.dim + 1) - 0.5)
row_centre
#>  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41

There appears to be an issue with the interpolating function below. To overcome this, a square field array is initially constructed and then trimmed after interpolation. This is why maximim is taken above.


Bivariate interpolation is then applied to the continuous array between the trial boarder. A single value is returned for each plot by averaging over the continuous array within the plot boundary. This is achieved using the interp package.

se <- c(t(interp::interp(x = boarder_df$x, # boarder knot point x-coordinates
                        y = boarder_df$y, # boarder knot point y-coordinates
                        z = boarder_df$z, # boarder knot point z-values
                        xo = col_centre,
                        yo = row_centre)$z)[1:n_rows, 1:n_cols])

Lastly, the spatial errors are centred and scaled to a variance of 0.19.

se <- c(scale(se, center = TRUE, scale = TRUE)) * sqrt(0.4 * sigma2e) # centre and scale
round(mean(se), digits = 2)
#> [1] 0
round(var(se), digits = 2)
#> [1] 0.19


The field layout with the spatial errors shaded is:

The trellis plot is:


A quick look at what happens when 10 additional knot points are sampled within the field boarder.

se <- c(t(interp::interp(x = c(boarder_df$x, runif(20, min = 0, max = 80)), # additional knot point x-coordinates
                        y = c(boarder_df$y, runif(20, min = 0, max = 40)), # additional knot point y-coordinates
                        z = c(boarder_df$z, runif(20, min = -1, max = 1)), # additional knot point z-values
                        xo = col_centre,
                        yo = row_centre)$z)[1:n_rows, 1:n_cols])
se <- c(scale(se, center = TRUE, scale = TRUE)) * sqrt(0.4 * sigma2e) # centre and scale

The trellis plot is:


A quick look at what happens when 100 additional knot points are sampled within the field boarder.

se <- c(t(interp::interp(x = c(boarder_df$x, runif(100, min = 0, max = 80)), # additional knot point x-coordinates
                        y = c(boarder_df$y, runif(100, min = 0, max = 40)), # additional knot point y-coordinates
                        z = c(boarder_df$z, runif(100, min = -1, max = 1)), # additional knot point z-values
                        xo = col_centre,
                        yo = row_centre)$z)[1:n_rows, 1:n_cols])
se <- c(scale(se, center = TRUE, scale = TRUE)) * sqrt(0.4 * sigma2e) # centre and scale

The trellis plot is:


b. Simulating random error

The random errors capture measurement error and noise, e.g. intrinsic variability within the plots (Besag, 1977).

The random errors are sampled from a standard normal distribution, and then centred and scaled to a variance of 0.19.

re <- rnorm(n_plots, mean = 0, sd = 1)
re <- c(scale(re, center = TRUE, scale = TRUE)) * sqrt(0.4*sigma2e) # centre and scale
round(mean(re), digits = 2)
#> [1] 0
round(var(re), digits = 2)  # random error variance
#> [1] 0.19


The field layout with the random errors shaded is:


c. Simulating extraneous variation

The extraneous errors capture extraneous variation predominately induced during the conduct of the trial, e.g. serpentine harvesting or spraying and unequal plot dimensions (Stefanova et al., 2009). This type of variation is assumed to be aligned exclusively with the column and row dimensions of the trial. Row errors will be demonstrated below, but note that the extension to column errors is straightforward.

The extraneous row errors are sampled from a standard normal distribution, centred, and then re-ordered to obtain a zig-zag pattern.

# ee <- rnorm(n_rows, mean = 0, sd = 1)
ee <- seq(0.1,2,0.1) # for demonstration 
ee <- c(scale(ee, center = TRUE, scale = FALSE))
ee <- c(t(cbind(sample(ee[ee < 0]), sample(ee[ee >= 0]))))

The extraneous row errors are repeated across all columns, and then scaled to a variance of 0.09.

ee <- rep(ee, times = n_cols)
ee <- c(scale(ee, center = FALSE, scale = TRUE)) * sqrt(0.2*sigma2e) # centre and scale
round(mean(ee), digits = 2)
#> [1] 0
round(var(ee), digits = 2)  # extraneous error variance
#> [1] 0.09


The field layout with the extraneous row errors shaded is:


Total plot errors

The total plot errors are constructed by summing the spatial, random and extraneous errors, and then scaling to a variance of 0.47

e <- se + re + ee
e <- scale(e, scale = TRUE) * sqrt(sigma2e) # scale
round(var(e), digits = 2) # total error variance
#>      [,1]
#> [1,] 0.47

The field layout with the total plot errors shaded is:

The trellis plot is:


2. Simulating the genetic values and non-genetic effects

Genetic values

The genetic values are sampled from a standard normal distribution, and then shifted to the trait mean of 4 and scaled to the trait variance of 0.2. These values represent differences in yield between genotypes that arise due to differences in additive and non-additive genetic effects.

gv <- rnorm(n_genos, mean = 0, sd = 1)
gv <- c(scale(gv, center = TRUE, scale = TRUE)) * sqrt(sigma2) + mu # shift and scale
mean(gv) # trait mean
#> [1] 4
var(gv)  # trait variance
#> [1] 0.2


Non-genetic effects

The non-genetic effects are sampled from a standard normal distribution, and then centred and scaled to a variance of 0.01. These effects represent minor differences in yield between blocks.

b <- rnorm(n_blocks, mean = 0, sd = 1)
b <- c(scale(b, center = TRUE, scale = TRUE)) * sqrt(0.01) # centre and scale
round(b, digits = 2) # block effects
#> [1]  0.07 -0.07

The yield in block 2 is slightly higher on average than in block 1.


3. Generating the phenotypes

Generating the phenotypes involves three parts:

  1. Generating design matrices for the simulated effects
  2. Generating phenotypes from the simulated effects
  3. Obtaining summary measures of heritability and expected accuracy.


a. Generating design matrices

The design matrices are used to link the simulated effects to the phenotypes, as determined by the experimental design.

A randomised complete block design is used to allocate genotypes to plots. A data frame is constructed that contains the plot factors “block”, “column” and “row”. The data frame is ordered as rows within columns.

plot_df <- data.frame(block = factor(rep(1:n_blocks, each = n_plots/n_blocks)),
                      col = factor(rep(1:n_cols, each = n_rows)),
                      row = factor(rep(1:n_rows)))
head(plot_df)
#>   block col row
#> 1     1   1   1
#> 2     1   1   2
#> 3     1   1   3
#> 4     1   1   4
#> 5     1   1   5
#> 6     1   1   6

A design function is then used to randomly allocate the genotypes (treatments), such that every genotype occurs exactly once in every block. This ensures the design is resolvable with respect to blocks.

plot_df$id <- factor(unlist(lapply(seq_len(n_blocks), function(x) sample(1:n_genos))))
head(plot_df)
#>   block col row id
#> 1     1   1   1 12
#> 2     1   1   2 44
#> 3     1   1   3 90
#> 4     1   1   4 20
#> 5     1   1   5 61
#> 6     1   1   6 92
head(with(plot_df, table(id, block))) # frequency of genotypes in each block
#>    block
#> id  1 2
#>   1 1 1
#>   2 1 1
#>   3 1 1
#>   4 1 1
#>   5 1 1
#>   6 1 1


The field trial layout with genotypes labelled is:


A quick look at some design features.

table(tt1 <- with(plot_df, table(id, col))) # frequency of genotypes in each column
#> 
#>   0   1 
#> 800 200
table(tt2 <- with(plot_df, table(id, row))) # frequency of genotypes in each row
#> 
#>    0    1    2 
#> 1806  188    6
t(tt1) %*% tt1 # concurrence of genotypes between columns
#>     col
#> col   1  2  3  4  5  6  7  8  9 10
#>   1  20  0  0  0  0  6  3  1  4  6
#>   2   0 20  0  0  0  3  5  2  6  4
#>   3   0  0 20  0  0  3  5  3  4  5
#>   4   0  0  0 20  0  4  2  9  3  2
#>   5   0  0  0  0 20  4  5  5  3  3
#>   6   6  3  3  4  4 20  0  0  0  0
#>   7   3  5  5  2  5  0 20  0  0  0
#>   8   1  2  3  9  5  0  0 20  0  0
#>   9   4  6  4  3  3  0  0  0 20  0
#>   10  6  4  5  2  3  0  0  0  0 20
t(tt2) %*% tt2 # concurrence of genotypes between rows
#>     row
#> row   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
#>   1  12  1  0  0  0  1  0  1  1  1  1  0  0  0  1  1  0  0  0  0
#>   2   1 10  0  0  0  0  1  0  1  3  0  1  1  0  1  0  0  0  0  1
#>   3   0  0 10  0  1  0  0  1  0  1  0  0  1  1  2  1  2  0  0  0
#>   4   0  0  0 12  0  2  1  0  0  0  0  0  1  1  0  0  0  2  1  0
#>   5   0  0  1  0 10  0  0  1  1  2  0  1  0  1  1  0  0  0  1  1
#>   6   1  0  0  2  0 10  0  0  0  1  0  0  0  0  1  0  2  1  2  0
#>   7   0  1  0  1  0  0 10  1  0  0  2  0  0  0  0  1  1  2  0  1
#>   8   1  0  1  0  1  0  1 10  0  0  0  1  0  0  0  1  0  1  1  2
#>   9   1  1  0  0  1  0  0  0 14  0  1  0  0  1  1  0  0  0  0  0
#>   10  1  3  1  0  2  1  0  0  0 10  0  1  0  0  0  0  1  0  0  0
#>   11  1  0  0  0  0  0  2  0  1  0 10  2  1  0  0  0  1  1  0  1
#>   12  0  1  0  0  1  0  0  1  0  1  2 10  1  1  0  1  0  0  0  1
#>   13  0  1  1  1  0  0  0  0  0  0  1  1 10  0  0  0  0  1  3  1
#>   14  0  0  1  1  1  0  0  0  1  0  0  1  0 10  0  2  1  1  1  0
#>   15  1  1  2  0  1  1  0  0  1  0  0  0  0  0 12  0  0  1  0  0
#>   16  1  0  1  0  0  0  1  1  0  0  0  1  0  2  0 12  0  0  1  0
#>   17  0  0  2  0  0  2  1  0  0  1  1  0  0  1  0  0 10  0  0  2
#>   18  0  0  0  2  0  1  2  1  0  0  1  0  1  1  1  0  0 10  0  0
#>   19  0  0  0  1  1  2  0  1  0  0  0  0  3  1  0  1  0  0 10  0
#>   20  0  1  0  0  1  0  1  2  0  0  1  1  1  0  0  0  2  0  0 10

Some genotypes occur more than once in a row. This is a consequence of the way the genotypes were allocated to plots, with blocks constructed in the column direction only. More appropriate experimental designs are encouraged that block in both dimensions.


Lastly, the design matrices are obtained.

X <- model.matrix(~block - 1, data = plot_df)
Z <- model.matrix(~id - 1, data = plot_df)


b. Generating phenotypes

The phenotypes are generated based on the linear mixed model:

y <- X %*% b + Z %*% gv + e

The final data frame is then formed by combining the plot and treatment factors with the phenotypes. The true genetic values are also added to aid with interpretation.

trial_df <- data.frame(plot_df,
                       yield = y,
                       gv = Z %*% gv)
head(trial_df)
#>   block col row id    yield       gv
#> 1     1   1   1 12 4.174104 3.730541
#> 2     1   1   2 44 4.834834 3.944484
#> 3     1   1   3 90 4.631071 5.174209
#> 4     1   1   4 20 3.458814 3.769645
#> 5     1   1   5 61 2.276513 3.704377
#> 6     1   1   6 92 3.584390 3.983402


The field trial layout with phenotypes shaded is:


A quick look at the phenotypes, grouped by genotype and block number.


c. Measures of heritability and expected accuracy

The simulated plot-level and line-mean heritabilities can be formally quantified to aid with interpretation. The square-root of the latter represents the expected prediction accuracy for the genetic values.

A quick look at the phenotypes and true genetic values.

The plot-level heritability is:

round(cor(trial_df$yield, trial_df$gv)^2, digits = 2) # simulated plot-level heritability
#> [1] 0.24
round(sigma2/(sigma2 + sigma2e), digits = 2)      # expected plot-level heritability
#> [1] 0.3

The minor discrepancies here are a result of the sampling process.


A quick look at the average phenotypes and true genetic values.

The line-mean heritability is:

round(cor(ave_yield[trial_df$id[1:100]], trial_df$gv[1:100])^2, digits = 2) # simulated line-mean heritability
#> [1] 0.38
round(sigma2/(sigma2 + sigma2e/n_reps), digits = 2) # expected line-mean heritability
#> [1] 0.46
round(cor(ave_yield[trial_df$id[1:100]], trial_df$gv[1:100]), digits = 2) # expected prediction accuracy
#> [1] 0.62