Build a trait-environment GLMM formula safely and flexibly

Constructs a model formula for trait-environment analyses in a single step. The function (i) auto-detects trait and environment columns from a long-format table, (ii) assembles fixed effects for all traits and all environment variables, (iii) optionally includes all pairwise \(trait \times environment\) interactions, and (iv) appends user-specified random-effects terms. The returned object is a standard formula suitable for glmmTMB, lme4, etc.

Usage

build_glmm_formula(
  data,
  response = "count",
  species_col = "species",
  site_col = "site_id",
  trait_cols = NULL,
  env_cols = NULL,
  env_exclude = c("site_id", "x", "y", "count", "species"),
  include_interactions = TRUE,
  random_effects = c("(1 | species)", "(1 | site_id)")
)

Arguments

data: data.frame. Long-format observations (e.g., species-by-site), including the response, species ID, site ID, trait columns, and environment columns.
response: character (default "count"). Name of the response variable (e.g., count/abundance).
species_col: character (default "species"). Column name identifying species.
site_col: character (default "site_id"). Column name identifying sites.
trait_cols: NULL or character vector. If NULL (default), traits are auto-detected using name prefixes ^trait_, ^t_, or ^trt_. If not found, falls back to “everything not excluded” (see env_exclude). Pass explicit names for full control.
env_cols: NULL or character vector. If NULL (default), environment variables are auto-detected using name prefixes ^env_, ^e_, ^clim_, ^soil_. If not found, falls back to “everything not in traits and not excluded”.
env_exclude: character vector. Columns to exclude from environment auto-detection. Defaults to c("site_id","x","y","count","species"). Adjust to your schema.
include_interactions: logical (default TRUE). If TRUE, adds a single block term (traits):(envs) which expands to all pairwise \(trait \times environment\) interactions.
random_effects: character vector. Random-effect terms to append to the RHS (e.g., "(1 | species)"). Use character(0) to omit random effects. Default adds random intercepts for species and site: c("(1 | species)", "(1 | site_id)").

Value

A formula with fixed effects (traits + envs + interactions) and any requested random effects, e.g.:


  count ~ trait_cont1 + ... + trait_cat + env1 + ... + envK +
          (trait_cont1 + ... + trait_cat):(env1 + ... + envK) +
          (1 | species) + (1 | site_id)

Details

Auto-detection:

Traits: first tries prefixes ^trait_, ^t_, ^trt_. If none match, uses all columns not in env_exclude, not response, not species_col, and not site_col.
Environment: first tries prefixes ^env_, ^e_, ^clim_, ^soil_. If none match, uses remaining non-excluded columns not already assigned as traits.

Interactions: When include_interactions = TRUE, a single block term (t1 + t2 + ...):(e1 + e2 + ...) is inserted; model-fitting packages will expand it to all pairwise interactions. Disable with FALSE if the design is too large or you prefer targeted interactions.

Random effects: Supplied verbatim (e.g., random intercepts/slopes). For example, c("(1 | species)", "(1 | site_id)") or c("(1 + key_trait | species)").

Examples

# Minimal reproducible toy example -----------------------------------------
set.seed(1)
n <- 100
longDF <- data.frame(
  site_id = factor(sample(paste0("s", 1:10), n, TRUE)),
  species = factor(sample(paste0("sp", 1:15), n, TRUE)),
  x = runif(n), y = runif(n),
  count = rpois(n, lambda = 3),
  # traits
  trait_cont1 = rnorm(n),
  trait_cont2 = rnorm(n),
  trait_cat = factor(sample(letters[1:3], n, TRUE)),
  # environments
  env1 = scale(rnorm(n))[, 1],
  env2 = scale(runif(n))[, 1]
)

# Build a full formula with all trait × environment interactions and default REs
fml <- build_glmm_formula(longDF)
fml
#> count ~ trait_cont1 + trait_cont2 + trait_cat + env1 + env2 + 
#>     (trait_cont1 + trait_cont2 + trait_cat):(env1 + env2) + (1 | 
#>     species) + (1 | site_id)
#> <environment: 0x563c7bec6e88>

# Example fit (uncomment if glmmTMB is available)
# mod = glmmTMB::glmmTMB(fml, data = longDF, family = glmmTMB::tweedie(link = "log"))
# summary(mod)

# Targeted columns & no interactions
fml2 <- build_glmm_formula(
  data = longDF,
  trait_cols = c("trait_cont1", "trait_cont2", "trait_cat"),
  env_cols = c("env1", "env2"),
  include_interactions = FALSE,
  random_effects = character(0)
)
fml2
#> count ~ trait_cont1 + trait_cont2 + trait_cat + env1 + env2
#> <environment: 0x563c7bfc2728>