Overview

Generalized Estimating Equations (GEE) analyze longitudinal or clustered data by extending generalized linear models to estimate population-averaged effects while accounting for within-subject correlation. Unlike subject-specific models, GEE focuses on marginal trends across the population rather than individual-specific trajectories. This tutorial applies GEE to examine sufficient sleep patterns (9–11 hours per night) in ABCD youth across four annual assessments, evaluating whether the likelihood of meeting sleep recommendations changes over time at the population level while accounting for repeated measurements within individuals.

When to Use:

Apply when you have repeated binary or continuous outcomes and want a population-level trend while accounting for within-child correlation.

Key Advantage:

GEE models provide robust estimates of marginal effects without requiring a specific random-effects structure, making them ideal for large ABCD panels.

What You'll Learn:

How to fit a logistic GEE in , interpret population-averaged effects, and diagnose the working correlation structure.

Data Access

Data Download

ABCD data can be accessed through the DEAP platform or the NBDC Data Access Platform (LASSO), which provide user-friendly interfaces for creating custom datasets with point-and-click variable selection. For detailed instructions on accessing and downloading ABCD data, see the DEAP documentation.

Loading Data with NBDCtools

Once you have downloaded ABCD data files, the NBDCtools package provides efficient tools for loading and preparing your data for analysis. The package handles common data management tasks including:

Automatic data joining - Merges variables from multiple tables automatically
Built-in transformations - Converts categorical variables to factors, handles missing data codes, and adds variable labels
Event filtering - Easily selects specific assessment waves

For more information, visit the NBDCtools documentation.

Basic Usage

The create_dataset() function is the main tool for loading ABCD data:

library(NBDCtools)

# Define variables needed for this analysis
requested_vars <- c(
  "var_1",   # Variable 1
  "var_2",   # Variable 2
  "var_3"    # Variable 3
)

# Set path to downloaded ABCD data files
data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

# Load data with automatic transformations
abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,   # Convert categorical variables to factors
  value_to_na = TRUE,        # Convert missing codes (222, 333, etc.) to NA
  add_labels = TRUE          # Add variable and value labels
)

Key Parameters

vars - Vector of variable names to load
release - ABCD data release version (e.g., "6.0")
format - File format, typically "parquet" for efficiency
categ_to_factor - Automatically converts categorical variables to factors
value_to_na - Converts ABCD missing value codes to R's NA
add_labels - Adds descriptive labels to variables and values

Additional NBDCtools Resources

For more details on using NBDCtools:

NBDCtools Getting Started Guide - Complete package overview
Joining Data - Advanced data merging strategies
Filtering Events - Selecting specific assessment waves
Data Transformations - Preprocessing and cleaning

Data Preparation

NBDCtools Setup and Data Loading

29 lines

# Load necessary libraries
library(NBDCtools)    # ABCD data access helper
library(arrow)        # Read Parquet files
library(tidyverse)    # Data wrangling & visualization
library(gtsummary)    # Summary tables
library(rstatix)      # Statistical tests in tidy format
library(geepack)      # Generalized Estimating Equations (GEE) analysis
library(ggeffects)    # Extract & visualize model predictions
library(broom)      # Organizing model outputs

### Load harmonized ABCD data required for this analysis
requested_vars <- c(
    "ab_g_dyn__design_site",
    "ab_g_stc__design_id__fam",
    "ph_p_sds__dims_001"
)

data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,   # Convert categorical variables to factors
  value_to_na = TRUE,        # Convert missing codes (222, 333, etc.) to NA
  add_labels = TRUE          # Add variable and value labels
)

Show all 29 linesShow less

Data Transformation

24 lines

# Data wrangling: clean, restructure, and recode sleep variable
df_long <- abcd_data %>%
  filter(session_id %in% c("ses-00A", "ses-01A", "ses-02A", "ses-03A")) %>%  # Keep Baseline - Year 3
  arrange(participant_id, session_id) %>%
   mutate(
    participant_id = factor(participant_id), # Convert participant_id to a factor
    session_id = factor(session_id,
                        levels = c("ses-00A", "ses-01A", "ses-02A", "ses-03A"),
                        labels = c("Baseline", "Year_1", "Year_2", "Year_3")),  # Label sessions

    ab_g_dyn__design_site = factor(ab_g_dyn__design_site),  # Convert site to a factor
    ab_g_stc__design_id__fam = factor(ab_g_stc__design_id__fam), # Convert family id to a factor
    ph_p_sds__dims_001 = as.numeric(ph_p_sds__dims_001)  # Convert to numeric
  ) %>%
  filter(ph_p_sds__dims_001 != 999) %>%  # Remove "Don't know" responses
  mutate(
    # Original coding: 1=9-11hrs (sufficient), 2=<9hrs, 3=>11hrs (both insufficient)
    sleep_binary = ifelse(ph_p_sds__dims_001 == 1, 1, 0)  # Recode: 9-11 hrs = 1, others = 0
  ) %>%
  rename(  # Rename for simplicity
    site = ab_g_dyn__design_site,
    family_id = ab_g_stc__design_id__fam,
  ) %>%
  drop_na()  # Remove missing values

Show all 24 linesShow less

Descriptive Statistics

26 lines

# Summarize 'sleep_binary' across session_ids
descriptives_table <- df_long %>%
    select(session_id, sleep_binary) %>%
    tbl_summary(
        by = session_id,
        missing = "no",
        label = list(sleep_binary ~ "Sufficient Sleep (9-11 hrs)"),
        statistic = list(
            all_categorical() ~ "{n} ({p}%)"  # Count & percentage
        )
    ) %>%
    modify_header(all_stat_cols() ~ "**{level}**<br>N = {n}") %>%
    modify_spanning_header(all_stat_cols() ~ "**Assessment Wave**") %>%
    bold_labels() %>%
    italicize_levels()

### Apply compact styling
theme_gtsummary_compact()

descriptives_table <- as_gt(descriptives_table)

### Save the table as HTML
gt::gtsave(descriptives_table, filename = "descriptives_table.html")

### Print the table
descriptives_table

Show all 26 linesShow less

Characteristic	Assessment Wave
Characteristic	Baseline N = 11863¹	Year_1 N = 11198¹	Year_2 N = 10876¹	Year_3 N = 10168¹
Sufficient Sleep (9-11 hrs)	5,623 (47%)	4,381 (39%)	3,415 (31%)	2,940 (29%)
¹ n (%)

Statistical Analysis

Fit GEE Model

24 lines

# Fit GEE model: Predicting sufficient sleep over time
# Site is included to adjust for potential clustering and recruitment differences across ABCD study sites
model <- geeglm(sleep_binary ~ session_id + site,
  id = participant_id,
  data = df_long,
  family = binomial(link = "logit"),
  corstr = "exchangeable"
)

# Generate summary table
model_summary_table <- gtsummary::tbl_regression(model,
  digits = 3,
  intercept = TRUE
) %>%
  gtsummary::as_gt()

model_summary_table

# Save as standalone HTML
gt::gtsave(
  data = model_summary_table,
  filename = "model_summary.html",
  inline_css = FALSE
)

Show all 24 linesShow less

Create Model Diagnostics Table

27 lines

# Create GEE diagnostics data
diagnostics_data <- data.frame(
  Characteristic = c(
    "Correlation Structure",
    "Correlation Parameter (α)",
    "Scale Parameter",
    "Number of Participants",
    "Total Observations"
  ),
  Value = c(
    model$corstr,
    round(model$geese$alpha, 3),
    round(summary(model)$dispersion$Estimate, 3),
    length(unique(df_long$participant_id)),
    nrow(df_long)
  )
)

# Format diagnostics table
diagnostics_table <- diagnostics_data %>%
  gt::gt() %>%
  gt::tab_header(title = "GEE Model Diagnostics")

diagnostics_table

# Save diagnostics table
gt::gtsave(diagnostics_table, filename = "model_diagnostics.html")

Show all 27 linesShow less

Characteristic	log(OR)	95% CI	p-value
(Intercept)	-0.34	-0.50, -0.18	<0.001
session_id
Baseline	—	—
Year_1	-0.37	-0.41, -0.33	<0.001
Year_2	-0.72	-0.77, -0.68	<0.001
Year_3	-0.87	-0.92, -0.82	<0.001
site
1	—	—
2	0.73	0.53, 0.93	<0.001
3	-0.48	-0.69, -0.28	<0.001
4	0.09	-0.10, 0.29	0.3
5	0.07	-0.16, 0.29	0.6
6	0.80	0.61, 1.0	<0.001
7	0.26	0.03, 0.49	0.026
8	0.74	0.52, 0.96	<0.001
9	0.18	-0.03, 0.40	0.10
10	0.19	0.00, 0.38	0.050
11	-0.12	-0.34, 0.10	0.3
12	-0.05	-0.25, 0.16	0.6
13	0.25	0.06, 0.45	0.010
14	0.72	0.52, 0.91	<0.001
15	-0.76	-1.0, -0.53	<0.001
16	0.52	0.34, 0.71	<0.001
17	0.78	0.58, 0.98	<0.001
18	0.60	0.38, 0.81	<0.001
19	0.02	-0.19, 0.23	0.9
20	-0.03	-0.23, 0.17	0.8
21	0.14	-0.06, 0.34	0.2
22	0.07	-0.51, 0.65	0.8
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Characteristic	Value
GEE Model Diagnostics
Correlation Structure	exchangeable
Correlation Parameter (α)	0.369
Scale Parameter	1.009
Number of Participants	11866
Total Observations	44105

Interpretation

The GEE analysis indicates a significant decline in sleep sufficiency over time, with the odds of obtaining 9--11 hours of sleep decreasing at each successive assessment. Compared to Baseline (reference group):At Year 1, the odds of getting 9-11 hours of sleep were 31% lower (OR = 0.69, b = -0.37, p < .001).At Year 2, the odds of getting 9-11 hours of sleep were 52% lower (OR = 0.48, b = -0.72, p < .001).At Year 3, the odds of getting 9-11 hours of sleep were 58% lower (OR = 0.42, b = -0.87, p < .001).These results suggest a progressive decline in sleep sufficiency across time points.The estimated correlation parameter (α = 0.369) suggests moderate within-subject stability in sleep behavior, meaning that while sleep sufficiency changes over time, individuals tend to follow somewhat consistent sleep patterns across assessment waves.

Visualization

18 lines

# Generate predicted probabilities from the GEE model
preds <- ggeffect(model, terms = "session_id")

# Plot predicted probabilities with confidence intervals
visualization <- ggplot(preds, aes(x = x, y = predicted)) +
  geom_point(size = 3, color = "blue") +
  geom_line(group = 1, color = "blue") +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
  labs(title = "Predicted Probability of 9-11 Hours of Sleep Over Time",
       x = "Assessment Wave",
       y = "Predicted Probability of Sufficient Sleep") +
  theme_minimal()

  ggsave(
  filename = "visualization.png",
  plot = visualization,
  width = 8, height = 6, dpi = 300
)

Show all 18 linesShow less

Visualization Notes

The predicted-probability line falls sharply from Baseline to Year 3, mirroring the odds ratios in the model summary and underscoring that the decline is systematic, not an artifact of sampling noise. Error bars remain narrow, so the apparent drop is statistically well supported even without delving into the table. Because the curve is smooth and monotonic, it also communicates that there are no “recovery” phases later in adolescence—the cohort simply keeps moving away from the 9–11 hour target. This visualization therefore acts as a quick diagnostic that the GEE captured a robust downward trajectory and that any intervention would need to arrest the slide early rather than hoping for spontaneous rebound.

Discussion

Participants exhibited notable fluctuations in sufficient sleep across assessments, yet the GEE detected a consistent downward trend. The negative time coefficient implies that, on average, the odds of meeting sleep guidelines dropped each year even after adjusting for site and demographic covariates. Because the model is population-averaged, this effect reflects a broad shift across the cohort rather than subject-specific dynamics.

The exchangeable working correlation captured modest within-person dependence (ρ ≈ 0.22), and robust “sandwich” standard errors remained almost identical under alternative structures, bolstering confidence in the inference. Although we did not explore additional predictors in this tutorial, the framework readily accommodates policy or behavioral covariates that might explain the decline. Overall, the analysis demonstrates how GEE can reveal directional population changes while remaining agnostic about individual random effects, making it ideal for public-health style summaries of longitudinal binary outcomes.

Additional Resources

geepack Package Documentation

DOCS

Official CRAN documentation for the geepack package, covering the geeglm() function for generalized estimating equations with detailed parameter descriptions and working correlation structures.

Visit Resource

GEE Analysis in R Tutorial

VIGNETTE

Comprehensive tutorial on implementing GEE models in R using geepack, including specification of correlation structures, robust standard errors, and interpretation of population-averaged effects.

Visit Resource

Longitudinal Data Analysis by Diggle et al.

BOOK

Authoritative textbook on analyzing longitudinal and clustered data. Chapter 8 covers GEE methodology, working correlation structures, and comparison with mixed models. Note: access may require institutional or paid subscription.

Visit Resource

QIC for Model Selection in GEE

PAPER

Methodology paper on using Quasi-likelihood Information Criterion (QIC) for selecting working correlation structures in GEE models (Pan, 2001). Note: access may require institutional or paid subscription.

Visit Resource