Linear Models (LM)

Simple Regression (Difference Score)

Explore how baseline characteristics predict change by regressing difference scores on individual covariates, demonstrated with handedness predicting ABCD height growth.

LM › Continuous
stats::lmUpdated 2025-11-05difference-scoresregressionabcd-study
Work in ProgressExamples are a work in progress. Please exercise caution when using code examples, as they may not be fully verified. If you spot gaps, errors, or have suggestions, we'd love your feedback—use the "Suggest changes" button to help us improve!

Overview

Difference scores quantify within-subject change by subtracting initial measurements from follow-up measurements. Simple regression models can then examine whether individual characteristics predict the magnitude of change, testing associations between stable person-level variables and change trajectories. This tutorial analyzes height measurements from ABCD youth across two annual assessments, computing difference scores to isolate individual change and using regression to test whether handedness predicts height change magnitude. This approach determines whether between-subject variability in stable characteristics is associated with differences in growth trajectories over time.

When to Use:
Use when you want to relate individual change scores to a between-subject predictor such as handedness or baseline characteristics.
Key Advantage:
Simple regression on difference scores lets you test whether group differences or continuous predictors explain variability in change magnitudes.
What You'll Learn:
How to compute difference scores, fit a regression predicting change, interpret slope/fit diagnostics, and visualize the predictor-change relationship.

Data Access

Data Download

ABCD data can be accessed through the DEAP platform or the NBDC Data Access Platform (LASSO), which provide user-friendly interfaces for creating custom datasets with point-and-click variable selection. For detailed instructions on accessing and downloading ABCD data, see the DEAP documentation.

Loading Data with NBDCtools

Once you have downloaded ABCD data files, the NBDCtools package provides efficient tools for loading and preparing your data for analysis. The package handles common data management tasks including:

  • Automatic data joining - Merges variables from multiple tables automatically
  • Built-in transformations - Converts categorical variables to factors, handles missing data codes, and adds variable labels
  • Event filtering - Easily selects specific assessment waves

For more information, visit the NBDCtools documentation.

Basic Usage

The create_dataset() function is the main tool for loading ABCD data:

library(NBDCtools)

# Define variables needed for this analysis
requested_vars <- c(
  "var_1",   # Variable 1
  "var_2",   # Variable 2
  "var_3"    # Variable 3
)

# Set path to downloaded ABCD data files
data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

# Load data with automatic transformations
abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,   # Convert categorical variables to factors
  value_to_na = TRUE,        # Convert missing codes (222, 333, etc.) to NA
  add_labels = TRUE          # Add variable and value labels
)
Key Parameters
  • vars - Vector of variable names to load
  • release - ABCD data release version (e.g., "6.0")
  • format - File format, typically "parquet" for efficiency
  • categ_to_factor - Automatically converts categorical variables to factors
  • value_to_na - Converts ABCD missing value codes to R's NA
  • add_labels - Adds descriptive labels to variables and values
Additional NBDCtools Resources

For more details on using NBDCtools:

Data Preparation

NBDCtools Setup and Data Loading
R31 lines
### Load necessary libraries
library(NBDCtools)    # ABCD data access helper
library(arrow)      # Efficient reading of Parquet files
library(tidyverse)  # Data wrangling and visualization
library(gt)         # Presentation-Ready Display Tables
library(gtsummary)  # Creating publication-quality tables
library(rstatix)    # Simplifying statistical tests
library(effectsize) # Calculating effect sizes
library(broom)      # Organizing model outputs

### Specify variables of interest
requested_vars <- c(
    "ab_g_dyn__design_site",
    "ab_g_stc__design_id__fam",
    "nc_y_ehis_score",
    "ph_y_anthr__height_mean"
)

### Load harmonized ABCD data
data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,   # Convert categorical variables to factors
  value_to_na = TRUE,        # Convert missing codes (222, 333, etc.) to NA
  add_labels = TRUE          # Add variable and value labels
)
Data Transformation
R21 lines
# Create long-form dataset with relevant columns
df_long <- abcd_data %>%
  # Keep only baseline and year 1 sessions
  filter(session_id %in% c("ses-00A", "ses-01A")) %>%
  arrange(participant_id, session_id) %>%
  mutate(
    # Relabel session IDs
    session_id = factor(session_id,
                        levels = c("ses-00A", "ses-01A"),
                        labels = c("Baseline", "Year_1")),
    # Relabel handedness
    handedness = factor(nc_y_ehis_score,
                       levels = c("1", "2", "3"),
                       labels = c("Right-handed", "Left-handed", "Mixed-handed"))
  ) %>%
  # Rename for clarity
  rename(
    site = ab_g_dyn__design_site,
    family_id = ab_g_stc__design_id__fam,
    height = ph_y_anthr__height_mean
  )
Reshape to Wide Format
R21 lines
# Reshape data from long to wide format for calculating difference score
# Step 1: Separate time-varying variables (height) from stable variables
df_timevarying <- df_long %>%
  select(participant_id, session_id, height) %>%
  pivot_wider(
    names_from = session_id,
    values_from = height,
    names_prefix = "Height_"
  )

# Step 2: Get static variables (one row per participant)
df_static <- df_long %>%
  filter(session_id == "Baseline") %>%
  select(participant_id, site, family_id, handedness) %>%
  filter(handedness != "Mixed-handed") %>%
  droplevels()

# Step 3: Join time-varying and static data
df_wide <- df_static %>%
  inner_join(df_timevarying, by = "participant_id") %>%
  drop_na(Height_Baseline, Height_Year_1)
Descriptive Statistics
R27 lines
# Create descriptive summary table
descriptives_table <- df_long %>%
  select(session_id, handedness, height) %>%
  tbl_summary(
    by = session_id,
    missing = "no",
    label = list(
      handedness ~ "Handedness",
      height ~ "Height"
    ),
    statistic = list(all_continuous() ~ "{mean} ({sd})")
  ) %>%
  modify_header(all_stat_cols() ~ "**{level}**<br>N = {n}") %>%
  modify_spanning_header(all_stat_cols() ~ "**Assessment Wave**") %>%
  bold_labels() %>%
  italicize_levels()

# Apply compact styling
theme_gtsummary_compact()

descriptives_table <- as_gt(descriptives_table)

### Save the table as HTML
gt::gtsave(descriptives_table, filename = "descriptives_table.html")

### Print the table
descriptives_table
Characteristic
Assessment Wave
Baseline
N = 11868
1
Year_1
N = 11219
1
Handedness

    Right-handed 9,418 (79%) 0 (NA%)
    Left-handed 848 (7.2%) 0 (NA%)
    Mixed-handed 1,594 (13%) 0 (NA%)
Height 55.3 (3.2) 57.6 (3.3)
1 n (%); Mean (SD)

Statistical Analysis

Fit Model
R24 lines
# Compute difference score
df_wide <- df_wide %>%
  mutate(height_diff = Height_Year_1 - Height_Baseline)  # Difference in height across assessments

# Calculate Cohen's d to derive effect size of the height difference
d_value <- cohens_d(df_wide$height_diff, mu = 0)
print(d_value)

# Fit a simple regression predicting height_diff from handedness
model <- lm(height_diff ~ handedness + site, data = df_wide)

# Generate a summary table for the regression model
model_summary <- gtsummary::tbl_regression(model,
    digits = 3,
    intercept = TRUE
) %>%
  gtsummary::as_gt()

# Save as standalone HTML
gt::gtsave(
  data = model_summary,
  filename = "model_summary.html",
  inline_css = FALSE # ensures self-contained output
)
Characteristic Beta 95% CI p-value
(Intercept) 2.1 1.9, 2.3 <0.001
handedness


    Right-handed — —
    Left-handed -0.07 -0.19, 0.05 0.3
site


    1 — —
    2 0.29 0.04, 0.53 0.021
    3 0.25 0.01, 0.49 0.042
    4 0.26 0.03, 0.49 0.026
    5 0.15 -0.12, 0.42 0.3
    6 0.00 -0.24, 0.24 >0.9
    7 0.06 -0.21, 0.34 0.6
    8 0.24 -0.03, 0.52 0.081
    9 0.38 0.12, 0.64 0.005
    10 0.01 -0.22, 0.24 >0.9
    11 0.38 0.12, 0.64 0.004
    12 0.14 -0.11, 0.38 0.3
    13 0.34 0.10, 0.57 0.005
    14 0.28 0.04, 0.52 0.024
    15 0.93 0.67, 1.2 <0.001
    16 0.37 0.15, 0.59 0.001
    17 0.25 0.01, 0.49 0.044
    18 0.11 -0.15, 0.38 0.4
    19 0.39 0.14, 0.64 0.002
    20 0.02 -0.22, 0.25 0.9
    21 0.44 0.19, 0.69 <0.001
    22 0.41 -0.34, 1.2 0.3
Abbreviation: CI = Confidence Interval
Interpretation
Interpretation

The difference score analysis reveals that participants experienced an average height increase of approximately 2.09 inches from Baseline to Year 1, indicating overall growth in the sample. Cohen's d of 1.37 (95% CI: [1.34, 1.39]) suggests a large effect size, indicating that the observed increase is not only statistically significant but also substantial in magnitude.A regression analysis examining whether handedness predicts height change (Year 1 height -- Baseline height) found no significant effect. Compared to right-handed participants (reference group), left-handed participants had a non-significant height change of b = -0.07, p = 0.29, and mixed-handed participants had a non-significant height change of b = 0.05, p = 0.32. These results indicate that handedness does not meaningfully account for variability in height change across participants.

Visualization
R29 lines
# Select a random subset for visualization (e.g., 250 participants)
df_subset <- df_wide %>% sample_n(min(250, nrow(df_wide)))

# Visualize difference scores by handedness
# We'll create a data frame containing both the difference score and handedness
plot_data <- df_subset %>%
  select(handedness, height_diff) %>%
  drop_na()

visualization <- ggplot(plot_data, aes(x = handedness, y = height_diff, fill = handedness)) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_jitter(
    position = position_jitter(width = 0.2, height = 0, seed = 123),
    size = 1.2,
    alpha = 0.5
  ) +
  labs(
    title = "Difference Scores by Handedness",
    x = "Handedness",
    y = "Height Difference (inches)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

  ggsave(
  filename = "visualization.png",
  plot = visualization,
  width = 8, height = 6, dpi = 300
)
Visualization
Interpretation
Interpretation

The violin plot illustrates the distribution of height differences across handedness groups, with substantial overlap in density curves indicating no systematic shift associated with handedness. Median markers and the jittered points show that all groups cluster around the same two-inch gain while still permitting normal between-person variability. The modest width of each violin compared with the overall range underscores how little explanatory power handedness adds beyond baseline height. Together with the regression output, the visualization reinforces that any perceived differences are likely due to random sampling variation rather than meaningful group effects.

Discussion

Participants demonstrated a general increase in height over time, with most difference scores landing above zero. When those scores were regressed on handedness (baseline height included as a covariate), the slope estimates hovered near zero and failed to reach significance, indicating that left- versus right-handed youth grew at similar rates. The residual standard error was modest relative to the scale of the outcome, suggesting that the bulk of variability reflects typical developmental noise rather than systematic group differences.

Diagnostic plots showed roughly homoscedastic residuals and no leverage points, so standard linear-model assumptions were reasonable. Visualizing the fitted lines by handedness helped communicate the same conclusion: the lines almost overlap, reinforcing that the practical effect size is negligible even with a reasonably large sample. This workflow demonstrates how simple regression on difference scores can test hypotheses about specific predictors while remaining transparent and easy to explain to collaborators who might be less familiar with mixed-model alternatives.

Additional Resources

4

R Documentation: lm

DOCS

Official R documentation for the lm() function, covering linear regression specifications, formula syntax, and diagnostic methods for difference score models.

Visit Resource

Linear Regression in R Tutorial

VIGNETTE

Step-by-step guide to fitting and interpreting linear regression models in R, including assumption checking, model diagnostics, and visualization of predicted values.

Visit Resource

Regression with Change Scores

PAPER

Methodology paper on using difference scores as outcomes or predictors in regression models, addressing reliability concerns and interpretation issues (Cronbach & Furby, 1970). Note: access may require institutional or paid subscription.

Visit Resource

broom Package for Tidy Regression Output

TOOL

R package for converting regression model results into clean, tidy dataframes, facilitating easier interpretation and visualization of difference score analyses.

Visit Resource