Title: | Variable Selection and Inference for Partial Semiparametric Linear Mixed-Effects Model |
---|---|
Description: | Implements a partial linear semiparametric mixed-effects model (PLSMM) featuring a random intercept and applies a lasso penalty to both the fixed effects and the coefficients associated with the nonlinear function. The model also accommodates interactions between the nonlinear function and a grouping variable, allowing for the capture of group-specific nonlinearities. Nonlinear functions are modeled using a set of bases functions. Estimation is conducted using a penalized Expectation-Maximization algorithm, and the package offers flexibility in choosing between various information criteria for model selection. Post-selection inference is carried out using a debiasing method, while inference on the nonlinear functions employs a bootstrap approach. |
Authors: | Sami Leon [aut, cre, cph] |
Maintainer: | Sami Leon <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.0 |
Built: | 2025-03-09 03:56:15 UTC |
Source: | https://github.com/sami-leon/plsmmlasso |
The create_bases
function generates a matrix of bases functions for modeling time.
The bases function matrix include Fourier basis functions and polynomial basis functions.
create_bases(t, keep = NULL)
create_bases(t, keep = NULL)
t |
A vector representing the timepoints for which basis functions are generated. |
keep |
(Optional) A vector specifying the indices of the basis functions to retain. If not provided, all generated basis functions are retained. |
The function constructs two types of basis functions: Fourier basis functions and polynomial basis functions.
Fourier basis functions are constructed based on the maximum timepoint (max_t
) and the input timepoints t
.
Polynomial basis functions are constructed with degrees ranging from 0.1 to 2, incrementing by 0.02.
A list with the following components:
bases |
A matrix containing the generated bases function. |
selected_bases |
A vector containing the indices of the selected basis functions after applying |
t <- seq(0, 10, by = 0.5) bases <- create_bases(t) selected_bases <- create_bases(t) selected_bases[[1]] selected_bases[[2]]
t <- seq(0, 10, by = 0.5) bases <- create_bases(t) selected_bases <- create_bases(t) selected_bases[[1]] selected_bases[[2]]
This function debias the lasso coefficients estimated from the plsmm_lasso
function
and computes p-values.
debias_plsmm(x, y, series, plsmm_output, a = 1, Z = NULL)
debias_plsmm(x, y, series, plsmm_output, a = 1, Z = NULL)
x |
A matrix of predictor variables. |
y |
A continuous vector of response variable. |
series |
A variable representing different series or groups in the data modeled as a random intercept. |
plsmm_output |
Output object obtained from the |
a |
A scalar that adjusts the variance of the random intercept |
Z |
(Optional) Pre-computed correction score matrix. If provided, it will be used directly for debiasing. |
The original data is decorrelated, and a correction score matrix is computed. The correction scores are a measure of correlation between the predictors in the data. The debiasing process utilizes these scores to compute debiased estimates of the coefficients, along with associated p-values.
A data frame containing debiased coefficients, standard errors, confidence intervals, and p-values.
set.seed(123) data_sim = simulate_group_inter(N = 50, n_mvnorm = 3, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from = seq(0,52,13), cos = FALSE, A_vec = c(1, 1.5)) sim = data_sim$sim x = as.matrix(sim[,-1:-3]) y = sim$y series = sim$series t = sim$t bases = create_bases(t) lambda <- 0.0046 gamma <- 0.00000001 plsmm_output <- plsmm_lasso(x, y, series, t, name_group_var = "group", bases$bases, gamma = gamma, lambda = lambda, timexgroup = TRUE, criterion = "BIC" ) debias_plsmm(x, y, series, plsmm_output)
set.seed(123) data_sim = simulate_group_inter(N = 50, n_mvnorm = 3, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from = seq(0,52,13), cos = FALSE, A_vec = c(1, 1.5)) sim = data_sim$sim x = as.matrix(sim[,-1:-3]) y = sim$y series = sim$series t = sim$t bases = create_bases(t) lambda <- 0.0046 gamma <- 0.00000001 plsmm_output <- plsmm_lasso(x, y, series, t, name_group_var = "group", bases$bases, gamma = gamma, lambda = lambda, timexgroup = TRUE, criterion = "BIC" ) debias_plsmm(x, y, series, plsmm_output)
The filter_nonzero_bases
function filters out bases functions that are essentially zero.
Bases functions with a sum of absolute values less than a threshold () are considered as essentially zero and are filtered out.
filter_nonzero_bases(bases)
filter_nonzero_bases(bases)
bases |
A matrix containing the bases functions. |
A list with the following components:
filtered_bases |
A matrix containing the filtered bases functions, removing those that are essentially zero. |
selected_bases |
A vector containing the indices of the selected bases functions after filtering. |
bases <- matrix(c(0, 0.1, 0.2, 0, 0, 0.3, 0, 0, 0), nrow = 3) filtered_bases <- filter_nonzero_bases(bases)
bases <- matrix(c(0, 0.1, 0.2, 0, 0, 0.3, 0, 0, 0), nrow = 3) filtered_bases <- filter_nonzero_bases(bases)
This function plots the observed data, the estimated mean trajectories, and
the estimated nonlinear functions from the output of plsmm_lasso
.
plot_fit( x, y, series, t, name_group_var, plsmm_output, predicted = FALSE, show_obs = FALSE )
plot_fit( x, y, series, t, name_group_var, plsmm_output, predicted = FALSE, show_obs = FALSE )
x |
A matrix of predictors. |
y |
A continuous vector of response variable. |
series |
A variable representing different series or groups in the data modeled as a random intercept. |
t |
A numeric vector indicating the time points. |
name_group_var |
A character string specifying the name of the grouping variable. |
plsmm_output |
Output object obtained from the |
predicted |
Logical indicating whether to plot predicted values. If |
show_obs |
Logical. If |
If predicted
is TRUE
the function uses the model from plsmm_output
to predict unobserved time points on a continuous grid of time.
Two plots:
The first plot shows the observed data and the estimated mean trajectories.
The second plot shows the estimated nonlinear functions.
set.seed(123) data_sim <- simulate_group_inter( N = 50, n_mvnorm = 3, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from = seq(0, 52, 13), cos = FALSE, A_vec = c(1, 1.5) ) sim <- data_sim$sim x <- as.matrix(sim[, -1:-3]) y <- sim$y series <- sim$series t <- sim$t bases <- create_bases(t) lambda <- 0.0046 gamma <- 0.00000001 plsmm_output <- plsmm_lasso(x, y, series, t, name_group_var = "group", bases$bases, gamma = gamma, lambda = lambda, timexgroup = TRUE, criterion = "BIC" ) plot_fit(x, y, series, t, name_group_var = "group", plsmm_output)
set.seed(123) data_sim <- simulate_group_inter( N = 50, n_mvnorm = 3, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from = seq(0, 52, 13), cos = FALSE, A_vec = c(1, 1.5) ) sim <- data_sim$sim x <- as.matrix(sim[, -1:-3]) y <- sim$y series <- sim$series t <- sim$t bases <- create_bases(t) lambda <- 0.0046 gamma <- 0.00000001 plsmm_output <- plsmm_lasso(x, y, series, t, name_group_var = "group", bases$bases, gamma = gamma, lambda = lambda, timexgroup = TRUE, criterion = "BIC" ) plot_fit(x, y, series, t, name_group_var = "group", plsmm_output)
Fits a partial linear semiparametric mixed effects model (PLSMM) via penalized maximum likelihood.
plsmm_lasso( x, y, series, t, name_group_var = NULL, bases, gamma, lambda, timexgroup, criterion, nonpara = FALSE, cvg_tol = 0.001, max_iter = 100, verbose = FALSE )
plsmm_lasso( x, y, series, t, name_group_var = NULL, bases, gamma, lambda, timexgroup, criterion, nonpara = FALSE, cvg_tol = 0.001, max_iter = 100, verbose = FALSE )
x |
A matrix of predictor variables. |
y |
A continuous vector of response variable. |
series |
A variable representing different series or groups in the data modeled as a random intercept. |
t |
A numeric vector indicating the timepoints. |
name_group_var |
A character string specifying the name of the grouping variable in the |
bases |
A matrix of bases functions. |
gamma |
The regularization parameter for the nonlinear effect of time. |
lambda |
The regularization parameter for the fixed effects. |
timexgroup |
Logical indicating whether to use a time-by-group interaction.
If |
criterion |
The information criterion to be computed. Options are "BIC", "BICC", or "EBIC". |
nonpara |
Logical. If TRUE, the |
cvg_tol |
Convergence tolerance for the algorithm. |
max_iter |
Maximum number of iterations allowed for convergence. |
verbose |
Logical indicating whether to print convergence details at each iteration. Default is |
This function fits a PLSMM with a lasso penalty on the fixed effects and the coefficient associated with the bases functions. It uses the Expectation-Maximization (EM) algorithm for estimation. The bases functions represent a nonlinear effect of time.
The model includes a random intercept for each level of the variable specified by series
. Additionally, if timexgroup
is
set to TRUE
, the model includes a time-by-group interaction, allowing each group of name_group_var
to have its own estimate
of the nonlinear function, which can capture group-specific nonlinearities over time. If name_group_var
is set to NULL
only
one nonlinear function for the whole data is being used
The algorithm iteratively updates the estimates until convergence or until the maximum number of iterations is reached.
A list containing the following components:
lasso_output |
A list with the fitted values for the fixed effect and nonlinear effect. The estimated coeffcients for the fixed effects and nonlinear effect. The indices of the used bases functions. |
se |
Estimated standard deviation of the residuals. |
su |
Estimated standard deviation of the random intercept. |
out_phi |
Data frame containing the estimated individual random intercept. |
ni |
Number of timepoitns per observations. |
hyperparameters |
Data frame with lambda and gamma values. |
converged |
Logical indicating if the algorithm converged. |
crit |
Value of the selected information criterion. |
set.seed(123) data_sim <- simulate_group_inter( N = 50, n_mvnorm = 3, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from = seq(0, 52, 13), cos = FALSE, A_vec = c(1, 1.5) ) sim <- data_sim$sim x <- as.matrix(sim[, -1:-3]) y <- sim$y series <- sim$series t <- sim$t bases <- create_bases(t) lambda <- 0.0046 gamma <- 0.00000001 plsmm_output <- plsmm_lasso(x, y, series, t, name_group_var = "group", bases$bases, gamma = gamma, lambda = lambda, timexgroup = TRUE, criterion = "BIC" ) # fixed effect coefficients plsmm_output$lasso_output$theta # fixed effect fitted values plsmm_output$lasso_output$x_fit # nonlinear functions coefficients plsmm_output$lasso_output$alpha # nonlinear functions fitted values plsmm_output$lasso_output$out_f # standard deviation of residuals plsmm_output$se # standard deviation of random intercept plsmm_output$su # series specific random intercept plsmm_output$out_phi
set.seed(123) data_sim <- simulate_group_inter( N = 50, n_mvnorm = 3, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from = seq(0, 52, 13), cos = FALSE, A_vec = c(1, 1.5) ) sim <- data_sim$sim x <- as.matrix(sim[, -1:-3]) y <- sim$y series <- sim$series t <- sim$t bases <- create_bases(t) lambda <- 0.0046 gamma <- 0.00000001 plsmm_output <- plsmm_lasso(x, y, series, t, name_group_var = "group", bases$bases, gamma = gamma, lambda = lambda, timexgroup = TRUE, criterion = "BIC" ) # fixed effect coefficients plsmm_output$lasso_output$theta # fixed effect fitted values plsmm_output$lasso_output$x_fit # nonlinear functions coefficients plsmm_output$lasso_output$alpha # nonlinear functions fitted values plsmm_output$lasso_output$out_f # standard deviation of residuals plsmm_output$se # standard deviation of random intercept plsmm_output$su # series specific random intercept plsmm_output$out_phi
Simulate a partial linear semiparametric mixed-effects model.
simulate_group_inter( N = 50, n_mvnorm = 100, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from, cos = FALSE, A_vec = c(1, 1.5) )
simulate_group_inter( N = 50, n_mvnorm = 100, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from, cos = FALSE, A_vec = c(1, 1.5) )
N |
Number of subjects. |
n_mvnorm |
Number of covariate generates from a multivariate normal distribution. |
grouped |
Logical indicating whether to include grouping effect. |
timepoints |
Vector specifying timepoints for each subject. |
nonpara_inter |
Logical indicating whether the nonparametric function is specific to each group. |
sample_from |
Vector of time points to sample from. |
cos |
Logical indicating whether to use cosine function for second group. |
A_vec |
Vector of amplitudes for the nonlinear functions for each group. |
A list with three components:
sim |
Simulated data frame. |
phi |
Individual random intercepts. |
f_val |
Values of the nonlinear functions. |
simulate_group_inter( N = 50, n_mvnorm = 100, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from = seq(0, 10, by = 0.1), cos = FALSE, A_vec = c(1, 1.5) )
simulate_group_inter( N = 50, n_mvnorm = 100, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from = seq(0, 10, by = 0.1), cos = FALSE, A_vec = c(1, 1.5) )
This function conducts a test of overall equality of two nonlinear functions and generates confidence bands for the estimated difference of the nonlinear functions using a bootstrap method.
test_f( x, y, series, t, name_group_var, plsmm_output, n_boot = 1000, predicted = FALSE, show_obs = FALSE, verbose = TRUE )
test_f( x, y, series, t, name_group_var, plsmm_output, n_boot = 1000, predicted = FALSE, show_obs = FALSE, verbose = TRUE )
x |
A matrix of predictors. |
y |
A continuous vector of response variable. |
series |
A variable representing different series or groups in the data modeled as a random intercept. |
t |
A numeric vector indicating the time points. |
name_group_var |
A character string specifying the name of the grouping variable. |
plsmm_output |
Output object obtained from the |
n_boot |
Numeric specifying the number of bootstrap samples (default is 1000). |
predicted |
Logical indicating whether to plot predicted values. If |
show_obs |
Logical. If |
verbose |
Logical indicating whether to display bootstrap progress. Default is |
The function generate bootstrap samples and estimate the nonlinear functions for each n_boot
sample.
These bootstrap estimates are then used to compute the L2-norm test of equality and the joint confidence bands.
A plot showing the estimated difference and confidence bands of the nonlinear functions.
A list containing:
overall_test_results |
Results from the L2-norm test of equality. |
CI_f |
Confidence intervals values for the difference of the estimated functions used for plotting. |
set.seed(123) data_sim <- simulate_group_inter( N = 50, n_mvnorm = 3, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from = seq(0, 52, 13), cos = FALSE, A_vec = c(1, 1.5) ) sim <- data_sim$sim x <- as.matrix(sim[, -1:-3]) y <- sim$y series <- sim$series t <- sim$t bases <- create_bases(t) lambda <- 0.0046 gamma <- 0.00000001 plsmm_output <- plsmm_lasso(x, y, series, t, name_group_var = "group", bases$bases, gamma = gamma, lambda = lambda, timexgroup = TRUE, criterion = "BIC" ) test_f_results <- test_f(x, y, series, t, name_group_var = "group", plsmm_output, n_boot = 10 ) test_f_results[[1]] test_f_results[[2]]
set.seed(123) data_sim <- simulate_group_inter( N = 50, n_mvnorm = 3, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from = seq(0, 52, 13), cos = FALSE, A_vec = c(1, 1.5) ) sim <- data_sim$sim x <- as.matrix(sim[, -1:-3]) y <- sim$y series <- sim$series t <- sim$t bases <- create_bases(t) lambda <- 0.0046 gamma <- 0.00000001 plsmm_output <- plsmm_lasso(x, y, series, t, name_group_var = "group", bases$bases, gamma = gamma, lambda = lambda, timexgroup = TRUE, criterion = "BIC" ) test_f_results <- test_f(x, y, series, t, name_group_var = "group", plsmm_output, n_boot = 10 ) test_f_results[[1]] test_f_results[[2]]
This function tunes a penalized partial linear semiparametric mixed-model (PLSMM) by performing a grid search over a set of hyperparameters to find the best model based on a given criterion.
tune_plsmm( x, y, series, t, name_group_var, bases, gamma_vec, lambda_vec, timexgroup, criterion, ... )
tune_plsmm( x, y, series, t, name_group_var, bases, gamma_vec, lambda_vec, timexgroup, criterion, ... )
x |
A matrix of predictors. |
y |
A continuous vector of response variable. |
series |
A variable representing different series or groups in the data modeled as a random intercept. |
t |
A numeric vector indicating the time points. |
name_group_var |
A character string specifying the name of the grouping variable. |
bases |
A matrix of bases functions. |
gamma_vec |
A vector of values for the regularization parameter for the coefficients of the nonlinear functions. |
lambda_vec |
A vector of values for the regularization parameter for the coefficients of the fixed effects. |
timexgroup |
Logical indicating whether to use a time-by-group interaction.
If |
criterion |
A character string specifying the criterion to be optimized ( |
... |
Additional arguments to be passed to the |
This function performs a grid search over the hyperparameters specified by lambda_vec
and gamma_vec
to find the best-fitted PLSMM based on the given criterion. It fits PLSMMs using the
plsmm_lasso
function for each combination of hyperparameters and retains only the models that
have converged. The best model is selected based on the minimum value of the specified criterion.
A PLSMM object representing the best-tuned model based on the specified criterion.
set.seed(123) data_sim <- simulate_group_inter( N = 50, n_mvnorm = 3, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from = seq(0, 52, 13), cos = FALSE, A_vec = c(1, 1.5) ) sim = data_sim$sim x = as.matrix(sim[,-1:-3]) y = sim$y series = sim$series t = sim$t bases = create_bases(t) lambdas <- c(0.0046, 0.0001) gammas <- 0.00000001 tuned_plsmm <- tune_plsmm(x, y, series, t, name_group_var = "group", bases$bases, gamma_vec = gammas, lambda_vec = lambdas, timexgroup = TRUE, criterion = "BIC" )
set.seed(123) data_sim <- simulate_group_inter( N = 50, n_mvnorm = 3, grouped = TRUE, timepoints = 3:5, nonpara_inter = TRUE, sample_from = seq(0, 52, 13), cos = FALSE, A_vec = c(1, 1.5) ) sim = data_sim$sim x = as.matrix(sim[,-1:-3]) y = sim$y series = sim$series t = sim$t bases = create_bases(t) lambdas <- c(0.0046, 0.0001) gammas <- 0.00000001 tuned_plsmm <- tune_plsmm(x, y, series, t, name_group_var = "group", bases$bases, gamma_vec = gammas, lambda_vec = lambdas, timexgroup = TRUE, criterion = "BIC" )