Big-memory workflows with bigPLScox
Frédéric Bertrand
Cedric, Cnam, Parisfrederic.bertrand@lecnam.net
2025-11-01
Source:vignettes/bigPLScox.Rmd
bigPLScox.RmdMotivation
A central feature of bigPLScox is the ability to
operate on file-backed bigmemory::big.matrix objects. This
vignette demonstrates how to prepare such datasets, fit models with
big_pls_cox() and big_pls_cox_gd(), and
integrate them with cross-validation helpers. The examples complement
the introductory vignette “Getting started with bigPLScox”.
Preparing a big.matrix
We simulate a moderately large design matrix and persist it to disk
via bigmemory::filebacked.big.matrix(). Using file-backed
storage allows models to train on datasets that exceed the available
RAM.
library(bigPLScox)
library(bigmemory)
set.seed(2024)
n_obs <- 5000
n_pred <- 100
X_dense <- matrix(rnorm(n_obs * n_pred), nrow = n_obs)
time <- rexp(n_obs, rate = 0.2)
status <- rbinom(n_obs, 1, 0.7)
big_dir <- tempfile("bigPLScox-")
dir.create(big_dir)
X_big <- filebacked.big.matrix(
nrow = n_obs,
ncol = n_pred,
backingpath = big_dir,
backingfile = "X.bin",
descriptorfile = "X.desc",
init = X_dense
)The resulting big.matrix can be reopened in future
sessions via its descriptor file. All big-memory modelling functions
accept either an in-memory matrix or a big.matrix
reference.
Fitting big-memory models
big_pls_cox() runs the classical PLS-Cox algorithm while
streaming data from disk.
fit_big <- big_pls_cox(
X = X_big,
time = time,
status = status,
ncomp = 5
)
head(fit_big$scores)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [2,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [3,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [4,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [5,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [6,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
str(fit_big)
#> List of 9
#> $ scores : num [1:5000, 1:5] -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 ...
#> $ loadings: num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 ...
#> $ weights : num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 ...
#> $ center : num [1:100] 0.982 0.982 0.982 0.982 0.982 ...
#> $ scale : num [1:100] 1.65e-07 1.65e-07 1.65e-07 1.65e-07 1.65e-07 ...
#> $ cox_fit :List of 10
#> ..$ coefficients : num [1:5] NA NA 1.82e+41 NA NA
#> ..$ var : num [1:5, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
#> ..$ loglik : num [1:2] -26230 -26230
#> ..$ score : num 8.72e-20
#> ..$ iter : int 1
#> ..$ linear.predictors: num [1:5000] 0.000392 0.000392 0.000392 0.000392 0.000392 ...
#> ..$ residuals : Named num [1:5000] 0.8605 0.7913 -0.0129 0.1737 0.9958 ...
#> .. ..- attr(*, "names")= chr [1:5000] "1" "2" "3" "4" ...
#> ..$ means : num [1:5] -4.45e-06 -1.67e-19 -1.67e-32 -1.71e-46 -1.15e-59
#> ..$ method : chr "efron"
#> ..$ class : chr "coxph"
#> $ keepX : int [1:5] 0 0 0 0 0
#> $ time : num [1:5000] 1.024 1.5182 0.1047 5.9911 0.0424 ...
#> $ status : num [1:5000] 1 1 0 1 1 1 1 1 0 1 ...
#> - attr(*, "class")= chr "big_pls_cox"The gradient-descent variant big_pls_cox_gd() uses
stochastic optimisation and is well-suited for very large datasets.
fit_big_gd <- big_pls_cox_gd(
X = X_big,
time = time,
status = status,
ncomp = 5,
max_iter = 100,
tol = 1e-4
)
head(fit_big$scores)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [2,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [3,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [4,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [5,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [6,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
str(fit_big)
#> List of 9
#> $ scores : num [1:5000, 1:5] -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 ...
#> $ loadings: num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 ...
#> $ weights : num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 ...
#> $ center : num [1:100] 0.982 0.982 0.982 0.982 0.982 ...
#> $ scale : num [1:100] 1.65e-07 1.65e-07 1.65e-07 1.65e-07 1.65e-07 ...
#> $ cox_fit :List of 10
#> ..$ coefficients : num [1:5] NA NA 1.82e+41 NA NA
#> ..$ var : num [1:5, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
#> ..$ loglik : num [1:2] -26230 -26230
#> ..$ score : num 8.72e-20
#> ..$ iter : int 1
#> ..$ linear.predictors: num [1:5000] 0.000392 0.000392 0.000392 0.000392 0.000392 ...
#> ..$ residuals : Named num [1:5000] 0.8605 0.7913 -0.0129 0.1737 0.9958 ...
#> .. ..- attr(*, "names")= chr [1:5000] "1" "2" "3" "4" ...
#> ..$ means : num [1:5] -4.45e-06 -1.67e-19 -1.67e-32 -1.71e-46 -1.15e-59
#> ..$ method : chr "efron"
#> ..$ class : chr "coxph"
#> $ keepX : int [1:5] 0 0 0 0 0
#> $ time : num [1:5000] 1.024 1.5182 0.1047 5.9911 0.0424 ...
#> $ status : num [1:5000] 1 1 0 1 1 1 1 1 0 1 ...
#> - attr(*, "class")= chr "big_pls_cox"Both functions return objects that expose the latent scores and loading vectors, allowing downstream visualisations and diagnostics identical to their in-memory counterparts.
Cross-validation on big matrices
Cross-validation for big-memory models is supported through the list interface. This enables streaming each fold directly from disk.
set.seed(2024)
data_big <- list(x = X_big, time = time, status = status)
cv_big <- cv.coxgpls(
data_big,
nt = 5,
ncores = 1,
ind.block.x = c(10, 40)
)
cv_big$opt_ntFor large experiments consider combining
foreach::foreach() with
doParallel::registerDoParallel() to parallelise folds.
Timing snapshot
The native C++ solvers substantially reduce wall-clock time compared
to fitting through the R interface alone. The bench package
provides convenient instrumentation; the chunk below only runs when it
is available.
if (requireNamespace("bench", quietly = TRUE)) {
bench::mark(
streaming = big_pls_cox(X_big, time, status, ncomp = 5, keepX = 0),
gd = big_pls_cox_gd(X_big, time, status, ncomp = 5, max_iter = 150),
iterations = 5,
check = FALSE
)
}
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 streaming 31.5ms 31.8ms 31.4 8.81MB 7.86
#> 2 gd 13.5ms 13.8ms 71.9 6.79MB 47.9Deviance residuals with big matrices
Once a model has been fitted we can evaluate deviance residuals using the new C++ backend. Supplying the linear predictor avoids recomputing it in R and works with any matrix backend.
Cleaning up
Temporary backing files can be removed after the analysis. In production pipelines you will typically keep the descriptor file alongside the binary data.
rm(X_big)
file.remove(file.path(big_dir, "X.bin"))
#> [1] TRUE
file.remove(file.path(big_dir, "X.desc"))
#> [1] TRUE
unlink(big_dir, recursive = TRUE)Additional resources
-
help(big_pls_cox)andhelp(big_pls_cox_gd)document all tuning parameters for the big-memory solvers. - The benchmarking vignette demonstrates how to measure performance improvements obtained with file-backed matrices.
- Consider persisting fitted objects with
saveRDS()to avoid recomputing large models when iterating on analyses.