Benchmarking bigPLScox
Frédéric Bertrand
Cedric, Cnam, Parisfrederic.bertrand@lecnam.net
2025-11-01
Source:vignettes/bigPLScox-benchmarking.Rmd
bigPLScox-benchmarking.RmdMotivation
High-dimensional survival datasets can be computationally demanding.
bigPLScox implements algorithms that scale to large
numbers of predictors and observations via component-based models,
sparse penalties, and stochastic gradient descent routines. This
vignette demonstrates how to benchmark the package against baseline
approaches using the bench package.
We focus on simulated data to illustrate reproducible comparisons
between the classical coxgpls() solver, its big-memory
counterparts, and the survival::coxph() implementation.
Setup
The examples below require a recent version of bench together with survival.
The helper dataCox() simulates survival outcomes with
right censoring. We work with a moderately sized problem here, but
larger values for n and p can be used to
stress test performance.
set.seed(2024)
sim_design <- dataCox(
n = 2000,
lambda = 2,
rho = 1.5,
x = matrix(rnorm(2000 * 50), ncol = 50),
beta = c(1, 3, rep(0, 48)),
cens.rate = 5
)
cox_data <- list(
x = as.matrix(sim_design[, -(1:3)]),
time = sim_design$time,
status = sim_design$status
)
X_big <- bigmemory::as.big.matrix(cox_data$x)Running the benchmark
We compare the classical Cox proportional hazards model with
coxgpls() and the two big_pls_cox() solvers.
The bench::mark() helper executes the estimators multiple
times and records timing statistics alongside memory usage
information.
bench_res <- bench::mark(
coxgpls = coxgpls(
cox_data$x,
cox_data$time,
cox_data$status,
ncomp = 5,
ind.block.x = c(3, 10)
),
big_pls = big_pls_cox(X_big, cox_data$time, cox_data$status, ncomp = 5),
big_pls_gd = big_pls_cox_gd(X_big, cox_data$time, cox_data$status, ncomp = 5, max_iter = 100),
survival = coxph(Surv(cox_data$time, cox_data$status) ~ cox_data$x, ties = "breslow"),
iterations = 100,
check = FALSE
)
bench_res
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 coxgpls 19.71ms 21.55ms 47.4 30.97MB 1291.
#> 2 big_pls 7.96ms 8.77ms 111. 3.54MB 12.3
#> 3 big_pls_gd 12.06ms 12.76ms 78.2 2.75MB 5.88
#> 4 survival 27.74ms 28.83ms 34.6 13.39MB 17.0
bench_summary <- bench_res[, c("expression", "median", "itr/sec")]
bench_summary
#> # A tibble: 4 × 3
#> expression median `itr/sec`
#> <bch:expr> <bch:tm> <dbl>
#> 1 coxgpls 21.55ms 47.4
#> 2 big_pls 8.77ms 111.
#> 3 big_pls_gd 12.76ms 78.2
#> 4 survival 28.83ms 34.6The resulting tibble reports elapsed time, memory allocations, and
garbage collection statistics for each estimator. The
itr/sec column is often the most useful indicator when
comparing multiple implementations. The bench_summary
object summarises the median runtime and iterations per second.
Visualising the results
bench provides ggplot-based helpers to visualise the
distributions of elapsed and memory usage.
plot(bench_res, type = "jitter")
Additional geometries, such as ridge plots, are available via
autoplot(bench_res, type = "ridge").
Accuracy on a shared test set
We use a common train/test split across four estimators and compare predictive performance using the survival concordance index. The test set is identical for each method to ensure a fair comparison.
set.seed(4242)
split_id <- sample(c("train", "test"), size = nrow(cox_data$x), replace = TRUE, prob = c(0.7, 0.3))
train_idx <- split_id == "train"
test_idx <- !train_idx
train_x <- cox_data$x[train_idx, , drop = FALSE]
test_x <- cox_data$x[test_idx, , drop = FALSE]
train_time <- cox_data$time[train_idx]
test_time <- cox_data$time[test_idx]
train_status <- cox_data$status[train_idx]
test_status <- cox_data$status[test_idx]
big_fit <- big_pls_cox(bigmemory::as.big.matrix(train_x), train_time, train_status, ncomp = 5)
gd_fit <- big_pls_cox_gd(bigmemory::as.big.matrix(train_x), train_time, train_status, ncomp = 5, max_iter = 120)
pls_fit <- plsRcox::plsRcox(train_x, train_time, train_status, nt = 5, verbose = FALSE)
lp_big <- predict(big_fit, newdata = test_x, type = "link")
lp_gd <- predict(gd_fit, newdata = test_x, type = "link")
lp_pls <- predict(pls_fit, newdata = test_x)
surv_test <- survival::Surv(test_time, test_status)
acc_results <- data.frame(
method = c("big_pls_cox", "big_pls_cox_gd", "plsRcox"),
c_index = c(
survival::concordance(surv_test ~ lp_big)$concordance,
survival::concordance(surv_test ~ lp_gd)$concordance,
survival::concordance(surv_test ~ lp_pls)$concordance
)
)
acc_results
#> method c_index
#> 1 big_pls_cox 0.06829377
#> 2 big_pls_cox_gd 0.57433234
#> 3 plsRcox 0.06821958All methods operate on the same train/test split and the concordance index facilitates direct comparison with the reference implementation from .
Exporting benchmark results
Use the function write.csv() to store the benchmarking
table as part of a reproducible pipeline. For larger studies consider
varying the number of latent components, sparsity constraints, or the
dataset dimensions.
if (!dir.exists("inst/benchmarks/results")) {
dir.create("inst/benchmarks/results", recursive = TRUE)
}
write.csv(bench_res[,1:9], file = "inst/benchmarks/results/benchmarking-demo.csv", row.names = FALSE)Reusing the benchmarking scripts
The package also ships with standalone scripts under
inst/benchmarks/ that mirror this vignette while exposing
additional configuration points. Run them from the repository root
as:
Rscript inst/benchmarks/cox-benchmark.R
Rscript inst/benchmarks/benchmark_bigPLScox.R
Rscript inst/benchmarks/cox_pls_benchmark.REach script accepts environment variables to adjust the problem size
and stores results under inst/benchmarks/results/ with
time-stamped filenames.