Skip to contents

Motivation

High-dimensional survival datasets can be computationally demanding. bigPLScox implements algorithms that scale to large numbers of predictors and observations via component-based models, sparse penalties, and stochastic gradient descent routines. This vignette demonstrates how to benchmark the package against baseline approaches using the bench package.

We focus on simulated data to illustrate reproducible comparisons between the classical coxgpls() solver, its big-memory counterparts, and the survival::coxph() implementation.

Setup

The examples below require a recent version of bench together with survival.

The helper dataCox() simulates survival outcomes with right censoring. We work with a moderately sized problem here, but larger values for n and p can be used to stress test performance.

set.seed(2024)
sim_design <- dataCox(
  n = 2000,
  lambda = 2,
  rho = 1.5,
  x = matrix(rnorm(2000 * 50), ncol = 50),
  beta = c(1, 3, rep(0, 48)),
  cens.rate = 5
)

cox_data <- list(
  x = as.matrix(sim_design[, -(1:3)]),
  time = sim_design$time,
  status = sim_design$status
)
X_big <- bigmemory::as.big.matrix(cox_data$x)

Running the benchmark

We compare the classical Cox proportional hazards model with coxgpls() and the two big_pls_cox() solvers. The bench::mark() helper executes the estimators multiple times and records timing statistics alongside memory usage information.

bench_res <- bench::mark(
  coxgpls = coxgpls(
    cox_data$x,
    cox_data$time,
    cox_data$status,
    ncomp = 5,
    ind.block.x = c(3, 10)
  ),
  big_pls = big_pls_cox(X_big, cox_data$time, cox_data$status, ncomp = 5),
  big_pls_gd = big_pls_cox_gd(X_big, cox_data$time, cox_data$status, ncomp = 5, max_iter = 100),
  survival = coxph(Surv(cox_data$time, cox_data$status) ~ cox_data$x, ties = "breslow"),
  iterations = 100,
  check = FALSE
)
bench_res
#> # A tibble: 4 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 coxgpls     19.71ms  21.55ms      47.4   30.97MB  1291.  
#> 2 big_pls      7.96ms   8.77ms     111.     3.54MB    12.3 
#> 3 big_pls_gd  12.06ms  12.76ms      78.2    2.75MB     5.88
#> 4 survival    27.74ms  28.83ms      34.6   13.39MB    17.0

bench_summary <- bench_res[, c("expression", "median", "itr/sec")]
bench_summary
#> # A tibble: 4 × 3
#>   expression   median `itr/sec`
#>   <bch:expr> <bch:tm>     <dbl>
#> 1 coxgpls     21.55ms      47.4
#> 2 big_pls      8.77ms     111. 
#> 3 big_pls_gd  12.76ms      78.2
#> 4 survival    28.83ms      34.6

The resulting tibble reports elapsed time, memory allocations, and garbage collection statistics for each estimator. The itr/sec column is often the most useful indicator when comparing multiple implementations. The bench_summary object summarises the median runtime and iterations per second.

Visualising the results

bench provides ggplot-based helpers to visualise the distributions of elapsed and memory usage.

plot(bench_res, type = "jitter")

Additional geometries, such as ridge plots, are available via autoplot(bench_res, type = "ridge").

Accuracy on a shared test set

We use a common train/test split across four estimators and compare predictive performance using the survival concordance index. The test set is identical for each method to ensure a fair comparison.

set.seed(4242)
split_id <- sample(c("train", "test"), size = nrow(cox_data$x), replace = TRUE, prob = c(0.7, 0.3))
train_idx <- split_id == "train"
test_idx <- !train_idx

train_x <- cox_data$x[train_idx, , drop = FALSE]
test_x <- cox_data$x[test_idx, , drop = FALSE]
train_time <- cox_data$time[train_idx]
test_time <- cox_data$time[test_idx]
train_status <- cox_data$status[train_idx]
test_status <- cox_data$status[test_idx]

big_fit <- big_pls_cox(bigmemory::as.big.matrix(train_x), train_time, train_status, ncomp = 5)
gd_fit <- big_pls_cox_gd(bigmemory::as.big.matrix(train_x), train_time, train_status, ncomp = 5, max_iter = 120)
pls_fit <- plsRcox::plsRcox(train_x, train_time, train_status, nt = 5, verbose = FALSE)

lp_big <- predict(big_fit, newdata = test_x, type = "link")
lp_gd <- predict(gd_fit, newdata = test_x, type = "link")
lp_pls <- predict(pls_fit, newdata = test_x)

surv_test <- survival::Surv(test_time, test_status)
acc_results <- data.frame(
  method = c("big_pls_cox", "big_pls_cox_gd", "plsRcox"),
  c_index = c(
    survival::concordance(surv_test ~ lp_big)$concordance,
    survival::concordance(surv_test ~ lp_gd)$concordance,
    survival::concordance(surv_test ~ lp_pls)$concordance
  )
)
acc_results
#>           method    c_index
#> 1    big_pls_cox 0.06829377
#> 2 big_pls_cox_gd 0.57433234
#> 3        plsRcox 0.06821958

All methods operate on the same train/test split and the concordance index facilitates direct comparison with the reference implementation from .

Exporting benchmark results

Use the function write.csv() to store the benchmarking table as part of a reproducible pipeline. For larger studies consider varying the number of latent components, sparsity constraints, or the dataset dimensions.

if (!dir.exists("inst/benchmarks/results")) {
  dir.create("inst/benchmarks/results", recursive = TRUE)
}
write.csv(bench_res[,1:9], file = "inst/benchmarks/results/benchmarking-demo.csv", row.names = FALSE)

Reusing the benchmarking scripts

The package also ships with standalone scripts under inst/benchmarks/ that mirror this vignette while exposing additional configuration points. Run them from the repository root as:

Rscript inst/benchmarks/cox-benchmark.R
Rscript inst/benchmarks/benchmark_bigPLScox.R
Rscript inst/benchmarks/cox_pls_benchmark.R

Each script accepts environment variables to adjust the problem size and stores results under inst/benchmarks/results/ with time-stamped filenames.