Skip to contents

Scope

This vignette documents the shipped release-validation study for SelectBoost.quantile. The goal is not to claim universal superiority, but to show how the current prototype behaves against two direct baselines:

  • plain quantile lasso (lasso)
  • cross-validated quantile lasso with a 1-SE penalty rule (lasso_tuned)
  • selectboost_quantile() with tau-aware screening, stronger tuning, complementary-pairs stability selection, capped neighborhoods, and a hybrid support score

The included benchmark artifacts were generated with:

  • scenarios from default_quantile_benchmark_scenarios()
  • tau = c(0.25, 0.5, 0.75)
  • 4 Monte Carlo replications per scenario
  • selectboost_quantile(..., B = 8, step_num = 0.5, screen = "auto", tune_lambda = "cv", lambda_rule = "one_se", lambda_inflation = 1.25, complementary_pairs = TRUE, max_group_size = 15, nlambda = 8)
  • stable support extracted with the hybrid summary score at threshold = 0.55
summary_path <- system.file(
  "extdata",
  "validation",
  "quantile_benchmark_release_summary.csv",
  package = "SelectBoost.quantile"
)
raw_path <- system.file(
  "extdata",
  "validation",
  "quantile_benchmark_release_raw.csv",
  package = "SelectBoost.quantile"
)

resolve_validation_path <- function(installed_path, filename) {
  if (nzchar(installed_path) && file.exists(installed_path)) {
    return(installed_path)
  }

  candidates <- c(
    file.path("inst", "extdata", "validation", filename),
    file.path("..", "inst", "extdata", "validation", filename)
  )
  candidates <- candidates[file.exists(candidates)]
  if (!length(candidates)) {
    stop("Could not locate shipped validation artifact: ", filename, call. = FALSE)
  }
  candidates[[1]]
}

summary_path <- resolve_validation_path(summary_path, "quantile_benchmark_release_summary.csv")
raw_path <- resolve_validation_path(raw_path, "quantile_benchmark_release_raw.csv")

validation_summary <- utils::read.csv(summary_path, stringsAsFactors = FALSE)
validation_raw <- utils::read.csv(raw_path, stringsAsFactors = FALSE)

validation_summary$family <- sub("_tau_.*$", "", validation_summary$scenario)
validation_summary$is_high_dim <- grepl("^high_dim", validation_summary$scenario)
validation_summary$mean_f1 <- with(
  validation_summary,
  ifelse(
    (2 * mean_tp + mean_fp + mean_fn) > 0,
    2 * mean_tp / (2 * mean_tp + mean_fp + mean_fn),
    NA_real_
  )
)

Overall summary

The first table averages the scenario-level summaries across the full shipped grid, including the n < p stress regime.

overall <- aggregate(
  cbind(mean_tpr, mean_fdr, mean_f1, failure_rate, mean_runtime_sec) ~ method,
  data = validation_summary,
  FUN = mean
)

knitr::kable(overall, digits = 3)
method mean_tpr mean_fdr mean_f1 failure_rate mean_runtime_sec
lasso 0.856 0.655 0.484 0 0.005
lasso_tuned 0.900 0.735 0.383 0 0.069
selectboost 0.734 0.063 0.808 0 3.794

Across the full grid, tuned lasso has the highest average true-positive rate, but it also carries the highest average false-discovery rate. The current selectboost_quantile() release is markedly more conservative: it gives up some recall, but in exchange it sharply lowers the false-discovery rate across the shipped benchmark grid and yields the best average F1 score.

Correlated but not high-dimensional regimes

The high_dim scenario is intentionally hard and changes the picture substantially. Excluding that regime gives a cleaner view of the correlated and misspecified-noise settings that the current prototype handles more naturally.

stable_regimes <- subset(validation_summary, !is_high_dim)

stable_overall <- aggregate(
  cbind(mean_tpr, mean_fdr, mean_f1, failure_rate, mean_runtime_sec) ~ method,
  data = stable_regimes,
  FUN = mean
)

knitr::kable(stable_overall, digits = 3)
method mean_tpr mean_fdr mean_f1 failure_rate mean_runtime_sec
lasso 0.886 0.625 0.521 0 0.005
lasso_tuned 0.936 0.719 0.406 0 0.040
selectboost 0.758 0.062 0.822 0 3.894

On these non-high-dimensional settings, the shipped study shows a consistent pattern:

  • lasso_tuned has the highest mean recall
  • selectboost_quantile() has the lowest mean false-discovery rate by a large margin
  • selectboost_quantile() also has the highest mean F1 score on the shipped grid
  • selectboost_quantile() remains slower than either lasso baseline, which is expected because it perturbs, subsamples, and refits repeatedly

The family-level breakdown is below.

family_summary <- aggregate(
  cbind(mean_tpr, mean_fdr, mean_f1) ~ family + method,
  data = stable_regimes,
  FUN = mean
)

knitr::kable(family_summary, digits = 3)
family method mean_tpr mean_fdr mean_f1
block_corr lasso 0.806 0.682 0.454
heavy_tail lasso 0.986 0.597 0.565
heteroskedastic lasso 0.861 0.643 0.503
high_corr lasso 0.778 0.600 0.518
moderate_corr lasso 1.000 0.601 0.565
block_corr lasso_tuned 0.917 0.754 0.345
heavy_tail lasso_tuned 1.000 0.692 0.449
heteroskedastic lasso_tuned 0.903 0.696 0.437
high_corr lasso_tuned 0.861 0.722 0.383
moderate_corr lasso_tuned 1.000 0.729 0.414
block_corr selectboost 0.778 0.164 0.783
heavy_tail selectboost 0.792 0.014 0.876
heteroskedastic selectboost 0.611 0.075 0.708
high_corr selectboost 0.708 0.059 0.797
moderate_corr selectboost 0.903 0.000 0.948
plot_df <- stable_regimes
method_levels <- c("lasso", "lasso_tuned", "selectboost")
cols <- c("lasso" = "#4C78A8", "lasso_tuned" = "#F58518", "selectboost" = "#54A24B")
plot(
  plot_df$mean_fdr,
  plot_df$mean_f1,
  col = cols[plot_df$method],
  pch = 19,
  xlab = "Mean FDR",
  ylab = "Mean F1",
  main = "Validation Summary by Scenario"
)
legend(
  "bottomleft",
  legend = method_levels,
  col = cols[method_levels],
  pch = 19,
  bty = "n"
)

High-dimensional stress regime

The high_dim family remains difficult, but it is no longer a failure mode in the earlier sense of selecting almost everything. The improved SelectBoost workflow now returns much sparser and more stable supports than either lasso baseline.

high_dim <- subset(validation_summary, is_high_dim)

high_dim_overall <- aggregate(
  cbind(mean_tpr, mean_fdr, mean_f1, failure_rate, mean_support_size) ~ method,
  data = high_dim,
  FUN = mean
)

knitr::kable(high_dim_overall, digits = 3)
method mean_tpr mean_fdr mean_f1 failure_rate mean_support_size
lasso 0.708 0.804 0.300 0 22.167
lasso_tuned 0.722 0.817 0.270 0 26.250
selectboost 0.611 0.067 0.738 0 3.917

The main remaining tradeoff is recall: selectboost_quantile() is much cleaner than the lasso baselines in high_dim, but it is still more conservative and can miss weaker signals. Even so, on the shipped study it achieves the best mean F1 score in that regime because it avoids the large false-positive burden of the lasso baselines. This is the main reason the package is best described as a polished v2 prototype rather than a finished methodological endpoint.

failure_rows <- subset(validation_summary, failure_rate > 0)
if (nrow(failure_rows)) {
  knitr::kable(failure_rows[, c(
    "scenario",
    "method",
    "failure_rate",
    "mean_tpr",
    "mean_fdr",
    "mean_support_size"
  )], digits = 3)
} else {
  cat("No method failures were recorded in the shipped study.\n")
}
#> No method failures were recorded in the shipped study.

Reproducing the study

From a source checkout, regenerate benchmark artifacts into a temporary directory with:

out_dir <- file.path(tempdir(), "SelectBoost.quantile-validation")
system2(
  "Rscript",
  c("inst/scripts/run_quantile_benchmark.R", out_dir, "4", "0.55")
)

The script loads the local package automatically when run from a source tree. It writes raw results, aggregated summaries, and a sessionInfo record to the chosen output directory. If no output directory is supplied, it defaults to a subdirectory of tempdir(). In the current source tree, that rerun uses the screening, stronger lambda, complementary-pairs stability, neighborhood-cap, and hybrid-support defaults defined in the package benchmark helper.