Skip to contents

Prepares a large-scale feature matrix for stochastic gradient descent byapplying optional normalisation, stratified sampling, and batching rules.

Usage

bigscale(
  formula = survival::Surv(time = time, status = status) ~ .,
  data,
  norm.method = "standardize",
  strata.size = 20,
  batch.size = 1,
  features.mean = NULL,
  features.sd = NULL,
  parallel.flag = FALSE,
  num.cores = NULL,
  bigmemory.flag = FALSE,
  num.rows.chunk = 1e+06,
  col.names = NULL,
  type = "short"
)

Arguments

formula

formula used to extract the outcome and predictors that should be included in the scaled design matrix.

data

Input data source containing the variables referenced in formula.

norm.method

Normalisation strategy (for example centring or standardising columns) applied to the feature matrix.

strata.size

Number of observations to retain from each stratum when constructing stratified batches.

batch.size

Total size of each mini-batch produced by the scaling routine.

features.mean

Optional vector of column means that can be reused to normalise multiple data sets in a consistent manner.

features.sd

Optional vector of column standard deviations that pairs with features.mean during scaling.

parallel.flag

Logical flag signalling whether the scaling work should be parallelised across cores.

num.cores

Number of processor cores allocated when parallel.flag is TRUE.

bigmemory.flag

Logical flag specifying whether intermediate results should be stored in bigmemory-backed matrices.

num.rows.chunk

Chunk size used when streaming data from on-disk objects into memory.

col.names

Optional character vector assigning column names to the generated design matrix.

type

Type of model or preprocessing target being prepared, such as survival or regression.

Value

A scaled design matrix of the scaler class along with metadata describing the transformation that was applied. time.indices: indices of the time variable cens.indices: indices of the censored variables features.indices: indices of the features time.sd: standard deviation of the time variable time.mean: mean of the time variable features.sd: standard deviation of the features features.mean: mean of the features nr: number of rows nc: number of columns col.names: columns names

See also

bigSurvSGD.na.omit() for fitting models that use the scaled features.

Examples

data(micro.censure, package = "bigPLScox")
surv_data <- stats::na.omit(
  micro.censure[, c("survyear", "DC", "sexe", "Agediag")]
)
scaled <- bigscale(
  survival::Surv(survyear, DC) ~ .,
  data = surv_data,
  norm.method = "standardize",
  batch.size = 16
)
#> Warning: Strata size times batch size is greater than number of observations.
#>  This package resizes them to strata size = 20 and batch size = 4