Prepares a large-scale feature matrix for stochastic gradient descent byapplying optional normalisation, stratified sampling, and batching rules.
Usage
bigscale(
formula = survival::Surv(time = time, status = status) ~ .,
data,
norm.method = "standardize",
strata.size = 20,
batch.size = 1,
features.mean = NULL,
features.sd = NULL,
parallel.flag = FALSE,
num.cores = NULL,
bigmemory.flag = FALSE,
num.rows.chunk = 1e+06,
col.names = NULL,
type = "short"
)Arguments
- formula
formula used to extract the outcome and predictors that should be included in the scaled design matrix.
- data
Input data source containing the variables referenced in
formula.- norm.method
Normalisation strategy (for example centring or standardising columns) applied to the feature matrix.
- strata.size
Number of observations to retain from each stratum when constructing stratified batches.
- batch.size
Total size of each mini-batch produced by the scaling routine.
- features.mean
Optional vector of column means that can be reused to normalise multiple data sets in a consistent manner.
- features.sd
Optional vector of column standard deviations that pairs with
features.meanduring scaling.- parallel.flag
Logical flag signalling whether the scaling work should be parallelised across cores.
- num.cores
Number of processor cores allocated when
parallel.flagisTRUE.- bigmemory.flag
Logical flag specifying whether intermediate results should be stored in bigmemory-backed matrices.
- num.rows.chunk
Chunk size used when streaming data from on-disk objects into memory.
- col.names
Optional character vector assigning column names to the generated design matrix.
- type
Type of model or preprocessing target being prepared, such as survival or regression.
Value
A scaled design matrix of the scaler class along with metadata describing the transformation that was applied. time.indices: indices of the time variable cens.indices: indices of the censored variables features.indices: indices of the features time.sd: standard deviation of the time variable time.mean: mean of the time variable features.sd: standard deviation of the features features.mean: mean of the features nr: number of rows nc: number of columns col.names: columns names
See also
bigSurvSGD.na.omit() for fitting models that use the scaled
features.
Examples
data(micro.censure, package = "bigPLScox")
surv_data <- stats::na.omit(
micro.censure[, c("survyear", "DC", "sexe", "Agediag")]
)
scaled <- bigscale(
survival::Surv(survyear, DC) ~ .,
data = surv_data,
norm.method = "standardize",
batch.size = 16
)
#> Warning: Strata size times batch size is greater than number of observations.
#> This package resizes them to strata size = 20 and batch size = 4