Partial least squares regression beta models with kfold cross validation

This function implements kfold cross validation on complete or incomplete datasets for partial least squares beta regression models

Usage

PLS_beta_kfoldcv(
  dataY,
  dataX,
  nt = 2,
  limQ2set = 0.0975,
  modele = "pls",
  family = NULL,
  K = nrow(dataX),
  NK = 1,
  grouplist = NULL,
  random = FALSE,
  scaleX = TRUE,
  scaleY = NULL,
  keepcoeffs = FALSE,
  keepfolds = FALSE,
  keepdataY = TRUE,
  keepMclassed = FALSE,
  tol_Xi = 10^(-12),
  weights,
  method,
  link = NULL,
  link.phi = NULL,
  type = "ML",
  verbose = TRUE
)

Arguments

dataY: response (training) dataset
dataX: predictor(s) (training) dataset
nt: number of components to be extracted
limQ2set: limit value for the Q2
modele: name of the PLS glm or PLS beta model to be fitted ("pls", "pls-glm-Gamma", "pls-glm-gaussian", "pls-glm-inverse.gaussian", "pls-glm-logistic", "pls-glm-poisson", "pls-glm-polr", "pls-beta"). Use "modele=pls-glm-family" to enable the family option.
family: a description of the error distribution and link function to be used in the model. This can be a character string naming a family function, a family function or the result of a call to a family function. (See family for details of family functions.) To use the family option, please set modele="pls-glm-family". User defined families can also be defined. See details.
K: number of groups
NK: number of times the group division is made
grouplist: to specify the members of the K groups
random: should the K groups be made randomly
scaleX: scale the predictor(s) : must be set to TRUE for modele="pls" and should be for glms pls.
scaleY: scale the response : Yes/No. Ignored since non always possible for glm responses.
keepcoeffs: shall the coefficients for each model be returned
keepfolds: shall the groups' composition be returned
keepdataY: shall the observed value of the response for each one of the predicted value be returned
keepMclassed: shall the number of miss classed be returned (unavailable)
tol_Xi: minimal value for Norm2(Xi) and \(\mathrm{det}(pp' \times pp)\) if there is any missing value in the dataX. It defaults to \(10^{-12}\)
weights: an optional vector of 'prior weights' to be used in the fitting process. Should be NULL or a numeric vector.
method: logistic, probit, complementary log-log or cauchit (corresponding to a Cauchy latent variable).
link: character specification of the link function in the mean model (mu). Currently, "logit", "probit", "cloglog", "cauchit", "log", "loglog" are supported. Alternatively, an object of class "link-glm" can be supplied.
link.phi: character specification of the link function in the precision model (phi). Currently, "identity", "log", "sqrt" are supported. The default is "log" unless formula is of type y~x where the default is "identity" (for backward compatibility). Alternatively, an object of class "link-glm" can be supplied.
type: character specification of the type of estimator. Currently, maximum likelihood ("ML"), ML with bias correction ("BC"), and ML with bias reduction ("BR") are supported.
verbose: should info messages be displayed ?

Value

results_kfolds

list of NK. Each element of the list sums up the results for a group division:

list: of K matrices of size about nrow(dataX)/K * nt with the predicted values for a growing number of components
list(): ...
list: of K matrices of size about nrow(dataX)/K * nt with the predicted values for a growing number of components

folds

list of NK. Each element of the list sums up the informations for a group division:

list: of K vectors of length about nrow(dataX) with the numbers of the rows of dataX that were used as a training set
list(): ...
list: of K vectors of length about nrow(dataX) with the numbers of the rows of dataX that were used as a training set

dataY_kfolds

list of NK. Each element of the list sums up the results for a group division:

list: of K matrices of size about nrow(dataX)/K * 1 with the observed values of the response
list(): ...
list: of K matrices of size about nrow(dataX)/K * 1 with the observed values of the response

call

the call of the function

Details

Predicts 1 group with the K-1 other groups. Leave one out cross validation is thus obtained for K==nrow(dataX).

There are seven different predefined models with predefined link functions available :

list("\"pls\""): ordinary pls models
list("\"pls-glm-Gamma\""): glm gaussian with inverse link pls models
list("\"pls-glm-gaussian\""): glm gaussian with identity link pls models
list("\"pls-glm-inverse-gamma\""): glm binomial with square inverse link pls models
list("\"pls-glm-logistic\""): glm binomial with logit link pls models
list("\"pls-glm-poisson\""): glm poisson with log link pls models
list("\"pls-glm-polr\""): glm polr with logit link pls models

Using the "family=" option and setting "modele=pls-glm-family" allows changing the family and link function the same way as for the glm function. As a consequence user-specified families can also be used.

The: accepts the links (as names) identity, log and inverse.
list("gaussian"): accepts the links (as names) identity, log and inverse.
family: accepts the links (as names) identity, log and inverse.
The: accepts the links logit, probit, cauchit, (corresponding to logistic, normal and Cauchy CDFs respectively) log and cloglog (complementary log-log).
list("binomial"): accepts the links logit, probit, cauchit, (corresponding to logistic, normal and Cauchy CDFs respectively) log and cloglog (complementary log-log).
family: accepts the links logit, probit, cauchit, (corresponding to logistic, normal and Cauchy CDFs respectively) log and cloglog (complementary log-log).
The: accepts the links inverse, identity and log.
list("Gamma"): accepts the links inverse, identity and log.
family: accepts the links inverse, identity and log.
The: accepts the links log, identity, and sqrt.
list("poisson"): accepts the links log, identity, and sqrt.
family: accepts the links log, identity, and sqrt.
The: accepts the links 1/mu^2, inverse, identity and log.
list("inverse.gaussian"): accepts the links 1/mu^2, inverse, identity and log.
family: accepts the links 1/mu^2, inverse, identity and log.
The: accepts the links logit, probit, cloglog, identity, inverse, log, 1/mu^2 and sqrt.
list("quasi"): accepts the links logit, probit, cloglog, identity, inverse, log, 1/mu^2 and sqrt.
family: accepts the links logit, probit, cloglog, identity, inverse, log, 1/mu^2 and sqrt.
The function: can be used to create a power link function.
list("power"): can be used to create a power link function.

Non-NULL weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations.

Note

Works for complete and incomplete datasets.

References

Frédéric Bertrand, Nicolas Meyer, Michèle Beau-Faller, Karim El Bayed, Izzie-Jacques Namer, Myriam Maumy-Bertrand (2013). Régression Bêta PLS. Journal de la Société Française de Statistique, 154(3):143-159. https://ojs-test.apps.ocp.math.cnrs.fr/index.php/J-SFdS/article/view/215

Author

Frédéric Bertrand
frederic.bertrand@lecnam.net
https://fbertran.github.io/homepage/

Examples


if (FALSE) { # \dontrun{
data("GasolineYield",package="betareg")
yGasolineYield <- GasolineYield$yield
XGasolineYield <- GasolineYield[,2:5]
bbb <- PLS_beta_kfoldcv(yGasolineYield,XGasolineYield,nt=3,modele="pls-beta")
kfolds2CVinfos_beta(bbb)
} # }