| Title: | Empirical Sample Complexity Bounds |
|---|---|
| Description: | Provides tools for estimating empirical sample complexity bounds for supervised learning tasks. The package supports simulation-based estimates of generalization curves, parametric extrapolation of empirical sample complexity bounds, theoretical bounds based on Vapnik-Chervonenkis dimension, and optional monotone Gaussian process extrapolation for users who install the external 'cmdstanr' workflow. For more details, see Carter and Choi (2024) <doi:10.31219/osf.io/evrcj>. |
| Authors: | Perry Carter [aut, cre] (ORCID: <https://orcid.org/0000-0002-4684-6533>), Dahyun Choi [aut] (ORCID: <https://orcid.org/0000-0002-2628-1467>) |
| Maintainer: | Perry Carter <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.7.0 |
| Built: | 2026-06-24 09:31:31 UTC |
| Source: | https://github.com/pjesscarter/scr |
estimate_accuracy()
Utility function to generate accuracy metrics, for use with estimate_accuracy()
acc_sim( n, method, p, dat, model, eta, nsample, outcome, power, effect_size, powersims, alpha, split, predictfn, replacement, ... )acc_sim( n, method, p, dat, model, eta, nsample, outcome, power, effect_size, powersims, alpha, split, predictfn, replacement, ... )
n |
An integer giving the desired sample size for which the target function is to be calculated. |
method |
An optional string stating the distribution from which data is to be generated. Default is i.i.d. uniform sampling. Currently also supports "Class Imbalance". Can also take a function outputting a vector of probabilities if the user wishes to specify a custom distribution. |
p |
If method is 'Class Imbalance', gives the degree of weight placed on the positive class. |
dat |
A rectangular |
model |
A function giving the model to be estimated |
eta |
A real number between 0 and 1 giving the probability of misclassification error in the training data. |
nsample |
A positive integer giving the number of samples to be generated for each value of $n$. Larger values give more accurate results. |
outcome |
A string giving the name of the outcome variable. |
power |
A logical indicating whether experimental power based on the predictions should also be reported |
effect_size |
If |
powersims |
If |
alpha |
If |
split |
A logical indicating whether the data was passed as a single data frame or separately. |
predictfn |
An optional user-defined function giving a custom predict method. If also using a user-defined model, the |
replacement |
A logical flag indicating whether sampling should be performed with replacement. |
... |
Additional model parameters to be specified by the user. |
A data frame giving performance metrics for the specified sample size.
Wrapper function for fitting extrapolation model to a single object of class "scb_data". Allows fitting of custom functions, but for general use interpolate_scb should be used instead.
conduct_interpolation( scbobject, epsilon, delta, maxN, delta_formula, epsilon_formula, delta_lower_bounds = NULL, epsilon_lower_bounds = NULL, delta_upper_bounds = NULL, epsilon_upper_bounds = NULL, delta_start = NULL, epsilon_start = NULL )conduct_interpolation( scbobject, epsilon, delta, maxN, delta_formula, epsilon_formula, delta_lower_bounds = NULL, epsilon_lower_bounds = NULL, delta_upper_bounds = NULL, epsilon_upper_bounds = NULL, delta_start = NULL, epsilon_start = NULL )
scbobject |
An object of class "scb_data" for interpolation to be conducted on. |
epsilon |
A real number between 0 and 1 giving the targeted maximum out-of-sample (OOS) error rate |
delta |
A real number between 0 and 1 giving the targeted maximum probability of observing an OOS error rate higher than |
maxN |
A positive integer giving value of the largest N for which extrapolation is to be conducted. |
delta_formula |
Formula of the form Delta ~ model(n,...) giving the NLS model to be applied to the delta curve. |
epsilon_formula |
Formula of the form Epsilon ~ model(n,...) giving the NLS model to be applied to the epsilon curve. |
delta_lower_bounds |
Optional named vector of lower bounds for the delta model parameters. |
epsilon_lower_bounds |
Optional named vector of lower bounds for the epsilon model parameters. |
delta_upper_bounds |
Optional named vector of upper bounds for the delta model parameters. |
epsilon_upper_bounds |
Optional named vector of upper bounds for the epsilon model parameters. |
delta_start |
Optional named vector of starting values for the delta model parameters. |
epsilon_start |
Optional named vector of starting values for the model parameters. |
A named list containing the interpolated dataframe, the original input data frame, and the given values of epsilon, delta, and maxN.
interpolate_scb() is the main wrapper for interpolation on a list.
Utility function for creating custom classification function for use in SCB calculations.
create_scb_model( model_fun, extra_args = list(), split = FALSE, arg_map = list(formula = "formula", data = "data", x = "x", y = "y") )create_scb_model( model_fun, extra_args = list(), split = FALSE, arg_map = list(formula = "formula", data = "data", x = "x", y = "y") )
model_fun |
A binary classification model supplied by the user, e.g. glm |
extra_args |
A list of additional default arguments to be passed to the classification model, e.g. family = binomial(link = "logit") |
split |
Logical indicating whether the model expects a single data argument or separate x/y values. |
arg_map |
Named list giving mappings for names of formula, data, x, and y arguments expected in |
A function taking supplied arguments that can be passed to other package functions such as estimate_accuracy()
Utility function for creating custom prediction function for use in SCB calculations.
create_scb_prediction( predict_fun, extra_args = list(), transform_fn = identity, arg_map = list(m = "m", newdata = "newdata") )create_scb_prediction( predict_fun, extra_args = list(), transform_fn = identity, arg_map = list(m = "m", newdata = "newdata") )
predict_fun |
A binary prediction model supplied by the user, e.g. predict.glm |
extra_args |
A list of additional default arguments to be passed to the classification model, e.g. type="response" |
transform_fn |
Function giving the transformation, if any, to be applied to the prediction output to generate binary predictions. Defaults to identity() |
arg_map |
Named list giving mappings for names of |
A prediction function taking supplied arguments that can be passed to other package functions such as estimate_accuracy()
Estimate sample complexity bounds for a binary classification algorithm using either simulated or user-supplied data.
estimate_accuracy( formula, model, data = NULL, dim = NULL, maxn = NULL, sparse = FALSE, density = NULL, upperlimit = NULL, nsample = 30, steps = 50, eta = 0.05, delta = 0.05, epsilon = 0.05, predictfn = NULL, subsample_size = NULL, nboot = 1L, power = FALSE, effect_size = NULL, powersims = NULL, alpha = 0.05, parallel = TRUE, coreoffset = 0, packages = list(), method = c("Uniform", "Class Imbalance"), p = NULL, minn = ifelse(is.null(data), ifelse(is.null(x), (dim + 1), (ncol(x) + 1)), (ncol(data) + 1)), x = NULL, y = NULL, backend = c("multisession", "multicore", "cluster", "sequential"), replacement = TRUE, ... )estimate_accuracy( formula, model, data = NULL, dim = NULL, maxn = NULL, sparse = FALSE, density = NULL, upperlimit = NULL, nsample = 30, steps = 50, eta = 0.05, delta = 0.05, epsilon = 0.05, predictfn = NULL, subsample_size = NULL, nboot = 1L, power = FALSE, effect_size = NULL, powersims = NULL, alpha = 0.05, parallel = TRUE, coreoffset = 0, packages = list(), method = c("Uniform", "Class Imbalance"), p = NULL, minn = ifelse(is.null(data), ifelse(is.null(x), (dim + 1), (ncol(x) + 1)), (ncol(data) + 1)), x = NULL, y = NULL, backend = c("multisession", "multicore", "cluster", "sequential"), replacement = TRUE, ... )
formula |
A |
model |
A binary classification model supplied by the user. Must take arguments |
data |
Optional. A rectangular |
dim |
Required if |
maxn |
Required if |
sparse |
Optional. A logical giving whether to generate sparse data, if data was not given. |
density |
Real number between 0 and 1 giving the proportion of non 0 entries in the sparse matrix. Used only if sparse is TRUE. |
upperlimit |
Optional. A positive integer giving the maximum sample size to be simulated, if data was supplied. |
nsample |
A positive integer giving the number of samples to be generated for each value of $n$. Larger values give more accurate results. |
steps |
A positive integer giving the interval of values of $n$ for which simulations should be conducted. Larger values give more accurate results. |
eta |
A real number between 0 and 1 giving the probability of misclassification error in the training data. |
delta |
A real number between 0 and 1 giving the targeted maximum probability of observing an OOS error rate higher than |
epsilon |
A real number between 0 and 1 giving the targeted maximum out-of-sample (OOS) error rate |
predictfn |
An optional user-defined function giving a custom predict method. If also using a user-defined model, the |
subsample_size |
An integer giving the size of the initial 'pilot' sample to be simulated. If left as NULL, the input data size will be used (benchmark SCB). |
nboot |
An integer giving the number of SCB bootstraps to be performed. |
power |
A logical indicating whether experimental power based on the predictions should also be reported |
effect_size |
If |
powersims |
If |
alpha |
If |
parallel |
Boolean indicating whether or not to use parallel processing. |
coreoffset |
If |
packages |
A list of packages that need to be loaded in order to run |
method |
An optional string stating the distribution from which data is to be generated. Default is i.i.d. uniform sampling. Can also take a function outputting a vector of probabilities if the user wishes to specify a custom distribution. |
p |
If method is 'Class Imbalance', gives the degree of weight placed on the positive class. |
minn |
Optional argument to set a different minimum n than the dimension of the algorithm. Useful with e.g. regularized regression models such as elastic net. |
x |
Optional argument for methods that take separate predictor and outcome data. Specifies a matrix-like object containing predictors. Note that if used, the x and y objects are bound together columnwise; this must be handled in the user-supplied helper function. |
y |
Optional argument for methods that take separate predictor and outcome data. Specifies a vector-like object containing outcome values. Note that if used, the x and y objects are bound together columnwise; this must be handled in the user-supplied helper function. |
backend |
One of the parallel backends used by |
replacement |
A logical flag indicating whether sampling should be performed with replacement. |
... |
Additional arguments that need to be passed to |
A list containing two named elements. Raw gives the exact output of the simulations, while Summary gives a table of accuracy metrics, including the achieved levels of and given the specified values. Alternative values can be calculated using getpac()
plot.scb_data(), to represent simulations visually, getpac(), to calculate summaries for alternate values of and without conducting a new simulation, and gendata(), to generated synthetic datasets.
# See the package README for an end-to-end example.# See the package README for an end-to-end example.
Utility function to fit extrapolation model, for use with conduct_interpolation()
fit_and_predict( formula, data, N_grid, maxN_obs, start = NULL, lower = NULL, upper = NULL )fit_and_predict( formula, data, N_grid, maxN_obs, start = NULL, lower = NULL, upper = NULL )
formula |
A formula object giving the model to be fit. |
data |
A data frame giving the data the model is to be fit on. |
N_grid |
An integer vector of N values to conduct interpolation and extrapolation on. |
maxN_obs |
A positive integer giving value of the largest N in the observed data. |
start |
Optional named vector of starting values for the model parameters. |
lower |
Optional named vector of lower bounds for the model parameters. |
upper |
Optional named vector of upper bounds for the model parameters. |
A fitted model object of the chosen type.
Fits the monotone-integrated Gaussian process extrapolator described in the
paper appendix. This is the nonparametric curve-fitting option only; it does
not generate the resampled accuracy curves. Use estimate_accuracy() first,
then pass the resulting scb_data object to interpolate_scb_gp().
fit_gp_scb_curve( x, y, curve = c("delta", "epsilon"), maxN = NULL, M_grid = 120L, epsilon0 = NULL, stan_file = NULL, seed = 2027, chains = 4L, parallel_chains = chains, iter_warmup = 800L, iter_sampling = 800L, adapt_delta = 0.97, max_treedepth = 12L, refresh = 100L, ci = 0.9, init = NULL, ... )fit_gp_scb_curve( x, y, curve = c("delta", "epsilon"), maxN = NULL, M_grid = 120L, epsilon0 = NULL, stan_file = NULL, seed = 2027, chains = 4L, parallel_chains = chains, iter_warmup = 800L, iter_sampling = 800L, adapt_delta = 0.97, max_treedepth = 12L, refresh = 100L, ci = 0.9, init = NULL, ... )
x |
Numeric vector of training sample sizes. |
y |
Numeric vector of observed curve values in |
curve |
Character string; either |
maxN |
Largest sample size on the prediction grid. Defaults to the
largest observed value in |
M_grid |
Number of evenly spaced prediction-grid points before observed sample sizes are added exactly to the grid. |
epsilon0 |
Baseline error rate for the epsilon curve. If |
stan_file |
Optional path to a Stan model. By default the Stan file
shipped in |
seed, chains, parallel_chains, iter_warmup, iter_sampling, adapt_delta, max_treedepth, refresh
|
Sampling controls passed to |
ci |
Credible interval level for posterior summaries. |
init |
Optional initialization function or list passed to |
... |
Additional arguments passed to the |
The implementation uses a Gaussian process prior on an unconstrained latent field, applies a softplus transform to obtain a nonnegative derivative, integrates that derivative on a fixed grid, and maps the integrated latent curve through the paper's delta or epsilon link function.
This function requires the optional packages cmdstanr and
posterior, and a working CmdStan installation. These packages are not hard
dependencies of scR so that the core package remains light-weight.
An object of class scR_gp_curve containing the fitted Stan object,
posterior summaries on the prediction grid, and the Stan data list.
Simulate data with appropriate structure to be used in estimating sample complexity bounds
gendata( model, dim, maxn, predictfn = NULL, varnames = NULL, sparse = FALSE, density = NULL, ... )gendata( model, dim, maxn, predictfn = NULL, varnames = NULL, sparse = FALSE, density = NULL, ... )
model |
A binary classification model supplied by the user. Must take arguments |
dim |
Gives the horizontal dimension of the data (number of predictor variables) to be generated. |
maxn |
Gives the vertical dimension of the data (number of observations) to be generated. |
predictfn |
An optional user-defined function giving a custom predict method. If also using a user-defined model, the |
varnames |
An optional character vector giving the names of variables to be used for the generated data |
sparse |
Logical indicating whether sparse matrix generation should be used to save on memory. Defaults to false for better accuracy. |
density |
Real number between 0 and 1 giving the proportion of non 0 entries in the sparse matrix. Used only if sparse is TRUE. |
... |
Additional arguments that need to be passed to |
A data.frame containing the simulated data.
estimate_accuracy(), to estimate sample complexity bounds given the generated data
mylogit <- function(formula, data) { structure( suppressWarnings(glm(formula = formula, data = data, family = binomial())), class = c("svrclass", "glm") ) } mypred <- function(m, newdata) { out <- predict.glm(m, newdata, type = "response") factor(ifelse(out > 0.5, 1, 0), levels = c("0", "1")) } set.seed(1) dat <- gendata(mylogit, dim = 2, maxn = 20, predictfn = mypred) head(dat)mylogit <- function(formula, data) { structure( suppressWarnings(glm(formula = formula, data = data, family = binomial())), class = c("svrclass", "glm") ) } mypred <- function(m, newdata) { out <- predict.glm(m, newdata, type = "response") factor(ifelse(out > 0.5, 1, 0), levels = c("0", "1")) } set.seed(1) dat <- gendata(mylogit, dim = 2, maxn = 20, predictfn = mypred) head(dat)
Recalculate achieved sample complexity bounds given different parameter inputs
getpac(table, epsilon = 0.05, delta = 0.05)getpac(table, epsilon = 0.05, delta = 0.05)
table |
A list containing an element named |
epsilon |
A real number between 0 and 1 giving the targeted maximum out-of-sample (OOS) error rate |
delta |
A real number between 0 and 1 giving the targeted maximum probability of observing an OOS error rate higher than |
A list containing two named elements. Raw gives the exact output of the simulations, while Summary gives a table of accuracy metrics, including the achieved levels of and given the specified values. Alternative values can be calculated using getpac() again.
plot.scb_data(), to represent simulations visually, getpac(), to calculate summaries for alternate values of and without conducting a new simulation, and gendata(), to generated synthetic datasets.
# Recalculate a stored scb_data object with alternate epsilon and delta values.# Recalculate a stored scb_data object with alternate epsilon and delta values.
Wrapper function for fitting extrapolation model to a list of objects of class "scb_data" using nonlinear least squares.
interpolate_scb( data_list, delta_interp_fun = c("logis", "logis5", "logis4", "declin"), epsilon_interp_fun = c("gompertz", "exp_plateau", "weibull", "quad_plateau"), epsilon, delta, maxN, delta_lower_bounds = NULL, epsilon_lower_bounds = NULL, delta_upper_bounds = NULL, epsilon_upper_bounds = NULL, delta_start = NULL, epsilon_start = NULL )interpolate_scb( data_list, delta_interp_fun = c("logis", "logis5", "logis4", "declin"), epsilon_interp_fun = c("gompertz", "exp_plateau", "weibull", "quad_plateau"), epsilon, delta, maxN, delta_lower_bounds = NULL, epsilon_lower_bounds = NULL, delta_upper_bounds = NULL, epsilon_upper_bounds = NULL, delta_start = NULL, epsilon_start = NULL )
data_list |
A list of objects of class "scb_data" for interpolation to be conducted on. |
delta_interp_fun |
The interpolation/extrapolation function to be used for the delta curve. Defaults to standard logistic, but 4 and 5 parameter as well as declining logistic functions are also supported. |
epsilon_interp_fun |
The interpolation/extrapolation function to be used for the epsilon curve. Defaults to gompertz, but exponential-plateau, weibull, and quadratic plateau functions are also supported. |
epsilon |
A real number between 0 and 1 giving the targeted maximum out-of-sample (OOS) error rate |
delta |
A real number between 0 and 1 giving the targeted maximum probability of observing an OOS error rate higher than |
maxN |
A positive integer giving value of the largest N for which extrapolation is to be conducted. |
delta_lower_bounds |
Optional named vector of lower bounds for the delta model parameters. |
epsilon_lower_bounds |
Optional named vector of lower bounds for the epsilon model parameters. |
delta_upper_bounds |
Optional named vector of upper bounds for the delta model parameters. |
epsilon_upper_bounds |
Optional named vector of upper bounds for the epsilon model parameters. |
delta_start |
Optional named vector of starting values for the delta model parameters. |
epsilon_start |
Optional named vector of starting values for the epsilon model parameters. |
A named list containing the interpolated dataframe, the original input data frame, and the given values of epsilon, delta, and maxN.
conduct_interpolation() can be used to fit custom curves or single simulations without confidence intervals.
Convenience wrapper around fit_gp_scb_curve() for objects produced by
estimate_accuracy(). The returned object has a structure similar to the
parametric output from interpolate_scb(), but represents posterior summaries
from a single monotone Gaussian process fit rather than bootstrap envelopes
over many parametric fits.
interpolate_scb_gp( scbobject, epsilon = 0.05, delta = 0.05, maxN, curve = c("both", "delta", "epsilon"), ... )interpolate_scb_gp( scbobject, epsilon = 0.05, delta = 0.05, maxN, curve = c("both", "delta", "epsilon"), ... )
scbobject |
An object of class |
epsilon |
Target maximum generalization error. |
delta |
Target maximum probability of exceeding |
maxN |
Largest sample size on the prediction grid. |
curve |
Which curve to fit: |
... |
Additional arguments passed to |
An object of class empirical_scb_gp.
simvcd()
Utility function to define the least-squares loss function to be optimized for simvcd()
loss(h, ngrid, xi, a = 0.16, a1 = 1.2, a11 = 0.14927)loss(h, ngrid, xi, a = 0.16, a1 = 1.2, a11 = 0.14927)
h |
A positive real number giving the current guess at VC dimension |
ngrid |
Vector of sample sizes for which the bounding function is estimated. |
xi |
Vector of estimated values of the bounding function, usually obtained from |
a |
Scaling coefficient for the bounding function. Defaults to the value given by Vapnik, Levin and Le Cun 1994. |
a1 |
Scaling coefficient for the bounding function. Defaults to the value given by Vapnik, Levin and Le Cun 1994. |
a11 |
Scaling coefficient for the bounding function. Defaults to the value given by Vapnik, Levin and Le Cun 1994. |
A real number giving the estimated value of the MSE given the current guess.
simvcd(), the user-facing function for simulating VC dimension and risk_bounds() to generate estimates for xi.
Plot a monotone Gaussian process sample-complexity fit
## S3 method for class 'empirical_scb_gp' plot( x, plot_type = c("Delta", "Epsilon"), include_legend = TRUE, include_title = FALSE, ... )## S3 method for class 'empirical_scb_gp' plot( x, plot_type = c("Delta", "Epsilon"), include_legend = TRUE, include_title = FALSE, ... )
x |
An object returned by |
plot_type |
Which curve to plot. |
include_legend |
Logical; whether to include a legend. |
include_title |
Logical; whether to include a title. |
... |
Ignored. |
A ggplot2 plot.
empirical_scb_list objectVisualizes bootstrap-estimated empirical sample complexity bounds (SCB) for either delta or epsilon.
## S3 method for class 'empirical_scb_list' plot( x, truedata, alpha = 0.05, plot_type = c("Delta", "Epsilon"), include_legend = TRUE, include_title = TRUE, ... )## S3 method for class 'empirical_scb_list' plot( x, truedata, alpha = 0.05, plot_type = c("Delta", "Epsilon"), include_legend = TRUE, include_title = TRUE, ... )
x |
An object of class |
truedata |
A bootstrapped list of benchmark simulations, each of class |
alpha |
Numeric between 0 and 1. Significance level used to compute bootstrap confidence intervals (default: |
plot_type |
Character string. Determines which SCB to plot: |
include_legend |
Logical. Whether to display a legend (default: |
include_title |
Logical. Whether to include a title (default: |
... |
Additional arguments passed to methods. |
A ggplot object displaying either the SCB-Delta or SCB-Epsilon curve with bootstrap confidence bands.
interpolate_scb() in order to prepare input data..
scb_data object)This method plots performance metrics estimated using estimate_accuracy() for objects of class "scb_data".
## S3 method for class 'scb_data' plot( x, metrics = c("Accuracy", "Precision", "Recall", "Fscore", "Delta", "Epsilon", "Power"), plottype = c("ggplot", "plotly"), letters = c("greek", "latin"), ... )## S3 method for class 'scb_data' plot( x, metrics = c("Accuracy", "Precision", "Recall", "Fscore", "Delta", "Epsilon", "Power"), plottype = c("ggplot", "plotly"), letters = c("greek", "latin"), ... )
x |
An object of class |
metrics |
A character vector containing the metrics to display in the plot. Can include any of "Accuracy", "Precision", "Recall", "Fscore", "Delta", "Epsilon", "Power". |
plottype |
A string indicating the graphics system to use. Must be either |
letters |
A string specifying whether |
... |
Additional arguments passed to methods. |
A ggplot or plot_ly object, depending on the value of plottype.
estimate_accuracy(), which generates the data used for plotting.
# Plot objects returned by estimate_accuracy().# Plot objects returned by estimate_accuracy().
Utility function to generate data points for estimation of the VC Dimension of a user-specified binary classification algorithm given a specified sample size.
risk_bounds( x, l, m, model, predictfn = NULL, sparse = FALSE, density = NULL, ... )risk_bounds( x, l, m, model, predictfn = NULL, sparse = FALSE, density = NULL, ... )
x |
An integer giving the desired sample size for which the target function is to be approximated. |
l |
A positive integer giving dimension (number of input features) of the model. |
m |
A positive integer giving the number of simulations to be performed at each design point (sample size value). Higher values give more accurate results but increase computation time. |
model |
A binary classification model supplied by the user. Must take arguments |
predictfn |
An optional user-defined function giving a custom predict method. If also using a user-defined model, the |
sparse |
Logical indicating whether sparse matrix generation should be used to save on memory. Defaults to false for better accuracy. |
density |
Real number between 0 and 1 giving the proportion of non 0 entries in the sparse matrix. Used only if sparse is TRUE. |
... |
Additional model parameters to be specified by the user. |
A real number giving the estimated value of Xi(n), the bounding function
Calculate sample complexity bounds for a classifier given target accuracy
scb(vcd = NULL, epsilon = NULL, delta = NULL, eta = NULL, theor = TRUE, ...)scb(vcd = NULL, epsilon = NULL, delta = NULL, eta = NULL, theor = TRUE, ...)
vcd |
The Vapnik-Chervonenkis dimension (VCD) of the chosen classifier. If |
epsilon |
A real number between 0 and 1 giving the targeted maximum out-of-sample (OOS) error rate |
delta |
A real number between 0 and 1 giving the targeted maximum probability of observing an OOS error rate higher than |
eta |
A real number between 0 and 1 giving the probability of misclassification error in the training data. |
theor |
A Boolean indicating whether the theoretical VCD is to be used. If |
... |
Arguments to be passed to |
A real number giving the sample complexity bound for the specified parameters.
simvcd(), to calculate VCD for a chosen model
scb(vcd = 7, epsilon = 0.05, delta = 0.05, eta = 0.05)scb(vcd = 7, epsilon = 0.05, delta = 0.05, eta = 0.05)
Estimate the Vapnik-Chervonenkis (VC) dimension of an arbitrary binary classification algorithm.
simvcd( model, dim, m = 1000, k = 1000, maxn = 5000, parallel = TRUE, coreoffset = 0, predictfn = NULL, a = 0.16, a1 = 1.2, a11 = 0.14927, minn = (dim + 1), sparse = FALSE, density = NULL, backend = c("multisession", "multicore", "cluster", "sequential"), packages = list(), ... )simvcd( model, dim, m = 1000, k = 1000, maxn = 5000, parallel = TRUE, coreoffset = 0, predictfn = NULL, a = 0.16, a1 = 1.2, a11 = 0.14927, minn = (dim + 1), sparse = FALSE, density = NULL, backend = c("multisession", "multicore", "cluster", "sequential"), packages = list(), ... )
model |
A binary classification model supplied by the user. Must take arguments |
dim |
A positive integer giving dimension (number of input features) of the model. |
m |
A positive integer giving the number of simulations to be performed at each design point (sample size value). Higher values give more accurate results but increase computation time. |
k |
A positive integer giving the number of design points (sample size values) for which the bounding function is to be estimated. Higher values give more accurate results but increase computation time. |
maxn |
Gives the vertical dimension of the data (number of observations) to be generated. |
parallel |
Boolean indicating whether or not to use parallel processing. |
coreoffset |
If |
predictfn |
An optional user-defined function giving a custom predict method. If also using a user-defined model, the |
a |
Scaling coefficient for the bounding function. Defaults to the value given by Vapnik, Levin and Le Cun 1994. |
a1 |
Scaling coefficient for the bounding function. Defaults to the value given by Vapnik, Levin and Le Cun 1994. |
a11 |
Scaling coefficient for the bounding function. Defaults to the value given by Vapnik, Levin and Le Cun 1994. |
minn |
Optional argument to set a different minimum n than the dimension of the algorithm. Useful with e.g. regularized regression models such as elastic net. |
sparse |
Logical indicating whether sparse matrix generation should be used to save on memory. Defaults to false for better accuracy. |
density |
Real number between 0 and 1 giving the proportion of non 0 entries in the sparse matrix. Used only if sparse is TRUE. |
backend |
One of the parallel backends used by |
packages |
A |
... |
Additional arguments that need to be passed to |
A real number giving the estimated value of the VC dimension of the supplied model.
scb(), to calculate sample complexity bounds given estimated VCD.
# Use small values of m, k, and maxn for quick smoke tests. # Use larger values in applied work.# Use small values of m, k, and maxn for quick smoke tests. # Use larger values in applied work.
Summarize a monotone Gaussian process sample-complexity fit
## S3 method for class 'empirical_scb_gp' summary(object, ...)## S3 method for class 'empirical_scb_gp' summary(object, ...)
object |
An object returned by |
... |
Ignored. |
A list summarizing the first grid point where each fitted curve meets its target.
For an empirical_scb_list object, finds
the mean-fit SCB crossing (where the average bootstrap curve first drops below the target),
a lower bound on that crossing (the smallest at which the lower CI envelope crosses), and
an upper bound on the crossing (the smallest at which the upper CI envelope crosses).
## S3 method for class 'empirical_scb_list' summary(object, alpha = 0.05, ...)## S3 method for class 'empirical_scb_list' summary(object, alpha = 0.05, ...)
object |
An object of class |
alpha |
Numeric in |
... |
Additional args (ignored). |
Invisibly, a list of components
deltaList with
SCB_N (mean-curve crossing),
status ("Observed"/"Extrapolated", or NA if not reached),
CI_lower_N, CI_upper_N (bounds on the crossing).
epsilonSame four elements for the target.
alphaThe CI level.
initialThe initial subsample size.
plot.empirical_scb_list, getpac