How it works
The generateData
function calls the low level generate
data components to create sets of simulated data. The following
components are called to create aspects of the simulated trial data:
- createTreatments(…): Creates a dataset of all possible treatment regimes to be allocated to subjects
- allocateTreatments(…): Allocates treatments to subjects in the simulated study
- createCovariates(…): Creates a set of fixed covariates for a simulated population
- createParameters(…): Creates simulated fixed and between subject parameters for subjects in each replicate
- createResponse(…): Creates a simulated response variable based on available derived data
- createMCAR(…): Adds a simulated “missing” flag to the data
- createDropout(…): Adds a simulated “missing” flag to the data based on a dropout function
- createInterims(…): Assigns subjects in the study to interim analyses
- createDirectories(…): creates ReplicateData directory under the current working directory.
- writeData(…): Which writes out the simulation replicate data in CSV
The generateData
function iteratively builds and
combines the data components for each replicate, and stores the data in
the “ReplicateData” subdirectory of the working directory. This
data can then be analyzed using a call to the analyzeData
function.
Arguments
The generateData
function takes a number of arguments
which are passed down to the various lower level functions.
Required Arguments
Argument Name | Description |
---|---|
replicateN | Specifies how many replicates / simulated trials to generate |
subjects | TOTAL number of subjects for the whole design. The default behaviour
is to allocate subjects to each treatment with equal probability, which
may not guarantee equal allocation. See treatSubj below for
further details of treatment allocation methods. |
treatDoses | Specifies the doses to be used in simulations. MSToolkit was
designed to evaluate the operating characteristics of clinical trials,
but its functionality can be extended to simulate non-clinical trials by
thinking of “doses” as other factors which vary between individuals
within a simulation replicate. In the generated dataset TRT defines the treatment arm to which a subject is allocated and for parallel group designs this has a corresponding, unique value of DOSE. However in the case of parallel group trials, TRT is the treatment sequence to which a subject is allocated and must have a corresponding treatment sequence. |
respEqn | Specifies the linear predictor for generating outcome values. This
should be a valid R expression or function. The expression can be
written directly in generateData or an R function defined
outside of generateData can be called. This function must
return a vector of equal length to the number of rows in the generated
data - one value per subject or one value per observation (TIME) within
each subject. |
Optional Arguments
Argument Name | Description |
---|---|
treatSubj treatProp |
treatSubj specifies the precise number of subjects to
allocate to each treatment (the sum of the elements of this vector must
equal subjects above). treatProp is a vector of proportions
specifying how subjects are to be allocated to each treatment in the
proportions specified (the vector must be of the same length as the
number of treatments - the length of treatDoses above and
sum to 1). Specify only one of these arguments. treatProp defines the probabilities of allocating each
treatment and does not guarantee that the exact proportion will be
allocated to a given treatment. treatSubj on the other hand
allocates exactly the specified number to each treatment. If the sum of
the number of subjects in treatSubj does not equal subjects
(above) then the sum of treatSubj is used in place of
subjects. |
treatType treatSeq treatPeriod |
If treatType is “crossover” then treatSeq
should contain the treatment sequences for subjects to be allocated to.
Each subject is then randomly allocated to one of the treatment
sequences unless treatSubj is specified as above. treatPeriod defines the timing of observations / response
values. If any times are less than zero then it is assumed that
DOSE=0 for these measurements (i.e. we assume a placebo run-in). For
times greater than or equal to zero DOSE is as specified in
treatDoses . |
genParNames genParMean genParVCov genParNames |
genParNames defines the names to be used for the data
generation model parameters for calculations and in the output dataset.
genParMean and genParVCov define the
mean value for these parameters and the variance-covariance matrix
defining how these parameters will vary across trial replicates. By
default we assume that genParVCov = 0 (i.e. parameters have fixed values
across trial replicates). See the Simulation
Overviewpage for more information. |
genParBtwNames genParBtwMean genParBtwVCov genParBtwCrit genParErrStruc |
These parameters define how between subject variability is to be
included for the parameters used in respEqn . Variables defined in genParNames which also appear in
genParBtwNames will have values generated from a
(multivariate) Normal distribution with mean genParBtwMean
and variance-covariance matrix genParBtwVCov . By default we
assume that genParBtwMean = 0 for all parameters i.e. the
parameters used in respEqn will have means specified by
genParMean (with between replicate variability specified by
genParVCov ) and will vary between subjects with covariance
genParBtwVCov . This process mirrors the usual hierarchical
model construct with fixed and random effects. genParBtwCrit applies ranges to the values generated
(similar to conCovCrit above). If genParErrStruc is specified as “additive” or “proportional”
then the subject specific variation is added to the fixed effect values
in an appropriate way. “additive” simply adds the values, while
“proportional” adds the subject specific variation to the logged fixed
effect value and then exponentiates. If genParErrStruc is
“none” then the two values are returned separately to the generated
dataset for the user to combine and use in an appropriate way. |
respDist respVCov respInvLink respErrStruc respCrit respDigits |
These parameters define the distributional properties for the
generated response variable. respEqn gives the
linear predictor for response, defining how treatments, doses,
covariates, time etc. relate to the mean response for an individual.
This linear predictor can then be used within a normal distribution to
define continuous response variables or, with the appropriate link
function (specified in respInvLink), can be used with binomial or
poisson distributions to create binary or count data. If we are creating continuous response outcomes then we can specify the residual (or within subject) variability, how this variability is added to the values from respEqn through respErrStruc and
whether the generated residual values need to be constrained within
certain ranges (given by respCrit ). Finally we can specify the number of significant digits for the generated response. MSToolkit version 2.0.0 only uses 1 value for residual
error, although future versions will extend this to allow multiple
residual error parameters to be created. |
interimSubj |
interimSubj defines how subjects will be assigned to
interim analysis data subsets. This should be a vector of cumulative
proportions e.g. c(0.3,0.6) or c(0.25,0.5,0.75). MSToolkit
will partition the dataset and allocate subjects randomly to one of the
interim analysis subsets. |
mcarProp mcarRule dropFun |
These parameters define how missing data is to be generated and
rules for dropping subjects. dropFun can be any valid R
function and so can use dataset covariates, parameters and responses as
drivers for the dropout function. |
conCovNames conCovMean conCovVCov conCovCrit conCovMaxDraws |
These parameters define how continuous covariates are to be
generated across subjects within replicates. Values are drawn from
(multivariate) Normal distributions. conCovCrit specifies
ranges or criteria for each covariate value. If the number of draws from
the distribution exceeds conCovMaxDraws before an
acceptable value is found then a warning is given. |
disCovNames disCovVals disCovProb disCovProbArray |
These parameters define how discrete covariates are to be generated
across subjects within replicates. Values of the discrete parameters are
specified in disCovVals and then these values are generated
in proportions given by disCovProb or
disCovProbArray if user wish to specify associations
between discrete covariate values. |
extCovNames extCovFile extCovSubset extCovRefCol extCovSameRow extCovDataId |
Covariate values can be sampled from an external file (e.g. an
existing database in an ASCII file). These parameters define which
variables to sample from the external file, the name of that file and
whether to subset the data in that file before sampling. Users can
choose to bring into the generated dataset a reference variable
identifying which rows of the external datafile have been sampled (in
order to check data values). It is also possible to specify whether to
sample covariate values independently (default) or whether to sample
covariate values from within the same row of the external file, thus
preserving correlations between covariates without making normality
assumptions. If a value is given for extCovDataId then this
is used to identify covariate values from each unique ID within the
external datafile. |