Skip to contents

How it works

The generateData function calls the low level generate data components to create sets of simulated data. The following components are called to create aspects of the simulated trial data:

  • createTreatments(…): Creates a dataset of all possible treatment regimes to be allocated to subjects
  • allocateTreatments(…): Allocates treatments to subjects in the simulated study
  • createCovariates(…): Creates a set of fixed covariates for a simulated population
  • createParameters(…): Creates simulated fixed and between subject parameters for subjects in each replicate
  • createResponse(…): Creates a simulated response variable based on available derived data
  • createMCAR(…): Adds a simulated “missing” flag to the data
  • createDropout(…): Adds a simulated “missing” flag to the data based on a dropout function
  • createInterims(…): Assigns subjects in the study to interim analyses
  • createDirectories(…): creates ReplicateData directory under the current working directory.
  • writeData(…): Which writes out the simulation replicate data in CSV

The generateData function iteratively builds and combines the data components for each replicate, and stores the data in the “ReplicateData” subdirectory of the working directory. This data can then be analyzed using a call to the analyzeData function.

Arguments

The generateData function takes a number of arguments which are passed down to the various lower level functions.

Required Arguments

Argument Name Description
replicateN Specifies how many replicates / simulated trials to generate
subjects TOTAL number of subjects for the whole design. The default behaviour is to allocate subjects to each treatment with equal probability, which may not guarantee equal allocation. See treatSubj below for further details of treatment allocation methods.
treatDoses Specifies the doses to be used in simulations. MSToolkit was designed to evaluate the operating characteristics of clinical trials, but its functionality can be extended to simulate non-clinical trials by thinking of “doses” as other factors which vary between individuals within a simulation replicate.
In the generated dataset TRT defines the treatment arm to which a subject is allocated and for parallel group designs this has a corresponding, unique value of DOSE. However in the case of parallel group trials, TRT is the treatment sequence to which a subject is allocated and must have a corresponding treatment sequence.
respEqn Specifies the linear predictor for generating outcome values. This should be a valid R expression or function. The expression can be written directly in generateData or an R function defined outside of generateData can be called. This function must return a vector of equal length to the number of rows in the generated data - one value per subject or one value per observation (TIME) within each subject.

Optional Arguments

Argument Name Description
treatSubj
treatProp
treatSubj specifies the precise number of subjects to allocate to each treatment (the sum of the elements of this vector must equal subjects above). treatProp is a vector of proportions specifying how subjects are to be allocated to each treatment in the proportions specified (the vector must be of the same length as the number of treatments - the length of treatDoses above and sum to 1). Specify only one of these arguments.
treatProp defines the probabilities of allocating each treatment and does not guarantee that the exact proportion will be allocated to a given treatment. treatSubj on the other hand allocates exactly the specified number to each treatment. If the sum of the number of subjects in treatSubj does not equal subjects (above) then the sum of treatSubj is used in place of subjects.
treatType
treatSeq
treatPeriod
If treatType is “crossover” then treatSeq should contain the treatment sequences for subjects to be allocated to. Each subject is then randomly allocated to one of the treatment sequences unless treatSubj is specified as above.
treatPeriod defines the timing of observations / response values. If any times are less than zero then it is assumed that DOSE=0 for these measurements (i.e. we assume a placebo run-in). For times greater than or equal to zero DOSE is as specified in treatDoses.
genParNames
genParMean
genParVCov
genParNames
genParNames defines the names to be used for the data generation model parameters for calculations and in the output dataset.
genParMean and genParVCov define the mean value for these parameters and the variance-covariance matrix defining how these parameters will vary across trial replicates. By default we assume that genParVCov = 0 (i.e. parameters have fixed values across trial replicates). See the Simulation Overviewpage for more information.
genParBtwNames
genParBtwMean
genParBtwVCov
genParBtwCrit
genParErrStruc
These parameters define how between subject variability is to be included for the parameters used in respEqn.
Variables defined in genParNames which also appear in genParBtwNames will have values generated from a (multivariate) Normal distribution with mean genParBtwMean and variance-covariance matrix genParBtwVCov. By default we assume that genParBtwMean = 0 for all parameters i.e. the parameters used in respEqn will have means specified by genParMean (with between replicate variability specified by genParVCov) and will vary between subjects with covariance genParBtwVCov. This process mirrors the usual hierarchical model construct with fixed and random effects.
genParBtwCrit applies ranges to the values generated (similar to conCovCrit above).
If genParErrStruc is specified as “additive” or “proportional” then the subject specific variation is added to the fixed effect values in an appropriate way. “additive” simply adds the values, while “proportional” adds the subject specific variation to the logged fixed effect value and then exponentiates. If genParErrStruc is “none” then the two values are returned separately to the generated dataset for the user to combine and use in an appropriate way.
respDist
respVCov
respInvLink
respErrStruc
respCrit
respDigits
These parameters define the distributional properties for the generated response variable.
respEqn gives the linear predictor for response, defining how treatments, doses, covariates, time etc. relate to the mean response for an individual. This linear predictor can then be used within a normal distribution to define continuous response variables or, with the appropriate link function (specified in respInvLink), can be used with binomial or poisson distributions to create binary or count data.
If we are creating continuous response outcomes then we can specify the residual (or within subject) variability, how this variability is added to the values from respEqn through respErrStruc and whether the generated residual values need to be constrained within certain ranges (given by respCrit).
Finally we can specify the number of significant digits for the generated response. MSToolkit version 2.0.0 only uses 1 value for residual error, although future versions will extend this to allow multiple residual error parameters to be created.
interimSubj interimSubj defines how subjects will be assigned to interim analysis data subsets. This should be a vector of cumulative proportions e.g. c(0.3,0.6) or c(0.25,0.5,0.75). MSToolkit will partition the dataset and allocate subjects randomly to one of the interim analysis subsets.
mcarProp
mcarRule
dropFun
These parameters define how missing data is to be generated and rules for dropping subjects. dropFun can be any valid R function and so can use dataset covariates, parameters and responses as drivers for the dropout function.
conCovNames
conCovMean
conCovVCov
conCovCrit
conCovMaxDraws
These parameters define how continuous covariates are to be generated across subjects within replicates. Values are drawn from (multivariate) Normal distributions. conCovCrit specifies ranges or criteria for each covariate value. If the number of draws from the distribution exceeds conCovMaxDraws before an acceptable value is found then a warning is given.
disCovNames
disCovVals
disCovProb
disCovProbArray
These parameters define how discrete covariates are to be generated across subjects within replicates. Values of the discrete parameters are specified in disCovVals and then these values are generated in proportions given by disCovProb or disCovProbArray if user wish to specify associations between discrete covariate values.
extCovNames
extCovFile
extCovSubset
extCovRefCol
extCovSameRow
extCovDataId
Covariate values can be sampled from an external file (e.g. an existing database in an ASCII file). These parameters define which variables to sample from the external file, the name of that file and whether to subset the data in that file before sampling. Users can choose to bring into the generated dataset a reference variable identifying which rows of the external datafile have been sampled (in order to check data values). It is also possible to specify whether to sample covariate values independently (default) or whether to sample covariate values from within the same row of the external file, thus preserving correlations between covariates without making normality assumptions. If a value is given for extCovDataId then this is used to identify covariate values from each unique ID within the external datafile.