Skip to content

Commit b31941a

Browse files
committed
Updating simstudy.Rmd
1 parent efc2612 commit b31941a

File tree

1 file changed

+17
-17
lines changed

1 file changed

+17
-17
lines changed

vignettes/simstudy.Rmd

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,11 @@ The `simstudy` package is a collection of functions that allows users to generat
4444

4545
Simulation using `simstudy` has two fundamental steps. The user (1) **defines** the data elements of a data set and (2) **generates** the data based on these definitions. Additional functionality exists to simulate observed or randomized **treatment assignment/exposures**, to create **longitudinal/panel** data, to create **multi-level/hierarchical** data, to create datasets with **correlated variables** based on a specified covariance structure, to **merge** datasets, to create data sets with **missing** data, and to create non-linear relationships with underlying **spline** curves.
4646

47-
The overarching philosophy of `simstudy` is to create data generating processes that mimic the typical models used to fit those types of data. So, the parameterization of some of the data generating processes may not follow the standard parameterizations for the specific distributions. For example, in `simstudy` *gamma*-distributed data are generated based on the specification of a mean $\mu$ (or $log(\mu)$) and a dispersion $d$, rather than shape $\alpha$ and rate $\beta$ parameters that more typically characterize the *gamma* distribution. When we estimate the parameters, we are modeling $\mu$ (or some function of $(\mu)$), so we should explicitly recover the `simstudy` parameters used to generate the model - illuminating the relationship between the underlying data generating processes and the models.
47+
The overarching philosophy of `simstudy` is to create data generating processes that mimic the typical models used to fit those types of data. So, the parameterization of some of the data generating processes may not follow the standard parameterizations for the specific distributions. For example, in `simstudy` *gamma*-distributed data are generated based on the specification of a mean $\mu$ (or $\log(\mu)$) and a dispersion $d$, rather than shape $\alpha$ and rate $\beta$ parameters that more typically characterize the *gamma* distribution. When we estimate the parameters, we are modeling $\mu$ (or some function of $(\mu)$), so we should explicitly recover the `simstudy` parameters used to generate the model - illuminating the relationship between the underlying data generating processes and the models.
4848

4949
## Overview
5050

51-
This introduction provides a brief overview to the basics of defining and generating data, including treatment or exposure variables. Subsequent sections in this vignette provide more details on these processes. For information on more elaborate data generating mechanisms, please refer to other package vignettes that provide more detailed descriptions.
51+
This introduction provides a brief overview to the basics of defining and generating data, including treatment or exposure variables. Subsequent sections in this vignette provide more details on these processes. For information on more elaborate data generating mechanisms, please refer to other vignettes in this package that provide more detailed descriptions.
5252

5353
### Defining the Data
5454

@@ -74,7 +74,7 @@ def <- defData(def,varname="visits", dist="poisson",
7474
formula="1.5 - 0.2 * age + 0.5 * female", link="log")
7575
```
7676

77-
The data definition table includes a row for each variable that is to be generated, and has the following fields: **varname**, **formula**, **variance**, **dist**, and **link**. **varname** provides the name of the variable to be generated. **formula** is either a value or string representing any valid R formula (which can include function calls) that in most cases defines the mean of the distribution. **variance** is a value or string that specifies either the variance or other parameter that characterizes the distribution; the default is 0. **dist** is defines the distribution of the variable to be generated; the default is *normal*. The **link** is a function that defines the relationship of the formula with the mean value, and can either *identity*, *log*, or *logit*, depending on the distribution; the default is *identity*.
77+
The data definition table includes a row for each variable that is to be generated, and has the following fields: `varname*`, `formula`, `variance`, `dist`, and `link`. `varname` provides the name of the variable to be generated. `formula` is either a value or string representing any valid R formula (which can include function calls) that in most cases defines the mean of the distribution. `variance` is a value or string that specifies either the variance or other parameter that characterizes the distribution; the default is 0. `dist` is defines the distribution of the variable to be generated; the default is *normal*. The `link` is a function that defines the relationship of the formula with the mean value, and can either *identity*, *log*, or *logit*, depending on the distribution; the default is *identity*.
7878

7979
If using `defData` to create the definition table, the first call to `defData` without specifying a definition name (in this example the definition name is *def*) creates a **new** data.table with a single row. An additional row is added to the table `def` each time the function `defData` is called. Each of these calls is the definition of a new field in the data set that will be generated.
8080

@@ -204,59 +204,59 @@ knitr::kable(d, align = "lllllccc")
204204

205205
#### beta
206206

207-
A *beta* distribution is a continuous data distribution that takes on values between $0$ and $1$. The *formula* specifies the mean $\mu$ (with the 'identity' link) or the log-odds of the mean (with the 'logit' link). The scalar value in the 'variance' represents the dispersion value $d$. The variance $\sigma^2$ for a beta distributed variable will be $\mu (1- \mu)/(1 + d)$. Typically, the beta distribution is specified using two shape parameters $\alpha$ and $\beta$, where $\mu = \alpha/(\alpha + \beta)$ and $\sigma^2 = \alpha\beta/[(\alpha + \beta)^2 (\alpha + \beta + 1)]$.
207+
A *beta* distribution is a continuous data distribution that takes on values between $0$ and $1$. The `formula` specifies the mean $\mu$ (with the 'identity' link) or the log-odds of the mean (with the 'logit' link). The scalar value in the 'variance' represents the dispersion value $d$. The variance $\sigma^2$ for a beta distributed variable will be $\mu (1- \mu)/(1 + d)$. Typically, the beta distribution is specified using two shape parameters $\alpha$ and $\beta$, where $\mu = \alpha/(\alpha + \beta)$ and $\sigma^2 = \alpha\beta/[(\alpha + \beta)^2 (\alpha + \beta + 1)]$.
208208

209209
#### binary
210210

211-
A *binary* distribution is a discrete data distribution that takes values $0$ or $1$. (It is more conventionally called a *Bernoulli* distribution, or is a *binomial* distribution with a single trial $n=1$.) The *formula* represents the probability (with the 'identity' link) or the log odds (with the 'logit' link) that the variable takes the value of 1. The mean of this distribution is $p$, and variance $\sigma^2$ is $p(1-p)$.
211+
A *binary* distribution is a discrete data distribution that takes values $0$ or $1$. (It is more conventionally called a *Bernoulli* distribution, or is a *binomial* distribution with a single trial $n=1$.) The `formula` represents the probability (with the 'identity' link) or the log odds (with the 'logit' link) that the variable takes the value of 1. The mean of this distribution is $p$, and variance $\sigma^2$ is $p(1-p)$.
212212

213213
#### binomial
214214

215215
A *binomial* distribution is a discrete data distribution that represents the count of the number of successes given a number of trials. The formula specifies the probability of success $p$, and the variance field is used to specify the number of trials $n$. Given a value of $p$, the mean $\mu$ of this distribution is $n*p$, and the variance $\sigma^2$ is $np(1-p)$.
216216

217217
#### categorical
218218

219-
A *categorical* distribution is a discrete data distribution taking on values from $1$ to $K$, with each value representing a specific category, and there are $K$ categories. The categories may or may not be ordered. For a categorical variable with $k$ categories, the *formula* is a string of probabilities that sum to 1, each separated by a semi-colon: $(p_1 ; p_2 ; ... ; p_k)$. $p_1$ is the probability of the random variable falling in category $1$, $p_2$ is the probability of category $2$, etc. The probabilities can be specified as functions of other variables previously defined. The *variance* and *link* fields do not apply to the *categorical* distribution.
219+
A *categorical* distribution is a discrete data distribution taking on values from $1$ to $K$, with each value representing a specific category, and there are $K$ categories. The categories may or may not be ordered. For a categorical variable with $k$ categories, the `formula` is a string of probabilities that sum to 1, each separated by a semi-colon: $(p_1 ; p_2 ; ... ; p_k)$. $p_1$ is the probability of the random variable falling in category $1$, $p_2$ is the probability of category $2$, etc. The probabilities can be specified as functions of other variables previously defined. The `link` options are *identity* or *logit*. The `variance` field does not apply to the *categorical* distribution.
220220

221221
#### exponential
222222

223-
An *exponential* distribution is a continuous data distribution that takes on non-negative values. The *formula* represents the mean $\theta$ (with the 'identity' link) or log of the mean (with the 'log' link). The *variance* argument does not apply to the *exponential* distribution. The variance $\sigma^2$ is $\theta^2$.
223+
An *exponential* distribution is a continuous data distribution that takes on non-negative values. The `formula` represents the mean $\theta$ (with the 'identity' link) or log of the mean (with the 'log' link). The `variance` argument does not apply to the *exponential* distribution. The variance $\sigma^2$ is $\theta^2$.
224224

225225
#### gamma
226226

227-
A *gamma* distribution is a continuous data distribution that takes on non-negative values. The *formula* specifies the mean $\mu$ (with the 'identity' link) or the log of the mean (with the 'log' link). The *variance* field represents a dispersion value $d$. The variance $\sigma^2$ is is $d \mu^2$.
227+
A *gamma* distribution is a continuous data distribution that takes on non-negative values. The `formula` specifies the mean $\mu$ (with the 'identity' link) or the log of the mean (with the 'log' link). The `variance` field represents a dispersion value $d$. The variance $\sigma^2$ is is $d \mu^2$.
228228

229229
#### mixture
230230

231-
The *mixture* distribution is a mixture of other predefined variables, which can be defined based on any of the other available distributions. The formula is a string structured with a sequence of variables $x_i$ and probabilities $p_i$: $x_1 | p_1 + … + x_n | p_n$. All of the $x_i$'s are required to have been previously defined, and the probabilities must sum to $1$ (i.e. $\sum_1^n p_i = 1$). The result of generating from a mixture is the value $x_i$ with probability $p_i$. The *variance* and *link* fields do not apply to the *mixture* distribution.
231+
The *mixture* distribution is a mixture of other predefined variables, which can be defined based on any of the other available distributions. The formula is a string structured with a sequence of variables $x_i$ and probabilities $p_i$: $x_1 | p_1 + … + x_n | p_n$. All of the $x_i$'s are required to have been previously defined, and the probabilities must sum to $1$ (i.e. $\sum_1^n p_i = 1$). The result of generating from a mixture is the value $x_i$ with probability $p_i$. The `variance` and `link` fields do not apply to the *mixture* distribution.
232232

233233
#### negBinomial
234234

235-
A *negative binomial* distribution is a discrete data distribution that represents the number of successes that occur in a sequence of *Bernoulli* trials before a specified number of failures occurs. It is often used to model count data more generally when a *Poisson* distribution is not considered appropriate; the variance of the negative binomial distribution is larger than the *Poisson* distribution. The *formula* specifies the mean $\mu$ or the log of the mean. The variance field represents a dispersion value $d$. The variance $\sigma^2$ will be $\mu + d\mu^2$.
235+
A *negative binomial* distribution is a discrete data distribution that represents the number of successes that occur in a sequence of *Bernoulli* trials before a specified number of failures occurs. It is often used to model count data more generally when a *Poisson* distribution is not considered appropriate; the variance of the negative binomial distribution is larger than the *Poisson* distribution. The `formula` specifies the mean $\mu$ or the log of the mean. The variance field represents a dispersion value $d$. The variance $\sigma^2$ will be $\mu + d\mu^2$.
236236

237237
#### nonrandom
238238

239-
Deterministic data can be "generated" using the *nonrandom* distribution. The *formula* explicitly represents the value of the variable to be generated, without any uncertainty. The *variance* and *link* fields do not apply to *nonrandom* data generation.
239+
Deterministic data can be "generated" using the *nonrandom* distribution. The `formula` explicitly represents the value of the variable to be generated, without any uncertainty. The `variance` and `link` fields do not apply to *nonrandom* data generation.
240240

241241
#### normal
242242

243-
A *normal* or *Gaussian* distribution is a continuous data distribution that takes on values between $-\infty$ and $\infty$. The *formula* represents the mean $\mu$ and the *variance* represents $\sigma^2$. The *link* field is not applied to the *normal* distribution.
243+
A *normal* or *Gaussian* distribution is a continuous data distribution that takes on values between $-\infty$ and $\infty$. The `formula` represents the mean $\mu$ and the `variance` represents $\sigma^2$. The `link` field is not applied to the *normal* distribution.
244244

245245
#### noZeroPoisson
246246

247-
The *noZeroPoisson* distribution is a discrete data distribution that takes on positive integers. This is a truncated *poisson* distribution that excludes $0$. The *formula* specifies the parameter $\lambda$ (link is 'identity') or log(\lambda) (*link* is log). The *variance* field does not apply to this distribution. The mean $\mu$ of this distribution is $\lambda/(1-e^{-\lambda})$ and the variance $\sigma^2$ is $(\lambda + \lambda^2)/(1-e^{-\lambda}) - \lambda^2/(1-e^{-\lambda})^2$. We are not typically interested in modeling data drawn from this distribution (except in the case of a *hurdle model*), but it is useful to generate positive count data where it is not desirable to have any $0$ values.
247+
The *noZeroPoisson* distribution is a discrete data distribution that takes on positive integers. This is a truncated *poisson* distribution that excludes $0$. The `formula` specifies the parameter $\lambda$ (link is 'identity') or log(\lambda) (`link` is log). The `variance` field does not apply to this distribution. The mean $\mu$ of this distribution is $\lambda/(1-e^{-\lambda})$ and the variance $\sigma^2$ is $(\lambda + \lambda^2)/(1-e^{-\lambda}) - \lambda^2/(1-e^{-\lambda})^2$. We are not typically interested in modeling data drawn from this distribution (except in the case of a *hurdle model*), but it is useful to generate positive count data where it is not desirable to have any $0$ values.
248248

249249
#### poisson
250250

251-
The *poisson* distribution is a discrete data distribution that takes on non-negative integers. The *formula* specifies the mean $\lambda$ (link is 'identity') or log of the mean (*link* is log). The *variance* field does not apply to this distribution. The variance $\sigma^2$ is $\lambda$ itself.
251+
The *poisson* distribution is a discrete data distribution that takes on non-negative integers. The `formula` specifies the mean $\lambda$ (link is 'identity') or log of the mean (`link` is log). The `variance` field does not apply to this distribution. The variance $\sigma^2$ is $\lambda$ itself.
252252

253253
#### uniform
254254

255-
A $uniform$ distribution is a continuous data distribution that takes on values from $a$ to $b$, where $b$ > $a$, and they both lie anywhere on the real number line. The *formula* is a string with the format "a;b", where *a* and *b* are scalars or functions of previously defined variables. The *variance* and *link* arguments do not apply to the *uniform* distribution.
255+
A *uniform* distribution is a continuous data distribution that takes on values from $a$ to $b$, where $b$ > $a$, and they both lie anywhere on the real number line. The `formula` is a string with the format "a;b", where *a* and *b* are scalars or functions of previously defined variables. The `variance` and `link` arguments do not apply to the *uniform* distribution.
256256

257257
#### uniformInt
258258

259-
A $uniform integer$ distribution is a discrete data distribution that takes on values from $a$ to $b$, where $b$ > $a$, and they both lie anywhere on the integer number line. The *formula* is a string with the format "a;b", where *a* and *b* are scalars or functions of previously defined variables. The *variance* and *link* arguments do not apply to the *uniform integer* distribution.
259+
A *uniform integer* distribution is a discrete data distribution that takes on values from $a$ to $b$, where $b$ > $a$, and they both lie anywhere on the integer number line. The `formula` is a string with the format "a;b", where *a* and *b* are scalars or functions of previously defined variables. The `variance` and `link` arguments do not apply to the *uniform integer* distribution.
260260

261261
## Adding data to an existing data table
262262

@@ -283,7 +283,7 @@ dd
283283

284284
### defCondition and addCondition
285285

286-
In certain situations, it might be useful to define a data distribution conditional on previously generated data in a way that is more complex than might be easily handled by a single formula. `defCondition` creates a special table of definitions and the new variable is added to an existing data set by calling `addCondition`. `defCondition` specifies a condition argument that will be based on a variable that already exists in the data set. The new variable can take on any `simstudy` distribution specified with the appropriate *formula*, *variance*, and *link* arguments.
286+
In certain situations, it might be useful to define a data distribution conditional on previously generated data in a way that is more complex than might be easily handled by a single formula. `defCondition` creates a special table of definitions and the new variable is added to an existing data set by calling `addCondition`. `defCondition` specifies a condition argument that will be based on a variable that already exists in the data set. The new variable can take on any `simstudy` distribution specified with the appropriate `formula`, `variance`, and `link` arguments.
287287

288288
In this example, the slope of a regression line of $y$ on $x$ varies depending on the value of the predictor $x$:
289289

0 commit comments

Comments
 (0)