This file is indexed.

/usr/lib/R/site-library/DESeq2/script/simulation.Rmd is in r-bioc-deseq2 1.18.1-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
# Assessment of DESeq2 through simulation

Michael Love, Wolfgang Huber, Simon Anders

<!-- this document built with knit() in R -->
<!-- pandoc -o simulation.pdf simulation.md -->

```{r options, echo=FALSE, results="hide"}
opts_chunk$set(dev="pdf")
```

Document compiled on:

```{r time, echo=FALSE}
Sys.time()
```

This document and associated files provide the code and plots for the
simulation section of the *DESeq2* paper, so that these results can be
easily updated over time. For more details, read the paper at the
following URL:

http://dx.doi.org/10.1101/002832

## Differential expression analysis

We assessed the sensitivity and specificity of various algorithms
using simulation to complement an analysis on real data. The following
Negative Binomial simulation samples (mean, dispersion) pairs from the
joint distribution of estimated means and dispersions from the Pickrell
et al dataset. The true differences between two groups are drawn from
either *z*, *0* or *-z*, where the *0* component represents 80% of the
genes. The absolute value of the effect size *z* for the 20% of genes
with differential expression is varied, as is the total sample size m
(such that each group has m/2 samples). 10,000 genes were simulated,
and each combination of parameters was repeated 6 times.  The code to
generate these results is in `simulateDE.R` and the code to run each
algorithm is in `runScripts.R`.

Note: *DSS* denotes running *DSS* and then Benjamini-Hochberg
adjustment on *p*-values. *DSS-FDR* denotes the native FDR estimation
of the *DSS* software. 
*SAMseq* denotes running *SAMseq* with *p*-value
estimation and Benjamini-Hochberg adjustment for FDR.
*SAMseq-FDR* denotes the native FDR estimation and no *p*-value
estimation. *EBSeq* likewise only produces FDR.

```{r loadDE}
load("results_simulateDE.RData")
res$m <- factor(res$m)
levels(res$m) <- paste0("m=",levels(res$m))
res$effSize <- factor(res$effSize)
levels(res$effSize) <- c("fold change 2","fold change 3","fold change 4")
res$algorithm <- factor(res$algorithm)
resClean <- res[!is.na(res$oneminusspecpvals),]
```

```{r simulateDE, fig.width=7, fig.height=5, fig.cap="Sensitivity and specificity on simulated datasets."}
library("ggplot2")
p <- ggplot(resClean, aes(y=sensitivity, x=oneminusspecpvals,
                          color=algorithm, shape=algorithm))
p + geom_point() + theme_bw() + facet_grid(effSize ~ m) +
  scale_shape_manual(values=1:9) +
  xlab("1 - specificity (false positive rate)") + 
  coord_cartesian(xlim=c(-.003,.035)) + 
  geom_vline(xintercept=.01) + 
  scale_x_continuous(breaks=c(0,.02))
```

Use of simulation to assess the sensitivity and specificity of
algorithms across combinations of sample size and effect size. The
sensitivity was calculated as the fraction of genes with adjusted
*p*-value less than 0.1 among the genes with true differences between
group means. The specificity was calculated as the fraction of genes
with *p*-value greater than 0.01 among the genes with no true
differences between group means.  The *p*-value was chosen instead of
the adjusted *p*-value, as this allows for comparison against the
expected fraction of *p*-values less than a critical value given the
uniformity of *p*-values under the null hypothesis. 

```{r simulateDEPrec, fig.width=7, fig.height=5, fig.cap="Sensitivity and precision on simulated datasets."}
library("ggplot2")
p <- ggplot(res, aes(y=sensitivity, x=oneminusprec,
                     color=algorithm, shape=algorithm))
p + geom_point() + theme_bw() + facet_grid(effSize ~ m) +
  scale_shape_manual(values=1:11) +
  xlab("1 - precision (false discovery rate)") + 
  coord_cartesian(xlim=c(-.03, .3)) + 
  geom_vline(xintercept=.1)
``` 

Sensitivity and precision of algorithms across combinations of sample
size and effect size. The sensitivity was calculated as the fraction
of genes with adjusted *p*-value less than 0.1 among the genes with true
differences between group means.  The precision was calculated as the
fraction of genes with true differences between group means among
those with adjusted *p*-value less than 0.1.

```{r simulateDESensStratify, fig.width=7, fig.height=5, fig.cap="Sensitivity dependence on mean count."}
library("reshape")
id.vars <- c("algorithm","effSize","m")
measure.vars <- c("sens0to20","sens20to100","sens100to300","sensmore300")
melted <- melt(res[,c(id.vars,measure.vars)], id.vars=id.vars,
               measure.vars=measure.vars)
names(melted) <- c(id.vars, "aveexp", "sensitivity")
levels(melted$aveexp) <- c("<20","20-100","100-300",">300")
p <- ggplot(melted, aes(y=sensitivity, x=aveexp, group=algorithm,
                        color=algorithm, shape=algorithm))
p + stat_summary(fun.y="mean", geom="line") +
  stat_summary(fun.y="mean", geom="point") +
  theme_bw() + facet_grid(effSize ~ m) +
  scale_shape_manual(values=1:11) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("mean counts")
```

The sensitivity of algorithms across combinations of sample size and
effect size, and further stratified by the mean of counts of the
differentially expressed genes in the simulation data. Points indicate
the average over 6 replicates. Algorithms all show a similar
dependence of sensitivity on the mean of counts. The height of the
sensitivity curve should be compared with the previous plot indicating
the total sensitivity and specificity of each algorithm.

## Performance in the presence of outliers

The following plots examine the affect of outliers on differential
calls by the two Negative-Binomial-based methods *DESeq2* and *edgeR*.
*DESeq2* was run with default settings, after turning off
gene filtering, and after turning off outlier replacement. *edgeR*
was run with default settings, and after using the *robust* option.
The code to generate these results is in `simulateOutliers.R`.

```{r loadOut}
load("results_simulateOutliers.RData")
# when < 7 replicates DESeq does not replace
res <- res[!(res$algorithm == "DESeq2-noRepl" & res$m < 14),]
# when >= 7 replicates DESeq does not filter
res <- res[!(res$algorithm == "DESeq2-noFilt" & res$m >= 14),]
res$m <- factor(res$m)
levels(res$m) <- paste0("m=",levels(res$m))
res$percentOutlier <- 100 * res$percentOutlier
res$percentOutlier <- factor(res$percentOutlier)
levels(res$percentOutlier) <- paste0(levels(res$percentOutlier),"% outlier")
``` 

Because the sensitivity-specificity curve is evaluated using the p
value, we use fhe following code to pick out the point on the
sensitivity-specificity curve with largest p value such that the
nominal adjusted *p*-value is less than 0.1.

```{r}
resSensPadj <- res[res$senspadj < .1,]
resSensPadj <- resSensPadj[nrow(resSensPadj):1,]
resSensPadj <- resSensPadj[!duplicated(with(resSensPadj,
                                            paste(algorithm, m, percentOutlier))),]
summary(resSensPadj$senspadj)
```

```{r simulateOutliersSens, fig.width=8, fig.height=4, fig.cap="Sensitivity and specificity in presence of outliers."}
library("ggplot2")
p <- ggplot(res, aes(x=oneminusspec, y=sensitivity, color=algorithm))
p + scale_x_continuous(breaks=c(0,.1,.2)) + 
  scale_y_continuous(breaks=c(0,.2,.4,.6,.8)) + 
  geom_line() + theme_bw() +
  facet_grid(m ~ percentOutlier) + xlab("1 - specificity") + 
  coord_cartesian(xlim=c(-.03, .25), ylim=c(-.05, .9)) + 
  geom_point(aes(x=oneminusspec, y=sensitivity, shape=algorithm),
            data=res[res$precpadj == .1,])
``` 

Sensitivity-specificity curves for detecting true differences in the
presence of outliers. Negative Binomial counts were simulated for 4000
genes and total sample sizes (m) of 10 and 20, for a two-group
comparison. 80% of the simulated genes had no true differential
expression, while for 20% of the genes true logarithmic (base 2)
fold changes of -1 or 1. The number of genes with simulated outliers
was increased from 0% to 15%. The outliers were constructed for a
gene by multiplying the count of a single sample by 100.  Sensitivity
and specificity were calculated by thresholding on *p*-values. Points
indicate an adjusted *p*-value of 0.1. DESeq2 filters genes with
potential outliers for samples with 3 to 6 replicates, while replacing
outliers for samples with 7 or more replicates, hence the filtering
can be turned off for the top row (m=10) and the replacement can be
turned off for the bottom row (m=20).

```{r simulateOutliersPrec, fig.width=8, fig.height=4, fig.cap="FDR and target FDR in presence of outliers."}
p <- ggplot(res, aes(x=precpadj, y=oneminusprec, color=algorithm))
p + scale_x_continuous(breaks=c(0,.1,.2)) + 
  scale_y_continuous(breaks=c(0,.1,.2)) + 
  geom_line() + theme_bw() + 
  facet_grid(m ~ percentOutlier) + 
  geom_abline(intercept=0,slope=1) +
  xlab("adjusted p-value") + ylab("1 - precision (FDR)") + 
  coord_cartesian(xlim=c(-.03, .25), ylim=c(-.05, .25)) + 
  geom_point(aes(x=precpadj, y=oneminusprec, shape=algorithm),
             data=res[res$precpadj == .1,])
```

Outlier handling: One minus the precision (false discovery rate)
plotted over various thresholds of adjusted *p*-value. Shown is the
results for the same simulation with outliers described in the
previous figure. Points indicate an adjusted *p*-value of 0.1.

## Accuracy of log fold change estimates

The following simulations used Negative Binomial random variables
with mean and dispersion pairs samples from the joint
distribution of mean-dispersion estimates from the Pickrell data. In
addition, true differences between two groups were randomly generated,
according to the following models, diagrammed below. The
accuracy of four methods for estimating the log fold change between
groups were compared by the root mean squared error (RMSE) and the 
mean absolute error (MAE). The four methods were chosen for their
particular focus on the logs fold change estimate. 
The code to generate these results is in `simulateLFCAccuracy.R`.

```{r lfcAccuracyHist, fig.width=6, fig.height=6, fig.cap="Examples of simulated log2 fold changes."}
par(mfrow=c(2,2),mar=c(3,3,3,1))
n <- 1000
brks <- seq(from=-4,to=4,length.out=20)
trimit <- function(x) x[x > -4 & x < 4] # for visualization only
myhist <- function(x, ...) hist(x, breaks=brks, border="white",
                                col="blue", xlab="", ylab="", ...)
myhist(trimit(rnorm(n)), main="bell")
myhist(trimit(c(rnorm(n * 8/10), runif(n * 2/10, -4, 4))), main="slab bell")
myhist(c(rep(0, n * 8/10), runif(n * 2/10, -4, 4)), main="slab spike")
myhist(c(rep(0, n * 8/10), sample(c(-2, 2), n * 2/10, TRUE)), main="spike spike")
```

Benchmarking LFC estimation: Models for simulating logarithmic (base
2) fold changes. For the bell model, true logarithmic fold changes
were drawn from a Normal with mean 0 and variance 1. For the slab bell
model, true logarithmic fold changes were drawn for 80% of genes
from a Normal with mean 0 and variance 1 and for 20% of genes from a
Uniform distribution with range from -4 to 4. For the slab spike
model, true logarithmic fold changes were drawn similarly to the slab
bell model except the Normal is replaced with a spike of logarithmic
fold changes at 0. For the spike spike model, true logarithmic fold
changes were drawn according to a spike of logarithmic fold changes at
0 (80%) and a spike randomly sampled from -2 or 2 (20%). These
spikes represent fold changes of 1/4 and 4, respectively.

```{r lfcAccuracy, fig.width=7, fig.height=5, fig.cap="Root mean squared error in estimating log2 fold changes."}
load("results_simulateLFCAccuracy.RData")
library("ggplot2")
library("Hmisc")
p <- ggplot(data=res, aes(x=m, y=RMSE, color=method, shape=method))
p + stat_summary(fun.y=mean, geom="point") + 
  stat_summary(fun.y=mean, geom="line") + 
  stat_summary(fun.data=mean_cl_normal, geom="errorbar") + 
  theme_bw() + xlab("total sample size") + facet_wrap(~ type) + 
  scale_x_continuous(breaks=unique(res$m))
```

Root mean squared error (RMSE) for estimating logarithmic fold changes
under the four models of logarithmic fold changes and varying total
sample size m. Simulated Negative Binomial counts were generated for
two groups and for 1000 genes. Points and error bars are drawn for the
mean and 95% confidence interval over 10 replications.

```{r lfcAccuracyDE, fig.width=6, fig.height=2.5, fig.cap="Root mean squared error for only differentially expressed genes."}
p <- ggplot(data=res[grepl("spike",res$type),],
            aes(x=m, y=DiffRMSE, color=method, shape=method))
p + stat_summary(fun.y=mean, geom="point") + 
  stat_summary(fun.y=mean, geom="line") + 
  stat_summary(fun.data=mean_cl_normal, geom="errorbar") + 
  theme_bw() + xlab("total sample size") +  ylab("RMSE only of DE genes") + 
  facet_wrap(~ type) + 
  scale_x_continuous(breaks=unique(res$m))
``` 

Root mean squared error (RMSE) of logarithmic fold change estimates,
only considering genes with non-zero true logarithmic fold change. For
the same simulation, shown here is the error only for the 20% of
genes with non-zero true logarithmic fold changes (for bell and slab
bell all genes have non-zero logarithmic fold change).

```{r lfcAccuracyMAE, fig.width=7, fig.height=5, fig.cap="Mean absolute error in estimating log2 fold changes."}
p <- ggplot(data=res, aes(x=m, y=MAE, color=method, shape=method))
p + stat_summary(fun.y=mean, geom="point") + 
  stat_summary(fun.y=mean, geom="line") + 
  stat_summary(fun.data=mean_cl_normal, geom="errorbar") + 
  theme_bw() + xlab("total sample size") +  ylab("MAE") + 
  facet_wrap(~ type) + 
  scale_x_continuous(breaks=unique(res$m))
``` 

Mean absolute error (MAE) of logarithmic fold change
estimates. Results for the same simulation, however here using median
absolute error in place of root mean squared error. Mean absolute
error places less weight on the largest errors.

```{r lfcAccuracyDiffMAE, fig.width=6, fig.height=2.5, fig.cap="Mean absolute error for only differentially expressed genes."}
p <- ggplot(data=res[grepl("spike",res$type),],
            aes(x=m, y=DiffMAE, color=method, shape=method))
p + stat_summary(fun.y=mean, geom="point") + 
  stat_summary(fun.y=mean, geom="line") + 
  stat_summary(fun.data=mean_cl_normal, geom="errorbar") + 
  theme_bw() + xlab("total sample size") +  ylab("MAE only of DE genes") + 
  facet_wrap(~ type) + 
  scale_x_continuous(breaks=unique(res$m))
```

Mean absolute error (MAE) of logarithmic fold change estimates, only
considering those genes with non-zero true logarithmic fold change.


## Transformations and distances for recovery of true clusters

The following simulation evaluated a set of methods for
transformation, and for calculating distances betweeen vectors of
counts, for their performance in recapturing true clusters in
simulated data. Negative Binomial counts were generated in
four groups, each with four samples. These groups were generated with
20% of genes given Normally-distributed log fold changes from a centroid. 
The standard deviation of the Normal for the non-null genes
was varied to make the clustering easier or more difficult.
The mean of the centroid and
the dispersion of the counts were drawn as pairs from the joint
distribution of estimates from the Pickrell et al
dataset. As the Pickrell dataset has high dispersion (RNA-Seq counts
of lymphoblast cells across a population of individuals), simulations
were also considered wherein the dispersion was 0.1 and 0.25 times the
Pickrell dispersions. Hierarchical clustering with complete linkage was
used to separate the samples into four predicted clusters, using a
variety of combinations of transformation and distance. These
predicted clusters were then compared to the true clusters according to
the simulation using the adjusted Rand Index. Furthermore, two
variations were considered, one in which the size factors between
conditions were equal and one in which the size factors within each
group were [1, 1, 1/3, 3].
The code to generate these results is in `simulateCluster.R`.

```{r simulateCluster, fig.width=8, fig.height=5, fig.cap="Clustering accuracy over the size of group differences."}
load("results_simulateCluster.RData")
library("ggplot2")
library("Hmisc")
res$sizeFactor <- factor(res$sizeFactor)
levels(res$sizeFactor) <- paste("size factors", levels(res$sizeFactor))
res$dispScale <- factor(res$dispScale)
levels(res$dispScale) <- paste(levels(res$dispScale),"x dispersion")
p <- ggplot(res, aes(x=rnormsd, y=ARI, color=method, shape=method))
p + stat_summary(fun.y=mean, geom="point", aes(shape=method)) + 
  stat_summary(fun.y=mean, geom="line") + 
  stat_summary(fun.data=mean_cl_normal, geom="errorbar") + 
  facet_grid(sizeFactor ~ dispScale, scale="free") + theme_bw() + 
  ylab("adjusted Rand Index") + xlab("SD of group differences")
```

Adjusted Rand Index from clusters using various transformation and
distances compared to the true clusters from simulation. The methods
assessed were Euclidean distance on counts normalized by size factor,
log2 of normalized counts plus a pseudocount of 1, and after applying
the rlog and variance stabilizing transformation. Additionally, the
Poisson Distance from the PoiClaClu package was used for hierarchical
clustering. The points indicate the mean from 20 simulations and the
bars are 95 percent confidence intervals.

## Genes expressed in only one condition

As discussed by
[Robinson and Smyth](http://biostatistics.oxfordjournals.org/content/9/2/321.long)
and by [Rapaport et al.](http://genomebiology.com/2013/14/9/R95),
it is desirable that, in the situation in which a gene is only
expressed in one condition, the statistical significance increase with
the signal to noise ratio of expression in the expressed condition.
For example, Rapaport et al. plot the $-\log_{10}(p)$ for *p* values over
the signal to noise ratio.
In the following code chunk we demontrate that *DESeq2* has increasing
$-\log_{10}(p)$ in a comparison of two conditions in which one group
has all zero counts, e.g.: $\{0,0,0\}$ vs $\{10,10,10\}$.

```{r onlyonecondition, fig.width=4, fig.height=4, fig.cap="Simulated gene expressed in only one condition"}
library("DESeq2")
m <- 6
disp <- 0.1
ii <- seq(from=1,to=4,length=7)
coldata <- DataFrame(x=factor(rep(1:2,each=m/2)))
pvals <- sapply(ii, function(i) {
  mat <- matrix(as.integer(c(rep(0,m/2),rep(10^i,m/2))),nrow=1)
  dds <- DESeqDataSetFromMatrix(mat, coldata, ~ x)
  sizeFactors(dds) <- rep(1,m)
  dispersions(dds) <- disp
  results(nbinomWaldTest(dds))$pvalue
})
plot(10^ii, -log10(pvals), log="x", type="b", xaxt="n",
     xlab="group 2 counts", ylab="-log10 pvalue")
axis(1,10^(1:4),10^(1:4))
```

## Session information

The session information saved in the `simulateDE` script:

```{r}
sessInfo
```