/usr/lib/R/site-library/DESeq2/script/simulation.Rmd is in r-bioc-deseq2 1.18.1-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 | # Assessment of DESeq2 through simulation
Michael Love, Wolfgang Huber, Simon Anders
<!-- this document built with knit() in R -->
<!-- pandoc -o simulation.pdf simulation.md -->
```{r options, echo=FALSE, results="hide"}
opts_chunk$set(dev="pdf")
```
Document compiled on:
```{r time, echo=FALSE}
Sys.time()
```
This document and associated files provide the code and plots for the
simulation section of the *DESeq2* paper, so that these results can be
easily updated over time. For more details, read the paper at the
following URL:
http://dx.doi.org/10.1101/002832
## Differential expression analysis
We assessed the sensitivity and specificity of various algorithms
using simulation to complement an analysis on real data. The following
Negative Binomial simulation samples (mean, dispersion) pairs from the
joint distribution of estimated means and dispersions from the Pickrell
et al dataset. The true differences between two groups are drawn from
either *z*, *0* or *-z*, where the *0* component represents 80% of the
genes. The absolute value of the effect size *z* for the 20% of genes
with differential expression is varied, as is the total sample size m
(such that each group has m/2 samples). 10,000 genes were simulated,
and each combination of parameters was repeated 6 times. The code to
generate these results is in `simulateDE.R` and the code to run each
algorithm is in `runScripts.R`.
Note: *DSS* denotes running *DSS* and then Benjamini-Hochberg
adjustment on *p*-values. *DSS-FDR* denotes the native FDR estimation
of the *DSS* software.
*SAMseq* denotes running *SAMseq* with *p*-value
estimation and Benjamini-Hochberg adjustment for FDR.
*SAMseq-FDR* denotes the native FDR estimation and no *p*-value
estimation. *EBSeq* likewise only produces FDR.
```{r loadDE}
load("results_simulateDE.RData")
res$m <- factor(res$m)
levels(res$m) <- paste0("m=",levels(res$m))
res$effSize <- factor(res$effSize)
levels(res$effSize) <- c("fold change 2","fold change 3","fold change 4")
res$algorithm <- factor(res$algorithm)
resClean <- res[!is.na(res$oneminusspecpvals),]
```
```{r simulateDE, fig.width=7, fig.height=5, fig.cap="Sensitivity and specificity on simulated datasets."}
library("ggplot2")
p <- ggplot(resClean, aes(y=sensitivity, x=oneminusspecpvals,
color=algorithm, shape=algorithm))
p + geom_point() + theme_bw() + facet_grid(effSize ~ m) +
scale_shape_manual(values=1:9) +
xlab("1 - specificity (false positive rate)") +
coord_cartesian(xlim=c(-.003,.035)) +
geom_vline(xintercept=.01) +
scale_x_continuous(breaks=c(0,.02))
```
Use of simulation to assess the sensitivity and specificity of
algorithms across combinations of sample size and effect size. The
sensitivity was calculated as the fraction of genes with adjusted
*p*-value less than 0.1 among the genes with true differences between
group means. The specificity was calculated as the fraction of genes
with *p*-value greater than 0.01 among the genes with no true
differences between group means. The *p*-value was chosen instead of
the adjusted *p*-value, as this allows for comparison against the
expected fraction of *p*-values less than a critical value given the
uniformity of *p*-values under the null hypothesis.
```{r simulateDEPrec, fig.width=7, fig.height=5, fig.cap="Sensitivity and precision on simulated datasets."}
library("ggplot2")
p <- ggplot(res, aes(y=sensitivity, x=oneminusprec,
color=algorithm, shape=algorithm))
p + geom_point() + theme_bw() + facet_grid(effSize ~ m) +
scale_shape_manual(values=1:11) +
xlab("1 - precision (false discovery rate)") +
coord_cartesian(xlim=c(-.03, .3)) +
geom_vline(xintercept=.1)
```
Sensitivity and precision of algorithms across combinations of sample
size and effect size. The sensitivity was calculated as the fraction
of genes with adjusted *p*-value less than 0.1 among the genes with true
differences between group means. The precision was calculated as the
fraction of genes with true differences between group means among
those with adjusted *p*-value less than 0.1.
```{r simulateDESensStratify, fig.width=7, fig.height=5, fig.cap="Sensitivity dependence on mean count."}
library("reshape")
id.vars <- c("algorithm","effSize","m")
measure.vars <- c("sens0to20","sens20to100","sens100to300","sensmore300")
melted <- melt(res[,c(id.vars,measure.vars)], id.vars=id.vars,
measure.vars=measure.vars)
names(melted) <- c(id.vars, "aveexp", "sensitivity")
levels(melted$aveexp) <- c("<20","20-100","100-300",">300")
p <- ggplot(melted, aes(y=sensitivity, x=aveexp, group=algorithm,
color=algorithm, shape=algorithm))
p + stat_summary(fun.y="mean", geom="line") +
stat_summary(fun.y="mean", geom="point") +
theme_bw() + facet_grid(effSize ~ m) +
scale_shape_manual(values=1:11) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("mean counts")
```
The sensitivity of algorithms across combinations of sample size and
effect size, and further stratified by the mean of counts of the
differentially expressed genes in the simulation data. Points indicate
the average over 6 replicates. Algorithms all show a similar
dependence of sensitivity on the mean of counts. The height of the
sensitivity curve should be compared with the previous plot indicating
the total sensitivity and specificity of each algorithm.
## Performance in the presence of outliers
The following plots examine the affect of outliers on differential
calls by the two Negative-Binomial-based methods *DESeq2* and *edgeR*.
*DESeq2* was run with default settings, after turning off
gene filtering, and after turning off outlier replacement. *edgeR*
was run with default settings, and after using the *robust* option.
The code to generate these results is in `simulateOutliers.R`.
```{r loadOut}
load("results_simulateOutliers.RData")
# when < 7 replicates DESeq does not replace
res <- res[!(res$algorithm == "DESeq2-noRepl" & res$m < 14),]
# when >= 7 replicates DESeq does not filter
res <- res[!(res$algorithm == "DESeq2-noFilt" & res$m >= 14),]
res$m <- factor(res$m)
levels(res$m) <- paste0("m=",levels(res$m))
res$percentOutlier <- 100 * res$percentOutlier
res$percentOutlier <- factor(res$percentOutlier)
levels(res$percentOutlier) <- paste0(levels(res$percentOutlier),"% outlier")
```
Because the sensitivity-specificity curve is evaluated using the p
value, we use fhe following code to pick out the point on the
sensitivity-specificity curve with largest p value such that the
nominal adjusted *p*-value is less than 0.1.
```{r}
resSensPadj <- res[res$senspadj < .1,]
resSensPadj <- resSensPadj[nrow(resSensPadj):1,]
resSensPadj <- resSensPadj[!duplicated(with(resSensPadj,
paste(algorithm, m, percentOutlier))),]
summary(resSensPadj$senspadj)
```
```{r simulateOutliersSens, fig.width=8, fig.height=4, fig.cap="Sensitivity and specificity in presence of outliers."}
library("ggplot2")
p <- ggplot(res, aes(x=oneminusspec, y=sensitivity, color=algorithm))
p + scale_x_continuous(breaks=c(0,.1,.2)) +
scale_y_continuous(breaks=c(0,.2,.4,.6,.8)) +
geom_line() + theme_bw() +
facet_grid(m ~ percentOutlier) + xlab("1 - specificity") +
coord_cartesian(xlim=c(-.03, .25), ylim=c(-.05, .9)) +
geom_point(aes(x=oneminusspec, y=sensitivity, shape=algorithm),
data=res[res$precpadj == .1,])
```
Sensitivity-specificity curves for detecting true differences in the
presence of outliers. Negative Binomial counts were simulated for 4000
genes and total sample sizes (m) of 10 and 20, for a two-group
comparison. 80% of the simulated genes had no true differential
expression, while for 20% of the genes true logarithmic (base 2)
fold changes of -1 or 1. The number of genes with simulated outliers
was increased from 0% to 15%. The outliers were constructed for a
gene by multiplying the count of a single sample by 100. Sensitivity
and specificity were calculated by thresholding on *p*-values. Points
indicate an adjusted *p*-value of 0.1. DESeq2 filters genes with
potential outliers for samples with 3 to 6 replicates, while replacing
outliers for samples with 7 or more replicates, hence the filtering
can be turned off for the top row (m=10) and the replacement can be
turned off for the bottom row (m=20).
```{r simulateOutliersPrec, fig.width=8, fig.height=4, fig.cap="FDR and target FDR in presence of outliers."}
p <- ggplot(res, aes(x=precpadj, y=oneminusprec, color=algorithm))
p + scale_x_continuous(breaks=c(0,.1,.2)) +
scale_y_continuous(breaks=c(0,.1,.2)) +
geom_line() + theme_bw() +
facet_grid(m ~ percentOutlier) +
geom_abline(intercept=0,slope=1) +
xlab("adjusted p-value") + ylab("1 - precision (FDR)") +
coord_cartesian(xlim=c(-.03, .25), ylim=c(-.05, .25)) +
geom_point(aes(x=precpadj, y=oneminusprec, shape=algorithm),
data=res[res$precpadj == .1,])
```
Outlier handling: One minus the precision (false discovery rate)
plotted over various thresholds of adjusted *p*-value. Shown is the
results for the same simulation with outliers described in the
previous figure. Points indicate an adjusted *p*-value of 0.1.
## Accuracy of log fold change estimates
The following simulations used Negative Binomial random variables
with mean and dispersion pairs samples from the joint
distribution of mean-dispersion estimates from the Pickrell data. In
addition, true differences between two groups were randomly generated,
according to the following models, diagrammed below. The
accuracy of four methods for estimating the log fold change between
groups were compared by the root mean squared error (RMSE) and the
mean absolute error (MAE). The four methods were chosen for their
particular focus on the logs fold change estimate.
The code to generate these results is in `simulateLFCAccuracy.R`.
```{r lfcAccuracyHist, fig.width=6, fig.height=6, fig.cap="Examples of simulated log2 fold changes."}
par(mfrow=c(2,2),mar=c(3,3,3,1))
n <- 1000
brks <- seq(from=-4,to=4,length.out=20)
trimit <- function(x) x[x > -4 & x < 4] # for visualization only
myhist <- function(x, ...) hist(x, breaks=brks, border="white",
col="blue", xlab="", ylab="", ...)
myhist(trimit(rnorm(n)), main="bell")
myhist(trimit(c(rnorm(n * 8/10), runif(n * 2/10, -4, 4))), main="slab bell")
myhist(c(rep(0, n * 8/10), runif(n * 2/10, -4, 4)), main="slab spike")
myhist(c(rep(0, n * 8/10), sample(c(-2, 2), n * 2/10, TRUE)), main="spike spike")
```
Benchmarking LFC estimation: Models for simulating logarithmic (base
2) fold changes. For the bell model, true logarithmic fold changes
were drawn from a Normal with mean 0 and variance 1. For the slab bell
model, true logarithmic fold changes were drawn for 80% of genes
from a Normal with mean 0 and variance 1 and for 20% of genes from a
Uniform distribution with range from -4 to 4. For the slab spike
model, true logarithmic fold changes were drawn similarly to the slab
bell model except the Normal is replaced with a spike of logarithmic
fold changes at 0. For the spike spike model, true logarithmic fold
changes were drawn according to a spike of logarithmic fold changes at
0 (80%) and a spike randomly sampled from -2 or 2 (20%). These
spikes represent fold changes of 1/4 and 4, respectively.
```{r lfcAccuracy, fig.width=7, fig.height=5, fig.cap="Root mean squared error in estimating log2 fold changes."}
load("results_simulateLFCAccuracy.RData")
library("ggplot2")
library("Hmisc")
p <- ggplot(data=res, aes(x=m, y=RMSE, color=method, shape=method))
p + stat_summary(fun.y=mean, geom="point") +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.data=mean_cl_normal, geom="errorbar") +
theme_bw() + xlab("total sample size") + facet_wrap(~ type) +
scale_x_continuous(breaks=unique(res$m))
```
Root mean squared error (RMSE) for estimating logarithmic fold changes
under the four models of logarithmic fold changes and varying total
sample size m. Simulated Negative Binomial counts were generated for
two groups and for 1000 genes. Points and error bars are drawn for the
mean and 95% confidence interval over 10 replications.
```{r lfcAccuracyDE, fig.width=6, fig.height=2.5, fig.cap="Root mean squared error for only differentially expressed genes."}
p <- ggplot(data=res[grepl("spike",res$type),],
aes(x=m, y=DiffRMSE, color=method, shape=method))
p + stat_summary(fun.y=mean, geom="point") +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.data=mean_cl_normal, geom="errorbar") +
theme_bw() + xlab("total sample size") + ylab("RMSE only of DE genes") +
facet_wrap(~ type) +
scale_x_continuous(breaks=unique(res$m))
```
Root mean squared error (RMSE) of logarithmic fold change estimates,
only considering genes with non-zero true logarithmic fold change. For
the same simulation, shown here is the error only for the 20% of
genes with non-zero true logarithmic fold changes (for bell and slab
bell all genes have non-zero logarithmic fold change).
```{r lfcAccuracyMAE, fig.width=7, fig.height=5, fig.cap="Mean absolute error in estimating log2 fold changes."}
p <- ggplot(data=res, aes(x=m, y=MAE, color=method, shape=method))
p + stat_summary(fun.y=mean, geom="point") +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.data=mean_cl_normal, geom="errorbar") +
theme_bw() + xlab("total sample size") + ylab("MAE") +
facet_wrap(~ type) +
scale_x_continuous(breaks=unique(res$m))
```
Mean absolute error (MAE) of logarithmic fold change
estimates. Results for the same simulation, however here using median
absolute error in place of root mean squared error. Mean absolute
error places less weight on the largest errors.
```{r lfcAccuracyDiffMAE, fig.width=6, fig.height=2.5, fig.cap="Mean absolute error for only differentially expressed genes."}
p <- ggplot(data=res[grepl("spike",res$type),],
aes(x=m, y=DiffMAE, color=method, shape=method))
p + stat_summary(fun.y=mean, geom="point") +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.data=mean_cl_normal, geom="errorbar") +
theme_bw() + xlab("total sample size") + ylab("MAE only of DE genes") +
facet_wrap(~ type) +
scale_x_continuous(breaks=unique(res$m))
```
Mean absolute error (MAE) of logarithmic fold change estimates, only
considering those genes with non-zero true logarithmic fold change.
## Transformations and distances for recovery of true clusters
The following simulation evaluated a set of methods for
transformation, and for calculating distances betweeen vectors of
counts, for their performance in recapturing true clusters in
simulated data. Negative Binomial counts were generated in
four groups, each with four samples. These groups were generated with
20% of genes given Normally-distributed log fold changes from a centroid.
The standard deviation of the Normal for the non-null genes
was varied to make the clustering easier or more difficult.
The mean of the centroid and
the dispersion of the counts were drawn as pairs from the joint
distribution of estimates from the Pickrell et al
dataset. As the Pickrell dataset has high dispersion (RNA-Seq counts
of lymphoblast cells across a population of individuals), simulations
were also considered wherein the dispersion was 0.1 and 0.25 times the
Pickrell dispersions. Hierarchical clustering with complete linkage was
used to separate the samples into four predicted clusters, using a
variety of combinations of transformation and distance. These
predicted clusters were then compared to the true clusters according to
the simulation using the adjusted Rand Index. Furthermore, two
variations were considered, one in which the size factors between
conditions were equal and one in which the size factors within each
group were [1, 1, 1/3, 3].
The code to generate these results is in `simulateCluster.R`.
```{r simulateCluster, fig.width=8, fig.height=5, fig.cap="Clustering accuracy over the size of group differences."}
load("results_simulateCluster.RData")
library("ggplot2")
library("Hmisc")
res$sizeFactor <- factor(res$sizeFactor)
levels(res$sizeFactor) <- paste("size factors", levels(res$sizeFactor))
res$dispScale <- factor(res$dispScale)
levels(res$dispScale) <- paste(levels(res$dispScale),"x dispersion")
p <- ggplot(res, aes(x=rnormsd, y=ARI, color=method, shape=method))
p + stat_summary(fun.y=mean, geom="point", aes(shape=method)) +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.data=mean_cl_normal, geom="errorbar") +
facet_grid(sizeFactor ~ dispScale, scale="free") + theme_bw() +
ylab("adjusted Rand Index") + xlab("SD of group differences")
```
Adjusted Rand Index from clusters using various transformation and
distances compared to the true clusters from simulation. The methods
assessed were Euclidean distance on counts normalized by size factor,
log2 of normalized counts plus a pseudocount of 1, and after applying
the rlog and variance stabilizing transformation. Additionally, the
Poisson Distance from the PoiClaClu package was used for hierarchical
clustering. The points indicate the mean from 20 simulations and the
bars are 95 percent confidence intervals.
## Genes expressed in only one condition
As discussed by
[Robinson and Smyth](http://biostatistics.oxfordjournals.org/content/9/2/321.long)
and by [Rapaport et al.](http://genomebiology.com/2013/14/9/R95),
it is desirable that, in the situation in which a gene is only
expressed in one condition, the statistical significance increase with
the signal to noise ratio of expression in the expressed condition.
For example, Rapaport et al. plot the $-\log_{10}(p)$ for *p* values over
the signal to noise ratio.
In the following code chunk we demontrate that *DESeq2* has increasing
$-\log_{10}(p)$ in a comparison of two conditions in which one group
has all zero counts, e.g.: $\{0,0,0\}$ vs $\{10,10,10\}$.
```{r onlyonecondition, fig.width=4, fig.height=4, fig.cap="Simulated gene expressed in only one condition"}
library("DESeq2")
m <- 6
disp <- 0.1
ii <- seq(from=1,to=4,length=7)
coldata <- DataFrame(x=factor(rep(1:2,each=m/2)))
pvals <- sapply(ii, function(i) {
mat <- matrix(as.integer(c(rep(0,m/2),rep(10^i,m/2))),nrow=1)
dds <- DESeqDataSetFromMatrix(mat, coldata, ~ x)
sizeFactors(dds) <- rep(1,m)
dispersions(dds) <- disp
results(nbinomWaldTest(dds))$pvalue
})
plot(10^ii, -log10(pvals), log="x", type="b", xaxt="n",
xlab="group 2 counts", ylab="-log10 pvalue")
axis(1,10^(1:4),10^(1:4))
```
## Session information
The session information saved in the `simulateDE` script:
```{r}
sessInfo
```
|