-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdifferential-transcript-airway.Rmd
341 lines (280 loc) · 12.5 KB
/
differential-transcript-airway.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
title: Differential transcript expression with Swish on the airway dataset
author: Michael Love
date: 8/1/2019
---
In this short tutorial, we will analyze the *airway* dataset using the
Swish non-parametric method for differential expression. In particular
we will look for DTE (differential transcript expression), in response
to treatment with dexamethasone, controlling for differences at
baseline across four donors. More information on the *airway* dataset
can be found [here](http://bioconductor.org/packages/airway). The
Swish method is described
in [this paper](https://doi.org/10.1093/nar/gkz622).
To use Swish, one needs to have quantified RNA-seq reads, including
the generation of inferential replicate counts, and then imported this
data into R/Bioconductor. All the packages listed here (except
`airway2`) can be installed from Bioconductor using, e.g.:
```
BiocManager::install(c("tximeta","fishpond"))
```
### tximeta to import quantification files
We have quantified the data using *Salmon* with 20 Gibbs samples that
will serve as inferential replicates. The Gibbs samples were generated
because we used the argument `--numGibbsSamples 20`.
We used *Snakemake* to run the jobs, and the exact Snakemake command
that was used can be found here:
<https://github.com/mikelove/airway2/blob/master/inst/extdata/Snakefile>
Now we import the data into R/Bioconductor using *tximeta*.
If you want to follow along, the quantification files have been
uploaded to GitHub in the following repository:
<https://github.com/mikelove/airway2>
So in our case the quantification files and a plaintext file with
the information about the samples are located within the following
directory.
```{r include=FALSE}
knitr::opts_chunk$set(cache=TRUE)
```
```{r}
dir <- "airway2/inst/extdata"
sra <- read.delim(file.path(dir,"SraRunTable.txt"))
sra[1:3,]
```
(The `inst/extdata` part of the path above is specific to how data is
stored in an R package, `airway2` being an R package. However there's
no need to have this directory structure in a typical data analysis.)
The only step needed to run *tximeta* is to make a *data.frame* that
has two or more columns. It must have a column `files` that points to
the `quant.sf` files, and it must have a column `names` with the names
for the samples. Additional columns will describe the samples
(e.g. condition, batch, donor, etc.).
```{r}
files <- file.path(dir,"quants",sra$Run,"quant.sf")
lvls <- c("Untreated", "Dexamethasone")
coldata <- data.frame(
files=files,
names=sra$Run,
donor=sra$cell_line,
condition=factor(sra$treatment, lvls)
)
coldata[1:3,]
```
We need to load the following R packages for this script:
```{r}
library(tximeta)
library(fishpond) # for the Swish method
suppressPackageStartupMessages(library(SummarizedExperiment))
```
Now we load in the quantification data:
```{r}
se <- tximeta(coldata)
```
### Running Swish to detect DTE
The following lines perform the Swish analysis on the transcript
level. The `quiet=TRUE` is just to suppress the progress bar in the R
output.
```{r}
y <- se
y <- scaleInfReps(y, quiet=TRUE)
y <- labelKeep(y)
y <- y[ mcols(y)$keep, ]
set.seed(1) # for reproducibility
y <- swish(y,
x="condition", # the condition to compare
pair="donor", # within donor pairs
nperms=16, # default is 30 perms
quiet=TRUE)
```
Two special notes about the `swish` call above:
1. If we had two groups with no donor information, we would just leave
out the `pair` argument. If we had batches, we would use `cov`
instead of `pair` to denote the batch indicator.
2. Here we have only `2^4 = 16` possible permutations. We will see this is
sufficient to detect many DE transcripts. If we had 3 paired
samples, it likely would not provide sufficient permutations to
detect DE transcripts. In general, the Swish default is to use 30
permutations on paired or unpaired data (30 permutations are used
if `nperms` isn't specified, and this was the setting evaluated in
the Swish paper).
Now we can examine the p-value distribution. It's important that a
method have a "well-behaved" p-value distribution in order for the
false discovery rate to be properly controlled by the p-value
adjustment method (necessary but not sufficient). Given that a dataset
has a mix of some features with small or no changes across condition,
and some features with changes that can be estimated with relative
precision, the set of features with small or no changes will produce
uniform bars from 0 to 1, and then --- if there are features which are
DE --- we should also see an enrichment of small p-values on the left
side of the histogram.
```{r}
hist(mcols(y)$pvalue, col="grey50", border="white")
```
It's useful to examine an MA plot, where the blue points indicate a
nominal FDR bound of 5%. For more details on annotating the MA plot
with identifiers, see the
[Swish vignette](https://bioconductor.org/packages/fishpond).
```{r}
plotMASwish(y)
```
We can tabulate the significant transcripts at a nominal 5% FDR by the
sign of the log2 fold change:
```{r}
with(mcols(y),
table(sig=qvalue < .05, sign.lfc=sign(log2FC))
)
```
We create two rankings of the transcripts, the significant
down-regulated transcripts (`lo`) and the significant up-regulated
transcripts (`hi`), each ranked by the LFC. We will create two vectors
which give the _row number_ of the down- and up-regulated transcripts.
The `order` function ranks from small to large so for `lo` we multiply
the LFC by an indicator if the transcript is significant. For `hi` we
flip the sign and do the same:
```{r}
sig <- mcols(y)$qvalue < .05
lo <- order(mcols(y)$log2FC * sig)
hi <- order(-mcols(y)$log2FC * sig)
```
We can now use `lo` and `hi` to look at the top transcripts. If we
want to restrict to only the significant ones, we can chop these
ranked row number vectors:
```{r}
lo <- head(lo, sum(sig & mcols(y)$log2FC < 0))
hi <- head(hi, sum(sig & mcols(y)$log2FC > 0))
```
The top up-regulated and down-regulated transcripts can be plotted
with inferential replicate data shown as boxes in a boxplot. The
orange boxes represent the dexamethasone treated samples, and the blue
are the untreated. Again the boxes show the range of the inferential
replicates for each sample for this transcript. They are grouped by
donor pair (alternating black and grey bars at the bottom of the plot
indicate pairs):
```{r}
plotInfReps(y, idx=hi[1], x="condition", cov="donor", xaxis=FALSE)
plotInfReps(y, idx=lo[1], x="condition", cov="donor", xaxis=FALSE)
```
(The `xaxis=FALSE` argument here suppresses the numbering of samples
on the x-axis, the numbers being more useful for when there are
biological replicates.)
### Comparison with DESeq2 at transcript level
In the [Swish paper](https://doi.org/10.1093/nar/gkz622), we find that
for DTE, across a range of sample sizes from n=4 to n=20 per group,
Swish outperforms other popular methods in terms of sensitivity while
controlling FDR across transcripts, even for those transcripts with
high inferential uncertainty. DESeq2, which was designed for
gene-level observed counts, tended to have too high FDR for those
transcripts with high inferential uncertainty (which makes sense as
this and other such methods do not make use of any information about
inferential uncertainty coming from the quantification method).
Here we also run DESeq2 and examine transcripts for which DESeq2 gives
a very small adjusted p-value, while Swish does not. The equivalent
DESeq2 design formula for the above `swish` call is `~donor +
condition`:
```{r}
library(DESeq2)
dds <- DESeqDataSet(se, ~donor + condition)
dds <- dds[rownames(y),]
dds <- DESeq(dds, fitType="local", quiet=TRUE)
res <- results(dds)
```
We can compare the number of transcripts at the nominal 5% FDR threshold:
```{r}
table(deseq2=res$padj < .05, swish=sig)
```
DESeq2 is calling more than double the number of transcripts that
Swish is calling. There are also a minority of transcripts
that Swish calls but not DESeq2. We found that Swish had comparable
power to parametric methods even at n=4 samples per group, so likely
some of the excess of calls from DESeq2 could be from transcripts with
high inferential uncertainty.
Another factor is that we only have 4 paired samples, and so Swish is
likely calling only those transcripts in which all 4 paired samples
demonstrated a _consistent_ change upon treatment, whereas DESeq2 may
call some transcripts with large changes in a subset of the 4 pairs,
and small or no change for the other pairs. Swish is based on the
SAMseq method, which had the appropriate publication title
"Finding consistent patterns: A nonparametric approach for identifying
differential expression in RNA-Seq data" --
[Li and Tibshirani (2011)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4605138/).
We will now take a look at some of the outlier transcripts in the
following plot, which have a very small adjusted p-value from DESeq2
but a mid range q-value from Swish:
```{r}
plot(mcols(y)$qvalue, -log10(res$padj), ylim=c(0,30),
xlab="Swish q-value", ylab="DESeq2 -log10 adj p-value", cex=.3)
DESeq2.extra <- res$padj < 1e-2 & mcols(y)$qvalue > .2
table(DESeq2.extra)
```
To demonstrate that some extra calls from DESeq2 have elevated
inferential uncertainty, we compute the Inferential Relative Variance
(InfRV) as defined in the Swish paper. It is defined as `(var -
mean)/mean` over the inferential replicates for a given sample, with
some parameters to keep the value positive and bounded for small
`mean`. Below we plot the mean of InfRV over samples, then group the
transcripts by whether they were in `DESeq2.extra`.
```{r fig.width=3}
y <- computeInfRV(y)
with(mcols(y), boxplot(log10(meanInfRV) ~ DESeq2.extra, col="grey"))
```
To demonstrate that some of the DESeq2 calls are for transcripts where
not all the 4 pairs have consistent effects from treatment, we can
construct the following contingency tables, showing that all of the
Swish calls have all 4 pairs in the same direction according to a
simple calculation of LFC, however DESeq2 has ~500 DTE calls which do
not have consistent sign of LFC across the 4 pairs.
```{r}
n.cts <- counts(dds, normalized=TRUE)
lfc <- log2(n.cts[,c(2,4,6,8)] + 1) - log2(n.cts[,c(1,3,5,7)] + 1)
consistent <- apply(lfc, 1, function(x) all(sign(x) == sign(x[1])))
table(consistent, Swish=sig)
table(consistent, DESeq2=res$padj < .05)
```
Note that if we had larger numbers of pairs, it would not necessarily
be the case that _all_ of the Swish calls would have consistent sign of
LFC for _all_ the pairs. However, with only 4 pairs, we see that this
happened to be the case, for this dataset.
If we look at 5 particular transcripts with very low adjusted p-value
from DESeq2 and a high q-value from Swish, we see that the inferential
replicates are helping to avoid calling these transcripts as
significant. These may represent cases where a transcript is not
actually expressed, but instead the quantification method is not sure
where to place the reads among this and other similar transcripts. At
the least, we would need more samples with less uncertainty on the
quantification (higher sequencing depth, more uniform coverage, longer
reads) in order to be sure that these transcripts are truly DE, and
not false positives for DESeq2.
```{r, fig.width=6, fig.height=14}
idx <- which(res$padj < 1e-5 & mcols(y)$qvalue > .4)
par(mfrow=c(5,2))
for (i in 1:5) {
plotInfReps(y, idx=idx[i], x="condition", cov="donor", xaxis=FALSE)
plotCounts(dds, gene=idx[i], "condition", transform=FALSE)
}
```
We can also examine some of the transcripts where Swish calls them as
DE but DESeq2 does not. The general pattern seems to be that for one
pair, the effect is much larger than for the other three pairs. For
paired data, Swish focuses on the sign of the treatment effect, and
whether this is stable across inferential replicates. So it is less
sensitive to whether the log fold change is nearly the same across all
donors, and more sensitive to the direction of the effect and whether
it is consistent across samples and inferential replicates.
```{r, fig.width=6, fig.height=10}
idx <- which(mcols(y)$qvalue < .05 & res$padj > .5)
par(mfrow=c(3,2))
for (i in 1:3) {
plotInfReps(y, idx=idx[i], x="condition", cov="donor", xaxis=FALSE)
plotCounts(dds, gene=idx[i], "condition", transform=FALSE)
}
```
### More details
For more details, consult the
[Swish vignette](https://bioconductor.org/packages/fishpond).
as well as the help pages for individual functions, e.g. `?swish`. You
can also post questions to the
[Bioconductor support site](https://support.bioconductor.org),
always making sure to post your code.
### Session info
```{r}
sessionInfo()
```