This file is indexed.

/usr/lib/R/site-library/ShortRead/doc/Overview.Rnw is in r-bioc-shortread 1.36.0-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
%\VignetteIndexEntry{An introduction to ShortRead}
%\VignetteDepends{BiocStyle}
%\VignetteKeywords{Short read, I/0, quality assessment}
%\VignettePackage{ShortRead}
\documentclass[]{article}

<<style, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@ 

\newcommand{\ShortRead}{\Biocpkg{ShortRead}}

\title{An Introduction to \Rpackage{ShortRead}}
\author{Martin Morgan}
\date{Modified: 21 October, 2013. Compiled: \today}

\begin{document}

\maketitle

<<preliminaries>>=
library("ShortRead")
@ 

The \Rpackage{ShortRead} package provides functionality for working
with FASTQ files from high throughput sequence analysis. The package
also contains functions for legacy (single-end, ungapped) aligned
reads; for working with BAM files, please see the \Biocpkg{Rsamtools},
\Biocpkg{GenomicRanges}, \Biocpkg{GenomicAlignments} and related packages.

\section{Sample data}

Sample FASTQ data are derived from ArrayExpress record
\href{http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-1147/}{E-MTAB-1147}.
Paired-end FASTQ files were retrieved and then sampled to 20,000
records with
<<sample, eval=FALSE>>=
sampler <- FastqSampler('E-MTAB-1147/fastq/ERR127302_1.fastq.gz', 20000)
set.seed(123); ERR127302_1 <- yield(sampler)
sampler <- FastqSampler('E-MTAB-1147/fastq/ERR127302_2.fastq.gz', 20000)
set.seed(123); ERR127302_2 <- yield(sampler)
@ 

\section{Functionality}

Functionality is summarized in Table~\ref{tab:fastq}. 

\begin{table}
  \centering
  \begin{tabular}{lll}
    \hline
    \multicolumn{3}{l}{Input} \\
    & \Rfunction{FastqStreamer} & Iterate through FASTQ files in chunks \\
    & \Rfunction{FastqSampler} & Draw random samples from FASTQ files \\
    & \Rfunction{readFastq} & Read an entire FASTQ file into memory \\
    & \Rfunction{writeFastq} & Write FASTQ objects to a connection (file) \\
    \multicolumn{3}{l}{Sequence and quality summary} \\
    & \Rfunction{alphabetFrequency} & Nucleotide or quality score use per read\\
    & \Rfunction{alphabetByCycle} & Nucleotide or quality score use by cycle\\
    & \Rfunction{alphabetScore} & Whole-read quality summary\\
    & \Rfunction{encoding} & Character / `phred' score mapping \\
    \multicolumn{3}{l}{Quality assessment} \\
    & \Rfunction{qa} & Visit FASTQ files to collect QA statistics \\
    & \Rfunction{report} & Generate a quality assessment report \\
    \multicolumn{3}{l}{Filtering and trimming} \\
    & \Rfunction{srFilter} & Pre-defined and bespoke filters \\
    & \Rfunction{trimTails}, etc. & Trim low-quality nucleotides \\
    & \Rfunction{narrow} & Remove leading / trailing nucleotides \\
    & \Rfunction{tables} & Summarize read occurrence \\
    & \Rfunction{srduplicated}, etc. & Identify duplicate reads \\
    & \Rfunction{filterFastq} & Filter reads from one file to another\\
    \hline
  \end{tabular}
  \caption{Key functions for working with FASTQ files}
  \label{tab:fastq}
\end{table}

\paragraph{Input} FASTQ files are large so processing involves
iteration in `chunks' using \Rfunction{FastqStreamer}
<<stream, eval=FALSE>>=
strm <- FastqStreamer("a.fastq.gz")
repeat {
    fq <- yield(strm)
    if (length(fq) == 0)
        break
    ## process chunk
}
@
or drawing a random sample from the file
<<sampler, eval=FALSE>>=
sampler <- FastqSampler("a.fastq.gz")
fq <- yield(sampler)
@ 
\noindent The default size for both streams and samples is 1M records;
this volume of data fits easily into memory. Small FASTQ files can be
read in to memory in their entirety using \Rfunction{readFastq}; we do
this for our sample data
<<readFastq>>=
fl <- system.file(package="ShortRead", "extdata", "E-MTAB-1147", 
                  "ERR127302_1_subset.fastq.gz")
fq <- readFastq(fl)
@ 

The result of data input is an instance of class \Rclass{ShortReadQ}
(Table~\ref{tab:classes}). 
\begin{table}
  \centering
  \begin{tabular}{ll}
    \hline
    \Rclass{DNAStringSet} & (\Biocpkg{Biostrings}) Short read sequences \\
    \Rclass{FastqQuality}, etc. & Quality encodings \\
    \Rclass{ShortReadQ} & Reads, quality scores, and ids \\
    \hline
  \end{tabular}
  \caption{Primary data types in the \Biocpkg{ShortRead} package}
  \label{tab:classes}
\end{table}
This class contains reads, their quality scores, and optionally the id
of the read.
<<ShortReadQ>>=
fq
fq[1:5]
head(sread(fq), 3)
head(quality(fq), 3)
@ 
\noindent The reads are represented as \Rclass{DNAStringSet}
instances, and can be manipulated with the rich tools defined in the
\Biocpkg{Biostrings} package. The quality scores are represented by a
class that represents the quality encoding inferred from the file; the
encoding in use can be discovered with
<<encoding>>=
encoding(quality(fq))
@ 
\noindent The primary source of documentation for these classes is
\Rcode{?ShortReadQ} and \Rcode{?QualityScore}.

\section{Common work flows}

\subsection{Quality assessment}

FASTQ files are often used for basic quality assessment, often to
augment the purely technical QA that might be provided by the
sequencing center with QA relevant to overall experimental design. A QA report is generated by creating a vector of paths to FASTQ files
<<qa-files, eval=FALSE>>=
fls <- dir("/path/to", "*fastq$", full=TRUE)
@ 
\noindent collecting statistics over the files
<<qa-qa, eval=FALSE>>=
qaSummary <- qa(fls, type="fastq")
@ 
\noindent and creating and viewing a report
<<qa-view, eval=FALSE>>=
browseURL(report(qaSummary))
@ 
\noindent By default, the report is based on a sample of 1M
reads.

These QA facilities are easily augmented by writing custom functions
for reads sampled from files, or by explorting the elements of the
object returned from \Rcode{qa()}, e.g., for an analysis of
ArrayExpress experiment E-MTAB-1147:
<<qa-files, echo=FALSE>>=
load("qa_E-MTAB-1147.Rda")
@ 
<<qa-elements>>=
qaSummary
@ 
%% 
For instance, the count of reads in each lane is summarized in the
\Robject{readCounts} element, and can be displayed as
<<qa-readCounts>>=
head(qaSummary[["readCounts"]])
head(qaSummary[["baseCalls"]])
@ 
%% 
The \Robject{readCounts} element contains a data frame with 1 row and
3 columns (these dimensions are indicated in the parenthetical
annotation of \Robject{readCounts} in the output of
\Rcode{qaSummary}). The rows represent different lanes. The columns
indicated the number of reads, the number of reads surviving the
Solexa filtering criteria, and the number of reads aligned to the
reference genome for the lane. The \Robject{baseCalls} element
summarizes base calls in the unfiltered reads.

The functions that produce the report tables and graphics are 
internal to the package. They can be accessed through calling 
ShortRead:::functionName where functionName is one of the functions
listed below, organized by report section. 
\begin{description}
\item [] Run Summary : .ppnCount, .df2a, .laneLbl, .plotReadQuality
\item [] Read Distribution : .plotReadOccurrences, .freqSequences
\item [] Cycle Specific : .plotCycleBaseCall, .plotCycleQuality
\item [] Tile Performance : .atQuantile, .colorkeyNames, .plotTileLocalCoords, .tileGeometry,
.plotTileCounts, .plotTileQualityScore
\item [] Alignment : .plotAlignQuality
\item [] Multiple Alignment : .plotMultipleAlignmentCount
\item [] Depth of Coverage : .plotDepthOfCoverage
\item [] Adapter Contamination : .ppnCount
\end{description}

\subsection{Filtering and trimming}

It is straight-forward to create filters to eliminate reads or to trim
reads based on diverse characteristics. The basic structure is to open
a FASTQ file, iterate through chunks of the file performing filtering
or trimming steps, and appending the filtered data to a new file.
<<filter-scheme>>=
myFilterAndTrim <- 
    function(fl, destination=sprintf("%s_subset", fl))
{
    ## open input stream
    stream <- open(FastqStreamer(fl))
    on.exit(close(stream))
    
    repeat {
        ## input chunk
        fq <- yield(stream)
        if (length(fq) == 0)
            break
        
        ## trim and filter, e.g., reads cannot contain 'N'...
        fq <- fq[nFilter()(fq)]  # see ?srFilter for pre-defined filters
        ## trim as soon as 2 of 5 nucleotides has quality encoding less 
        ## than "4" (phred score 20)
        fq <- trimTailw(fq, 2, "4", 2)
        ## drop reads that are less than 36nt
        fq <- fq[width(fq) >= 36]
        
        ## append to destination
        writeFastq(fq, destination, "a")
    }
}
@ 
\noindent This is memory efficient and flexible. Care must be taken to
coordinate pairs of FASTQ files representing paired-end reads, to
preserve order.

\section{Using \Rpackage{ShortRead} for data exploration}

\subsection{Data I/O}

\ShortRead{} provides a variety of methods to read data into \R{}, in
addition to \Rfunction{readAligned}. 

\subsubsection{\Rfunction{readXStringColumns}}

\Rfunction{readXStringColumns} reads a column of DNA or other
sequence-like data. For instance, the Solexa files
\texttt{s\_N\_export.txt} contain lines with the following
information:
<<export>>=
## location of file
exptPath <- system.file("extdata", package="ShortRead")
sp <- SolexaPath(exptPath)
pattern <- "s_2_export.txt"
fl <- file.path(analysisPath(sp), pattern)
strsplit(readLines(fl, n=1), "\t")
length(readLines(fl))
@ 
% 
Column 9 is the read, and column 10 the ASCII-encoded Solexa Fastq
quality score; there are 1000 lines (i.e., 1000 reads) in this sample
file. 

Suppose the task is to read column 9 as a \Rclass{DNAStringSet} and
column 10 as a \Rclass{BStringSet}. \Rclass{DNAStringSet} is a class
that contains IUPAC-encoded DNA strings (IUPAC code allows for
nucleotide ambiguity); \Rclass{BStringSet} is a class that contains
any character with ASCII code 0 through 255. Both of these classes are
defined in the \Rpackage{Biostrings}
package. \Rfunction{readXStringColumns} allows us to read in columns
of text as these classes.

Important arguments for \Rfunction{readXStringColumns} are the
\Rcode{dirPath} in which to look for files, the \Rcode{pattern} of
files to parse, and the \Rcode{colClasses} of the columns to be
parsed. The \Rcode{dirPath} and \Rcode{pattern} arguments are like
\Rcode{list.files}. \Rcode{colClasses} is like the corresponding
argument to \Rfunction{read.table}: it is a \Rclass{list} specifying
the class of each column to be read, or \Robject{NULL} if the column
is to be ignored. In our case there are 21 columns, and we would like
to read in columns 9 and 10. Hence
<<colClasses>>=
colClasses <- rep(list(NULL), 21)
colClasses[9:10] <- c("DNAString", "BString")
names(colClasses)[9:10] <- c("read", "quality")
@ 
% 
We use the class of the type of sequence (e.g., \Rclass{DNAString} or
\Rclass{BString}), rather than the class of the set that we will
create ( e.g., \Rclass{DNAStringSet} or \Rclass{BStringSet}).
Applying names to \Robject{colClasses} is not required, but makes
subsequent manipulation easier. We are now ready to read our file
<<readXStringColumns>>=
cols <- readXStringColumns(analysisPath(sp), pattern, colClasses)
cols
@ 
% 
The file has been parsed, and appropriate data objects were created.

A feature of \Rfunction{readXStringColumns} and other input functions
in the \Rpackage{ShortRead} package is that all files matching
\Rcode{pattern} in the specified \Rcode{dirPath} will be read into
a single object. This provides a convenient way to, for instance,
parse all tiles in a Solexa lane into a single \Rclass{DNAStringSet}
object.

There are several advantages to reading columns as \Rclass{XStringSet}
objects. These are more compact than the corresponding character
representation:
<<size>>=
object.size(cols$read)
object.size(as.character(cols$read))
@ 
% 
They are also created much more quickly. And the \Rclass{DNAStringSet} and
related classes are used extensively in \Rpackage{ShortRead},
\Rpackage{Biostrings}, \Rpackage{BSgenome} and other packages relevant
to short read technology.


\subsection{Sorting}

Short reads can be sorted using \Rfunction{srsort}, 
or the permutation required to bring the short read into 
lexicographic order can be determined using
\Rfunction{srorder}. These functions are different from
\Rfunction{sort} and \Rfunction{order} because the result is
independent of the locale, and they operate quickly on
\Rclass{DNAStringSet} and \Rclass{BStringSet} objects.

The function \Rfunction{srduplicated} identifies duplicate reads. This
function returns a logical vector, similar to \Rfunction{duplicated}.
The negation of the result from \Rfunction{srduplicated} is useful to
create a collection of unique reads. An experimental scenario where
this might be useful is when the sample preparation involved PCR. In this
case, replicate reads may be due to artifacts of sample preparation,
rather than differential representation of sequence in the sample
prior to PCR.

\subsection{Summarizing read occurrence}

The \Rfunction{tables} function summarizes read occurrences, for instance,
<<tables>>=
tbls <- tables(fq)
names(tbls)
tbls$top[1:5]
head(tbls$distribution)
@ 
%% 
The \Robject{top} component returned by \Robject{tables} is a list
tallying the most commonly occurring sequences in the short
reads. Knowledgeable readers will recognize the top-occurring read as a
close match to one of the manufacturer adapters.

The \Robject{distribution} component returned by \Robject{tables} is a
data frame that summarizes how many reads (e.g.,
\Sexpr{tbls[["distribution"]][1,"nReads"]}) are represented exactly
\Sexpr{tbls[["distribution"]][1,"nOccurrences"]} times.

\subsection{Finding near matches to short sequences}

Facilities exist for finding reads that are near matches to specific
sequences, e.g., manufacturer adapter or primer
sequences. \Rfunction{srdistance} reports the edit distance between
each read and a reference sequence. \Rfunction{srdistance} is
implemented to work efficiently for reference sequences whose length
is of the same order as the reads themselves (10's to 100's of bases).
To find reads close to the most common read in the example above, one
might say
<<srdistance>>=
dist <- srdistance(sread(fq), names(tbls$top)[1])[[1]]
table(dist)[1:10]
@ 
%% 
`Near' matches can be filtered, e.g.,
<<aln-not-near>>=
fqSubset <- fq[dist>4]
@

A different strategy can be used to tally or eliminate reads that consist
predominantly of a single nucleotide. \Rfunction{alphabetFrequency}
calculates the frequency of each nucleotide (in DNA strings) or letter
(for other string sets) in each read. Thus one could identify and
eliminate reads with more than 30 adenine nucleotides with
<<polya>>=
countA <- alphabetFrequency(sread(fq))[,"A"] 
fqNoPolyA <- fq[countA < 30]
@ 
%% 
\Rfunction{alphabetFrequency}, which simply counts nucleotides, is
much faster than \Rfunction{srdistance}, which performs full pairwise
alignment of each read to the subject.

Users wanting to use \R{} for whole-genome alignments or more flexible
pairwise aligment are encouraged to investigate the
\Rpackage{Biostrings} package, especially the \Rclass{PDict} class and
\Rfunction{matchPDict} and \Rfunction{pairwiseAlignment} functions.

\section{Legacy support for early file formats}

The \Biocpkg{ShortRead} package contains functions and classes to
support early file formats and ungapped alignments. Help pages are
flagged as `legacy'; versions of \Biocpkg{ShortRead} prior to 1.21
(\Bioconductor{} version 2.13) contain a vignette illustrating common
work flows with these file formats.

%---------------------------------------------------------
% SessionInfo
%---------------------------------------------------------
\section{sessionInfo}
<<sessionInfo, results=tex, print=TRUE>>=
toLatex(sessionInfo())
@ 

\end{document}