This file is indexed.

/usr/lib/R/site-library/Biobase/doc/ExpressionSetIntroduction.Rnw is in r-bioc-biobase 2.14.0-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
%\VignetteDepends{Biobase}
%\VignetteIndexEntry{An introduction to Biobase and ExpressionSets}
%\VignetteKeywords{tutorial, environment, graphics, ExpressionSet}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[12pt]{article}
\usepackage{amsmath,fullpage}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\usepackage{theorem}
\usepackage{float}
\usepackage{ifthen}

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\newcommand{\R}{{\textsf{R}}}
\newcommand{\code}[1]{{\texttt{#1}}}
\newcommand{\term}[1]{{\emph{#1}}}
\newcommand{\Rpackage}[1]{\textsf{#1}}
\newcommand{\Rfunction}[1]{\texttt{#1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}

%% Excercises and Questions
\usepackage{theorem}
\theoremstyle{break} \newtheorem{Ex}{Exercise}
\theoremstyle{break} \newtheorem{Q}{Question}
%% And solution or answer
\newenvironment{solution}{%
  \begin{center}
    \begin{minipage}{0.75\textwidth}
      %%     \color{blue}
    }{
    \end{minipage}
  \end{center}
  \bigskip%
}

\bibliographystyle{plainnat}
\title{An Introduction to Bioconductor's \Rclass{ExpressionSet} Class}

\author{Seth Falcon, Martin Morgan, and Robert Gentleman}
\date{6 October, 2006; revised  9 February, 2007}

\begin{document}

<<init, echo=FALSE>>=
options(width=65)
@ 

\maketitle

\section{Introduction}
\Rpackage{Biobase} is part of the Bioconductor project, and is used by
many other packages. \Rpackage{Biobase} contains standardized data
structures to represent genomic data. The \Rclass{ExpressionSet} class
is designed to combine several different sources of information into a
single convenient structure. An \Rclass{ExpressionSet} can be
manipulated (e.g., subsetted, copied) conveniently, and is the input
or output from many Bioconductor functions.

The data in an \Rclass{ExpressionSet} is complicated, consisting of
expression data from microarray experiments (\code{assayData};
\code{assayData} is used to hint at the methods used to access
different data components, as we will see below), `meta-data'
describing samples in the experiment (\code{phenoData}), annotations
and meta-data about the features on the chip or technology used for
the experiment (\code{featureData}, \code{annotation}), information
related to the protocol used for processing each sample (and usually
extracted from manufacturer files, \code{protocolData}), and a
flexible structure to describe the experiment
(\code{experimentData}). The \Rclass{ExpressionSet} class coordinates
all of this data, so that you do not usually have to worry about the
details. However, an \Rclass{ExpressionSet} needs to be created in the
first place, and creation can be complicated.

In this introduction we learn how to create and manipulate
\Rclass{ExpressionSet} objects, and practice some basic \R{} skills.

\section{Preliminaries}

\subsection{Installing Packages}

If you are reading this document and have not yet installed any
software on your computer, visit \url{http://bioconductor.org} and
follow the instructions for installing \R{} and Bioconductor.  Once
you have installed \R{} and Bioconductor, you are ready to go with
this document. In the future, you might find that you need to install
one or more additional packages. The best way to do this is to start
an \R{} session and evaluate commands like
<<install-pkg, eval=false>>=
source("http://bioconductor.org/biocLite.R")
biocLite(c("Biobase"))
@ 
%% 

\subsection{Loading Packages}

The definition of the \Rclass{ExpressionSet} class along with many
methods for manipulating \Rclass{ExpressionSet} objects are defined in
the \Rpackage{Biobase} package.  In general, you need to load class
and method definitions before you use them.  When using Bioconductor,
this means loading \R{} packages using \Rfunction{library} or
\Rfunction{require}.

<<loadlib, results=hide>>=
library("Biobase")
@ 

\begin{Ex}
What happens when you try to load a package that is not installed?
\end{Ex}
\begin{solution}
When using \Rfunction{library}, you get an error message.  With
\Rfunction{require}, the return value is \Robject{FALSE} and a warning
is printed.
\end{solution}

\section{Building an ExpressionSet From .CEL and other files}

Many users have access to .CEL or other files produced by microarray
chip manufacturer hardware. Usually the strategy is to use a
Bioconductor package such as \Rpackage{affyPLM}, \Rpackage{affy},
\Rpackage{oligo}, or \Rpackage{limma}, to read these files.
These Bioconductor packages have functions (e.g.,
\Rfunction{ReadAffy}, \Rfunction{expresso}, or \Rfunction{justRMA} in
\Rpackage{affy}) to read CEL files and perform preliminary
preprocessing, and to represent the resulting data as an
\Rclass{ExpressionSet} or other type of object.  Suppose the result
from reading and preprocessing CEL or other files is named
\Robject{object}, and \Robject{object} is different from
\Rclass{ExpressionSet}; a good bet is to try, e.g.,
<<convert,eval=FALSE>>=
library(convert)
as(object, "ExpressionSet")
@ 
%% 
It might be the case that no converter is available. The path then is
to extract relevant data from \Robject{object} and use this to create
an \Rclass{ExpressionSet} using the instructions below.

\section{Building an ExpressionSet From Scratch}

As mentioned in the introduction, the data from many high-throughput
genomic experiments, such as microarray experiments, usually consist
of several conceptually distinct parts: assay data, phenotypic
meta-data, feature annotations and meta-data, and a description of the
experiment. We'll construct each of these components, and then
assemble them into an \Rclass{ExpressionSet}.

\subsection{Assay data}

One important part of the experiment is a matrix of `expression'
values.  The values are usually derived from microarrays of one sort
or another, perhaps after initial processing by manufacturer software
or Bioconductor packages. The matrix has $F$ rows and $S$ columns,
where $F$ is the number of features on the chip and $S$ is the number
of samples.

A likely scenario is that your assay data is in a 'tab-delimited' text
file (as exported from a spreadsheet, for instance) with rows
corresponding to features and columns to samples.  The strategy is to
read this file into \R{} using the \Rfunction{read.table} command,
converting the result to a \Rclass{matrix}. A typical command
to read
a tab-delimited file that includes column `headers' is
<<read-table-geneData>>=
dataDirectory <- system.file("extdata", package="Biobase")
exprsFile <- file.path(dataDirectory, "exprsData.txt")
exprs <- as.matrix(read.table(exprsFile, header=TRUE, sep="\t",
                              row.names=1,
                              as.is=TRUE))
@ 
%% 
The first two lines create a file path pointing to where the assay
data is stored; replace these with a character string pointing to your
own file, e.g,
<<exprsFile,eval=FALSE>>=
exprsFile <- "c:/path/to/exprsData.txt"
@ 
%% 
(Windows users: note the use of \verb+/+ rather than \verb+\+; this
is because \R{} treats the \verb+\+ character as an `escape' sequence to
change the meaning of the subsequent character). See the help pages
for \Rfunction{read.table} for more detail.  A common variant is that
the character separating columns is a comma (``comma-separated
values'', or ``csv'' files), in which case the \Robject{sep} argument
might be \Robject{sep=","}.

It is always important to verify that the data you have read matches
your expectations. At a minimum, check the class and dimensions of
\Robject{geneData} and take a peak at the first several rows
<<geneData-peak>>=
class(exprs)
dim(exprs)
colnames(exprs)
head(exprs[,1:5])
@ 
%% 

At this point, we can create a minimal \Rclass{ExpressionSet} object using the
\code{ExpressionSet} constructor:
<<ExpressionSet-basic>>=
minimalSet <- ExpressionSet(assayData=exprs)
@ 
% 
We'll get more benefit from expression sets by creating a richer
object that coordinates phenotypic and other data with our expression
data, as outlined in the following sections.

\subsection{Phenotypic data}

Phenotypic data summarizes information about the samples (e.g., sex,
age, and treatment status; referred to as `covariates').  The
information describing the samples can be represented as a table with
$S$ rows and $V$ columns, where $V$ is the number of covariates.
An example of phenotypic data can be input with
<<pData>>=
pDataFile <- file.path(dataDirectory, "pData.txt")
pData <- read.table(pDataFile,
                    row.names=1, header=TRUE, sep="\t")
dim(pData)
rownames(pData)
summary(pData)
@ 
%% 
There are three columns of data, and 26 rows. Note that the number of
rows of phenotypic data match the number of columns of expression
data, and indeed that the row and column names are identically
ordered:
<<geneCovariate-geneData-name-match>>=
all(rownames(pData)==colnames(exprs))
@ 
%% 
This is an essential feature of the relationship between the
assay and phenotype data; \Rclass{ExpressionSet} will complain if
these names do not match.

Phenotypic data can take on a number of different forms. For instance,
some covariates might reasonably be represented as numeric values.
Other covariates (e.g., gender, tissue type, or cancer status) might
better be represented as \Robject{factor} objects (see the help page
for \Robject{factor} for more information). It is especially important
that the phenotypic data are encoded correctly; the \Robject{colClasses}
argument to \Rfunction{read.table} can be helpful in correctly
inputing (and ignoring, if desired) columns from the file.

\begin{Ex}
  What class does \Rfunction{read.table} return?
\end{Ex}

\begin{Ex}
  Determine the column names of \Robject{pData}.  Hint:
  \Robject{apropos("name")}.
\end{Ex}
\begin{solution}
<<colnames>>=
names(pData)
@ 
\end{solution}

\begin{Ex}
  Use \Rfunction{sapply} to determine the classes of each column of
  \Robject{pData}.  Hint: read the help page for \Rfunction{sapply}.
\end{Ex}
\begin{solution}
<<sapplyClasses>>=
sapply(pData, class)
@ 
\end{solution}

\begin{Ex}
  What is the sex and Case/Control status of the 15th and 20th
  samples?  And for the sample(s) with \Robject{score} greater than
  $0.8$.
\end{Ex}
\begin{solution}
<<simpleSubsetting>>=
pData[c(15, 20), c("gender", "type")]
pData[pData$score>0.8,]
@ 
\end{solution}

Investigators often find that the meaning of simple column names does
not provide enough information about the covariate -- What is the
cryptic name supposed to represent? What units are the covariates
measured in? We can create a data frame containing such meta-data (or
read the information from a file using \Rfunction{read.table}) with
<<metadata-create>>=
metadata <- data.frame(labelDescription=
                       c("Patient gender", 
                         "Case/control status", 
                         "Tumor progress on XYZ scale"),
                       row.names=c("gender", "type", "score"))
@ 
%% 
This creates a \Rclass{data.frame} object with a single column called
\code{labelDescription}, and with row names identical to the column
names of the \Rclass{data.frame} containing the phenotypic data. The
column \Robject{labelDescription} \emph{must} be present; other columns
are optional.

Bioconductor's \Rpackage{Biobase} package provides a class called
\Rclass{AnnotatedDataFrame} that conveniently stores and manipulates
the phenotypic data and its metadata in a coordinated fashion. Create
and view an \Rclass{AnnotatedDataFrame} instance with:
<<AnnotatedDataFrame>>=
phenoData <- new("AnnotatedDataFrame", 
                 data=pData, varMetadata=metadata)
phenoData
@ 
%% 
Some useful operations on an \Rclass{AnnotatedDataFrame} include
\Rmethod{sampleNames}, \Rmethod{pData} (to extract the original
\Robject{pData} \Rclass{data.frame}), and \Rmethod{varMetadata}. In addition,
\Rclass{AnnotatedDataFrame} objects can be subset much like a
\Rclass{data.frame}:
<<AnnotatedDataFrame-subset>>=
head(pData(phenoData))
phenoData[c("A","Z"),"gender"]
pData(phenoData[phenoData$score>0.8,])
@ 
%% 

\subsection{Annotations and feature data}

Meta-data on features is as important as meta-data on samples, and can
be very large and diverse.  A single chip design (i.e., collection of
features) is likely to be used in many different experiments, and it
would be inefficient to repeatedly collect and coordinate the same
meta-data for each \Rclass{ExpressionSet} instance. Instead, the ideas
is to construct specialized meta-data packages for each type of chip
or instrument.  Many of these packages are available from the
Bioconductor web site.  These packages contain information such as the
gene name, symbol and chromosomal location. There are other meta-data
packages that contain the information that is provided by other
initiatives such as GO and KEGG.  The \Rpackage{annotate} and
\Rpackage{AnnotationDbi} packages provides basic data manipulation
tools for the meta-data packages.

The appropriate way to create annotation data for features is very
straight-forward: we provide a character string identifying the type of
chip used in the experiment. For instance, the data we are using is
from the Affymetrix hgu95av2 chip:
<<annotation>>=
annotation <- "hgu95av2"
@ 
%% 
It is also possible to record information about features that are
unique to the experiment (e.g., flagging particularly relevant
features).  This is done by creating or modifying an
\Robject{AnnotatedDataFrame} like that for \Robject{phenoData} but
with row names of the \Rclass{AnnotatedDataFrame} matching rows of the
assay data.

\subsection{Experiment description}

Basic description about the experiment (e.g., the investigator or lab
where the experiment was done, an overall title, and other notes) can
be recorded by creating a \Rclass{MIAME} object. One way to create a
\Rclass{MIAME} object is to use the \Rfunction{new} function:
<<R.MIAME>>=
experimentData <- new("MIAME",
  name="Pierre Fermat",
  lab="Francis Galton Lab",
  contact="pfermat@lab.not.exist",
  title="Smoking-Cancer Experiment",
  abstract="An example ExpressionSet",
  url="www.lab.not.exist",
  other=list(
    notes="Created from text files"
  ))
@ 
%% 
Usually, \Rfunction{new} takes as arguments  the class name and
pairs of names and values corresponding to different slots in the
class; consult the help page for \Rclass{MIAME} for details of
available slots.

\subsection{Assembling an \Rclass{ExpressionSet}}

An \Rclass{ExpressionSet} object is created by assembling its component
parts and callng the \code{ExpressionSet} constructor:
<<ExpressionSetFinally>>=
exampleSet <- ExpressionSet(assayData=exprs, 
                  phenoData=phenoData, 
                  experimentData=experimentData,
                  annotation="hgu95av2")
@ 
%% 
Note that the names on the right of each equal sign can refer to any
object of appropriate class for the argument. See the help page for
\Rclass{ExpressionSet} for more information.  

We created a rich data object to coordinate diverse sources of
information. Less rich objects can be created by providing less
information. As mentioned earlier, a minimal expression set can be
created with
<<ExpressionSet-minimal>>=
minimalSet <- ExpressionSet(assayData=exprs)
@ 
%% 
Of course this object has no information about phenotypic or feature
data, or about the chip used for the assay.

\section{\Rclass{ExpressionSet} Basics}

Now that you have an \Rclass{ExpressionSet} instance, let's explore
some of the basic operations.  You can get an overview of the
structure and available methods for \Rclass{ExpressionSet} objects by
reading the help page:
<<helpExpressionSet, eval=FALSE>>=
help("ExpressionSet-class")
@ 

When you print an \Rclass{ExpressionSet} object, a brief summary of
the contents of the object is displayed (displaying the entire object
would fill your screen with numbers):
<<showExpressionSet>>=
exampleSet
@ 

\subsection{Accessing Data Elements}

A number of accessor functions are available to extract data from an
\Rclass{ExpressionSet} instance.
You can access the columns of the phenotype data (an
\Rclass{AnnotatedDataFrame} instance) using \verb+$+:
<<usingDollar>>=
exampleSet$gender[1:5]
exampleSet$gender[1:5] == "Female"
@ 
%% 
You can retrieve the names of the features using
\Rfunction{featureNames}.  For many microarray datasets, the feature
names are the probe set identifiers.
<<featureNames>>=
featureNames(exampleSet)[1:5]
@ 
%% 
The unique identifiers of the samples in the data set are available
via the \Rfunction{sampleNames} method.  The \Rfunction{varLabels}
method lists the column names of the phenotype data:
<<sampleNames>>=
sampleNames(exampleSet)[1:5]
varLabels(exampleSet)
@ 
%% 
Extract the expression matrix and the \Rclass{AnnotatedDataFrame} of
sample information using \Rfunction{exprs} and \Rfunction{phenoData},
respectively:
<<exprs>>=
mat <- exprs(exampleSet)
dim(mat)
adf <- phenoData(exampleSet)
adf
@ 

\subsubsection{Subsetting}

Probably the most useful operation to perform on
\Rclass{ExpressionSet} objects is subsetting.  Subsetting an
\Rclass{ExpressionSet} is very similar to subsetting the expression
matrix that is contained within the \Rclass{ExpressionSet}, the first
argument subsets the features and the second argument subsets the
samples.  Here are some examples: Create a new \Rclass{ExpressionSet}
consisting of the 5 features and the first 3 samples:
<<first10>>=
vv <- exampleSet[1:5, 1:3]
dim(vv)
featureNames(vv)
sampleNames(vv)
@ 
%% 
Create a subset consisting of only the male samples:
<<males>>=
males <- exampleSet[ , exampleSet$gender == "Male"]
males
@ 

\section{What was used to create this document}

The version number of \R{} and the packages and their versions that were
used to generate this document are listed below.

<<echo=FALSE,results=tex>>=
toLatex(sessionInfo())
@
 
\end{document}