This file is indexed.

/usr/lib/R/site-library/genefilter/doc/howtogenefilter.Rnw is in r-bioc-genefilter 1.60.0-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
%
% NOTE -- ONLY EDIT howtogenefilter.Rnw!!!
% howtogenefilter.tex file will get overwritten.
%
%\VignetteIndexEntry{Using the genefilter function to filter genes from a microarray dataset}
%\VignetteDepends{Biobase, genefilter, class}
%\VignetteKeywords{Expression Analysis}
%\VignettePackage{genefilter}
\documentclass{article}

\usepackage{hyperref}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\classdef}[1]{%
  {\em #1}
}

\begin{document}
\title{Using the genefilter function to filter genes from a microarray dataset}

\maketitle

\section*{Introduction}

The {\em genefilter} package can be used to filter (select) genes from
a microarray dataset according to a variety of different
filtering mechanisms.
Here, we will consider the example dataset
in the \verb+sample.ExpressionSet+ example from the {\em Biobase} package.
This experiment has 26 samples, and there are 500 genes and 3
covariates. The covariates are named \verb+sex+, \verb+type+ and
\verb+score+. The first two have two levels and the last one is
continuous.

<<>>=
library("Biobase")
library("genefilter")
data(sample.ExpressionSet)
varLabels(sample.ExpressionSet)
table(sample.ExpressionSet$sex)
table(sample.ExpressionSet$type)
@
%$

One dichotomy that can be of interest for subsequent analyses is whether the filter is
\emph{specific} or \emph{non-specific}. Here, specific means that we are
filtering with reference to sample metadata, for example, \texttt{type}. For example, if
we want to select genes that are differentially expressed in the two
groups defined by \texttt{type}, that is a specific filter.
If on the other hand we want to select genes that are expressed in more
than 5 samples, that is an example of a non--specific filter.

First, let us see how to perform a non--specific filter.
Suppose we want to select genes that have an expression measure above 200 in at
least 5 samples. To do that we use the function \verb+kOverA+.

There are three steps that must be performed.
\begin{enumerate}
\item Create function(s) implementing the filtering criteria.
\item Assemble it (them) into a (combined) filtering function.
\item Apply the filtering function to the expression matrix.
\end{enumerate}

<<>>=
f1 <- kOverA(5, 200)
ffun <- filterfun(f1)
wh1 <- genefilter(exprs(sample.ExpressionSet), ffun)
sum(wh1)
@

Here \verb+f1+ is a function that implies our ``expression measure above 200 in at
least 5 samples'' criterion, the function \verb+ffun+ is the filtering
function (which in this case consists of only one criterion), and we apply it using \verb+genefilter+.
There were \Sexpr{sum(wh1)} genes that satisfied the criterion and passed the filter.

As an example for a specific filter, let us select genes that are differentially
expressed in the groups defined by \verb+type+.

<<>>=
f2 <- ttest(sample.ExpressionSet$type, p=0.1)
wh2 <- genefilter(exprs(sample.ExpressionSet), filterfun(f2))
sum(wh2)
@
%$
Here, \texttt{ttest} is a function from the \texttt{genefilter}
package which provides a suitable wrapper around \texttt{t.test} from
package \textit{stats}. Now we see that there are \Sexpr{sum(wh2)}
genes that satisfy the selection criterion. 

Suppose that we want to combine the two filters. We want those genes
for which at least 5 have an expression measure over 200 \emph{and} which also are differentially
expressed between the groups defined by \verb+type+.

<<>>=
ffun_combined <- filterfun(f1, f2)
wh3 <- genefilter(exprs(sample.ExpressionSet), ffun_combined)
sum(wh3)
@

Now we see that there are only \Sexpr{sum(wh3)} genes  that satisfy both conditions.

%%FIXME: need to replace this with something else
%Our last example is to select genes that are
%differentially expressed in at least one of the three groups defined
%by \verb+cov3+.
%To do that we use an Anova filter. This filter uses an analysis of
%variance appraoch (via the \verb+lm+) function to test the hypothesis
%that at least one of the three group means is different from the other
%%two. The test is applied, then the $p$--value computed. We select
%those genes that have a low $p$--value.
%
%<<>>=
%Afilter <- Anova(eset$cov3)
%aff <- filterfun(Afilter)
%wh4 <- genefilter(exprs(eset), aff)
%sum(wh4)
%
%@
%%$
%We see that there are 14 genes that pass this filter and that are
%candidates for further exploration.


\section*{Selecting genes that appear useful for prediction}

The function \texttt{knnCV} defined below performs $k$--nearest neighbour classification
using leave--one--out cross--validation.
At the same time it aggregates the genes that were selected. The
function returns the predicted classifications as its returned
value. However, there is an additional side effect. The number of
times that each gene was used (provided it was at least one) are
recorded and stored in the environment of the aggregator \verb+Agg+.
These can subsequently be retrieved and used for other purposes.

<<aggregate>>=

 knnCV <- function(EXPR, selectfun, cov, Agg, pselect = 0.01, Scale=FALSE) {
   nc <- ncol(EXPR)
   outvals <- rep(NA, nc)
   for(i in 1:nc) {
      v1 <- EXPR[,i]
      expr <- EXPR[,-i]
      glist <- selectfun(expr, cov[-i], p=pselect)
      expr <- expr[glist,]
      if( Scale ) {
        expr <- scale(expr)
        v1 <- as.vector(scale(v1[glist]))
      }
      else
         v1 <- v1[glist]
      out <- paste("iter ",i, " num genes= ", sum(glist), sep="")
      print(out)
      Aggregate(row.names(expr), Agg)
      if( length(v1) == 1)
         outvals[i] <- knn(expr, v1, cov[-i], k=5)
      else
          outvals[i] <- knn(t(expr), v1, cov[-i], k=5)
    }
    return(outvals)
  }
@
%$

<<aggregate>>=
 gfun <- function(expr, cov, p=0.05) {
    f2 <- ttest(cov, p=p)
    ffun <- filterfun(f2)
    which <- genefilter(expr, ffun)
  }

@

Next we show how to use this function on the dataset
\verb+geneData+.

<<aggregate, results=hide>>=
  library("class")

  ##scale the genes
  ##genescale is a slightly more flexible "scale"
  ##work on a subset -- for speed only
  geneData <- genescale(exprs(sample.ExpressionSet)[1:75,], 1)

  Agg <- new("aggregator")

  testcase <- knnCV(geneData, gfun, sample.ExpressionSet$type, 
         Agg, pselect=0.05)
@ 
<<aggregate>>=
sort(sapply(aggenv(Agg), c), decreasing=TRUE)
@
%$
The environment \verb+Agg+ contains, for each gene,
the number of times it was selected in the cross-validation.


\section*{Session Information}

The version number of R and packages loaded for generating the vignette were:

<<echo=FALSE,results=tex>>=
toLatex(sessionInfo())
@

\end{document}