This file is indexed.

/usr/lib/R/site-library/gdata/doc/unknown.Rnw is in r-cran-gdata 2.13.3-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
%\VignetteIndexEntry{Working with Unknown Values}
%\VignettePackage{gdata}
%\VignetteKeywords{unknown, missing, manip}

\documentclass[a4paper]{report}
\usepackage{Rnews}
\usepackage[round]{natbib}
\bibliographystyle{abbrvnat}

\usepackage{Sweave}
\SweaveOpts{strip.white=all, keep.source=TRUE}

\begin{document}

\begin{article}

\title{Working with Unknown Values}
\subtitle{The \pkg{gdata} package}
\author{by Gregor Gorjanc}

\maketitle

This vignette has been published as \cite{Gorjanc}.

\section{Introduction}

Unknown or missing values can be represented in various ways. For example
SAS uses \code{.}~(dot), while \R{} uses \code{NA}, which we can read as
Not Available. When we import data into \R{}, say via \code{read.table} or
its derivatives, conversion of blank fields to \code{NA} (according to
\code{read.table} help) is done for \code{logical}, \code{integer},
\code{numeric} and \code{complex} classes. Additionally, the
\code{na.strings} argument can be used to specify values that should also
be converted to \code{NA}. Inversely, there is an argument \code{na} in
\code{write.table} and its derivatives to define value that will replace
\code{NA} in exported data. There are also other ways to import/export data
into \R{} as described in the {\emph R Data Import/Export} manual
\citep{RImportExportManual}.  However, all approaches lack the possibility
to define unknown value(s) for some particular column. It is possible that
an unknown value in one column is a valid value in another column. For
example, I have seen many datasets where values such as 0, -9, 999 and
specific dates are used as column specific unknown values.

This note describes a set of functions in package \pkg{gdata}\footnote{
  package version 2.3.1} \citep{WarnesGdata}: \code{isUnknown},
\code{unknownToNA} and \code{NAToUnknown}, which can help with testing for
unknown values and conversions between unknown values and \code{NA}. All
three functions are generic (S3) and were tested (at the time of writing)
to work with: \code{integer}, \code{numeric}, \code{character},
\code{factor}, \code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list},
\code{data.frame} and \code{matrix} classes.

\section{Description with examples}

The following examples show simple usage of these functions on
\code{numeric} and \code{factor} classes, where value \code{0} (beside
\code{NA}) should be treated as an unknown value:

<<ex01>>=
library("gdata")
xNum <- c(0, 6, 0, 7, 8, 9, NA)
isUnknown(x=xNum)
@

The default unknown value in \code{isUnknown} is \code{NA}, which means
that output is the same as \code{is.na} --- at least for atomic
classes. However, we can pass the argument \code{unknown} to define which
values should be treated as unknown:

<<ex02>>=
isUnknown(x=xNum, unknown=0)
@

This skipped \code{NA}, but we can get the expected answer after
appropriately adding \code{NA} into the argument \code{unknown}:

<<ex03>>=
isUnknown(x=xNum, unknown=c(0, NA))
@

Now, we can change all unknown values to \code{NA} with \code{unknownToNA}.
There is clearly no need to add \code{NA} here. This step is very handy
after importing data from an external source, where many different unknown
values might be used. Argument \code{warning=TRUE} can be used, if there is
a need to be warned about ``original'' \code{NA}s:

<<ex04>>=
(xNum2 <- unknownToNA(x=xNum, unknown=0))
@

Prior to export from \R{}, we might want to change unknown values
(\code{NA} in \R{}) to some other value. Function \code{NAToUnknown} can be
used for this:

<<ex05>>=
NAToUnknown(x=xNum2, unknown=999)
@

Converting \code{NA} to a value that already exists in \code{x} issues an
error, but \code{force=TRUE} can be used to overcome this if needed. But be
warned that there is no way back from this step:

<<ex06>>=
NAToUnknown(x=xNum2, unknown=7, force=TRUE)
@

Examples below show all peculiarities with class \code{factor}.
\code{unknownToNA} removes \code{unknown} value from levels and inversely
\code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is
properly distinguished from \code{NA}. It can also be seen that the
argument \code{unknown} in functions \code{isUnknown} and
\code{unknownToNA} need not match the class of \code{x} (otherwise factor
should be used) as the test is internally done with \code{\%in\%}, which
nicely resolves coercing issues.

<<ex07>>=
(xFac <- factor(c(0, "BA", "RA", "BA", NA, "NA")))
isUnknown(x=xFac)
isUnknown(x=xFac, unknown=0)
isUnknown(x=xFac, unknown=c(0, NA))
isUnknown(x=xFac, unknown=c(0, "NA"))
isUnknown(x=xFac, unknown=c(0, "NA", NA))

(xFac <- unknownToNA(x=xFac, unknown=0))
(xFac <- NAToUnknown(x=xFac, unknown=0))
@

These two examples with classes \code{numeric} and \code{factor} are fairly
simple and we could get the same results with one or two lines of \R{}
code. The real benefit of the set of functions presented here is in
\code{list} and \code{data.frame} methods, where \code{data.frame} methods
are merely wrappers for \code{list} methods.

We need additional flexibility for \code{list}/\code{data.frame} methods,
due to possibly having multiple unknown values that can be different among
\code{list} components or \code{data.frame} columns. For these two methods,
the argument \code{unknown} can be either a \code{vector} or \code{list},
both possibly named. Of course, greater flexibility (defining multiple
unknown values per component/column) can be achieved with a \code{list}.

When a \code{vector}/\code{list} object passed to the argument
\code{unknown} is not named, the first value/component of a
\code{vector}/\code{list} matches the first component/column of a
\code{list}/\code{data.frame}. This can be quite error prone, especially
with \code{vectors}. Therefore, I encourage the use of a \code{list}. In
case \code{vector}/\code{list} passed to argument \code{unknown} is named,
names are matched to names of \code{list} or \code{data.frame}. If lengths
of \code{unknown} and \code{list} or \code{data.frame} do not match,
recycling occurs.

The example below illustrates the application of the described functions to
a list which is composed of previously defined and modified numeric
(\code{xNum}) and factor (\code{xFac}) classes. First, function
\code{isUnknown} is used with \code{0} as an unknown value. Note that we
get \code{FALSE} for \code{NA}s as has been the case in the first example.

<<ex08>>=
(xList <- list(a=xNum, b=xFac))
isUnknown(x=xList, unknown=0)
@

We need to add \code{NA} as an unknown value. However, we do not get the
expected result this way!

<<ex09>>=
isUnknown(x=xList, unknown=c(0, NA))
@

This is due to matching of values in the argument \code{unknown} and
components in a \code{list}; i.e., \code{0} is used for component \code{a}
and \code{NA} for component \code{b}.  Therefore, it is less error prone
and more flexible to pass a \code{list} (preferably a named list) to the
argument \code{unknown}, as shown below.

<<ex10>>=
(xList1 <- unknownToNA(x=xList,
                       unknown=list(b=c(0, "NA"),
                                    a=0)))
@

Changing \code{NA}s to some other value (only one per component/column) can
be accomplished as follows:

<<ex11>>=
NAToUnknown(x=xList1,
            unknown=list(b="no", a=0))
@

A named component \code{.default} of a \code{list} passed to argument
\code{unknown} has a special meaning as it will match a component/column
with that name and any other not defined in \code{unknown}. As such it is
very useful if the number of components/columns with the same unknown
value(s) is large. Consider a wide \code{data.frame} named \code{df}. Now
\code{.default} can be used to define unknown value for several columns:

<<ex12, echo=FALSE>>=
df <- data.frame(col1=c(0, 1, 999, 2),
                 col2=c("a", "b", "c", "unknown"),
                 col3=c(0, 1, 2, 3),
                 col4=c(0, 1, 2, 2))
@

<<ex13>>=
tmp <- list(.default=0,
            col1=999,
            col2="unknown")
(df2 <- unknownToNA(x=df,
                    unknown=tmp))
@

If there is a need to work only on some components/columns you can of
course ``skip'' columns with standard \R{} mechanisms, i.e.,
by subsetting \code{list} or \code{data.frame} objects:

<<ex14>>=
df2 <- df
cols <- c("col1", "col2")
tmp <- list(col1=999,
            col2="unknown")
df2[, cols] <- unknownToNA(x=df[, cols],
                           unknown=tmp)
df2
@

\section{Summary}

Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown}
provide a useful interface to work with various representations of
unknown/missing values. Their use is meant primarily for shaping the data
after importing to or before exporting from \R{}. I welcome any comments or
suggestions.

% \bibliography{refs}

\begin{thebibliography}{1}
\providecommand{\natexlab}[1]{#1}
\providecommand{\url}[1]{\texttt{#1}}
\expandafter\ifx\csname urlstyle\endcsname\relax
  \providecommand{\doi}[1]{doi: #1}\else
  \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi

\bibitem[Gorjanc(2007)]{Gorjanc}
G.~Gorjanc.
\newblock Working with unknown values: the gdata package.
\newblock \emph{R News}, 7\penalty0 (1):\penalty0 24--26, 2007.
\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-1.pdf}.

\bibitem[{R Development Core Team}(2006)]{RImportExportManual}
{R Development Core Team}.
\newblock \emph{R Data Import/Export}, 2006.
\newblock URL \url{http://cran.r-project.org/manuals.html}.
\newblock ISBN 3-900051-10-0.

\bibitem[Warnes (2006)]{WarnesGdata}
G.~R. Warnes.
\newblock \emph{gdata: Various R programming tools for data manipulation},
  2006.
\newblock URL
  \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}.
\newblock R package version 2.3.1. Includes R source code and/or documentation
  contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley.

\end{thebibliography}

\address{Gregor Gorjanc\\
  University of Ljubljana, Slovenia\\
\email{gregor.gorjanc@bfro.uni-lj.si}}

\end{article}

\end{document}