/usr/lib/R/site-library/gdata/doc/unknown.Rnw is in r-cran-gdata 2.18.0-1.1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 | %\VignetteIndexEntry{Working with Unknown Values}
%\VignettePackage{gdata}
%\VignetteKeywords{unknown, missing, manip}
\documentclass[a4paper]{report}
\usepackage{Rnews}
\usepackage[round]{natbib}
\bibliographystyle{abbrvnat}
\usepackage{Sweave}
\SweaveOpts{strip.white=all, keep.source=TRUE}
\begin{document}
\begin{article}
\title{Working with Unknown Values}
\subtitle{The \pkg{gdata} package}
\author{by Gregor Gorjanc}
\maketitle
This vignette has been published as \cite{Gorjanc}.
\section{Introduction}
Unknown or missing values can be represented in various ways. For example
SAS uses \code{.}~(dot), while \R{} uses \code{NA}, which we can read as
Not Available. When we import data into \R{}, say via \code{read.table} or
its derivatives, conversion of blank fields to \code{NA} (according to
\code{read.table} help) is done for \code{logical}, \code{integer},
\code{numeric} and \code{complex} classes. Additionally, the
\code{na.strings} argument can be used to specify values that should also
be converted to \code{NA}. Inversely, there is an argument \code{na} in
\code{write.table} and its derivatives to define value that will replace
\code{NA} in exported data. There are also other ways to import/export data
into \R{} as described in the {\emph R Data Import/Export} manual
\citep{RImportExportManual}. However, all approaches lack the possibility
to define unknown value(s) for some particular column. It is possible that
an unknown value in one column is a valid value in another column. For
example, I have seen many datasets where values such as 0, -9, 999 and
specific dates are used as column specific unknown values.
This note describes a set of functions in package \pkg{gdata}\footnote{
package version 2.3.1} \citep{WarnesGdata}: \code{isUnknown},
\code{unknownToNA} and \code{NAToUnknown}, which can help with testing for
unknown values and conversions between unknown values and \code{NA}. All
three functions are generic (S3) and were tested (at the time of writing)
to work with: \code{integer}, \code{numeric}, \code{character},
\code{factor}, \code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list},
\code{data.frame} and \code{matrix} classes.
\section{Description with examples}
The following examples show simple usage of these functions on
\code{numeric} and \code{factor} classes, where value \code{0} (beside
\code{NA}) should be treated as an unknown value:
<<ex01>>=
library("gdata")
xNum <- c(0, 6, 0, 7, 8, 9, NA)
isUnknown(x=xNum)
@
The default unknown value in \code{isUnknown} is \code{NA}, which means
that output is the same as \code{is.na} --- at least for atomic
classes. However, we can pass the argument \code{unknown} to define which
values should be treated as unknown:
<<ex02>>=
isUnknown(x=xNum, unknown=0)
@
This skipped \code{NA}, but we can get the expected answer after
appropriately adding \code{NA} into the argument \code{unknown}:
<<ex03>>=
isUnknown(x=xNum, unknown=c(0, NA))
@
Now, we can change all unknown values to \code{NA} with \code{unknownToNA}.
There is clearly no need to add \code{NA} here. This step is very handy
after importing data from an external source, where many different unknown
values might be used. Argument \code{warning=TRUE} can be used, if there is
a need to be warned about ``original'' \code{NA}s:
<<ex04>>=
(xNum2 <- unknownToNA(x=xNum, unknown=0))
@
Prior to export from \R{}, we might want to change unknown values
(\code{NA} in \R{}) to some other value. Function \code{NAToUnknown} can be
used for this:
<<ex05>>=
NAToUnknown(x=xNum2, unknown=999)
@
Converting \code{NA} to a value that already exists in \code{x} issues an
error, but \code{force=TRUE} can be used to overcome this if needed. But be
warned that there is no way back from this step:
<<ex06>>=
NAToUnknown(x=xNum2, unknown=7, force=TRUE)
@
Examples below show all peculiarities with class \code{factor}.
\code{unknownToNA} removes \code{unknown} value from levels and inversely
\code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is
properly distinguished from \code{NA}. It can also be seen that the
argument \code{unknown} in functions \code{isUnknown} and
\code{unknownToNA} need not match the class of \code{x} (otherwise factor
should be used) as the test is internally done with \code{\%in\%}, which
nicely resolves coercing issues.
<<ex07>>=
(xFac <- factor(c(0, "BA", "RA", "BA", NA, "NA")))
isUnknown(x=xFac)
isUnknown(x=xFac, unknown=0)
isUnknown(x=xFac, unknown=c(0, NA))
isUnknown(x=xFac, unknown=c(0, "NA"))
isUnknown(x=xFac, unknown=c(0, "NA", NA))
(xFac <- unknownToNA(x=xFac, unknown=0))
(xFac <- NAToUnknown(x=xFac, unknown=0))
@
These two examples with classes \code{numeric} and \code{factor} are fairly
simple and we could get the same results with one or two lines of \R{}
code. The real benefit of the set of functions presented here is in
\code{list} and \code{data.frame} methods, where \code{data.frame} methods
are merely wrappers for \code{list} methods.
We need additional flexibility for \code{list}/\code{data.frame} methods,
due to possibly having multiple unknown values that can be different among
\code{list} components or \code{data.frame} columns. For these two methods,
the argument \code{unknown} can be either a \code{vector} or \code{list},
both possibly named. Of course, greater flexibility (defining multiple
unknown values per component/column) can be achieved with a \code{list}.
When a \code{vector}/\code{list} object passed to the argument
\code{unknown} is not named, the first value/component of a
\code{vector}/\code{list} matches the first component/column of a
\code{list}/\code{data.frame}. This can be quite error prone, especially
with \code{vectors}. Therefore, I encourage the use of a \code{list}. In
case \code{vector}/\code{list} passed to argument \code{unknown} is named,
names are matched to names of \code{list} or \code{data.frame}. If lengths
of \code{unknown} and \code{list} or \code{data.frame} do not match,
recycling occurs.
The example below illustrates the application of the described functions to
a list which is composed of previously defined and modified numeric
(\code{xNum}) and factor (\code{xFac}) classes. First, function
\code{isUnknown} is used with \code{0} as an unknown value. Note that we
get \code{FALSE} for \code{NA}s as has been the case in the first example.
<<ex08>>=
(xList <- list(a=xNum, b=xFac))
isUnknown(x=xList, unknown=0)
@
We need to add \code{NA} as an unknown value. However, we do not get the
expected result this way!
<<ex09>>=
isUnknown(x=xList, unknown=c(0, NA))
@
This is due to matching of values in the argument \code{unknown} and
components in a \code{list}; i.e., \code{0} is used for component \code{a}
and \code{NA} for component \code{b}. Therefore, it is less error prone
and more flexible to pass a \code{list} (preferably a named list) to the
argument \code{unknown}, as shown below.
<<ex10>>=
(xList1 <- unknownToNA(x=xList,
unknown=list(b=c(0, "NA"),
a=0)))
@
Changing \code{NA}s to some other value (only one per component/column) can
be accomplished as follows:
<<ex11>>=
NAToUnknown(x=xList1,
unknown=list(b="no", a=0))
@
A named component \code{.default} of a \code{list} passed to argument
\code{unknown} has a special meaning as it will match a component/column
with that name and any other not defined in \code{unknown}. As such it is
very useful if the number of components/columns with the same unknown
value(s) is large. Consider a wide \code{data.frame} named \code{df}. Now
\code{.default} can be used to define unknown value for several columns:
<<ex12, echo=FALSE>>=
df <- data.frame(col1=c(0, 1, 999, 2),
col2=c("a", "b", "c", "unknown"),
col3=c(0, 1, 2, 3),
col4=c(0, 1, 2, 2))
@
<<ex13>>=
tmp <- list(.default=0,
col1=999,
col2="unknown")
(df2 <- unknownToNA(x=df,
unknown=tmp))
@
If there is a need to work only on some components/columns you can of
course ``skip'' columns with standard \R{} mechanisms, i.e.,
by subsetting \code{list} or \code{data.frame} objects:
<<ex14>>=
df2 <- df
cols <- c("col1", "col2")
tmp <- list(col1=999,
col2="unknown")
df2[, cols] <- unknownToNA(x=df[, cols],
unknown=tmp)
df2
@
\section{Summary}
Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown}
provide a useful interface to work with various representations of
unknown/missing values. Their use is meant primarily for shaping the data
after importing to or before exporting from \R{}. I welcome any comments or
suggestions.
% \bibliography{refs}
\begin{thebibliography}{1}
\providecommand{\natexlab}[1]{#1}
\providecommand{\url}[1]{\texttt{#1}}
\expandafter\ifx\csname urlstyle\endcsname\relax
\providecommand{\doi}[1]{doi: #1}\else
\providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi
\bibitem[Gorjanc(2007)]{Gorjanc}
G.~Gorjanc.
\newblock Working with unknown values: the gdata package.
\newblock \emph{R News}, 7\penalty0 (1):\penalty0 24--26, 2007.
\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-1.pdf}.
\bibitem[{R Development Core Team}(2006)]{RImportExportManual}
{R Development Core Team}.
\newblock \emph{R Data Import/Export}, 2006.
\newblock URL \url{http://cran.r-project.org/manuals.html}.
\newblock ISBN 3-900051-10-0.
\bibitem[Warnes (2006)]{WarnesGdata}
G.~R. Warnes.
\newblock \emph{gdata: Various R programming tools for data manipulation},
2006.
\newblock URL
\url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}.
\newblock R package version 2.3.1. Includes R source code and/or documentation
contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley.
\end{thebibliography}
\address{Gregor Gorjanc\\
University of Ljubljana, Slovenia\\
\email{gregor.gorjanc@bfro.uni-lj.si}}
\end{article}
\end{document}
|