/usr/lib/R/site-library/snpStats/doc/tdt-vignette.Rnw is in r-bioc-snpstats 1.28.0+dfsg-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | %\documentclass[a4paper,12pt]{article}
\documentclass[12pt]{article}
\usepackage{fullpage}
% \usepackage{times}
%\usepackage{mathptmx}
%\renewcommand{\ttdefault}{cmtt}
\usepackage{graphicx}
\usepackage[pdftex,
bookmarks,
bookmarksopen,
pdfauthor={David Clayton},
pdftitle={TDT and snpStats Vignette}]
{hyperref}
\title{TDT vignette\\Use of snpStats in family--based studies}
\author{David Clayton}
\date{\today}
\usepackage{Sweave}
\SweaveOpts{echo=TRUE, pdf=TRUE, eps=FALSE}
\begin{document}
\setkeys{Gin}{width=1.0\textwidth}
%\VignetteIndexEntry{TDT tests}
%\VignettePackage{snpStats}
\maketitle
\section*{Pedigree data}
The {\tt snpStats} package contains some tools for analysis of
family-based studies. These assume that a subject support file
provides the information necessary to reconstruct pedigrees in the
well-known format used in the {\it LINKAGE} package. Each line of the
support file must contain an identifier of the {\em pedigree} to which
the individual belongs, together with an identifier of subject within pedigree,
and the within-pedigree identifiers for the subject's father and
mother. Usually this information, together with phenotype data, will
be contained in a dataframe with rownames which link to the rownames
of the {\tt SnpMatrix} containing the genotype data. The following
commands read some illustrative data on 3,017 subjects and 43
(autosomal) SNPs\footnote{These data are on a much smaller scale than
would arise in genome-wide studies, but serve to illustrate the
available tools. Note, however, that execution speeds are quite adequate for
genome-wide data.}. The data consist of a dataframe containing the
subject and pedigree information ({\tt pedData}) and a {\tt
SnpMatrix} containing the genotype data ({\tt genotypes}):
<<family-data>>=
require(snpStats)
data(families)
genotypes
head(pedData)
@
The first family comprises four individuals: two parents and two
sibling offspring. The parents are ``founders'' in the pedigree, {\it
i.e.} there is no data for their parents, so that their {\tt father}
and {\tt mother} identifiers are set to {\tt NA}. This differs from
the convention in the {\it LINKAGE} package, which would code these as
zero. Otherwise coding is as in {\it LINKAGE}: {\tt sex} is coded 1 for
male and 2 for female, and disease status ({\tt affected}) is coded
1 for unaffected and 2 for affected.
\section*{Checking for mis-inheritances}
The function {\tt misinherits} counts non-Mendelian inheritances in
the data. It returns a logical matrix with one row for each subject
who has any mis-inheritances and one column for each SNP which was ever
mis-inherited.
<<mis-inheritances>>=
mis <- misinherits(data=pedData, snp.data=genotypes)
dim(mis)
@
Thus, 114 of the subjects and 37 of the SNPs had at least one
mis-inheritance. The following commands count mis-inheritances per
subject and plot its frequency distribution, and similarly,
for mis-inheritances per SNP:
<<per-subj-snp,fig=TRUE>>=
per.subj <- apply(mis, 1, sum, na.rm=TRUE)
per.snp <- apply(mis, 2, sum, na.rm=TRUE)
par(mfrow = c(1, 2))
hist(per.subj,main='Histogram per Subject', xlab='Subject')
hist(per.snp,main='Histogram per SNP', xlab='SNP')
@
Note that mis-inheritances must be ascribed to offspring, although the
error may lie with the parent data. The following commands first
extract the pedigree identifiers for mis-inheriting subjects and go on
to chart the numbers of mis-inheritances per family:
<<per-family,fig=TRUE>>=
fam <- pedData[rownames(mis), "familyid"]
per.fam <- tapply(per.subj, fam, sum)
par(mfrow = c(1, 1))
hist(per.fam, main='Histogram per Family', xlab='Family')
@
None of the above analyses suggest serious problems with the data,
although there are clearly a few genotyping errors.
\section*{TDT tests}
At present, the package only allows testing of
discrete disease phenotypes in case--parent trios --- basically the
Transmission/Disequilibrium Test (TDT). This is carried out by the
function {\tt tdt.snp}, which returns the same class of object as that
returned by {\tt single.snp.tests}; allelic (1 df) and genotypic
(2~df) tests are computed. The following commands compute the tests,
display the $p$-values, and plot quantile--quantile plots of the 1~df tests
chi-squared statistics:
<<tdt-tests,fig=TRUE,keep.source=TRUE>>=
tests <- tdt.snp(data = pedData, snp.data = genotypes)
cbind(p.values.1df = p.value(tests, 1),
p.values.2df = p.value(tests, 2))
qq.chisq(chi.squared(tests, 1), df = 1)
@
Since these SNPs were all in a region of known association, the
overdispersion of test statistics is not surprising. Note that,
because each family had two affected offspring, there were twice as
many parent-offspring trios as families. In the above tests, the
contribution of the two trios in each family to the test statistic
have been assumed to be independent. When there is {\em linkage}
between the genetic locus and disease trait, this assumption is
incorrect and an alternative variance estimate can be used by
specifying {\tt robust=TRUE} in the call. However, in practice,
linkage is very rarely strong enough to require this correction.
\end{document}
|