/usr/lib/R/site-library/phangorn/doc/phangorn-specials.Rnw is in r-cran-phangorn 1.99.14-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 | %\VignetteIndexEntry{Advanced features}
%\VignetteKeywords{Documentation}
%\VignettePackage{phangorn}
%\VignetteEngine{Sweave}
\documentclass[12pt]{article}
% setwd("/home/kschliep/Desktop/phangorn/vignettes")
% Sweave("phangorn-specials.Rnw")
% tools::texi2dvi("phangorn-specials.tex", pdf=TRUE)
\usepackage{times}
\usepackage{hyperref}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in
\newcommand{\R}{\textsf{R}}
\newcommand{\pml}{\Robject{pml}}
\newcommand{\phangorn}{\Rpackage{phangorn}}
\newcommand{\ape}{\Rpackage{ape}}
\newcommand{\multicore}{\Rpackage{multicore}}
\newcommand{\term}[1]{\emph{#1}}
\newcommand{\mref}[2]{\htmladdnormallinkfoot{#2}{#1}}
\begin{document}
% Ross Ihakas extenstion for nicer representation
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em}
\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
\fvset{listparameters={\setlength{\topsep}{0pt}}}
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
<<echo=FALSE>>=
options(width=70)
foo <- packageDescription("phangorn")
@
\title{Special features of phangorn (Version \Sexpr{foo$Version})} %$
\author{\mref{mailto:klaus.schliep@gmail.com}{Klaus P. Schliep}}
\maketitle
\nocite{Paradis2012}
\section*{Introduction}
This document illustrates some of the \phangorn{} \cite{Schliep2011} specialised features which are useful but maybe not as well-known or just not (yet) described elsewhere. This is mainly interesting for someone who wants to explore different models or set up some simulation studies. We show how to construct data objects for different character states other than nucleotides or amino acids or how to set up different models to estimate transition rate.
The vignette \emph{Trees} describes in detail how to estimate phylogenies from nucleotide or amino acids.
\section{User defined data formats}\label{sec:USER}
To better understand how to define our own data type it is useful to know a bit more about the internal representation of \Robject{phyDat} objects. The internal representation of \Robject{phyDat} object is very similar to \Robject{factor} objects.
As an example we will show here several possibilities to define nucleotide data with gaps defined as a fifth state. Ignoring gaps or coding them as ambiguous sites - as it is done in most programs, also in phangorn as default - may be misleading (see Warnow(2012)\cite{Warnow2012}). When the number of gaps is low and the gaps are missing at random coding gaps as separate state may be not important.
Let assume we have given a matrix where each row contains a character vector of a taxonomical unit:
<<echo=TRUE>>=
library(phangorn)
data = matrix(c("r","a","y","g","g","a","c","-","c","t","c","g",
"a","a","t","g","g","a","t","-","c","t","c","a",
"a","a","t","-","g","a","c","c","c","t","?","g"),
dimnames = list(c("t1", "t2", "t3"),NULL), nrow=3, byrow=TRUE)
data
@
Normally we would transform this matrix into an phyDat object and gaps are handled as ambiguous character like "?".
<<>>=
gapsdata1 = phyDat(data)
gapsdata1
@
Now we will define a "USER" defined object and have to supply a vector levels of the character states for the new data, in our case the for nucleotide states and the gap. Additional we can define ambiguous states which can be any of the states.
<<echo=TRUE>>=
gapsdata2 = phyDat(data, type="USER", levels=c("a","c","g","t","-"),
ambiguity = c("?", "n"))
gapsdata2
@
This is not yet what we wanted as two sites of our alignment, which contain the ambiguous characters "r" and "y", got deleted.
To define ambiguous characters like "r" and "y" explicitly we have to supply a contrast matrix similar to contrasts for factors.
<<echo=TRUE>>=
contrast = matrix(data = c(1,0,0,0,0,
0,1,0,0,0,
0,0,1,0,0,
0,0,0,1,0,
1,0,1,0,0,
0,1,0,1,0,
0,0,0,0,1,
1,1,1,1,0,
1,1,1,1,1),
ncol = 5, byrow = TRUE)
dimnames(contrast) = list(c("a","c","g","t","r","y","-","n","?"),
c("a", "c", "g", "t", "-"))
contrast
gapsdata3 = phyDat(data, type="USER", contrast=contrast)
gapsdata3
@
Here we defined "n" as a state which can be any nucleotide but not a gap "-" and "?" can be any state including a gap.
These data can be used in all functions available in \phangorn{} to compute distance matrices or perform parsimony and maximum likelihood analysis.
\section{Estimation of non-standard transition rate matrices}
In the last section \ref{sec:USER} we described how to set up user defined data formats. Now we describe how to estimate transition matrices with pml.
Again for nucleotide data the most common models can be called directly in the \Rfunction{optim.pml} function (e.g. "JC69", "HKY", "GTR" to name a few). Table \ref{models} lists all the available nucleotide models, which can estimated directly in \Rfunction{optim.pml}. For amino acids several transition matrices are available ("WAG", "JTT", "LG", "Dayhoff", "cpREV", "mtmam", "mtArt", "MtZoa", "mtREV24", "VT","RtREV", "HIVw", "HIVb", "FLU", "Blossum62", "Dayhoff\_DCMut" and "JTT-DCMut") or can be estimated with \Rfunction{optim.pml}. For example Mathews et al. (2010) \cite{Mathews2010} used this function to estimate a phytochrome amino acid transition matrix.
We will now show how to estimate a rate matrix with different transition ($\alpha$) and transversion ratio ($\beta$) and a fixed rate to the gap state ($\gamma$) - a kind of Kimura two-parameter model (K81) for nucleotide data with gaps as fifth state (see table \ref{gaps}).
\begin{table}[htbp]
\centering
\begin{tabular}{l|lllll}
& a & c & g & t & - \\
\hline
a & & & & & \\
c & $\beta$ & & & & \\
g & $\alpha$ & $\beta$ & & & \\
t & $\beta$ & $\alpha$ & $\beta$ & & \\
- & $\gamma$ & $\gamma$ & $\gamma$ & $\gamma$ & \\
\end{tabular}
\caption{Rate matrix K to optimise. }\label{gaps}
\end{table}
The parameters subs accepts a vector of consecutive integers and at least one element has to be zero (these gets the reference rate of 1).
<<>>=
tree = unroot(rtree(3))
fit = pml(tree, gapsdata3)
fit = optim.pml(fit, optQ=TRUE, subs=c(1,0,1,2,1,0,2,1,2,2),
control=pml.control(trace=0))
fit
@
Here are some conventions how the models are estimated: \\
If a model is supplied the base frequencies bf and rate matrix Q are optimised according to the model (nucleotides) or the adequate rate matrix and frequencies are chosen (for amino acids).
If optQ=TRUE and neither a model or subs are supplied than a symmetric (optBf=FALSE) or reversible model (optBf=TRUE, i.e. the GTR for nucleotides) is estimated. This can be slow if the there are many character states, e.g. for amino acids.
\begin{table}[htbp]
\centering
\begin{tabular}{|llllr|}
\hline
model & optQ & optBf & subs & df \\
\hline
JC & FALSE & FALSE & $c(0, 0, 0, 0, 0, 0)$ & 0 \\
F81 & FALSE & TRUE & $c(0, 0, 0, 0, 0, 0)$ & 3 \\
K80 & TRUE & FALSE & $c(0, 1, 0, 0, 1, 0)$ & 1 \\
HKY & TRUE & TRUE & $c(0, 1, 0, 0, 1, 0)$ & 4 \\
TrNe & TRUE & FALSE & $c(0, 1, 0, 0, 2, 0)$ & 2 \\
TrN & TRUE & TRUE & $c(0, 1, 0, 0, 2, 0)$ & 5 \\
TPM1 & TRUE & FALSE & $c(0, 1, 2, 2, 1, 0)$ & 2 \\
K81 & TRUE & FALSE & $c(0, 1, 2, 2, 1, 0)$ & 2 \\
TPM1u & TRUE & TRUE & $c(0, 1, 2, 2, 1, 0)$ & 5 \\
TPM2 & TRUE & FALSE & $c(1, 2, 1, 0, 2, 0)$ & 2 \\
TPM2u & TRUE & TRUE & $c(1, 2, 1, 0, 2, 0)$ & 5 \\
TPM3 & TRUE & FALSE & $c(1, 2, 0, 1, 2, 0)$ & 2 \\
TPM3u & TRUE & TRUE & $c(1, 2, 0, 1, 2, 0)$ & 5 \\
TIM1e & TRUE & FALSE & $c(0, 1, 2, 2, 3, 0)$ & 3 \\
TIM1 & TRUE & TRUE & $c(0, 1, 2, 2, 3, 0)$ & 6 \\
TIM2e & TRUE & FALSE & $c(1, 2, 1, 0, 3, 0)$ & 3 \\
TIM2 & TRUE & TRUE & $c(1, 2, 1, 0, 3, 0)$ & 6 \\
TIM3e & TRUE & FALSE & $c(1, 2, 0, 1, 3, 0)$ & 3 \\
TIM3 & TRUE & TRUE & $c(1, 2, 0, 1, 3, 0)$ & 6 \\
TVMe & TRUE & FALSE & $c(1, 2, 3, 4, 2, 0)$ & 4 \\
TVM & TRUE & TRUE & $c(1, 2, 3, 4, 2, 0)$ & 7 \\
SYM & TRUE & FALSE & $c(1, 2, 3, 4, 5, 0)$ & 5 \\
GTR & TRUE & TRUE & $c(1, 2, 3, 4, 5, 0)$ & 8 \\
\hline
\end{tabular}
\caption{DNA models available in phangorn, how they are defined and number of parameters to estimate. }\label{models}
\end{table}
\section{Codon substitution models}
A special case of the transition rates are codon models. \phangorn{} now offers the possibility to estimate the $d_N/d_S$ ratio (sometimes called ka/ks), for an overview see \cite{Yang2006}. These functions extend the option to estimates the $d_N/d_S$ ratio for pairwise sequence comparison as it is available through the function \Rfunction{kaks} in \Rpackage{seqinr}. The transition rate between between codon $i$ and $j$ is defined as follows:
\begin{eqnarray}
q_{ij}=\left\{
\begin{array}{l@{\quad}l}
0 & \textrm{if i and j differ in more than one position} \\
\pi_j & \textrm{for synonymous transversion} \\
\pi_j\kappa & \textrm{for synonymous transition} \\
\pi_j\omega & \textrm{for non-synonymous transversion} \\
\pi_j\omega\kappa & \textrm{for non synonymous transition}
\end{array}
\right. \nonumber
\end{eqnarray}
where $\omega$ is the $d_N/d_S$ ratio, $\kappa$ the transition transversion ratio and $\pi_j$ is the the equilibrium frequencies of codon $j$.
For $\omega\sim1$ the an amino acid change is neutral, for $\omega < 1$ purifying selection and $\omega > 1$ positive selection.
There are four models available:
"codon0", where both parameter $\kappa$ and $\omega$ are fixed to 1, "codon1" where both parameters are estimated and "codon2" or "codon3" where $\kappa$ or $\omega$ is fixed to 1.
We compute the $d_N/d_S$ for some sequences given a tree using the ML functions \Rfunction{pml} and \Rfunction{optim.pml}. First we have to transform the the nucleotide sequences into codons (so far the algorithms always takes triplets).
<<echo=TRUE>>=
library(phangorn)
primates = read.phyDat("primates.dna", format="phylip", type="DNA")
tree <- NJ(dist.ml(primates))
dat <- phyDat(as.character(primates), "CODON")
fit <- pml(tree, dat)
fit0 <- optim.pml(fit, control = pml.control(trace = 0))
fit1 <- optim.pml(fit, model="codon1", control=pml.control(trace=0))
fit2 <- optim.pml(fit, model="codon2", control=pml.control(trace=0))
fit3 <- optim.pml(fit, model="codon3", control=pml.control(trace=0))
anova(fit0, fit2, fit3, fit1)
@
The models described here all assume equal frequencies for each codon (=1/61). One can optimise the codon frequencies setting the option to optBf=TRUE. As the convergence of the 61 parameters the convergence is likely slow set the maximal iterations to a higher value than the default (e.g. control = pml.control(maxit=50)).
\section{Generating trees}
\phangorn{} has several functions to generate tree topologies, which may are interesting for simulation studies. \Rfunction{allTrees} computes all possible bifurcating tree topologies either rooted or unrooted for up to 10 taxa. One has to keep in mind that the number of trees is growing exponentially, use \Rfunction(howmanytrees) from \ape{} as a reminder.
%<<echo=TRUE>>=
%trees = allTrees(5)
%@
<<label=plotAll,include=FALSE>>=
trees = allTrees(5)
par(mfrow=c(3,5), mar=rep(0,4))
for(i in 1:15)plot(trees[[i]], cex=1, type="u")
@
\begin{figure}
\begin{center}
<<label=figAll,fig=TRUE,echo=FALSE>>=
<<plotAll>>
@
\end{center}
\caption{all (15) unrooted trees with 5 taxa}
\label{fig:NJ}
\end{figure}
\Rfunction{nni} returns a list of all trees which are one nearest neighbor interchange away.
<<echo=TRUE>>=
trees = nni(trees[[1]])
@
\Rfunction{rNNI} and \Rfunction{rSPR} generate trees which are a defined number of NNI (nearest neighbor interchange) or SPR (subtree pruning and regrafting) away.
\bibliographystyle{plain}
\bibliography{phangorn}
\section{Session Information}
The version number of \R{} and packages loaded for generating the vignette were:
<<echo=FALSE,results=tex>>=
toLatex(sessionInfo())
@
\end{document}
|