/usr/share/doc/mcl/html/clmprotocols.html is in mcl-doc 1:14-137-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Copyright (c) 2014 Stijn van Dongen -->
<head>
<meta name="keywords" content="manual">
<style type="text/css">
/* START aephea.base.css */
body
{ text-align: justify;
margin-left: 0%;
margin-right: 0%;
}
a:link { text-decoration: none; }
a:active { text-decoration: none; }
a:visited { text-decoration: none; }
a:link { color: #1111aa; }
a:active { color: #1111aa; }
a:visited { color: #111166; }
a.local:link { color: #11aa11; }
a.local:active { color: #11aa11; }
a.local:visited { color: #116611; }
a.intern:link { color: #1111aa; }
a.intern:active { color: #1111aa; }
a.intern:visited { color: #111166; }
a.extern:link { color: #aa1111; }
a.extern:active { color: #aa1111; }
a.extern:visited { color: #661111; }
a.quiet:link { color: black; }
a.quiet:active { color: black; }
a.quiet:visited { color: black; }
div.verbatim
{ font-family: monospace;
margin-top: 1em;
margin-bottom: 1em;
font-size: 10pt;
margin-left: 2em;
white-space: pre;
}
div.indent
{ margin-left: 8%;
margin-right: 0%;
}
.right { text-align: right; }
.left { text-align: left; }
.nowrap { white-space: nowrap; }
.item_leader
{ position: relative;
margin-left: 8%;
}
.item_compact { position: absolute; vertical-align: baseline; }
.item_cascade { position: relative; }
.item_leftalign { text-align: left; }
.item_rightalign
{ width: 2em;
text-align: right;
}
.item_compact .item_rightalign
{ position: absolute;
width: 52em;
right: -2em;
text-align: right;
}
.item_text
{ position: relative;
margin-left: 3em;
}
.smallcaps { font-size: smaller; text-transform: uppercase }
/* END aephea.base.css */
body { font-family: "Garamond", "Gill Sans", "Verdana", sans-serif; }
body
{ text-align: justify;
margin-left: 8%;
margin-right: 8%;
}
</style>
<title>Work flows and protocols for mcl and friends</title>
</head>
<body>
<p style="text-align:right">
16 May 2014
<a class="local" href="clmprotocols.ps"><b>clmprotocols</b></a>
14-137
</p>
<div class=" itemize " style="margin-top:1em; font-size:100%">
<div class=" item_compact"><div class=" item_rightalign nowrap " style="right:-3em">1.</div></div>
<div class=" item_text " style="margin-left:4em">
<a class="intern" href="#name">NAME</a>
</div>
<div class=" item_compact"><div class=" item_rightalign nowrap " style="right:-3em">2.</div></div>
<div class=" item_text " style="margin-left:4em">
<a class="intern" href="#description">DESCRIPTION</a>
</div>
<div class=" item_compact"><div class=" item_rightalign nowrap " style="right:-3em">3.</div></div>
<div class=" item_text " style="margin-left:4em">
<a class="intern" href="#internal">Network representation</a>
</div>
<div class=" item_compact"><div class=" item_rightalign nowrap " style="right:-3em">4.</div></div>
<div class=" item_text " style="margin-left:4em">
<a class="intern" href="#large">Loading large networks</a>
</div>
<div class=" item_compact"><div class=" item_rightalign nowrap " style="right:-3em">5.</div></div>
<div class=" item_text " style="margin-left:4em">
<a class="intern" href="#formatconversion">Converting between formats</a>
</div>
<div class=" item_compact"><div class=" item_rightalign nowrap " style="right:-3em">6.</div></div>
<div class=" item_text " style="margin-left:4em">
<a class="intern" href="#jobs">Using threading and job dispatching</a>
</div>
<div class=" item_compact"><div class=" item_rightalign nowrap " style="right:-3em">7.</div></div>
<div class=" item_text " style="margin-left:4em">
<a class="intern" href="#blast">Clustering similarity graphs encoded in BLAST results</a>
</div>
<div class=" item_compact"><div class=" item_rightalign nowrap " style="right:-3em">8.</div></div>
<div class=" item_text " style="margin-left:4em">
<a class="intern" href="#array">Clustering expression data</a>
</div>
<div class=" item_compact"><div class=" item_rightalign nowrap " style="right:-3em">9.</div></div>
<div class=" item_text " style="margin-left:4em">
<a class="intern" href="#degree">Reducing node degrees in the graph</a>
</div>
<div class=" item_compact"><div class=" item_rightalign nowrap " style="right:-3em">10.</div></div>
<div class=" item_text " style="margin-left:4em">
<a class="intern" href="#seealso">SEE ALSO</a>
</div>
<div class=" item_compact"><div class=" item_rightalign nowrap " style="right:-3em">11.</div></div>
<div class=" item_text " style="margin-left:4em">
<a class="intern" href="#author">AUTHOR</a>
</div>
</div>
<a name="name"></a>
<h2>NAME</h2>
<p style="margin-bottom:0" class="asd_par">
clmprotocols — Work flows and protocols for mcl and friends</p>
<a name="description"></a>
<h2>DESCRIPTION</h2>
<p style="margin-top:0em; margin-bottom:0em">
A guide to doing analysis with mcl and its helper programs.
</p>
<a name="internal"></a>
<h2>Network representation</h2>
<p style="margin-bottom:0" class="asd_par">
The clustering program <b>mcl</b> expects the name of file as its first argument.
If the <b>--abc</b> option is used, the file is assumed to adhere to a
simple format where a network is specified edge by edge, one line and one
edge at a time.
Each line describes an edge as two labels and a numerical value, all
separated by white space. The labels and the value respectively identify the
two nodes and the edge weight. The format is called <span class="smallcaps">ABC</span>-format,
where '<span class="smallcaps">A</span>' and '<span class="smallcaps">B</span>' represent the two labels and '<span class="smallcaps">C</span>' represents the
edge weight. The latter is optional; if omitted the edge weight is set to one.
If <span class="smallcaps">ABC</span>-format is used, the output is returned as a listing of clusters,
each cluster given as a line of white-space separated labels.
</p>
<p style="margin-bottom:0" class="asd_par">
MCL can also utilize a second representation, which is a stringent and
unambiguous format for both input and output.
This is called <i>matrix format</i> and it is required when using other
programs in the mcl suite, for example when comparing and analysing
clusterings using <a class="local sibling" href="clm.html">clm</a> or when extracting and transforming
networks using <a class="local sibling" href="mcx.html">mcx</a>.
Native mode (matrix format) is entered simply by <i>not</i> specifying
<b>--abc</b>.
</p>
<p style="margin-bottom:0" class="asd_par">
The recommended approach using <b>mcl</b> is to convert an external format to
<span class="smallcaps">ABC</span>-format. The program <a class="local sibling" href="mcxload.html">mcxload</a> reads the latter and creates a
native network file and a dictionary file that maps network nodes to
labels. All applications in the MCL suite, including <b>mcl</b> itself, can read
this native network file format. Label output can be obtained using
<a class="local sibling" href="mcxdump.html">mcxdump</a>. The workflow is thus:
</p>
<div class="verbatim"> # External format has been converted to file data.abc (abc format)
mcxload -abc data.abc --stream-mirror -write-tab data.tab -o data.mci
mcl data.mci -I 1.4
mcl data.mci -I 2
mcl data.mci -I 4
mcxdump -icl out.data.mci.I14 -tabr data.tab -o dump.data.mci.I14
mcxdump -icl out.data.mci.I20 -tabr data.tab -o dump.data.mci.I20
mcxdump -icl out.data.mci.I40 -tabr data.tab -o dump.data.mci.I40</div>
<p style="margin-top:0em; margin-bottom:0em">
In this example the cluster output is stored in native format and dumped to
labels using mcxdump. The stored output can now be used to learn more about
the clusterings. An example is the following, where <a class="local sibling" href="clm.html">clm</a> is applied
in mode <b>dist</b> to gauge the distance between different clusterings.
</p>
<div class="verbatim"> clm dist --chain out.data.mci.I{14,20,40}</div>
<a name="large"></a>
<h2>Loading large networks</h2>
<p style="margin-bottom:0" class="asd_par">
If you deal with very large networks (say with hundreds of millions
of edges), it is recommended to use binary format (cf <a class="local sibling" href="mcxio.html">mcxio</a>).
This is simply achieved by adding <tt>--write-binary</tt> to the mcxload
command line. The resulting file is no longer human-readable but
will be faster to read by a factor between ten- or twenty-fold
compared to standard <span class="smallcaps">MCL</span>-edge network format, and a factor around fifty-fold
compared to label format.
All <span class="smallcaps">MCL</span>-edge programs are able to read binary format, and speed
of reading will be somewhere in the order of millions of edges
per second, compared to, for example, roughly 100K edges
per second for label format.
</p>
<p style="margin-bottom:0" class="asd_par">
Memory usage for mcxload can be lowered
by replacing the option <tt>--stream-mirror</tt> with <tt>-ri max</tt>.
</p>
<a name="formatconversion"></a>
<h2>Converting between formats</h2>
<p style="margin-bottom:0" class="asd_par">
<b>Converting label format to tabular format</b><br>
Label format, two or three (including weight) columns:
</p>
<div class="verbatim"> mcxload -abc data.abc --stream-mirror -write-tab data.tab -o data.mci
mcxdump -imx data.mci -tab data.tab --dump-table</div>
<p style="margin-top:0em; margin-bottom:0em">Simple Interaction File (SIF) format:</p>
<div class="verbatim"> mcxload -sif data.sif --stream-mirror -write-tab data.tab -o data.mci
mcxdump -imx data.mci -tab data.tab --dump-table</div>
<p style="margin-top:0em; margin-bottom:0em">
These two examples are very similar, and differ only in the way the input to <b>mcxload</b> is specified.
</p>
<a name="jobs"></a>
<h2>Using threading and job dispatching</h2>
<p style="margin-bottom:0" class="asd_par">
The programs <a class="local sibling" href="mcxarray.html">mcxarray</a>, <a class="local sibling" href="mcxclcf.html">mcx clcf</a>, <a class="local sibling" href="mcxctty.html">mcx ctty</a>,
<a class="local sibling" href="mcxdiameter.html">mcx diameter</a> and <a class="local sibling" href="clminfo2.html">clm info2</a> can all make use of both
threading and job dispatching. The clustering program <a class="local sibling" href="mcl.html">mcl</a>
can only use threading.
</p>
<p style="margin-bottom:0" class="asd_par">
Instructing these programs to use threads is easy. It just requires
supplying <b>-t</b> <b><num></b>, e.g. use <b>-t</b> <b>4</b> to generate four threads.
It is only sensible to use <tt><num></tt> threads on a machine that has at least <tt><num></tt> CPUs.
It is additionally recommended that a threaded program has exclusive access to those CPUs
and does not have to contend with other jobs.
</p>
<p style="margin-bottom:0" class="asd_par">
For the afore-mentioned programs it is additionally possible to split the computational
load over multiple machines. If <tt><N></tt> machines are available then <tt><N></tt> jobs should
be started. Each job should have an identical parameter <b>-J</b> <b>N</b> (e.g. <b>-J</b> <b>10</b>),
and varying parameters <b>-j</b> <b>0</b>, <b>-j</b> <b>1</b>, ... <b>-j</b> <b>N-1</b> (e.g. <b>-j</b> <b>9</b>).
It is possible to use threads in each individual job, but the number of threads should be
identical across all jobs issued. Output should typically be directed using a convention
such as <b>-o</b> <b>out.0</b>, <b>-o</b> <b>out.1</b>, ... <b>-o</b> <b>out.9</b>.
</p>
<p style="margin-bottom:0" class="asd_par">
After all jobs have finished the outputs must be combined to form the final answer.
The manner in which this is done is dependent on the program used.
With the example output above this would be done as follows. It can be seen
that <a class="local sibling" href="clminfo2.html">clm info2</a> is not yet supported by <a class="local sibling" href="mcxcollect.html">mcx collect</a> and requires somewhat
idiosyncratic processing.
</p>
<div class="verbatim"> # mcx diameter:
mcx collect --add-column -o out.diameter out.{0,1,2,3,4,5,6,7,8,9}
# mcx ctty:
mcx collect --add-column -o out.ctty out.{0,1,2,3,4,5,6,7,8,9}
# mcx clcf:
mcx collect --add-column -o out.clcf out.{0,1,2,3,4,5,6,7,8,9}
# mcxarray:
mcx collect --add-matrix -o out.ctty out.{0,1,2,3,4,5,6,7,8,9}
# clm info2:
clxdo add_table out.{0,1,2,3,4,5,6,7,8,9} > out.info2</div>
<a name="blast"></a>
<h2>Clustering similarity graphs encoded in BLAST results</h2>
<p style="margin-bottom:0" class="asd_par">
A specific instance of the workflow above is the clustering of proteins based on
their sequence similarities. In the most typical scenario the external
format is BLAST output, which needs to be transformed to <span class="smallcaps">ABC</span> format.
In the examples below the input is in columnar blast format
obtained with the blast -m8 option.
It requires a version of <b>mcl</b> at least as recent as <tt>09-061</tt>.
First we create an <span class="smallcaps">ABC</span>-formatted file using the external columnar BLAST
format, which is assumed to be in a file called <tt>seq.cblast</tt>.
</p>
<div class="verbatim"> cut -f 1,2,11 seq.cblast > seq.abc</div>
<p style="margin-top:0em; margin-bottom:0em">
The columnar format in the file <tt>seq.cblast</tt> has, for a given BLAST hit,
the sequence labels in the first two columns and the asssociated E-value in
column 11. It is parsed by the standard UNIX cut(1) utility. The format
must have been created with the BLAST -m8 option so that no comment lines
are present. Alternatively these can be filtered out using grep.
</p>
<p style="margin-top:0em; margin-bottom:0em">
The newly created <tt>seq.abc</tt> file is loaded by <a class="local sibling" href="mcxload.html">mcxload</a>,
which writes both a network file <tt>seq.mci</tt> and a dictionary
file <tt>seq.tab</tt>.
</p>
<div class="verbatim"> mcxload -abc seq.abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)'
-o seq.mci -write-tab seq.tab</div>
<p style="margin-top:0em; margin-bottom:0em">
The <tt>--stream-mirror</tt> option ensures that the resulting network will be
undirected, as recommended when using <b>mcl</b>. Omitting this option would
result in a directed network as BLAST E-values generally differ between two
sequences. The default course of action for <a class="local sibling" href="mcxload.html">mcxload</a> is to use the
best value found between a pair of labels. The next option,
<tt>--abc-neg-log10</tt> tranforms the numerical values in the input (the BLAST
E-values) by taking the logarithm in base 10 and subsequently negating the
sign. Finally, the transformed values are capped so that any E-value below
1e-200 is set to a maximum allowed edge weight of 200.
</p>
<p style="margin-bottom:0" class="asd_par">
To obtain clusterings from <tt>seq.mci</tt> and <tt>seq.tab</tt> one has two
choices. The first is to generate an abstract clustering representation
and from that obtain the label output, as follows.
Below the <b>-o</b> option is not used, so mcl will create meaningful and
unique output names by itself. The default way of doing this is to preprend
the prefix <tt>out.</tt> and to append a suffix encoding the inflation value
used, with inflation encoded using two digits of precision and the decimal
separator removed.
</p>
<div class="verbatim"> mcl seq.mci -I 1.4
mcl seq.mci -I 2
mcl seq.mci -I 4
mcl seq.mci -I 6
mcxdump -icl out.seq.mci.I14 -tabr seq.tab -o dump.seq.mci.I14
mcxdump -icl out.seq.mci.I20 -tabr seq.tab -o dump.seq.mci.I20
mcxdump -icl out.seq.mci.I40 -tabr seq.tab -o dump.seq.mci.I40
mcxdump -icl out.seq.mci.I60 -tabr seq.tab -o dump.seq.mci.I60</div>
<p style="margin-top:0em; margin-bottom:0em">
Now the file <tt>out.seq.tab.I14</tt> and its associates can be used for example
to compute the distances between the encoded clusterings with
<b>clm dist</b>, to compute a set of strictly reconciled nested clusterings
with <b>clm order</b>, or to compute an efficiency criterion with
<b>clm info</b>.
</p>
<p style="margin-bottom:0" class="asd_par">
Alternatively, label output can be obtained directly from <b>mcl</b>
as follows.</p>
<div class="verbatim"> mcl seq.mci -I 1.4 -use-tab seq.tab
mcl seq.mci -I 2 -use-tab seq.tab
mcl seq.mci -I 4 -use-tab seq.tab
mcl seq.mci -I 6 -use-tab seq.tab</div>
<a name="array"></a>
<h2>Clustering expression data</h2>
<p style="margin-bottom:0" class="asd_par">
The clustering of expression data constitutes another workflow. In this case the
external format usually is a tabular file format containing labels for genes
or probes and numerical values measuring the expression values or fold
changes across a series of conditions or experiments. Such tabular files can
be processed by <a class="local sibling" href="mcxarray.html">mcxarray</a>, which comes installed with <b>mcl</b>. The
program computes correlations (either Pearson or Spearmann) between genes,
and creates an edge between genes if their correlation exceeds the specified
cutoff. From this <a class="local sibling" href="mcxarray.html">mcxarray</a> creates both a network file and a
dictionary file. In the example below, the file <tt>expr.data</tt> is
in tabular format with one row of column headers (e.g. tags for
experiments) and one column of row identifiers (e.g. probe or gene identifiers).
</p>
<div class="verbatim"> mcxarray -data expr.data -skipr 1 -skipc 1 -o expr.mci -write-tab expr.tab --pearson -co 0.7 -tf 'abs(),add(-0.7)'
</div>
<p style="margin-top:0em; margin-bottom:0em">
This uses the Pearson correlation, ignoring values below 0.7.
The remaining values in the interval <tt>[0.7-1]</tt> are remapped to the interval
<tt>[0-0.3]</tt>. This is recommended so that the edge weights will have
increased contrast between them, as <b>mcl</b> is affected by relative differences
(ratios) between edge weights rather than absolute differences. To illustrate
this, values 0.75 and 0.95 are mapped to 0.05 and 0.25, with respective
ratios 0.79 and 0.25.
The network file <tt>expr.mci</tt> and the dictionary file <tt>expr.tab</tt> can
now be used as before.
</p>
<p style="margin-bottom:0" class="asd_par">
It is possible to investigate the effect of the correlation cutoff as follows.
First a network is generated at a very low threshold, and this network
is analysed using <b>mcxquery</b>.
</p>
<div class="verbatim"> mcxarray -data expr.data -skipr 1 -skipc 1 -o expr20.mcx --write-binary --pearson -co 0.2 -tf 'abs()'
mcx query -imx expr20.mcx --vary-correlation
</div>
<p style="margin-top:0em; margin-bottom:0em">
The output is in a tabular format describing the properties of the network
at increasing correlation thresholds. Examples are the size of the biggest
component, the number of orphan nodes (not connected to any other node), and
the mean and median node degrees.
A good way to choose the cutoff is to balance the number of singletons
and the median node degree. Both should preferably not be too high.
For example the number of orphan nodes should be
less than ten percent of the total number of nodes,
and the median node degree should be at most one hundred neighbours.
</p>
<a name="degree"></a>
<h2>Reducing node degrees in the graph</h2>
<p style="margin-top:0em; margin-bottom:0em">
A good way to lower node degrees in a network is to require that
an edge is among the best <i>k</i> edges (those of highest weight) for
<i>both</i> nodes incident to the edge, for some value of <i>k</i>. This is
achieved by using <tt>knn(k)</tt> in the argument to the <b>-tf</b> option to
mcl or <b>mcx alter</b>.
To give an example, a graph was formed on translations in Ensembl release 57 on 2.6M nodes.
The similarities were obtained from BLAST scores,
leading to a graph with a total edge count of 300M, with
best-connected nodes of degree respectively
11148, 9083, 9070, 9019 and 8988, and with mean node degree 233.
These degrees are unreasonable.
The graph was subjected to <b>mcx query</b> to investigate the effect of
varying k-NN parameters. A good heuristic is to choose a value
that does not significantly change the number of singletons in the input graph.
In the example it meant that <b>-tf</b> <b>'knn(160)'</b> was feasible, leading
to a mean node degree of 98.
</p>
<p style="margin-bottom:0" class="asd_par">
A second approach to reduce node degrees is to employ the <b>-ceil-nb</b> option.
This ranks nodes by node degree, highest first. Nodes are considered
in order of rank, and edges of low weight are removed from the graph until
a node satisfies the node degree threshold specified by <b>-ceil-nb</b>.
</p>
<p style="margin-bottom:0" class="asd_par">
</p>
<a name="seealso"></a>
<h2>SEE ALSO</h2>
<p style="margin-top:0em; margin-bottom:0em">
<a class="local sibling" href="mcxio.html">mcxio</a>.
</p>
<a name="author"></a>
<h2>AUTHOR</h2>
<p style="margin-top:0em; margin-bottom:0em">
Stijn van Dongen.</p>
</body>
</html>
|