This file is indexed.

/usr/share/doc/sphinx3/sphinxman_misc.html is in sphinx3-doc 0.8-0ubuntu1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <title>Decoding</title>
  <style type="text/css">
     pre { font-size: medium; background: #f0f8ff; padding: 2mm; border-style: ridge ; color: teal}
     code {font-size: medium; color: teal}
  </style>
</head>
 <body>

<a name="top">INDEX</a>
<p>(This is under construction.)
<!======================================================================>
<ol>
   <li><a href="#0"><font color="red">Miscellaneous</font></a>
   <ul>
   <li><a href="#00">Generating lattices and N-best lists, and some facts about them</a>
   <li><a href="#01">Explanation of various fields in an ARPA format LM</a>
   <li><a href="#02">Generating matchseg files</a>
   <li><a href="#03">Explanation of some SPHINX-II decoder flags</a>
   </ul>
   </ol>
<!======================================================================>

<a name="0"></a>
<a name="00"></a>
<!------------------------------------------------------------------------->
<center><h4><font color="red">MISCELLANEOUS</font></h4></center>
<TABLE width="100%" bgcolor="#ffffff">
<td>GENERATING LATTICES AND N-BEST LISTS, AND SOME FACTS</td>
</table>
<!------------------------------------------------------------------------->
<p>
<b><u>Generating lattices:</b></u>
<p>
Lattices can be generated by including the flag
<pre>
-oulatdir [directory in which you want to write lattices]
</pre>
in the argument of the s3decode binary. Corresponding to each
utterance, the decoder will then write a lattice in the directory
you have specified. The lattice will be named utteranceid.lat.gz,
and the contents of the file can be seen by giving the command
"zcat filename" from the commandline on a unix machine.
<p>
If the utterance name in the ctl file includes directory names too, then
you have the option of including or excluding them from the lattice
filenames by including or excluding the string ,CTL from the argument
you give to -outlatdir, respectively. This string is appended directly
after the argument, without a space. Thus if the argument you give is
<pre>
-outlatdir current
</pre>
the extended argument would be
<pre>
-outlatdir current,CTL
</pre>
<p>
<b><u>Generating N-best lists from lattices:</b></u>
<p>
N-best lists can be generated from the lattices by using
the binary s3astar. It works just like the decoder (it takes
the same controlfile as the decoder, and the inlatdir is the
same as the outlatdir that the decoder used). You need to
additionally provide an nbestdir where the N-best files are
written. The number of hypotheses in any N-best list can be
specified using a -nbest argument (the default value is 200, but
note that just becuase you ask for 200 hypotheses it does not mean that
you will get 200 hypotheses. If the lattice holds fewer than 200
possible hypotheses, you'll get fewer hypotheses). The N-best files
will look like matchseg outputs.
<p>
<p>

<b><u>Example of a lattice and explanation of format:</b></u>
<p>
The lattice has three
distinct sections. In the first all the nodes in the graph with their
associated words and being and end times are listed. In the second section
the acoustic scores associated with each of the nodes is listed.  In the
final section the scores associated with the edge between any two words is
listed. The lattice also has additional lines of information mentioning the
total number of nodes in the graph, the id of the first and last nodes,
and text describing the format of the lines in the lattice. In addition,
the lattice may contain lines that begin with a "#". These are comments.
<p>
Here are examples from each of the components of the lattice. Explanations
are interspersed.
<pre>
.....
Nodes 1949 (NODEID WORD STARTFRAME FIRST-ENDFRAME LAST-ENDFRAME)
0 ++GARBAGE++ 254 256 256
1 ++LAUGH++ 254 256 256
2 ++N++ 254 256 256
3 ++GARBAGE++ 253 255 255
4 ++LAUGH++ 253 255 255
5 ++N++ 253 255 255
6 ++GARBAGE++ 252 254 254
7 ++LAUGH++ 252 254 254
8 ++N++ 252 254 254
9 ++GARBAGE++ 251 253 253
10 ++N++ 251 253 253
11 A 251 253 253
12 ++GARBAGE++ 250 252 252
13 ++N++ 250 252 252
14 A 250 252 252
15 ++N++ 249 251 251
16 HAVE 245 250 253
17 ARE(2) 245 250 250
18 HAVE 244 249 249
19 GO 244 249 251  
.....
</pre>
Node no. 16 is the word HAVE and begins on the 245th frame and can end
anywhere between the 250th and 253rd frames.
<pre>
.....
#
Initial 1948
Final 82
.....
</pre>
Nodes are written out in *reverse* order in the lattice. As a
result, the node that is written out last is actually the *first*
node in the lattice. Nodes are also not written in stricly reverse
sequential order since, due to the "stretch" in the ending frames
of different nodes, it is difficult to determine a precise sequence
for all but the first node. As a result, in this lattice, the
first node was node number 1948 (the one written out last), but
the last node was actually node 82.
<pre>
.....
#
BestSegAscr 13865 (NODEID ENDFRAME ASCORE)
1948 2 -172014
1948 3 -207858
1948 4 -220188
1947 5 -351673 
.....
</pre>
While a node can end at many different frames, the acoustic score
associated with the node when it ends at a particular frame will be
different from that associated with it when it ends at a different
frame. This portion of the lattice shows this information. In this example,
when node number 1948 ends at frame 2 it has an acoustic score of -172014,
when it ends at frame 3, the acoustic score is -207858, etc. Note, however
that this acoustic score is only the best score and is not really useful
since the true score for the node would depend on the path being considered
due to the existence of cross-word triphones. 
<pre>
.....
#
Edges (FROM-NODEID TO-NODEID ASCORE)
33 23 -243293
33 20 -297751
35 23 -1599007
37 23 -1923161     
.....
</pre>
The true acoustic score for any word is dependent on the word following
it in the path. We therefore associate this score with the *edge* leading
from that word to the following word. There can be many edges leading
out of a node even at a given frame. Each of these edges is likely to
have a different score than the other edges.
In the above portion of the lattice we are given the information that
the edge from node 33 to node 23 has the score -243293, the edge from
node 33 to node 20 has score -297751 and so on. Keep in mind that there
can be only one edge between any two nodes, even though a node can
end at many different frames. This is because only one of these possible
ending frames will permit a proper edge to the unique starting frame
of the next word.
<pre>
.....
1948 1440 -2083713
1948 1399 -220188
End 
.....
</pre>
The lattice ends here.
<p>
Note also that a lattice is actually a *tree*, and so the left 
context of any node is fixed. So, the variations in acoustic scores
of words are only due to the right contexts, since any node in
a tree can have only one predecessor. However, what the sphinx3 writes
out is not a lattice, but actually a DAG, or a directed, acyclic
graph. What is done here is that nodes representing the same word
in the lattice are merged if they have identical time stamps. 
What you see in the "lattice" file is actually this DAG and not a 
tree-structured lattice at all. 
<p>


<u>An important consideration in combining lattices from different
sources:</u><br>
if you had two parallel paths of this kind:
<p>
<pre>
   ......> WORD1 ------> WORD2
   ......> WORD1 ------> WORD2
</pre>
(WORD1 = "and", say and WORD2 = "the")
<p>
You CAN merge it to 
<pre>
   .....> WORD1 ------> WORD2
</pre>
<p>
*If* you are using CI models! Then the two parallel edges (the dashed
edges) would have had close to identical scores, so you could just take 
the highest score.
But if you are using CD models here's what will happen: the edges from 
path1 and path2 will have *different* scores in the lattice *even* if 
both WORD1 and WORD2 begin and end at exactly the same time instants in 
both cases.
This is because the the word preceding WORD1 in the two cases would have been different,
so the cross-word triphone score of the first phone in the word would have
been different. e.g.
<pre>
   OK......> WORD1(and) ------> WORD2(the)
   BIT......> WORD1(and) ------> WORD2(the)
</pre>
the word preceding "and" in the first path is "ok", the x-wd triphone
at the beginning of and in the first path is A(EY,N).
The preceding word is "bit" in path 2, the x-wd triphone at the
beginning of "and" is  A(T,N). So the score of the edge between
"and" and "the" would reflect this in the two paths and be different.
All this even when you are only working with a *single*
lattice (e.g. the MFC lattice). Any heuristic, like using the highest
score for the merged path (node/edges) is likely to backfire for this
reason (but would have to be experimentally tested).
If path1 is (say) from and  MFC-based lattice and path2 is (say) from a 
PLP-based one, this problem is compunded by the additional problem of
how to come up with correct scaling factors for the scores.

<p>
<a name="0"></a>
<a name="01"></a>
<!------------------------------------------------------------------------->
<center><h4><font color="red">MISCELLANEOUS</font></h4></center>
<TABLE width="100%" bgcolor="#ffffff">
<td>EXPLANATION OF VARIOUS FIELDS IN AN ARPA FORMAT LM</td>
</table>
<!------------------------------------------------------------------------->
At the top of the LM are the lines:
<pre>
\data\
ngram 1=NUM1
ngram 2=NUM2
ngram 3=NUM3
</pre>

This means that there are NUM1 unigrams, NUM2 bigrams
and NUM3 trigrams in the LM.

Then you have a line
<pre>
\1-grams:
</pre>
This means that all following lines are unigrams until you encounter
a line "\2-grams:" or a "\end\" marker.
The \end\ marker marks the end of the arpa LM.

All unigrams have the form
<pre>
MUMa  WORD NUMb
</pre>
NUMa is the log probabilty of the unigram for the word WORD.
NUMb is the back-off weight associated with that word.


For bigrams entries may be
<pre>
NUMa WORD1 WORD2 NUMb
</pre>
or
<pre>
NUMa WORD1 WORD2
</pre>
The first form of entry is when the LM also has trigrams.
If it is only a bigram LM the entries will be of the second
form.

Here, NUMa is the log prob of the bigram P(WORD2 | WORD1) and NUMb is
the back-off weight for the word pair (WORD1 WORD2).

The general N-gram entry is of the form
<pre>
NUMa WORD1 WORD2 ... WORDN NUMb
</pre>
or if it is an Ngram model
<pre>
NUMa WORD1 WORD2 ... WORDN
</pre>
All logarithms are base 10.

To prune the LM you can delete all N-gram entries where the difference
between the probability entry for that Ngram and the predicted probability
for the N-gram obtained by backing off is very small. The predicted
probability is (of course)
for trigrams:
P(C|A,B) ~ P(C|B) * backoffwt(A,B).
For bigram
P(C|B) ~ P(C) * backofwt(B)

Pruning is easiest done only on the highest order Ngram since deleting
lower order Ngrams will delete the back-off weight for that Ngram as
well and affect our prediction for the higher order Ngram. For example,
if we pruned P(B|A) out of the LM, then the backoffwt(A,B) would
also get pruned out. This affects the estimate of P(C|A,B), and the
pruning heuristic would have to be appropriately considered.
<p>
<a name="0"></a>
<a name="02"></a>
<!------------------------------------------------------------------------->
<center><h4><font color="red">MISCELLANEOUS</font></h4></center>
<TABLE width="100%" bgcolor="#ffffff">
<td>GENERATING MATCHSEG FILES, FORMAT OF MATCHSEG OUTPUT</td>
</table>
<!------------------------------------------------------------------------->
Matchseg files can be generated by using the flag -matchsegfn [filename]
in the argument of the decoder binary.
<pre>
SB01 
    S 36683016 
    T -27154407 
    A -21711975 
    L -858796 
    0 -603268 0 &#60s> 
    17 -814154 -49595 SHOW 
    40 -594450 -11806 THE(3) 
    51 -463637 -23023 &#60sil> 
    65 -2880525 -88849 ITS(2) 
    133 -1203392 -72333 DATES 
    171 -806587 -5753 FOR 
    185 -898344 -26773 ALL 
    210 - 2459017 -71603 DEPLOYED 
    260 -1765774 -94176 CEP
    302 -1791387 -76622 EVERETT(2) 
    338 -843218 -62384 THEIR 
    355 -848925 -17748 HOME 
    376 -1482528 -2325 PORT 
    411 -105 8809 -79875 SPEED 
    435 -786647 -37608 BY 
    453 -1771960 -32704 FIVE 
    482 -363323 -23 023 &#60sil> 
    489 -276030 -82596 &#60/s> 
    514
</pre>

This is a hypothesis in the "matchseg format". This output usually comes in a single line, but it was rearranged above to make it easier to read. The first word (or "field")
is the filename, S is a scaling factor (to prevent integers from wrapping
around due to underflowing - it can be thought of as a normalization factor
for likelihoods). T is the total likehood of the utterance, A is the
acoustic likelihood L is the LM likelihood (these are all log likelihoods,
hence the large numbers). Then onwards the format is beginning_frame_number
acoustic_score lm_score WORD beginning_frame_number acoustic_score lm_score
WORD ........ and so on.  In the end, the LAST frame number of the
utterance is written (514 in this case).
<br>
The hypothesis itself is the string of all words between &#60s> and &#60/s>.
(The hypothesis combination program needs two or more
such matchseg files [in the same order] and outputs a matchseg file
which is the best path hypothesis in the graph constructed from
the input matchseg files.)
<p>
<a name="0"></a>
<a name="03"></a>
<!------------------------------------------------------------------------->
<center><h4><font color="red">MISCELLANEOUS</font></h4></center>
<TABLE width="100%" bgcolor="#ffffff">
<td>EXPLANATION OF SOME SPHINX-II DECODER FLAGS</td>
</table>
<!------------------------------------------------------------------------->
<p>
<b> -compress </b>: <i>compress excess background frames</i><br>
<b> -compress </b>: <i>compress excess background frames based on prior utt</i>
<br>Typical silence compression code is as follows:
<pre>
   if (silcomp == COMPRESS_PRIOR) {
        j = 0;
        for (i = 0; i < nfr; i++) {
            if (histo_add_c0 (mfc[i][0])) {
                if (i != j)
                    memcpy (mfc[j], mfc[i], sizeof(float)*CEP_SIZE);

                comp2rawfr[j++] = i;
            }
            /* Else skip the frame, don't copy across */
        }
        nfr = j;
    }

</pre>

<p>The "silence" frames are actually deleted.</p>

note: This is not good when you are using models trained in the 
standard manner using SPHINX-III.
Deleting silence frames completely during decoding (regardless of whether
they are put back in the seg file later) is bad. We train cross-word
triphones with silence as context explicitly. There are usually hundreds of
such triphones in the model set. If there are no silence frames at all in
the sequence of frames being decoded, the cross-word triphones with silence
never get a chance of being used. Note that in most model sets, the silence
and breath models are usually the best trained models. 
<p>

<hr>
<em> last modified: 22 Nov. 2000 </em>
</body>
</html>