This file is indexed.

/usr/share/doc/sphinx3/sphinxman_FAQ.html is in sphinx3-doc 0.8-0ubuntu1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
<!DOCTYPE HTML PUBLIC "-//w3c//DTD HTML 4.01//EN">
<html>
<head>
  <title>Sphinx-3 FAQ</title>
  <style type="text/css">
     pre { font-size: medium; background: #f0f8ff; padding: 2mm; border-style: ridge ; color: teal}
     code {font-size: medium; color: teal}
  </style>
</head>
 <body>
 
    <H1><center><U>Sphinx-3 FAQ</U></center></H1>
    <center>
      Rita Singh<br>
      Sphinx Speech Group<br>
      School of Computer Science<br>
      Carnegie Mellon University<br>
      Pittsburgh, PA 15213<br>
    </center>

<p>This document is constantly under construction. You will find the most up-to-date version of this document <a href="http://www.speech.cs.cmu.edu/sphinxman/">here</a>.</p>
<a NAME="top"></a>INDEX (this document is under construction...)
<ol>
<li><a href="#1">Data-preparation for training acoustic models</a>
<li><a href="#2">Selecting modeling parameters</a>
<li><a href="#13">Feature computation</a>
<li><a href="#15">Modeling filled pauses and non-speech events</a>
<li><a href="#3">Training speed</a>

<li><a href="#5">Questions specific to log files</a>
<li><a href="#6">Vector-quantization for discrete and semi-continuous 
models</a>
<li><a href="#14">Flat-initialization</a>
<li><a href="#7">Updating existing models</a>
<li><a href="#8">Utterance, word and phone segmentations</a>
<li><a href="#9">Force-alignment(Viterbi alignment)</a>
<li><a href="#10">Baum-Welch iterations and associated likelihoods</a>
<li><a href="#11">Dictionaries, pronunciations and phone-sets</a>
<li><a href="#12">Decision tree building and parameter sharing</a>


<li><a href="#4">Post-training disasters</a>
<li><a href="#16">Why is my recognition accuracy poor?<a>
<li><a href="#17">Interpreting SPHINX-II file formats<a>
<li><a href="#23">Interpreting SPHINX-III file formats<a>
<li><a href="#18">Hypothesis combination<a>
<li><a href="#19">Language model<a>
<li><a href="#20">Training context-dependent models with untied states<a>
<li><a href="#21">Acoustic likelihoods and scores<a>
<li><a href="#22">Decoding problems<a>
<li><a href="#23">(Added at 20040910 by Arthur Chan) Why Sphinx III's performance is poorer than recognizer X? <a>
</ol>


<hr>

<a NAME="1"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>DATA-PREPARATION FOR TRAINING ACOUSTIC MODELS</td>
</table>

<p>
Q<font color="#483d8b">. How can I tell the size of my speech corpus
in hours? Can I use it all for training?</font>
<p>
A. You can only train with utterances for which you have transcripts.
You cannot usually tell the size of your corpus from the number of utterances
you have. Sometimes utterances are very long, and at other times they may
be as short as a single word or sound. The best way to estimate the size
of your corpus in hours is to look at the total size in bytes of all utterance
files which you can use to train your models. Speech data are usually stored
in integer format. Assuming that is so and ignoring any headers that your
file might have, an approximate estimate of the size of your corpus in
hours can be obtained from the following parameters of the speech data:
<br>&nbsp;
<table BORDER=0 CELLSPACING=0 WIDTH="700" >
<tr>
<td><b>Sampling Rate:&nbsp;</b></td>

<td>If this is <i>S</i> KiloHertz, then there are <i>S </i>x1000 samples
or integers in every second of your data.</td>
</tr>

<tr>
<td><b>Sample Size:</b></td>

<td>If your sampling is "8bit" then every integer has 1 byte associated
with it. If it is "16bit" then every integer in you data has 2 bytes associated
with it</td>
</tr>

<tr>
<td><b>Hour Size:</b></td>

<td>3600 seconds in an hour</td>
</tr>
</table>

<p>Here's a quick reference table:
<br>&nbsp;
<table BORDER=0 CELLSPACING=0 CELLPADDING=0 COLS=4 WIDTH="500" >
<tr>
<td>No. of bytes</td>

<td>Sampling rate</td>

<td>Sample size</td>

<td>Hours of data</td>
</tr>

<tr>
<td>X</td>

<td>8khz</td>

<td>8bit</td>

<td>X / (8000*1*3600)</td>
</tr>

<tr>
<td>X</td>

<td>16khz</td>

<td>16bit</td>

<td>X / (1600*2*3600)</td>
</tr>
</table>

<p>Q: <font color="#483d8b">I have about 7000 utterances or 12 hours of
speech in my training set. I found faligned transcripts for all but 114
utterances, and those 114 utterances have no transcripts that I can find.
Should I leave them out of the training? I don't think it will make that
much difference at its 1.4% of the data. Also, for this much data, how
many senones should I use?</font>
<p>A: Leave out utterances for which you don't have transcripts (unless
you have very little data in the first place, in which case hear out the
audio and transcribe it yourself). In this case, just leave them out.
<p>Thumb rule figures for the number of senones that you should be training
are given in the following table:
<br>&nbsp;
<center><table BORDER COLS=2 WIDTH="300" >
<tr>
<td>Amount of training data(hours)</td>

<td>No. of senones</td>
</tr>

<tr ALIGN=LEFT>
<td>1-3</td>

<td>500-1000</td>
</tr>

<tr>
<td>4-6</td>

<td>1000-2500</td>
</tr>

<tr>
<td>6-8</td>

<td>2500-4000</td>
</tr>

<tr>
<td>8-10</td>

<td>4000-5000</td>
</tr>

<tr>
<td>10-30&nbsp;</td>

<td>5000-5500</td>
</tr>

<tr>
<td>30-60</td>

<td>5500-6000</td>
</tr>

<tr>
<td>60-100&nbsp;</td>

<td>6000-8000</td>
</tr>

<tr>
<td>Greater than 100</td>

<td>8000 are enough</td>
</tr>
</table></center>

<p>Q: <font color="#483d8b">What is force-alignment? Should I force-align
my transcripts before I train?</font>
<p>A: The process of force-alignment takes an existing transcript, and
finds out which, among the many pronunciations for the words occuring in
the transcript, are the correct pronunciations. So when you refer to "force-aligned"
transcripts, you are also inevitably referring to a *dictionary* with reference
to which the transcripts have been force-aligned. So if you have two dictionaries
and one has the word "PANDA" listed as:
<br>&nbsp;
<table BORDER=0 CELLSPACING=0 CELLPADDING=0 WIDTH="300" >
<tr>
<td><tt><font size=+1>PANDA</font></tt></td>

<td><tt><font size=+1>P AA N D AA</font></tt></td>
</tr>

<tr>
<td><tt><font size=+1>PANDA(2)</font></tt></td>

<td><tt><font size=+1>P AE N D AA</font></tt></td>
</tr>

<tr>
<td><tt><font size=+1>PANDA(3)</font></tt></td>

<td><tt><font size=+1>P AA N D AX</font></tt></td>
</tr>
</table>

<p>and the other one has the same word listed as
<br>&nbsp;
<table BORDER=0 CELLSPACING=0 CELLPADDING=0 WIDTH="300" >
<tr>
<td><tt><font size=+1>PANDA</font></tt></td>

<td><tt><font size=+1>P AE N D AA</font></tt></td>
</tr>

<tr>
<td><tt><font size=+1>PANDA(2)</font></tt></td>

<td><tt><font size=+1>P AA N D AX</font></tt></td>
</tr>

<tr>
<td><tt><font size=+1>PANDA(3)</font></tt></td>

<td><tt><font size=+1>P AA N D AA</font></tt></td>
</tr>
</table>

<p>And you force-aligned using the first dictionary and get your transcript
to look like :
<br>I&nbsp; SAW&nbsp; A&nbsp; PANDA(3)&nbsp; BEAR,
<br>then if you used that transcript to train but used the second dictionary
to train, then you would be giving the <i>wrong</i> pronunciation to the
trainer. You would be telling the trainer that the pronunciation for the
word PANDA in your corpus is "<tt><font size=+1>P AA N D AA</font></tt>"&nbsp;
instead of the correct one, which should have been "<tt><font size=+1>P
AA N D AX</font></tt>". The data corresponding to the phone <tt><font size=+1>AX</font></tt>will
now be wrongly used to train the phone <tt><font size=+1>AA</font></tt>.
<p>What you must really do is to collect your transcripts, use only the
first listed pronunciation in your training dictionary, train ci models,
and use *those ci models* to force-align your transcripts against the training
dictionary. Then go all the way back and re-train your ci models with the
new transcripts.
<br>&nbsp;
<p>Q: <font color="#483d8b">I don't have transcripts. How can I force-align?</font>
<p>A: you cannot force-align any transcript that you do not have.
<br>&nbsp;
<p>Q: <font color="#483d8b">I am going to first train a set of coarse models
to force-align the transcripts. So I should submit begining and end silence
marked transcripts to the trainer for the coarse models. Currently I am
keeping all the fillers, such as UM, BREATH, NOISE etc. in my transcripts,
but wrapped with "+". Do you think the trainer will consider them as fillers
instead of normal words?</font>
<p>A: According to the trainer, ANY word listed in the dictionary in terms
of any phone/sequence of phones is a valid word. BUT the decision tree
builder ignores any +word+ phone as a noise phone and does not build decision
trees for the phone. So while training, mark the fillers as ++anything++
in the transcript and then see that either the filler dictionary or the
main dictionary has some mapping
<p>++anything++ +something+
<p>where +something+ is a phone listed in your phonelist.

<p>Q: <font color="#483d8b">I have a huge collection of data recorded under
different conditions. I would like to train good speaker-independent models
using this (or a subset of this) data. How should I select my data? I also
suspect that some of the transcriptions are not very accurate, but I can't
figure out which ones are inaccurate without listening to all the data.
</font>
<p>
A. If the broad acoustic conditions are similar (for example, if all your
data has been recorded off TV shows), it is best to use all data you can
get for training speaker-independent bandwidth-independent models,
gender-independent models. If you suspect that some of the data you are
using might be bad for some reason, then during the baum-welch iterations
you can monitor the likelihoods corresponding to each utterance and discard
the really low-likelihood utterances. This would filter out the bad
acoustic/badly transcribed data.
<p>
Q:<font color="#483d8b">
What is the purpose of the 4th field in the control file:
<p>
newfe/mfcc/sw02001 24915 25019 /phase1/disc01/sw2001.txt_a_249.15_250.19
<p>
Should I leave the /phase1/disc01... as is or should it be formatted
differently? I'm not sure where/why this field is used so I can't make a
good guess as to what it should be.
</font>
<p>
A. The fourth field in the control file is simply an utterance identifier.
So long as that field and the entry at the end of the corresponding
utterance in the transcript file are the same, you can have anything
written there and the training will go through. It is only a very
convenient tag. The particular format that you see for the fourth field is
just an "informative" way of tagging. Usually we use file paths and names
alongwith other file attributes that are of interest to us.
<p>
Q:<font color="#483d8b">
I am trying to train with Switchboard data. Switchboard data is mulaw
encoded. Do we have generic tools for converting from stereo mulaw to
standard raw file?
</font>
<p>
A. NIST provides a tool called w_edit which lets you specify the output
format, the desired channel to decode and the beginning and ending sample
that you would like decoded. ch_wave, a part of the Edinburgh speech tools,
does this decoding as well (send mail to awb@cs.cmu.edu for more
information on this). Here is a conversion table for converting 8 bit mulaw
to 16 bit PCM.  The usage must be clear from the table - linear_value =
linear[mu_law_value]; (i.e. if your mu law value is 16, the PCM value is
linear[16]);
<pre>

------------- mu-law to PCM conversion table-----------------------
  static short int linear[256] = {-32124, -31100, -30076, -29052,
 -28028, -27004, -25980, -24956, -23932, -22908, -21884, -20860,
 -19836, -18812, -17788, -16764, -15996, -15484, -14972, -14460,
 -13948, -13436, -12924, -12412, -11900, -11388, -10876, -10364,
 -9852, -9340, -8828, -8316, -7932, -7676, -7420, -7164, -6908,
 -6652, -6396, -6140, -5884, -5628, -5372, -5116, -4860, -4604,
 -4348, -4092, -3900, -3772, -3644, -3516, -3388, -3260, -3132,
 -3004, -2876, -2748, -2620, -2492, -2364, -2236, -2108, -1980,
 -1884, -1820, -1756, -1692, -1628, -1564, -1500, -1436, -1372,
 -1308, -1244, -1180, -1116, -1052, -988, -924, -876, -844, -812,
 -780, -748, -716, -684, -652, -620, -588, -556, -524, -492, -460,
 -428, -396, -372, -356, -340, -324, -308, -292, -276, -260, -244,
 -228, -212, -196, -180, -164, -148, -132, -120, -112, -104, -96,
 -88, -80, -72, -64, -56, -48, -40, -32, -24, -16, -8, 0, 32124,
  31100, 30076, 29052, 28028, 27004, 25980, 24956, 23932, 22908,
  21884, 20860, 19836, 18812, 17788, 16764, 15996, 15484, 14972,
  14460, 13948, 13436, 12924, 12412, 11900, 11388, 10876, 10364,
  9852, 9340, 8828, 8316, 7932, 7676, 7420, 7164, 6908, 6652, 6396,
  6140, 5884, 5628, 5372, 5116, 4860, 4604, 4348, 4092, 3900, 3772,
  3644, 3516, 3388, 3260, 3132, 3004, 2876, 2748, 2620, 2492, 2364,
  2236, 2108, 1980, 1884, 1820, 1756, 1692, 1628, 1564, 1500, 1436,
  1372, 1308, 1244, 1180, 1116, 1052, 988, 924, 876, 844, 812,
  780, 748, 716, 684, 652, 620, 588, 556, 524, 492, 460, 428, 396,
  372, 356, 340, 324, 308, 292, 276, 260, 244, 228, 212, 196, 180,
  164, 148, 132, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40,
  32, 24, 16, 8, 0};
------------- mu-law to PCM conversion table-----------------------
</pre>
<p>

<a href="#top">back to top
<hr>
<a NAME="2"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>SELECTING MODELING PARAMETERS</td>
</table>

<p>Q:<font color="#483d8b">How many senones should I train?</font>
<p>A: Thumb rule figures for the number of senones that you should be
<br>training are given in the following table:
<p>Amount of training data(hours) No. of senones
<br>&nbsp;
<center><table BORDER COLS=2 WIDTH="300" >
<tr>
<td>Amount of training data(hours)</td>

<td>No. of senones</td>
</tr>

<tr ALIGN=LEFT>
<td>1-3</td>

<td>500-1000</td>
</tr>

<tr>
<td>4-6</td>

<td>1000-2500</td>
</tr>

<tr>
<td>6-8</td>

<td>2500-4000</td>
</tr>

<tr>
<td>8-10</td>

<td>4000-5000</td>
</tr>

<tr>
<td>10-30&nbsp;</td>

<td>5000-5500</td>
</tr>

<tr>
<td>30-60</td>

<td>5500-6000</td>
</tr>

<tr>
<td>60-100&nbsp;</td>

<td>6000-8000</td>
</tr>

<tr>
<td>Greater than 100</td>

<td>8000 are enough</td>
</tr>
</table></center>

<p>Q:<font color="#483d8b"> How many states-per-hmm should I specify for
my training?</font>
<p>A: If you have "difficult" speech (noisy/spontaneous/damaged), use 3-state
hmms with a noskip topology. For clean speech you may choose to use any
odd number of states, depending on the amount of data you have and the
type of acoustic units you are training. If you are training word models,
for example, you might be better off using 5 states or higher. 3-5 states
are good for shorter acoustic units like phones. You cannot currently train
1 state hmms with the Sphinx.
<p>Remember that the topology is also related to the frame rate and the
minimum expected duration of your basic sound units. For example the phoneme
"T" rarely lasts more than 10-15 ms. If your frame rate is 100 frames per
second, "T" will therefore be represented in no more than 3 frames. If
you use a 5 state noskip topology, this would force the recognizer to use
at least 5 frames to model the phone. Even a 7 state topology that permits
skips between alternate states would force the recognizer to visit at least
4 of these states, thereby requring the phone to be at least 4 frames long.
Both would be erroneous. Give this point very serious thought before you
decide on your HMM topology. If you are not convinced, send us a mail and
we'll help you out.

<p>Q:<font color="#483d8b">I have two sets of models, A and B. The set A
has been trained with 10,000 tied states (or senones) and B has been
trained with 5,000 senones. If I want to compare the recognition results
on a third database using A and B, does this difference in the number of
senones matter?</font>
<p>
A. If A and B have been optimally trained (i.e. the amount of data
available for training each has been well considered), then the difference
in the number of tied states used should not matter.
<p>
<a href="#top">back to top
<hr>
<a NAME="3"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>TRAINING SPEED</td>
</table>

<p><br>Q: <font color="#483d8b">I am trying to train models on s single
machine. I just want want to train a set of coarse models for forced-alignment.
The baum-welch iterations are very slow. In 24 hours, it has only gone
through 800 utterances. I have total 16,000 utterances. As this speed,
it will take 20 days for the first iteration of baum-welch, considering
the convergence ratio to be 0.01, it will take several months to obtain
the first CI-HMM, let alone CD-HMM. Is there any way to speed this up?</font>
<p>A: If you start from flat-initialized models the first two iterations
of baum welch will always be very slow. This is because all paths through
the utterance are similar and the algorithm has to consider all of them.
In the higher iterations, when the various state distributions begin to
differ from each other, the computation speeds up a great deal.
<p>Given the observed speed of your machine, you cannot possibly hope to
train your models on a single machine. You may think of assigning a lower
value to the "topn" argument of the bw executable, but since you are training
CI models, changing the topn value from its default (99) to any smaller
number will not affect the speed, since there is only at best 1 Gaussian
per state anyway throughout the computation.
<p>Try to get more machines to share the jobs. There is a -npart option
to help you partition your training data. Alternatively, you can shorten
your training set, since you only want to use the models for forced alignment.
Models trained with about 10 hours of data will do the job just as well.
<h4>
<a href="#top">back to top
<hr WIDTH="100%"><a NAME="4"></a></a><font color="#FFFFFF">POST-TRAINING
DISASTERS</font></h4>
<p>
Q: <font color="#483d8b">
I've trained with clean speech. However, when I try to decode
noisy speech with my models, the decoder just dies. shouldn't it
give at least some junk hypothesis?
<p></font>
A. Adding
noise to the test data increases the mismatch between the models and
test data. So if the models are not really well trained (and hence
not very generalizable to slightly different data), the decoder dies.
There are multiple resons for this:
<ul>
<li> The decoder cannot find any valid complete paths during decoding. All paths that lead to a valid termination may get pruned out
<li> The likelihood of the data may be so poor that the decoder goes
   into underflow. This happens if even *only one* of your models is
   very badly trained. The likelihood of this one model becomes very
   small and the resulting low likelihood get inverted to a very large
   positive number becuase the decoder uses integer arithmetic, and
   results in segmentation errors, artihmetic errors, etc.
</ul>
One way to solve this problem is just to retrain with
noisy data.

<p>
Q: <font color="#483d8b">
I've trained models but I am not able to decode. The decoder settings
seem to be ok. It just dies when I try to decode.</font>
<p>
A.  If all flag setting are fine, then decoder is probably dying becuase
 the acoustic models are bad. This is because of multiple reasons
a) All paths that lead to a valid termination may get pruned out
b) The likelihood of the data may be so poor that the decoder goes
   into underflow. This happens if even *only one* of your models is
   very badly trained. The likelihood of this one model becomes very
   small and the resulting low likelihood get inverted to a very large
   positive number becuase the decoder uses integer arithmetic, and
   results in segmentation errors, artihmetic errors, etc.
<p>
You'll probably have to retrain the models in a better way. Force-align
properly, make sure that all phones and triphones that you do train are well
represented in your training data, use more data for training if you can,
check your dictionaries and use correct pronunciations, etc.       
<p>



Q: <font color="#483d8b">
I started from one set of models, and trained
further using another bunch of data. This data looked more like my test
data, and there was a fair amount of it. So my models should have improved.
When I use these models for recognition, however, the performance of the
system is awful. What went wrong?</font>
<p>A: The settings use to train your base models may have differed in one
or more ways from the settings you used while training with the new data.
The most dangerous setting mismatches is the agc (max/none). Check the
other settings too, and finally make sure that during decoding you use
the same agc (and other relevant settings like varnorm and cmn) during
training.
<h4>
<a href="#top">back to top</a></h4>
<hr>

<a name="5"></a>
<table width="100%" bgcolor="#ffffff">
<td>QUESTIONS SPECIFIC TO LOG-FILE OUTPUTS</td></table>
Q. <font color="#483d8b">My decode log file gives the following message:
<p>
ERROR: "../feat.c", line 205: Header size field: -1466929664(a8906e00);
filesize: 1(00000001)
<br>================
<br>exited with status 0
</font>
<p>
A. The feature files are byte swapped!
<p>
<p>
Q. <font color="#483d8b">During force-alignment, the log file has many messages which say "Final state not reached" 
and the corresponding transcripts do not get force-aligned. What's wrong?
</font>
<p>
A. The message means that the utterance likelihood was very low, meaning in
turn that the sequence of words in your transcript for the corresponding
feature file given to the force-aligner is rather unlikely. The most common
reasons are that you may have the wrong model settings or the transcripts
being considered may be inaccurate. For more on this go to <a
href="#9">Viterbi-alignment</a> 
<p>

Q.<font color="#483d8b">
I am trying to do flat-initialization for training ci models. The cp_parm
program is complaining about the -feat option. The original script did
not specify a -feat option, however the cp_parm program complained that the
default option was unimplemented. I've made
several attempts at specifing a -feat option with no luck. Below is the
output of two run. Can you give me an idea of what is happening here?
<p>
Default (no -feat passed) produces:
<pre>
-feat     c[1..L-1]d[1..L-1]c[0]d[0]dd[0]dd[1..L-1]
c[1..L-1]d[1..L-1]c[0]d[0]dd[0]dd[1..L-1]
ERROR: "../feat.c", line 121: Unimplemented feature
c[1..L-1]d[1..L-1]c[0]d[0]dd[0]dd[1..L-1]
ERROR: "../feat.c", line 122: Implemented features are:
        c/1..L-1/,d/1..L-1/,c/0/d/0/dd/0/,dd/1..L-1/
        c/1..L-1/d/1..L-1/c/0/d/0/dd/0/dd/1..L-1/
        c/0..L-1/d/0..L-1/dd/0..L-1/
        c/0..L-1/d/0..L-1/
INFO: ../s3gau_io.c(128): Read
[path]/model_parameters/new_fe.ci_continuous_flatinitial
/globalmean [1x1x1 array]
gau 0 <= 0
gau 1 <= 0
gau 2 <= 0
</pre>
This is the error message if I attempt to specify the -feat option:
<pre>
 -feat     c[1..L-1]d[1..L-1]c[0]d[0]dd[0]dd[1..L-1]
....

ERROR: "../feat.c", line 121: Unimplemented feature
c[1..L-1]d[1..L-1]c[0]d[0]dd[0]dd[1..L-1]
ERROR: "../feat.c", line 122: Implemented features are:
        c/1..L-1/,d/1..L-1/,c/0/d/0/dd/0/,dd/1..L-1/
        c/1..L-1/d/1..L-1/c/0/d/0/dd/0/dd/1..L-1/
        c/0..L-1/d/0..L-1/dd/0..L-1/
        c/0..L-1/d/0..L-1/
</pre>
</font>
<p>
A. The last three lines in the case when you do not specify the -feat
option say that the cp_parm is going through and the mean vector labelled
"0" is being copied to state 0, state 1, state 2....  The same "0" vector
is being copied because this is a flat_initialization where all means,
variances etc are given equal flat values. At this point, these errors in
the log files can just be ignored. 
<p>
Q.<font color="#483d8b">
 I am trying to make linguistic questions for state tying. The program keeps
failing because it can't allocate enough memory. Our machines are rather
large with 512MB and 1 to 2 GB swap space. Does it make sense that it
really doesn't have enough memory, or is it more likely something else
failed? Below is the log from this program.
<pre>
 -varfn
{path]/model_parameters/new_fe.ci_continuous/variances \
 -mixwfn
[path]/model_parameters/new_fe.ci_continuous/mixture_weights \
 -npermute 168 \
 -niter 0 \
 -qstperstt 20 \
.....
.....
.....
INFO: ../s3gau_io.c(128): Read
/sphx_train/hub97/training/model_parameters/new_fe.ci_continuous/means
[153x1x1 array]
INFO: ../s3gau_io.c(128): Read
/sphx_train/hub97/training/model_parameters/new_fe.ci_continuous/variances
[153x1x1 array]
FATAL_ERROR: "../ckd_alloc.c", line 109: ckd_calloc_2d failed for caller at
../main.c(186) at ../ckd_alloc.c(110)
</pre>
</font>
<p>
A. make_quests searches 2^npermute combinations several times for the
optimal clustering of states. For this, it has to store 2^npermute
values (for the comparison). So, setting -npermute to anything greater
than 8 or 10 makes the program very slow, and anything over 28 will
make the program fail. We usually use a value of 8.
<p>
Q.<font color="#483d8b">
I'm getting a message about end of data beyond end of file from agg_seg
during vector-quantization. I assume this means the .ctl file references
a set of data beyond the end of the file. Should I ignore this?
</font>
<p>
A. Yes, for agg_seg if its going through in spite of the message. Agg-seg
only collects samples of feature vectors to use for quantization through
kmeans. No, for the rest of the training because it may cause random
problems.  The entry in the control file and the corresponding transcript
have to be removed, if you cannot correct them for some reason.
<p>
<h4><a href="#top">back to top</a></h4>
<hr>





<a name="6"></a>
<table width="100%" bgcolor="#ffffff">
<td>VECTOR-QUANTIZATION FOR DISCRETE AND SEMI-CONTINUOUS MODELS</td></table>
<p>
Q.<font color="#483d8b">
I have a question about VQ. When you look at the 39-dimensional [cep +
d-cep + dd-cep ] vector, it's clear that each part (cep, d-cep, dd-cep)
will have quite a different dynamic range and different mean.  How should
we account for this when doing DISCRETE HMM modeling? Should we make a
separate codebook for each? If so, how should we "recombine" when
recognizing? Or should we rescale the d-cep and dd-cep up so they can
"compete" with the "larger" cep numbers in contributing to the overall VQ?

<p>
In other words, suppose we want to train a complete discrete HMM system -
is there a way to incorporate the d-cep and dd-cep features into the
system to take advantage of their added information? If we just
concatenate them all into one long vector and do standard VQ, the d-cep
and dd-cep won't have much of an influence as to which VQ codebook entry
matches best an incoming vector. Perhaps we need to scale up the d-cep
and dd-cep features so they have the same dynamic range as the cep
features? Is there a general strategy that people have done in the past
to make this work? Or do we have to "bite the bullet" and move up to
semi-continuous HMM modeling?
</font>
<p>

A:
You *could* add d-cep and dd-cep with the cepstra into one long feature.
However, this is always inferior to modeling them as separate feature
streams (unless you use codebooks with many thousand codewords).
<p>
Secondly, for any cepstral vector, the dynamic range and value of
c[12], for example, is much smaller (by orders of magnitude) than c[1]
and doesnt affect the quantization at all. In fact, almost all the
quantization is done on the basis of the first few cepstra with the
largest dynamic ranges. This does not affect system performance in
a big way. One of the reasons is that the classification information
in the features that do not affect VQ much is also not too great.
<p>
However, if you do really want to be careful with dynamic ranges, you
could perform VQ using Mahalanobis distances, instead of Euclidean
distances. In the Mahalanbis distance each dimension is weighted by
the inverse of the standard deviation of that component of the data
vectors. e.g. c[12] would be weighted by (1/std_dev(c[12])). The
standard deviations could be computed either over the entire data
set (based on the global variance) or on a per-cluster basis (you
use the standard deviation of each of the clusters you obtain during
VQ to weight the distance from the mean of that cluster). Each of these
two has a slightly different philisophy, and could result in slightly
different results.
<p>
A third thing you could do is to compute a Gaussian mixture with your
data, and classify each data vector (or extended data vector, if you
prefer to combine cepstra/dcep/ddcep into a single vector) as belonging
to one of your gaussians. You then use the mean of that Gaussian as the
codeword representing that vector. Dynamic ranges of data will not be an
issue at all in this case.
<p>
Note: In the sphinx, for semi-continuous modeling, a separate codebook is
made for each of the four feature streams: 12c,24d,3energy,12dd. Throughout
the training, the four streams are handled independently of each other and
so in the end we have four sets of mixture weights corresponding to each
senone or hmm state. The sphinx does not do discrete modeling directly.
<p>
Q<font color="#483d8b">.
For vector-quantization, should the control file entires correspond
exactly to the transcript file entries?
</font>
<p>
A. For the vq, the order of files in the ctl need not match the order of
transcripts. However, for the rest of the training, the way our system
binaries are configured, there has to be an exact match. The vq does not
look at the transcript file. It just groups data vectors (which are
considered without reference to the transcripts).
<p>
Q<font color="#483d8b">.
What is the difference between the -stride flag in agg-seg and kmeans-init?
</font>
<p>
A. -stride in agg-seg samples the feature vectors at stride^th intervals
, the vectors are then used for VQ. In the kmeans-init program its
function is the same, but this time it operates on the vectors
already accumulated by agg-seg, so we usually set it to 1.     
<p>
Q<font color="#483d8b">.
Regarding the size of the VQ Codebook: is there something to say that
the size 256 optimal? Would increasing the size affect the speed of
decoding? </font>
<p>
A. For more diverse acoustic environments, having a larger codebook size
would result in better models and better recognition. We have been using
256 codewords primarily for use with the SPHINX-II decoder, since for
historical reasons it does not handle larger codebbok sizes.  The original
sphinx-II used a single byte integer to index the codewords. The largest
number possible was therefore 256. The format conversion code which
converts models from SPHINX-III format to SPHINX-II format accordingly
requires that your models be trained with a codebook size of 256.
<p>
 The standard Sphinx-III decoder, however, can handle larger codebooks.
Increasing the codebook size would slow down the speed of decoding since
the the number of mixture-weights would be higher for each HMM state.
<p>
Q<font color="#483d8b">.
I am trying to do VQ. It just doesn't go through. What could be wrong?
</font>
<p>
A. Its hard to say without looking at the log files. If a log file is
  not being generated, check for machine/path problems. If it is being
  generated, here are the common causes you can check for:
<ol>
<li> byte-swap of the feature files
<li> negative file lengths
<li> bad vectors in the feature files, such as those computed from headers
<li> the presence of very short files (a vector or two long)
</ol>
<h4><a href="#top">back to top</a></h4>
<hr>

<a NAME="7"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>UPDATING EXISTING MODELS</td>
</table>

<p>
Q<font color="#483d8b">.I have 16 gaussian/state continuous models, which
took a lot of time to train. Now I have some more data and would like to
update the models. Should I train all over again starting with the tied
mdef file (the trees)?
</font>
<p>
A. Training from the trees upto 16 or 32 gaussians per state takes a lot of
time. If you have more data from the same domain or thereabouts, and just
want to update your acoustic models, then you are probably better off
starting with the current 16 or 32 gaussians/state models and running a few
iterations of baum-welch from there on with *all* the data you have. While
there would probably be some improvment if you started from the trees I
dont think it would be very different from iterating from the current
models.

You *would* get better models if you actually built the trees all over
again using all the data you have (since they would now consider more
triphones), but that would take a long time.
<p>

Q<font color="#483d8b">.I have a set of models A, which have a few filler
phones. I want to use additional data from another corpus to adapt the
model set A to get a new adapted model set B. However, the corpus for B has
many other filler phones which are not the same as the filler models in set
A. What do I do to be able to adapt?</font>
<p>
A. Edit the filler dictionary and insert the fillers you want to train.
Map each filler in B to a filler phone (or a sequence of phones) in model
set A. for example
<pre>
++UM++       AX M
++CLICK++    +SMACK+
++POP++      +SMACK+
++HMM++      HH M
++BREATH++   +INHALE+
++RUSTLE++   +INHALE+
</pre>

On the LHS, list the fillers in B. On the RHS, put in the corresponding
fillers (or phones) in A. In this case, it will be a many-to-one mapping
from B to A.
<p>
To force-align, add the above filler transcriptions to the *main* dictionary
used to force-align.
<h4><a href="#top">back to top</a></h4>
<hr>

<a NAME="8"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>UTTERANCE, WORD AND PHONE SEGMENTATIONS</td>
</table>
<p>
Q<font color="#483d8b">.
How do I use the sphinx-3 decoder to get phone segmentations?
</font>
<p>
A. The decoder works at the sentence level and outputs word level
segmentations. If your "words" are phones, you have a phone-decoder and you
can use the -matchsegfn flag to write the phone segmentations into a
file. If your words are not phones (and are proper words), then write out
matchseg files (using the -matchsegfn option rather than the
-matchfn option), pull out all the words from the output matchseg
files *including all noises and silences* and then run a force-
alignment on the corresponding pronunciation transcript to get the phone
segmentation.  You will have to remove the &#60s>, &#60sil> and &#60/s> markers
before you force-align though, since the aligner introduces them perforce.
<p>

Q<font color="#483d8b">.
How do I obtain word segmentations
corresponding to my transcripts?</font>
<p>
A. You can use the SPHINX decoder to obtain phone or word level segmentations.
   Replacing the flag -matchfn <matchfile> with -matchsegfn <matchfile>
   in your decode options will generate the hypotheses alongwith
   word segmentations in the matchfile. You can run a phone level decode
   in a similar way to obtain phone segmentations.

<p>
Q<font color="#483d8b">. The recordings in my training corpus are very
long (about 30 minutes each or more). Is there an easy way to break them up
into smaller utterances?</font>
<p>
A. One easy way to segment is to build a language model from the transcripts
of the utterances you are trying to segment, and decode over 50 sec. sliding
windows to obtain the
word boundaries. Following this, the utterances can be segmented
(say) at approx. 30 sec. slots. Silence or breath markers are good breaking
points. 
<p>
There are other, better ways to segment, but
they are meant to do a good job in situations where you do not have
the transcripts for your recordings (eg. for speech that you are about
to decode). They will certainly be applicable in situations where you
do have transcripts, but aligning your transcripts to the segments would
involve some extra work.
<p>
<h4><a href="#top">back to top</a></h4>
<hr>

<a NAME="9"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>FORCE-ALIGNMENT (VITERBI ALIGNMENT)</td>
</table>
<p>
Q<font color="#483d8b">.
Will the forced-aligner care if I leave the (correct)
alternate pronunciation markers in the transcript? Or do I need to
 remove them?
</font>
<p>
The force-aligner  strips off the alternate pronunciation markers and
re-chooses the correct pronunciation from the dictionary.
<p>

Q<font color="#483d8b">.Some utterances in my corpus just don't get
force-aligned. The aligner dies on them and produces no output. what's
wrong?</font>
<p>
A. Firstly, let's note that "force-alignment" is CMU-specific jargon. The
force-aligner usually dies on some 1% of the files. If the models are good,
it dies in fewer cases.  Force-alignment fails for various reasons - you
may have spurious phones in your dictionary or may not have any dictionary 
entry for one or more words in the transcript, the
models you are using may have been trained on acoustic conditions which do
not match the conditions in the corpus you are trying to align, you may
have trained initial models with transcripts which are not force-aligned
(this is a standard practice) and for some reason one or more of the models
may have zero parameter values, you may have bad transcriptions or may be
giving the wrong transcript for your feature files, there may be too much
noise in the current corpus, etc. The aligner does not check whether your
list of feature files and the transcript file entries are in the same
order. Make sure that you have them in order, where there is a one-to-one
correspondence between the two files. If these files are not aligned, the
aligner will not align most utterances. The ones that do get aligned will
be out of sheer luck and the alignments will be wrong.
<p>
There may be another reason for alignment failure: if you are
force-aligning using a phoneset which is a subset of the phones for which
you have context-dependent models (such that the dictionary which was used
to train your models has been mapped on to a dictionary with lesser
phones), then for certain acoustic realizations of your phones, the
context-dependent models may not be present. This causes the aligner to
back up to context-idependent (CI) models, giving poor likelihoods. When
the likelihoods are too poor, the alignment fails.  Here's a possible
complication: sometimes in this situation, the backoff to CI models does
not work well (for various reasons which we will not discuss here). If you
find that too many of your utterances are not getting force-aligned and
suspect that this may be due to the fact that you are using a subset of the
phone-set in the models used for alignment, then an easy solution is to
temporarily restore the full phoneset in your dictionary for
force-alignment, and once it is done, revert to the smaller set for
training, without changing the order of the dictionary entries.
<p>
After Viterbi-alignment, if you are still left with enough transcripts
to train, then it is a good idea to go ahead and train your new models.
The new models can be used to redo the force-alignment, and this
would result in many more utterances getting successfuly aligned. You can,
of course, iterate the process of training and force-alignment if 
getting most of the utterances to train is important to you. Note that
force-alignmnet is not necessary if a recognizer uses phone-networks
for training. However, having an explicit aligner has many uses and offers
a lot of flexibility in many situations.
<p>
Q<font color="#483d8b">.
I have a script for force-alignment with continuous models. I want to
force-align with some semi-continuous models that I have. What needs to
change in my script?
</font>
<p>
A. In the script for force-alignment, apart from the
paths and model file names, the model type has to be changed from ".cont"
to ".semi" and the feature type has to be changed to "s2_4x", if you have
4-stream semi-continuous models.
<p>
Q<font color="#483d8b">.
I'm using sphinx-2 force-aligner to do some aligning, it basically works
but seems way too happy about inserting a SIL phone between words (when
there clearly isn't any silence).  I've tried to compensate with this by
playing with the -silpen but it didn't help. why does the aligner insert so
many spurious silences?
</font>
<p>
A. The problem may be due to many factors.
Here's a checklist that might help you track down the problem:
<ol>
<li> Is there an agc mismatch between your models and your force-aligner
settings? If you have trained your models with agc "max" then you must
not set agc to "none" during force-alignment (and vice-versa).
<li>Listen to the words which are wrongly followed by the SIL phone after
  force-alignment. If such a word clearly does not have any silence following
  it in the utterance, then check the pronunciation of the word in
  your dictionary. If if the pronunciation is not really correct (for
example if you have a flapped "R " in place of a retroflexed "R " or a "Z "
in place
  of an "S " (quite likely to happen if the accent is non-native), the
  aligner is likely to make an error and insert a silence or noise word
  in the vicinity of that word.
<li> Are your features being computed exactly the same way as the features
  that were used to train the acoustic models that you are using to
  force-align? Your parametrization can go wrong even if you are using
  the *same* executable to compute features now as you used for training
the models. If, for example, your training features were computed at the
standard analysis rate of 100 frame/sec with 16khz, 16bit sampling,
  and if you are now assuming either an 8khz sampling rate or 8 bit data in
your code, you'll get twice as many frames as you should for any given
utterance.  With features computed at this rate, the force-aligner will
just get silence-happy.  
<li> Are the acoustic conditions and *speech*
bandwidth of the data you are force-aligning the same as those for which
you have acoustic models? For example, if you are trying to force-align the
data recorded directly off your TV with models built with telephone data,
then even if your sampling rate is the same in both cases, the alignment
will not be good.  
<li> Are your beams too narrow? Beams should typically
be of the order
   of 1e-40 to 1e-80. You might mistakenly have them set at a much higher
value (which means much *narrower* beams).
</ul>
<p>

<h4><a href="#top">back to top</a></h4>
<hr>
<a NAME="10"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>BAUM-WELCH ITERATIONS AND ASSOCIATED LIKELIHOODS</td>
</table>
<p>
Q<font color="#483d8b">.
How many iterations of Baum-Welch should I run for CI/CD-untied/CD-tied training?</font>
<p>
A. 6-10 iterations are good enough for each. It is better to check the ratio
of total likelihoods from the previous iteration to the current one to
decide if a desired convergence ratio has been achieved. The scripts provided
with the SPHINX package keep track of these ratios to automatically decide
how many iterations to run, based on a "desired" convergence ratio that
you must provide. If you run too many iterations, the models get overfitted to
the training data. You must decide if you want this to happen or not.
<p>
Q<font color="#483d8b">.
The training data likelihoods at the end of my current iteration
of Baum-Welch training are identical to the likelihoods at the end of
the previous iteration. What's wrong and why are they not changing?
</font>
<p>
A. The most likely reason is that for some reason the acoustic models
did not get updated on your disk at the end of the previous iteration.
When you begin with  the same acoustic models again and again, the likelihoods
end up being the same every time.
<p>
Q<font color="#483d8b">.
 The total likelihood at the end of my current Baum-Welch iteration is
actually lower than the likelihood at the end of the previous iteration.
Should this happen?</font>
<p>

A. Theoretically, the likelihoods must increase monotonically. However,
this condition holds only when the training data size is constant. In every
iteration (especially if your data comes from difficult acoustic
conditions), the Baum-Welch algorithm may fail in the backward pass on some
random subset of the utterances. Since the effective training data size is
no longer constant, the likelihoods may actually decrease at the end of the
current iteration, compared to the previous likelihoods.
However, this should not happen very often. If it does, then you
might have to check out your transcripts and if they are fine, you might
have to change your training strategy in some appropriate manner.
<p>



Q<font color="#483d8b">.  In my training, as the forward-backward
(Baum-Welch) iterations progress, there are more and more error messages in
the log file saying that the backward pass failed on the given
utterance. This should not happen since the algorithm guarantees that the
models get better with every iteration. What's wrong?
</font>
<p>
A. As the models get better, the "bad" utterances are better identified
through their very low likelihoods, and the backward pass fails on them.
The data may be bad due to many reasons, the most common one being
noise. The solution is to train coarser models, or train fewer triphones by
setting the "maxdesired" flag to a lower number (of triphones) when making
the untied mdef file, which lists the triphones you want to train.  If this
is happening during CI training, check your transcripts to see if the
within-utterance silences and non-speech sounds are transcribed in
appropriate places, and if your transcriptions are correct. Also check if
your data has difficult acoustic conditions, as in noisy recordings with
non-stationary noise.  If all is well and the data is very noisy and you
can't do anything about it, then reduce the number of states in your HMMs
to 3 and train models with a noskip topology. If the utterances still die,
you'll just have to live with it. Note that as more and more utterances
die, more and more states in your mdef file are "not seen" during training.
The log files will therefore have more and more messages to this effect.
<p>
Q<font color="#483d8b">.
My baum-welch training is really slow! Is there something I can do to speed it
up, apart from getting a faster processor?
</font>
<p>
A. In the first iteration, the models begin from flat distributions, and
so the first iteration is usually very very slow. As the models get better
in subsequent iterations, the training speeds up. There are other reasons 
why the iterations could be slow: the transcripts may not be
force-aligned or the data may be noisy. For the same amount of training
data, clean speech training gets done much faster than noisy speech training.
The noisier the speech, the slower the training. If you have not
force-aligned, the solution is to train CI models, force-align and retrain.
If the data are noisy, try reducing the number of HMM states and/or
not allowing skipped states in the HMM topology. Force-alignment also
filters out bad transcripts and very noisy utterances.
<p>
Q<font color="#483d8b">.
The first iteration of Baum-Welch through my data has an error:
<pre>
 INFO: ../main.c(757): Normalizing var
 ERROR: "../gauden.c", line 1389: var (mgau=0, feat=2, density=176,
component=1) < 0
</pre>
 Is this critical?
</font>
<p>
A.This happens because we use the following formula to estimate
variances:
<p>
    variance = avg(x<sup>2</sup>) - [avg(x)]<sup>2</sup>
<p>
There are a few weighting terms included (the baum-welch "gamma" weights), but
they are immaterial to this discussion.

The *correct* way to estimate variances is
<p>
    variance = avg[(x - avg(x)]<sup>2</sup>)
<p>
The two formulae are equivalent, of course, but the first one is
far more sensitive to arithmetic precision errors in the
computer and can result in negative variances. The second formula is
too expensive to compute (we need one pass through the data to compute
avg(x), and another to compute the variance). So we use the first one in the
sphinx and we therefore get the errors of the kind we see above, sometimes.
<p>
The error is not critical (things will continue to work), but may be
indicative of other problems, such as bad initialization, or isolated
clumps of data with almost identical values (i.e. bad data).
<p>

Another thing that usually points to bad initialization is that you may
have mixture-weight counts that are exactly zero (in the case of
semi-continuous models) or the gaussians may have zero means and variances
(in the case of continuous models) after the first iteration.
<p>
If you are computing semi-continuous models, check to make sure the initial
means and variances are OK. Also check to see if all the cepstra files are
being read properly.
<p>
<h4><a href="#top">back to top</a></h4>
<hr>


<a NAME="11"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>DICTIONARIES, PRONUNCIATIONS AND PHONE-SETS</td>
</table>
<p>
Q<font color="#483d8b">.I've been using a script from someone that removes the stress markers
in cmudict as well as removes the deleted stops. This script is removing the
(2) or (3) markers that occur after multiple pronunciations of the same
word. That is,
<pre>
 A  EY
 A  AX
</pre>
is produced instead of
<pre>
A    EY
A(2) AX
</pre>
What is the consequence of removing this muliple pronunciation marker? Will
things still work?
</font>
<p>
A.
The (2), (3) etc. are important for the training. It is the only way
the trainer knows which pronunciation of the word has been used in
the utterance, and that is what the force-aligner decides for the
rest of the training. So, once the force-alignment is done, the
rest of the training has to go through with the same dictionary,
and neither the pronunciations nor the pronunciation markers should
change.
<p>
Independently of this, the script that you are using should be renumbering
the dictionary pronunciations in the manner required by the trainer in
order for you to use it for training and decoding.  Pronunciation markers
are required both during training and during decoding.

<p>
Q<font color="#483d8b">.I have trained a set of models, and one of the phones I have trained
models for is "TS" (as in CATS = K AE TS). Now I want to remove the phone
TS from the dictionary and do not want to retain it's models. What are the
issues involved?
</font>
<p>
A. You can change every instance of the phone "TS" in your decode
   dictionary to "T S". In that case, you need not explicitly remove the
models for TS from your model set. Those models will not be considered
during decoding. However, if you just remove TS from the decode dictionary
and use the models that you have, many of the new triphones involving T and
S would not have corresponding models (since they were not there during
training). This will adversely affect recognition performance. You can
compose models for these new triphones from the existing set of models by
making a new tied-mdef file with the new decode dictionary that you want to
use. This is still not as good as training explicitly for those triphones,
but is better than not having the triphones at all.  The ideal thing to do
would be to train models without "TS" in the training dictionary as well,
because replacing TS with T S will create new triphones. Data will get
redistributed and this will affect the decision trees for all phones,
especially T ans S. When decision trees get affected, state tying gets
affected, and so the models for all phones turn out to be slightly
different.  

<p>
Q<font color="#483d8b">. What is a filler dictionary? What is its format?
</font>
<p>

A. A filler dictionary is like any dictionary, with a word and its
pronunciation listed on a line. The only difference is that the word is
what *you* choose to call a non-speech event, and its pronunciation is
given using whatever filler phones you have models for (or are building
models for). So if you have models for the
phone +BREATH+, then you can compose a filler dictionary to look
like
<p>
++BREATHING++   +BREATH+
<p>
or
<p>
BREATH_SOUND    +BREATH+
<p>
or...
<p>
The left hand entry can be anything (we usually just write the
phone with two plus signs on either side - but that's only a
convention).
<p>
Here's an example of what a typical filler dictionary looks like:

<pre>
++BREATH++                     +BREATH+
++CLICKS++                     +CLICKS+
++COUGH++                      +COUGH+
++LAUGH++                      +LAUGH+
++SMACK++                      +SMACK+
++UH++                         +UH+
++UHUH++                       +UHUH+
++UM++                         +UM+
++FEED++                       +FEED+
++NOISE++                      +NOISE+
</pre>

When using this with SPHINX-III, just make sure that there are no extra
spaces after the second column word, and no extra empty lines at the end of
the dictionary

<h4><a href="#top">back to top</a></h4>
<hr>

<a NAME="12"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>DECISION-TREE BUILDING AND PARAMETER SHARING</td>
</table>
<p>
Q<font color="#483d8b">.
In HTK,
after we do decision-tree-driven state-clustering, we run a "model
compression" step, whereby any triphones which now (after clustering) point
to the same sequence of states are mapped, so that they are effectively the
same physical model.  This would seem to have the benefit of reducing the
recognition lattice size (although we've never verified that HVite actually
does this.)  Do you know if Sphinx 3.2 also has this feature?
</font>
<p>
A. The sphinx does not need to do any compression because it does not
physically duplicate any distributions. all state-tying is done through
a mapping table (mdef file), which points each state to the appropriate
distributions.
<p>

Q<font color="#483d8b">.
The log file for bldtree gives the following error:
<pre>
INFO: ../main.c(261): 207 of 207 models have observation count greater than
0.000010
FATAL_ERROR: "../main.c", line 276: Fewer state weights than states
</pre>
</font>
<p>
A. The -stwt flag has fewer arguments that the number of HMM-states that
   you are modeling in the current training. The -stwt flag needs a string
of numbers equal to the number of HMM-states, for example, if you were
using 5-state HMMs, then the flag could be given as "-stwt 1.0 0.3 0.1 0.01
0.001". Each of these numbers specify the weights to be given to state
distributions during tree building, beginning with the *current*
state. The second number specifies the weight to be given to the states
*immediately adjacent* to the current state (if there are any), 
the third number specifies the weight to be given to adjacent states 
*one removed* from the immediately adjacent one (if there are any), 
and so on.
<p>
<h4><a href="#top">back to top</a></h4>
<hr>

<a NAME="13"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>FEATURE COMPUTATION</td>
</table>
<p>
Q<font color="#483d8b">.
How appropriate are the standard frame specifications for feature
computation? I am using the default values but the features look a bit
"shifted" with respect to the speech waveform. Is this a bug?
</font>
<p>
A. There are two factors here: the frame *size* and the frame *rate*.
Analysis frame size is typically 25 ms. Frame rate is 100 frames/sec.  In
other words, we get one frame every 10 ms (a nice round number), but we
may need to adjust boundaries a little bit because of the frame size (a 5ms
event can get smeared over three frames - it could occur in the tail end of
one frame, the middle of the next one, and the beginning of the third, for
the 10ms frame shifts). The feature vectors sometimes look shifted with
respect to the speech samples. However, there is no shift between the
frames and the speech data. Any apparent shift is due to smearing. We do
frequently get an additional frame at the end of the utterance because we
pad zeros, if necessary, after the final samples in order to fill up the
final frame.
<p>
Q<font color="#483d8b">.
How do I find the center frequencies of the Mel filters?
</font>
<p>
A. The mel function we use to find the mel frequency for any frequency x is 
<p>
(2595.0*(float32)log10(1.0+x/700.0))
<p>
substitute x with the upper and lower frequencies, subtract the results,
and divide by the number of filters you have + 1 : that will give
you the bandwidth of each filter as twice the number you get after division.
The number you get after division + the lower frequency is the center
frequency of the first filter. The rest of the center frequencies can be
found by using the bandwidths and the knowledge that the filters are
equally spaced on the mel frequency axis and overlap by half the
bandwidth. These center frequencies can be
transformed back to normal frequency using the inverse mel function
<p>
(700.0*((float32)pow(10.0,x/2595.0) - 1.0))
<p>
where x is now the center frequency.
<p>
Q<font color="#483d8b">.
Does the front-end executable compute difference features?
</font>
<p>
A. No. The difference features are computed during runtime by the  SPHINX-III
trainer and decoders.
<p>
Q<font color="#483d8b">.
What would be the consequence of widening the analysis windows beyond 25ms
for feature computation?
</font>
<p>
Analysis windows are currently 25 ms wide with 10ms shifts. Widening
them would have many undesirable consequences:
<ul>
<li> spectra will get more smoothed, and so you'll lose information for
   recognition
<li> smaller phones would get completely obliterated
<li> deltas will no longer be so informative (remember that in the dropping-
   every-other-frame experiment they are computed before dropping), as
   the time lags considered for their computation will be larger
<li> We engineer the system to enable HMM states to capture steady
   state regions. When a single frame begins representing changing
   information (when it is long, it will capture multiple events
   within the same frame, and not represent any of them accurately.
   Speech is steady only in about 25ms sections), the states will no
   longer capture the kind of classification information we want them to.
   The models will result in poor recognition as a consequence. 
</ul>
<p>
Q<font color="#483d8b">.
I ran the new wave2feat program with
the configuration of srate = 16000 and nfft = 256, then the program
crashed. I changed the nfft to 512, it works. So I'd like to know why.
</font>
<p>
A. At a sampling rate of 16000 samples/sec, a 25ms frame has 400 samples.
If you try to fill these into 256 locations of allocated memory (in
the FFT) you will have a segmentation fault. There *could* have been
a check for this in the FFT code, but the default for 16kHz has been
set correctly to be 512, so this was considered unnecessary.
<p>

<p><h4><a href="#top">back to top</a></h4>
<hr>
<a NAME="15"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>MODELING FILLED PAUSES AND NON-SPEECH EVENTS</td>
</table>
<p>
Q<font color="#483d8b">.
Can you explain the difference between putting the words as fillers ++()++ 
instead of just putting them in the normal dictionary?  My dictionary
currently contains pronunciations for UH-HUH, UH-HUH(2) and UH-HUH(3).
Should all of these effectively be merged to ++UH-HUH++ and mapped to
a single filler phone like +UH-HUH+?
</font>
<p>
A. Putting them as normal words in the dictionary should not matter if
you are training CI models. However, at the CD stage when the list of
training triphones is constructed, the phones corresponding to the (++  ++)
entries are mapped by the trainer to silence. For example the triphone
constructed from the utterance
<p> ++UM++ A ++AH++
<p>
would be AX(SIL,SIL)
and not AX(+UM+,AA) [if you  have mapped ++UM++ to +UM+ and ++AH++ to the
phone AA for training, in one of the training dictionaries]
<p>
Also, when you put ++()++ in the main dictionary and map it to some sequence
of phones other than a single +()+ phone, you cannot build a model for
the filler. For example UH-HUH may be mapped to AH HH AX , AX HH AX etc
in the main dict, and when you train, the instances of UH-HUH just
contribute to the models for AH, AX or HH and the corresponding triphones.
On the other hand, if you map ++UH-HUH++ to +UH-HUH+, you can have the
instances contribute exclusively to the phone +UH-HUH+. The decision
to keep the filler as a normal word in the training dictionary and
assign alternate pronunciations to it OR to model it exclusively by
a filler phone must be judiciously made keeping the requirements of
your task in mind.
<p>
During decoding and in the language model, the filler words ++()++ are
treated very differently from the other words. The scores associated are
computed in a different manner, taking certain additional insertion
penalties into account.
<p>
Also, the SPHINX-II decoder is incapable of using a new filler unless there
is an exclusive model for it (this is not the case with the SPHINX-III
decoder). If there isn't, it will treat the filler as a normal dictionary
word and will ignore it completely if it is not there in the language model
(which usually doesn't have fillers), causing a significant loss in
accuracy for some tasks.

<p>
Q<font color="#483d8b">.
My training data contains no filler words (lipsmack, cough etc.) Do
you think I should retrain trying to insert fillers during forced alignment
so that I could train on them? Since what I have is spontaneous speech, I
can't imagine that in all 20000 utterances there are no filled pauses etc.
</font>
<p>
A. Don't use falign to insert those fillers. The forced aligner has a
tendency to arbitrarily introduce fillers all over the place. My guess is
that you will lose about 5%-10% relative by not having the fillers to
model. If you are going to use the SPHINX-III decoder, however, you can
compose some improtant fillers like "UH" and "UM" as "AX" or "AX HH" or "AX
M" and use them in the fillerdict.  However, the sphinx-2 decoder cannot
handle this. If possible, try listening to some utterances and see if
you can insert about 50 samples of each filler - that should be enough to
train them crudely.
<p>
Q<font color="#483d8b">.
How is SIL different from the other fillers? Is there any special reason why
I should designate the filler phones as +()+? What if I *want* to make
filler triphones?</font>
<p>
A.Silence is special in that it forms contexts for
triphones, but doesn't have it's own triphones (for which it is
the central phone, ie). The fillers neither form contexts
nor occur as independent triphones. If you want to build triphones 
for a filler, then the filler must be designated as a proper phone 
without the "+" in the dictionaries.
<p>
Q<font color="#483d8b">.
What is the meaning of the two columns in the fillerdict? I want to reduce the
number of fillers in my training.
</font>
<p>
In a filler dictionary, we map all non-speech like sounds to some
phones and we then train models for those phones.
Forexample, we may say
<pre>
++GUNSHOT++     +GUNSHOT+
</pre>
The meaning is the same as "the pronunciation of the word ++GUNSHOT++ in
the transcripts must be interpreted to be +GUNSHOT+"
Now if I have five more filler words in my transcripts:
<pre>
++FALLINGWATER++
++LAUGH++
++BANG++
++BOMBING++
++RIFLESHOT++
</pre>
Then I know that the sounds of ++BANG++, ++BOMBING++ and ++RIFLESHOT++
are somewhat similar, so I can reduce the number of filler phones
to be modelled by modifying the entries in the filler dict
to look like
<pre>
++GUNSHOT++     +GUNSHOT+
++BANG++        +GUNSHOT+
++BOMBING++     +GUNSHOT+
++RIFLESHOT++   +GUNSHOT+
++FALLINGWATER++        +WATERSOUND+
++LAUGH++       +LAUGHSOUND+
</pre>
so we have to build models only for the phones +GUNSHOT+, +WATERSOUND+ and
+LAUGHSOUND+ now.
<p>
<hr>
<a NAME="16"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>WHY IS MY RECOGNITION ACCURACY POOR?</td>
</table>
<p>
Q<font color="#483d8b">.
I am using acoustic models that were provided with the SPHINX package on
opensource. The models seem to be really bad. Why is my recognition accuracy so poor?</font>
<p>

A. The reason why you are getting poor recognition with the current models
is that they are not trained with data from your recording setup. while
they have been trained with a large amount of data, the acoustic conditions
specific to your recording setup may not have been encountered during
training and so the models may not be generalizable to your
recordings. More than noise, training under matched conditions makes a huge
difference to the recognition performance. There may be other factors, such
as feature set or agc mismatch. Check to see if you are indeed using all
the models provided for decoding. For noisy data, it is important to enter
all the relevant noise models (filler models) provided in the noise
dictionary that is being used during decoding.
<p>
To improve the performance, the models must be adapted to the kind of data
you are trying to recognize. If it is possible, collect about 30 minutes
(or more if you can) of data from your setup, transcribe them carefully,
and adapt the existing models using this data. This will definitely improve
the recognition performance on your task.
<p>
It may also be that your task has a small, closed vocabulary. In that case
having a large number of words in the decode dictionary and language model
may actually cause acoustic confusions which are entirely avoidable.
All you have to do in this situation is to retain *only* the words in your
vocabulary in the decode dictionary. If you can build a language model
with text that is exemplary of the kind of language you are likely to
encounter in your task, it will boost up the performance hugely.
<p>
It may also be that you have accented speech for which correct pronunciations
are not present in the decode dictionary. Check to see if that is the
case, and if is, then it would help to revise the dictionary pronunciations,
add newer variants to existing pronunciations etc. also check to see if
you have all the words that you are trying to recognize in your recognition
dictionary.
<p>
If you suspect that noise is a huge problem, then try using some noise
compensation algorithm on your data prior to decoding. Spectral subtraction is
a popular noise compensation method, but it does not always work.
<p>
All this, of course, assuming that the signals you are recording or trying
to recognize are not distorted or clipped due to hardware problems in your
setup. Check out especially the utterances which are really badly recognized 
by actually looking at a display of the speech signals. In fact, this is the
first thing that you must check.

<hr>
<a NAME="23"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>WHY SPHINX III'S PERFORMANCE IS POORER THAN RECOGNIZER X?</td>
</table>
<p>
Q<font color="#483d8b">.
Sphinx III's default acoustic and language models appear to be not
able to take care of tasks like dictation.  Why?  </font>
<p>

<p> 

(By Arthur Chan at 20040910) Design of a speech recognizer is largely
affected by the goal of the recognizer.  In the case of CMU Sphinx,
most of the effort were driven by DARPA research in 90s.  The
broadcast news models were trained in the so called eval97 task.
Where transcription are required to be done for broadcast news.  

The above explains why the model don't really work well for task like
dictation.  The data simply just for the use of dictation.

Commercial speech application also requires a lot of specific tuning
and application engineering. For example, most commercial dictation
engine use more well-processed training material to train the acoustic
model and language model. They also apply techniques such as speaker
adaptation.  CMU was very unfortunately don't have enough resource to
carry out these researches. 
</p> 

<hr>
<a NAME="17"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>INTERPRETING SPHINX-II FILE FORMATS</td>
</table>
You can read more about the SPHINX-II file formats <a href="http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#4">here</a>
<p>
Q<font color="#483d8b">. (this question is reproduced as it was asked!)
    I was trying to read the SphinxII HMM files (Not in a very good
format). I read your provided  "SCHMM_format" file with your distribution.
But, life is never that easy, the cHMM files format should have been very
easy and straight forward..!!!
<br>
From your file ...
<br>
<u>chmm FILES</u>
<br>
There is one *.chmm file per ci phone. Each stores the transition matrix
associated with that particular ci phone in following binary format.
(Note all triphones associated with a ci phone share its transition matrix)
<pre>
(all numbers are 4 byte integers):

-10     (a  header to indicate this is a tmat file)
256     (no of codewords)
5       (no of emitting states)
6       (total no. of states, including non-emitting state)
1       (no. of initial states. In fbs8 a state sequence can only begin
         with state[0]. So there is only 1 possible initial state)
0       (list of initial states. Here there is only one, namely state 0)
1       (no. of terminal states. There is only one non-emitting terminal
state)                                                                     
5       (id of terminal state. This is 5 for a 5 state HMM)
14      (total no. of non-zero transitions allowed by topology)
[0 0 (int)log(tmat[0][0]) 0]   (source, dest, transition prob, source id)
[0 1 (int)log(tmat[0][1]) 0]
[1 1 (int)log(tmat[1][1]) 1]
[1 2 (int)log(tmat[1][2]) 1]
[2 2 (int)log(tmat[2][2]) 2]
[2 3 (int)log(tmat[2][3]) 2]
[3 3 (int)log(tmat[3][3]) 3]
[3 4 (int)log(tmat[3][4]) 3]
[4 4 (int)log(tmat[4][4]) 4]
[4 5 (int)log(tmat[4][5]) 4]
[0 2 (int)log(tmat[0][2]) 0]
[1 3 (int)log(tmat[1][3]) 1]
[2 4 (int)log(tmat[2][4]) 2]
[3 5 (int)log(tmat[3][5]) 3]
</pre>
There are thus 65 integers in all, and so each *.chmm file should be
65*4 = 260 bytes in size.
<br>
 ...
<br>
that should have been easy enough, until I was surprised with the fact that
the probabilities are all written in long (4 bytes) format although the
float is also 4 bytes so no space reduction is achieved, Also they are        
stored LOG and not linear although overflow considerations (the reasons for
taking the logs are during run time not in the files...)
<br>
All this would be normal and could be achieved ....
<br>
but when I opened the example files I found very strange data that would
not represent any linear or logarithmic or any format of probability values
That is if we took the file "AA.chmm" we would find that the probabilities
from state 0 to any other state are written in hex as follows:
<pre>
(0,0) 00 01 89 C7
(0,1) 00 01 83 BF
(0,2) 00 01 10 AA
</pre>
As I recall that these probabilities should all summate to "1".
Please, show me how this format would map to normal probabilities like 0.1,
0.6, 0.3 ...                                                                  
</font>
<p>
A.
First - we store integers for historic reasons. This is no longer
the case in the Sphinx-3 system. The sphinx-2 is eventually going
to be replaced by sphinx-3, so we are not modifying that system. One of the
original reasons for storing everything in integer format was that
integer arithmetic is faster than floating point arithmetic in most
computers. However, this was not the only reason..
<br>
Second - we do not actually store *probabilities* in the chmm files.
Instead we store *expected counts* (which have been returned by the
baum-welch algorithm). These have to be normalized by summing and
dividing by the sum.
<br>
Finally - the numbers you have listed below translate to the following
integers:
<pre>
>(0,0) 00 01 89 C7
This number translates to 100807

>(0,1) 00 01 83 BF
This number translates to 99263

>(0,2) 00 01 10 AA                                                             
This number translates to 69802
</pre>
These numbers are the logarithmic version of the floating point counts
with one simple variation - the logbase is not "e" or 10; it is 1.0001.
This small base was used for reasons of precision - larger bases would
result in significant loss of precision when the logarithmized number was
trunctated to integer.                                               
<p>

Q<font color="#483d8b">. The problem that I am facing is that I already
have an HMM model trained & generated using the Entropic HTK and I want to
try to use your decoder with this model.  So I am trying to build a
conversion tool to convert from the HTK format to your format. In HTK
format, the trasition matrix is all stored in probabilities!!  So how do I
convert these probabilities into your "expected counts".
</font>
<p>
A. You can take logbase 1.0001 of the HTK probs, truncate and
store them.
<p>


<hr>
<a NAME="18"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>HYPOTHESIS COMBINATION</td>
</table>
<p>
Q<font color="#483d8b">.
In the hypothesis-combination code, all the scaling factors of recombined
hypotheses are 0. Why is this so?
</font>
<p>
A. Since the hypothesis combination code does not perform rescaling of scores
during combination, there is no scaling involved. Hence the scaling factor
comes out to be 0. This is usually what happens with any method that
rescores lattices without actually recomputing acoustic scores.  Second,
the code uses *scaled* scores for recombination. This is because different
features have different dynamic ranges, and therefore the likelihoods
obtained with different features are different. In order to be able to
compare different accoustic features, their likelihoods would have to be
normalized somehow. Ideally, one would find some global normalization
factor for each feature, and normalize the scores for that feature using
this factor. However, since we do not have global normalization factors, we
simply use the local normalization factor that the decoder has
determined. This has the added advantage that we do not have to rescale any
likelihoods. So the true probability of a word simply remains
LMscore+acoustic score.  The correct way of finding scaling factors
(esp. in the case of combination at the lattice level, which is more
complex than combination at the hypothesis level) is a problem that, if
solved properly, will give us even greater improvements with combination.

<p>
Q<font color="#483d8b">.
Given the scaling factor, the acoustic and LM
likelihood of a word in the two hypothesis to be combined, how do we decide
which one to be appear in the recombined hypothesis. For example, the
word "SHOW" appears in both hypotheses but in different frames (one is in
the 40th another is in 39th) - these words are merged - but how should we
decide the begining frame of word "SHOW" in the recombined hypothesis,
and why does it become the 40th frame after recombination?
</font>
<p>
A. This is another problem with "merging" nodes as we do it. Every time we
merge two nodes, and permit some difference in the boundaries of the words
being merged, the boundaries of the merged node become unclear. The manner
in which we have chosen the boundaries of the merged node is just one of
many ways of doing it, none of which have any clear advantage over the other.
It must be noted though that if we chose the larger of the two boundareis
(e.g if we merge WORD(10,15) with WORD(9,14) to give us WORD(9,15)), the
resultant merged node gets "wider". This can be a problem when we are
merging many different hypotheses as some of the nodes can get terribly
wide (when many of the hypotheses have a particular word, but with widely
varying boundaries), resulting in loss of performance. This is an issue
that must be cleared up for lattice combination.

<p>
Q<font color="#483d8b">.
I noticed that the acoustic likelihood of some emerging
word doesn't change from the original hypothesis to the recombined
hypothesis, For example, the word "FOR"  has acoustic likelihood to be
-898344 in one hyp and -757404 in another hypothesis, they all appear in
the 185th frame in both hyp. but in the recombined hyp, the word "FOR"
appears at 185th frame with likelihood -757404, the same as in one of the
hypotheses. These likelihoods should have been combined, but it appears
that they haven't been combined. Why not?
</font>

<p>
A. The scores you see are *log* scores. So, when we
combine -757404 with -898344, we actually compute

log(e^-757404 + e^-898344).

But e^-757404 >> e^-898344
as a result

e^-757404 + e^-898344 = e^-757404  to within many decimal places. As a result
the combined score is simply -757404.

<hr>
<a NAME="19"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>LANGUAGE MODEL</td>
</table>
<p>
Q<font color="#483d8b">.
How can the backoff weights a language model be positive?
</font>
<p>
A. Here's how we can explain positive numbers in place of backoff weights
in the LM:
The numbers you see in the ARPA format LM (use din CMU) are not
probabilities. They are log base 10 numbers, so you have log10(probs)
and log10(backoffweights). Backoff weights are NOT probabilities.
<p>
Consider a 4 word vocab 
<p>
A B C D.
<p>
Let their unigram probabilities be
<pre>
A     0.4
B     0.3
C     0.2
D     0.1
</pre>
which sum to 1. (no [UNK] here).
<p>
Consider the context A. Suppose in the LM text we only observed the
strings AA and AB. For accounting for the unseen strings AC and AD (in
this case) we will perform some discounting (using whatever method we
want to). So after discounting, let us say the probabilities of the
seen strings are:
P(A|A) = 0.2 <br>
P(B|A) = 0.3  <br>

So, since we've never see AC or AD, we approximate P(C|A) with <br>
P(C|A) = bowt(A) * P(C)<br>

and P(D|A) with<br>
P(D|A) = bowt(A) * P(D)<br>

So we should get<br>
P(A|A) + P(B|A) + P(C|A) + P(D|A) = bowt(A)*(P(C)+P(D)) + P(A|A) + P(B|A) 
<br>
= bowt(A) * (0.1+0.2) + 0.2 + 0.3<br>
= 0.5 + 0.3*bowt(A)<br>

<p>
But the sum P(A|A)..P(D|A) must be 1
<p>
So obviously bowt(A) > 1
<p>
And log(bowt(A)) will be positive..
<p>
bowts can thus in general be greater than or lesser than 1. 
In larger and fuller LM training data, where most n-grams are seen,
it is mostly less than 1.
<hr>
<a NAME="20"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>TRAINING CONTEXT-DEPENDENT MODELS WITH UNTIED STATES</td>
</table>
<p>
Q<font color="#483d8b">.
 During the cd-untied phase, things are failing miserably. A very long list
of "..senones that never occur in the input data" is being generated. 
The result of this large list is that the means file ends up with a
large number of zeroed vectors. What could be the reason?
</font>
<p>
A. The number of triphones listed in the untied model definition file could
be far greater than the actual number of triphones present in your training
corpus. This could happen if the model-definition file is being created
off the dictionary, without any <i>effective</i> reference to the transcripts (<i>e.g</i> minimum required occurence in the transcripts = 0), and with a
large value for the default number of triphones, OR if, by mistake, you are 
using a pre-existing model-definition file that was created off a much larger
corpus.
<p>
<a NAME="21"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>ACOUSTIC LIKELIHOODS AND SCORES</td>
</table>
<p>
Q<font color="#483d8b">.
Acoustic likelihoods for words as written out by the decoder and 
(force-)aligner are both positive and negative, while they are
exclusively negative in the lattices. How is this possible?
</font>
<p>

The acoustic likelihoods for each word as seen in the decoder and 
aligner outputs are scaled at each frame by the maximum score for
that frame. The final (total) scaling factor is written out in
the decoder MATCHSEG output as the number following the letter "S". "T" is
the total score without the scaling factor. The real score is
the sum of S and T. The real score for each word is written out
in the logfile only if you ask for the backtrace (otherwise that
table is not printed). In the falign output, only the real scores
are written. The real scores of words are both positive and negative,
and large numbers because they use a very small logbase (1.0001 is
the default value for both the decoder and the aligner).
<p>
In the lattices, only the scaled scores are stored and total scaling
factor is not written out. This would not affect any rescoring
of a lattice, but might affect (positively or negatively) the combination of lattices
because the scaling factors may be different for each lattice.
<p>
Q<font color="#483d8b">.
In the following example
<pre>
Decoding:
         WORD        Sf    Ef      Ascore      Lmscore
         SHOW        11    36     1981081      -665983
         LOCATIONS   37    99      -13693      -594246
         AND        100   109     -782779      -214771
         C-RATINGS  110   172     1245973      -608433

falign:
          Sf    Ef     Ascore         WORD
          11    36    2006038         SHOW
          37    99     -37049         LOCATIONS
         100   109    -786216         AND
         110   172    1249480         C-RATINGS
</pre>
We see that the score from decoding and falign are different
even for words that begin and end at the same frames. Why is this so?
I am confused about the difference in the ABSOLUTE score (the one without
normalization by maximum in each frame) from decode and falign.
In the above example, the absolute score for word "locations" ( with lc
"show", rc "and") begining at frame no. 37 and ending at frame no. 99
is -13693 from the decode (I get the number from the decode log file, with 
backtrace on). while the score for
exactly the same word, same time instants and same context is different
in falign output( -37049 ). Can this be due to the DAG merge?
</font>
<p>
There are several reasons why falign and decoder scores
can be different. One, as you mention, is artifically
introduced DAG edges. However, in the forward pass
there is no DAG creation, so AM scores obtained from
the FWDVIT part of the decode will not have DAG creation
related artefacts.
Other possible reasons for differences in falign and
decode scores are differences in logbase, differences
in beam sizes, differences in floor values for the HMM
parameters etc.
Even when all these parameters are identical the scores
can be different because the decoder must consider many
other hyotheses in its pruning strategy and may prune
paths through the hypothesized  transcript differently
from the forced aligner. The two scores will only be
identical if word, phone and state level segmentations
are all identical in the two cases. Otherwise they
can be different (although not in a big way).
Unfortunately, the decoder does not output state segmentations,
so you can't check on this. You could check phone segmentations
to make sure they are identical in both cases. If they are not,
that will explain it If they are, the possibility of different
state segmentations still exists.

<p>
Q<font color="#483d8b">.
The decoder outputs the following for utterance 440c0206:
<br>
FWDXCT: 440c0206 S 31362248 T -28423295 A -23180713 L -536600 0 -2075510
0 &#60s> 56 -661272 -54550 TO(3) 73 -4390222 -89544 AUGUST 158 -2868798
-113158 INSIDER 197 -1833121 -1960 TRADING 240 -2867326 -74736 RECENTLY
292 -941077 -55738 ON 313 -1669590 -12018 THE(2) 347 -1454081 -63248
QUARTER 379 -4419716 -71648 JUMPED 511
<br>
Now let's say I have the same utts, and the same models, but a different
processing (e.g. some compensation is now applied), and in this expt I
get:
<br> 
FWDXCT: 440c0206 S 36136210 T -25385610 A -21512567 L -394130 0 -1159384
0 &#60s> 56 -711513 -63540 TWO 73 -1679116 -52560 OTHER 103 -2163915 -52602
ISSUES 152 -2569731 -51616 BEGAN 197 -1952266 -22428 TRADING 240
-5408397 -74736 RECENTLY 333 -3049232 -47562 REPORTED(2) 395 -2819013 -
29086 &#60/s> 511
<br>
Let's say I want to see if this compensation scheme increased the
likelihood of the utterance. can I just compare the acoustic scores
(after the "A") directly, or do I have to take the scaling ("S") into
account somehow (e.g. add it back in (assuming its applied in the log
domain))? 
</font>
<p>
You have to add the scaling factor to the acoustic likelihood against
the A to get the total likelihood. You can then compare the scores across
different runs with the same acoustic models.
<p>
<a NAME="22"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>DECODING PROBLEMS</td>
</table>
<p>
Q<font color="#483d8b">.
I am trying to use the opensource SPHINX-II decoder with semicontinuous
models. The decoder sometimes takes very long to decode an utterance, and
occasionally just hangs. What could be the problem?
</font>
<p>
Check your -agc*, -normmean and  -bestpath flags.
It is important to set the AGC/CMN flags to the same setting as was used
to train the models. Otherwise, the decoder makes more mistakes.
When this happens, when it tries to create a DAG for rescoring
for the bestpath (which is enabled by seting "-bestpath TRUE")
it gets trapped while creating the DAG and spends inordinate
amounts of time on it (sometimes never succeeding at all).
Even if the AGC/CMN flags are correct, this can happen on bad utterances.
Set -bestpath FALSE and check if the problem persists for the
correct AGC/CMN settings. If it does, there might be a problem with your
acoustic models.
<p>
<a NAME="23"></a>
<TABLE width="100%" bgcolor="#ffffff">
<td>INTERPRETING SPHINX-III FILE FORMATS</td>
</table>
<p>
Q<font color="#483d8b">.
what's up with the s3 mixw files? the values seem all over the place. to
get mixw interms of numbers that sum to one, do you have to sum up all
mixw and divide by the total? any idea why it is done this way? is there
a sphinx function to  return normalized values? not that its hard to
write but no need reinventing the wheel...
here's an example of 1gau mixw file, using printp to view contents:
<pre>
--------with -norm no
mixw 5159 1 1
mixw [0 0] 1.431252e+04

        1.431e+04 
mixw [1 0] 3.975112e+04

        3.975e+04 
mixw [2 0] 2.254014e+04

        2.254e+04 
mixw [3 0] 2.578259e+04

        2.578e+04 
mixw [4 0] 1.262872e+04

        1.263e+04 

-with -norm yes
mixw 5159 1 1
mixw [0 0] 1.431252e+04

        1.000e+00 
mixw [1 0] 3.975112e+04

        1.000e+00 
mixw [2 0] 2.254014e+04

        1.000e+00 
mixw [3 0] 2.578259e+04

        1.000e+00 
mixw [4 0] 1.262872e+04

        1.000e+00
</pre>
</font>
<p>
In s3, we have mixtures of gaussians for each state. Each gaussian has
a different mixture weight. When there is only 1 gaussian/state the
mixture weight is 1. However, instead of writing the number 1 we
write a number like "1.431252e+04" which is basically the no. of
times the state occured in the corpus. This number is
useful in other places during training (interpolation, adaptation, tree
building etc). The number following
"mixw" like "mixw 5159" below merely tells you the total number of
mixture wts. (equal to the total no. of tied states for 1 gau/st models).

So
<pre>
--------with -norm no
mixw 5159 1 1
mixw [0 0] 1.431252e+04

        1.431e+04 
</pre>
implies you haven't summed all weights and divided by total

and
<pre>
--------with -norm yes
mixw 5159 1 1
mixw [0 0] 1.431252e+04

        1.0000..
</pre>
implies you *have* summed and divided by total (here  you
have only one mixw to do it on per state), and so get a mixw of 1.






<hr>
    <H2></H2><!-- Just to provide some space -->
    
    <address>Maintained by <a href="mailto:egouvea+sourceforge@cs.cmu.edu">Evandro B. Gouv&ecirc;a</a></address>
    <!-- Created: Fri  Nov 17 17:05:14 EST 2000 -->
    <!-- hhmts start -->
Last modified: Wed Jul 26 13:18:28 Eastern Daylight Time 2006
<!-- hhmts end -->
</body>
</html>