/usr/share/doc/festival-doc/html/Utterance-structure.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Created by GNU Texinfo 5.2, http://www.gnu.org/software/texinfo/ -->
<head>
<title>Festival Speech Synthesis System: Utterance structure</title>

<meta name="description" content="Festival Speech Synthesis System: Utterance structure">
<meta name="keywords" content="Festival Speech Synthesis System: Utterance structure">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link href="index.html#Top" rel="start" title="Top">
<link href="Index.html#Index" rel="index" title="Index">
<link href="Index.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="Utterances.html#Utterances" rel="up" title="Utterances">
<link href="Utterance-types.html#Utterance-types" rel="next" title="Utterance types">
<link href="Utterances.html#Utterances" rel="prev" title="Utterances">
<style type="text/css">
<!--
a.summary-letter {text-decoration: none}
blockquote.smallquotation {font-size: smaller}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
div.indentedblock {margin-left: 3.2em}
div.lisp {margin-left: 3.2em}
div.smalldisplay {margin-left: 3.2em}
div.smallexample {margin-left: 3.2em}
div.smallindentedblock {margin-left: 3.2em; font-size: smaller}
div.smalllisp {margin-left: 3.2em}
kbd {font-style:oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
pre.smalldisplay {font-family: inherit; font-size: smaller}
pre.smallexample {font-size: smaller}
pre.smallformat {font-family: inherit; font-size: smaller}
pre.smalllisp {font-size: smaller}
span.nocodebreak {white-space:nowrap}
span.nolinebreak {white-space:nowrap}
span.roman {font-family:serif; font-weight:normal}
span.sansserif {font-family:sans-serif; font-weight:normal}
ul.no-bullet {list-style: none}
-->
</style>


</head>

<body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000">
<a name="Utterance-structure"></a>
<div class="header">
<p>
Next: <a href="Utterance-types.html#Utterance-types" accesskey="n" rel="next">Utterance types</a>, Up: <a href="Utterances.html#Utterances" accesskey="u" rel="up">Utterances</a> &nbsp; [<a href="Index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index.html#Index" title="Index" rel="index">Index</a>]</p>
</div>
<hr>
<a name="Utterance-structure-1"></a>
<h3 class="section">14.1 Utterance structure</h3>

<a name="index-utterance-2"></a>
<a name="index-TTS-processes"></a>
<p>Festival&rsquo;s basic object for synthesis is the <em>utterance</em>.  An
represents some chunk of text that is to be rendered as speech.  In
general you may think of it as a sentence but in many cases it wont
actually conform to the standard linguistic syntactic form of a
sentence.  In general the process of text to speech is to take an
utterance which contains a simple string of characters and convert it
step by step, filling out the utterance structure with more information
until a waveform is built that says what the text contains.
</p>
<p>The processes involved in conversion are, in general, as follows
</p><dl compact="compact">
<dt><em>Tokenization</em></dt>
<dd><p>Converting the string of characters into a list of tokens.  Typically
this means whitespace separated tokesn of the original text string.
</p></dd>
<dt><em>Token identification</em></dt>
<dd><p>identification of general types for the tokens, usually this is trivial
but requires some work to identify tokens of digits as years, dates,
numbers etc.
</p></dd>
<dt><em>Token to word</em></dt>
<dd><p>Convert each tokens to zero or more words, expanding numbers,
abbreviations etc.
</p></dd>
<dt><em>Part of speech</em></dt>
<dd><p>Identify the syntactic part of speech for the words.
</p></dd>
<dt><em>Prosodic phrasing</em></dt>
<dd><p>Chunk utterance into prosodic phrases.
</p></dd>
<dt><em>Lexical lookup</em></dt>
<dd><p>Find the pronucnation of each word from a lexicon/letter to sound
rule system including phonetic and syllable structure.
</p></dd>
<dt><em>Intonational accents</em></dt>
<dd><p>Assign intonation accents to approrpiate syllables.
</p></dd>
<dt><em>Assign duration</em></dt>
<dd><p>Assign duration to each phone in the utterance.
</p></dd>
<dt><em>Generate F0 contour (tune)</em></dt>
<dd><p>Generate tune based on accents etc.
</p></dd>
<dt><em>Render waveform</em></dt>
<dd><p>Render waveform from phones, duration and F) target values, this
itself may take several steps including unit selection (be they
diphones or other sized units), imposition of dsesired prosody
(duration and F0) and waveform reconstruction.
</p></dd>
</dl>
<p>The number of steps and what actually happens may vary and is dependent
on the particular voice selected and the utterance&rsquo;s <em>type</em>,
see below.
</p>
<p>Each of these steps in Festival is achived by a <em>module</em> which
will typically add new information to the utterance structure.
</p>
<a name="index-Utterance-structure"></a>
<a name="index-Items"></a>
<a name="index-Relations"></a>
<p>An utterance structure consists of a set of <em>items</em> which may be
part of one or more <em>relations</em>.  Items represent things like words
and phones, though may also be used to represent less concrete objects
like noun phrases, and nodes in metrical trees.  An item contains a set
of features, (name and value).  Relations are typically simple lists of
items or trees of items.  For example the the <code>Word</code> relation is a
simple list of items each of which represent a word in the utterance.
Those words will also be in other relations, such as the
<em>SylStructure</em> relation where the word will be the top of a tree
structure containing its syllables and segments.
</p>
<p>Unlike previous versions of the system items (then called stream items)
are not in any particular relations (or stream).  And are merely part of
the relations they are within.  Importantly this allows much more general
relations to be made over items that was allowed in the previous
system.  This new architecture is the continuation of our goal
of providing a general efficient structure for representing complex
interrelated utterance objects.
</p>
<a name="index-Festival-relations"></a>
<p>The architecture is fully general and new items and relations may
be defined at run time, such that new modules may use any relations
they wish. However within our standard English (and other voices)
we have used a specific set of relations ass follows.
</p><dl compact="compact">
<dt><em>Token</em></dt>
<dd><p>a list of trees.  This is first formed as a list of tokens found
in a character text string.  Each root&rsquo;s daughters are the <em>Word</em>&rsquo;s
that the token is related to.
</p></dd>
<dt><em>Word</em></dt>
<dd><p>a list of words.  These items will also appear as daughters (leaf nodes)
of the <code>Token</code> relation.  They may also appear in the <code>Syntax</code>
relation (as leafs) if the parser is used.  They will also be leafs
of the <code>Phrase</code> relation.
</p></dd>
<dt><em>Phrase</em></dt>
<dd><p>a list of trees.  This is a list of phrase roots whose daughters are
the <code>Word's</code> within those phrases.
</p></dd>
<dt><em>Syntax</em></dt>
<dd><p>a single tree.  This, if the probabilistic parser is called, is a syntactic
binary branching tree over the members of the <code>Word</code> relation.
</p></dd>
<dt><em>SylStructure</em></dt>
<dd><p>a list of trees.  This links the <code>Word</code>, <code>Syllable</code> and
<code>Segment</code> relations.  Each <code>Word</code> is the root of a tree
whose immediate daughters are its syllables and their daughters in
turn as its segments.
</p></dd>
<dt><em>Syllable</em></dt>
<dd><p>a list of syllables.  Each member will also be in a the
<code>SylStructure</code> relation.  In that relation its parent will be the
word it is in and its daughters will be the segments that are in it.
Syllables are also in the <code>Intonation</code> relation giving links to
their related intonation events.
</p></dd>
<dt><em>Segment</em></dt>
<dd><p>a list of segments (phones).  Each member (except silences) will be leaf
nodes in the <code>SylStructure</code> relation.  These may also be in the
<code>Target</code> relation linking them to F0 target points.
</p></dd>
<dt><em>IntEvent</em></dt>
<dd><p>a list of intonation events (accents and boundaries).  These are related
to syllables through the <code>Intonation</code> relation as leafs on that
relation.  Thus their parent in the <code>Intonation</code> relation is the
syllable these events are attached to.
</p></dd>
<dt><em>Intonation</em></dt>
<dd><p>a list of trees relating syllables to intonation events.  Roots of
the trees in <code>Intonation</code> are <code>Syllables</code> and their daughters
are <code>IntEvents</code>.
</p></dd>
<dt><em>Wave</em></dt>
<dd><p>a single item with a feature called <code>wave</code> whose value
is the generated waveform.
</p></dd>
</dl>
<p>This is a non-exhaustive list some modules may add other relations
and not all utterance may have all these relations, but the above
is the general case.
</p>
<hr>
<div class="header">
<p>
Next: <a href="Utterance-types.html#Utterance-types" accesskey="n" rel="next">Utterance types</a>, Up: <a href="Utterances.html#Utterances" accesskey="u" rel="up">Utterances</a> &nbsp; [<a href="Index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index.html#Index" title="Index" rel="index">Index</a>]</p>
</div>



</body>
</html>
festival-doc 1:2.1~release-8 / usr / share / doc / festival-doc / html / Utterance-structure.html