/usr/share/doc/happy/html/sec-glr-misc.html

<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>3.4. Further information</title><link rel="stylesheet" type="text/css" href="fptools.css"><meta name="generator" content="DocBook XSL Stylesheets V1.78.1"><link rel="home" href="index.html" title="Happy User Guide"><link rel="up" href="sec-glr.html" title="Chapter 3. Generalized LR Parsing"><link rel="prev" href="sec-glr-semantics.html" title="3.3. Including semantic results"><link rel="next" href="sec-AttributeGrammar.html" title="Chapter 4. Attribute Grammars"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">3.4. Further information</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="sec-glr-semantics.html">Prev</a> </td><th width="60%" align="center">Chapter 3. Generalized LR Parsing</th><td width="20%" align="right"> <a accesskey="n" href="sec-AttributeGrammar.html">Next</a></td></tr></table><hr></div><div class="sect1"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sec-glr-misc"></a>3.4. Further information</h2></div></div></div><p>
      Other useful information...
      </p><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-examples"></a>3.4.1. The GLR examples</h3></div></div></div><p>
	The directory <code class="literal">examples/glr</code> contains several examples
	from the small to the large. Please consult these or use them as a
	base for your experiments.
        </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-graphs"></a>3.4.2. Viewing forests as graphs</h3></div></div></div><p>
	If you run the examples with <span class="application">GHC</span>, each
	run will produce a file <code class="literal">out.daVinci</code>. This is a
	graph in the format expected by the <span class="emphasis"><em>daVinci</em></span>
	graph visualization tool.
	(See <a class="ulink" href="http://www.informatik.uni-bremen.de/~davinci/" target="_top">http://www.informatik.uni-bremen.de/~davinci/</a>
	for more information. Educational use licenses are currently
	available without charge.)
        </p><p>
	We highly recommend looking at graphs of parse results - it really
	helps to understand the results.
	The graphs files are created with Sven Panne's library for
	communicating with <span class="emphasis"><em>daVinci</em></span>, supplemented
	with some extensions due to Callaghan. Copies of this code are
	included in the examples directory, for convenience.
	If you are trying to view large and complex graphs, contact Paul
	Callaghan (there are tools and techniques to make the graphs more
	manageable).
	</p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-applications"></a>3.4.3. Some Applications of GLR parsing</h3></div></div></div><p>
	GLR parsing (and related techniques) aren't just for badly written
	grammars or for things like natural language (NL) where ambiguity is
	inescapable. There are applications where ambiguity can represent
	possible alternatives in pattern-matching tasks, and the flexibility
	of these parsing techniques and the resulting graphs support deep
	analyses. Below, we briefly discuss some examples, a mixture from
	our recent work and from the literature.
        </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">Gene sequence analysis</span></dt><dd><p>
	       Combinations of structures within gene sequences can be
	       expressed as a grammar, for example a "start" combination
	       followed by a "promoter" combination then the gene proper.
	       A recent undergraduate project has used this GLR implementation
	       to detect candiate matches in data, and then to filter these
	       matches with a mixture of local and global information.
	       </p></dd><dt><span class="term">Rhythmic structure in poetry</span></dt><dd><p>
	       Rhythmic patterns in (English) poetry obey certain rules,
	       and in more modern poetry can break rules in particular ways
	       to achieve certain effects. The standard rhythmic patterns
	       (eg. iambic pentameter) can be encoded as a grammar, and
	       deviations from the patterns also encoded as rules.
	       The neutral reading can be parsed with this grammar, to
	       give a forest of alternative matches. The forest can be
	       analysed to give a preferred reading, and to highlight
	       certain technical features of the poetry.
	       An undergraduate project in Durham has used this implementation
	       for this purpose, with promising results.
	       </p></dd><dt><span class="term">Compilers -- instruction selection</span></dt><dd><p>
	       Recent work has phrased the translation problem in
	       compilers from intermediate representation to an
	       instruction set for a given processor as a matching
	       problem. Different constructs at the intermediate
	       level can map to several combinations of machine
	       instructions. This knowledge can be expressed as a
	       grammar, and instances of the problem solved by
	       parsing. The parse forest represents competing solutions,
	       and allows selection of optimum solutions according
	       to various measures.
	       </p></dd><dt><span class="term">Robust parsing of ill-formed input</span></dt><dd><p>
	       The extra flexibility of GLR parsing can simplify parsing
	       of formal languages where a degree of `informality' is allowed.
	       For example, Html parsing. Modern browsers contain complex
	       parsers which are designed to try to extract useful information
	       from Html text which doesn't follow the rules precisely,
	       eg missing start tags or missing end tags.
	       Html with missing tags can be written as an ambiguous grammar,
	       and it should be a simple matter to extract a usable
	       interpretation from a forest of parses.
	       Notice the technique: we widen the scope of the grammar,
	       parse with GLR, then extract a reasonable solution.
	       This is arguably simpler than pushing an LR(1) or LL(1)
	       parser past its limits, and also more maintainable.
	       </p></dd><dt><span class="term">Natural Language Processing</span></dt><dd><p>
	       Ambiguity is inescapable in the syntax of most human languages.
	       In realistic systems, parse forests are useful to encode
	       competing analyses in an efficient way, and they also provide
	       a framework for further analysis and disambiguation. Note
	       that ambiguity can have many forms, from simple phrase
	       attachment uncertainty to more subtle forms involving mixtures
	       of word senses. If some degree of ungrammaticality is to be
	       tolerated in a system, which can be done by extending the
	       grammar with productions incorporating common forms of
	       infelicity, the degree of ambiguity increases further. For
	       systems used on arbitrary text, such as on newspapers,
	       it is not uncommon that many sentences permit several
	       hundred or more analyses. With such grammars, parse forest
	       techniques are essential.
	       Many recent NLP systems use such techniques, including
	       the Durham's earlier LOLITA system - which was mostly
	       written in Haskell.
	       </p></dd></dl></div></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-workings"></a>3.4.4. Technical details</h3></div></div></div><p>
	The original implementation was developed by Ben Medlock,
	as his undergraduate final year project,
	using ideas from Peter Ljungloef's Licentiate thesis
	(see <a class="ulink" href="http://www.cs.chalmers.se/~peb/parsing" target="_top">http://www.cs.chalmers.se/~peb/parsing</a>, and
	we recommend the thesis for its clear analysis of parsing
	algorithms).
	Ljungloef's version produces lists of parse trees, but Medlock
	adapted this to produce an explicit graph containing parse structure
	information. He also incorporated
	the code into <span class="application">Happy</span>.
        </p><p>
	After Medlock's graduation, Callaghan extended the code to
	incorporate semantic information, and made several improvements
	to the original code, such as improved local packing and
	support for hidden left recursion. The performance of the
	code was significantly improved, after changes of representation
	(eg to a chart-style data structure)
	and technique. Medlock's code was also used in several student
	projects, including analysis of gene sequences (Fischer) and
	analysis of rhythmic patterns in poetry (Henderson).
        </p><p>
	The current code implements the standard GLR algorithm extended
	to handle hidden left recursion. Such recursion, as in the grammar
	below from Rekers [1992], causes the standard algorithm to loop
	because the empty reduction <code class="literal">A -&gt; </code> is always
	possible and the LR parser will not change state. Alternatively,
	there is a problem because an unknown (at the start of parsing)
	number of <code class="literal">A</code>
	items are required, to match the number of <code class="literal">i</code>
	tokens in the input.
        </p><pre class="programlisting">
	S -&gt; A Q i | +
	A -&gt;
        </pre><p>
	The solution to this is not surprising. Problematic recursions
	are detected as zero-span reductions in a state which has a
	<code class="literal">goto</code> table entry looping to itself. A special
	symbol is pushed to the stack on the first such reduction,
	and such reductions are done at most once for any token
	alternative for any input position.
	When popping from the stack, if the last token being popped
	is such a special symbol, then two stack tails are returned: one
	corresponding to a conventional pop (which removes the
	symbol) and the other to a duplication of the special symbol
	(the stack is not changed, but a copy of the symbol is returned).
	This allows sufficient copies of the empty symbol to appear
	on some stack, hence allowing the parse to complete.
	</p><p>
	The forest is held in a chart-style data structure, and this supports
	local ambiguity packing (chart parsing is discussed in Ljungloef's
	thesis, among other places).
	A limited amount of packing of live stacks is also done, to avoid
	some repetition of work.
	</p><p>
	[Rekers 1992] Parser Generation for Interactive Environments,
	PhD thesis, University of Amsterdam, 1992.
        </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-filter"></a>3.4.5. The <code class="option">--filter</code> option</h3></div></div></div><p>
	You might have noticed this GLR-related option. It is an experimental
	feature intended to restrict the amount of structure retained in the
	forest by discarding everything not required for the semantic
	results. It may or it may not work, and may be fixed in a future
	release.
        </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-limitations"></a>3.4.6. Limitations and future work</h3></div></div></div><p>
	The parser supports hidden left recursion, but makes no attempt
	to handle cyclic grammars that have rules which do not consume any
	input. If you have a grammar like this, for example with rules like
	<code class="literal">S -&gt; S</code> or
	<code class="literal">S -&gt; A S | x; A -&gt; empty</code>, the implementation will
	loop until you run out of stack - but if it will happen, it often
	happens quite quickly!
	</p><p>
	The code has been used and tested frequently over the past few years,
	including being used in several undergraduate projects. It should be
	fairly stable, but as usual, can't be guaranteed bug-free. One day
	I will write it in Epigram!
	</p><p>
	If you have suggestions for improvements, or requests for features,
	please contact Paul
	Callaghan. There are some changes I am considering, and some
	views and/or encouragement from users will be much appreciated.
	Further information can be found on Callaghan's
	<a class="ulink" href="http://www.dur.ac.uk/p.c.callaghan/happy-glr" target="_top">GLR parser
	page</a>.
        </p></div><div class="sect2"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-acknowledgements"></a>3.4.7. Thanks and acknowledgements</h3></div></div></div><p>
	Many thanks to the people who have used and tested this software
	in its various forms, including Julia Fischer, James Henderson, and
	Aarne Ranta.
        </p></div></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="sec-glr-semantics.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="sec-glr.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="sec-AttributeGrammar.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">3.3. Including semantic results </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> Chapter 4. Attribute Grammars</td></tr></table></div></body></html>
happy 1.19.4-1 / usr / share / doc / happy / html / sec-glr-misc.html