/usr/share/gocode/src/gopkg.in/neurosnap/sentences.v1/test_files/english/kiss

The sentence is a fundamental and relatively well-understood unit in theoretical and
computational linguistics.
{{sentence_break}}
Syntactic processing is generally performed on a sentenceby-sentence
basis, and many processes are constrained by this abstract concept in that
they may be active inside a sentence but not between consecutive sentences.
{{sentence_break}}
Among
these, we find collocations, idioms, anaphora, as well as variable binding and quantification.
{{sentence_break}}
Given that these processes are crucially constrained by sentence boundaries,
the successful determination of these boundaries is a prerequisite for proper sentence
processing.
{{sentence_break}}
Sentence boundary detection is not a trivial task, though.
{{sentence_break}}
Graphemes often
serve more than one purpose in writing systems.
{{sentence_break}}
The period, which is employed as sentence
boundary marker, is no exception to this rule.
{{sentence_break}}
It is also used to mark abbreviations,
initials, ordinal numbers, and ellipses.
{{sentence_break}}
Moreover, a period can be used to mark an abbreviation
and a sentence boundary at the same time.
{{sentence_break}}
In such cases, the second period
is haplologically omitted and only one period is used as end-of-sentence and abbreviation
marker.1 Sentence boundary detection thus has to be considered as an instance of
ambiguity resolution.
{{sentence_break}}
The ambiguity of the period is illustrated by example (1).
{{sentence_break}}
∗ Sprachwissenschaftliches Institut, Ruhr-Universitat Bochum, 44780 Bochum, Germany, E-mail: ¨
tibor@linguistics.rub.de
† Sprachwissenschaftliches Institut, Ruhr-Universitat Bochum, 44780 Bochum, Germany, E-mail: ¨
strunk@linguistics.rub.de
1 See Nunberg (1990) for a linguistic discussion of punctuation and the ambiguity of the period.
{{sentence_break}}
°c 2006 Association for Computational Linguistics
Computational Linguistics Volume 20, Number 1
Example 1
CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday,
according to lead underwriter L.F. Rothschild & Co. (cited from WSJ 05/29/1987)
Periods that form part of an abbreviation but are taken to be end-of-sentence markers
or vice versa do not only introduce errors in the determination of sentence boundaries.
{{sentence_break}}
Segmentation errors propagate into further components which rely on accurate
sentence segmentation and subsequent analyses are most likely affected negatively.
{{sentence_break}}
Walker et al.
{{sentence_break}}
(2001), for example, stress the importance of correct sentence boundary
disambiguation for machine translation and Kiss and Strunk (2002b) show that errors
in sentence boundary detection lead to a higher error rate in part-of-speech tagging.
{{sentence_break}}
In this paper, we present an approach to sentence boundary detection that builds
on language-independent methods and determines sentence boundaries with high accuracy.
{{sentence_break}}
It does not make use of additional annotations, part-of-speech tagging, or precompiled
lists to support sentence boundary detection but extracts all necessary data
from the corpus to be segmented.
{{sentence_break}}
Also, it does not use orthographic information as primary
evidence and is thus suited to process single-case text.
{{sentence_break}}
It focuses on robustness
and flexibility in that it can be applied with good results to a variety of languages without
any further adjustments.
{{sentence_break}}
At the same time, the modular structure of the proposed
system makes it possible in principle to integrate language-specific methods and clues
to further improve its accuracy.
{{sentence_break}}
The basic algorithm has been determined experimentally
on the basis of an unannotated development corpus of English.
{{sentence_break}}
We have applied
the resulting system to further corpora of English text as well as to corpora from ten
other languages: Brazilian Portuguese, Dutch, Estonian, French, German, Italian, Norwegian,
Spanish, Swedish, and Turkish.
{{sentence_break}}
Without further additions or amendments to
the system produced through experimentation on the development corpus, the mean
accuracy of sentence boundary detection on newspaper corpora in eleven languages is
98.74 %.
{{sentence_break}}
Sentence boundary detection has received some attention in the past decade, as can
be witnessed by approaches such as Grefenstette and Tapanainen (1994), Palmer and
Hearst (1997), Reynar and Ratnaparkhi (1997), and Mikheev (2002), amongst others.
{{sentence_break}}
It
is a common property of these approaches that they have been applied to a small range
of languages only, usually covering one to three languages.
{{sentence_break}}
Although these methods
are very successful when applied to these individual languages, one of which is usually
English, concentrating on a small set of languages is problematic in two respects.
{{sentence_break}}
First, it remains unclear how well the suggested methods operate if they are tested on a
wider sample of languages.
{{sentence_break}}
Second, the complexity of the task itself may be underestimated
if only familiar languages are considered.
{{sentence_break}}
Mikheev (2002, page 314), for example,
proposes that an ideal upper bound error rate for sentence boundary detection could be
0.02 % because he estimates disagreement of human annotators on this task at about 1 in
5000 cases.
{{sentence_break}}
But Mikheev’s assessment seems to presume that language-specific systems
for sentence boundary detection should be compared to the proficiency of individual
speakers who are fluent in the languages under evaluation.
{{sentence_break}}
Surely, a Russian speaker
with no knowledge of German would reach an accuracy of 99.98 % only by chance, if
at all, on the task of segmenting a German text into sentences.
{{sentence_break}}
If the present system is
compared to the performance of human speakers, it should be compared to an idealized
speaker who is able to detect sentence boundaries accurately even for languages that
he/she has no knowledge of.
{{sentence_break}}
Under these circumstances, the estimated lower bound
for sentence boundaries detection will presumably be higher than Mikheev’s estimate.
{{sentence_break}}
2
Kiss and Strunk Unsupervised Sentence Boundary Detection
We approach sentence boundary detection by first determining possible abbreviations
in the text.
{{sentence_break}}
Quantitatively, abbreviations are a major source of ambiguities in
sentence boundary detection since they often constitute up to 30 % of the possible candidates
for sentence boundaries in running text; see section 6.1.
{{sentence_break}}
While the concept ‘sentence
boundary’ remains elusive in that typical, cross-linguistically valid, as well as robust
properties of a sentence boundary cannot easily be defined, the same does not hold
for abbreviations.
{{sentence_break}}
The end of a sentence cannot easily be characterized as either appearing
after a particular word, between two particular words, after a particular word class,
or in between two particular word classes.
{{sentence_break}}
But, as we will show, an abbreviation can be
cross-linguistically characterized in such a way.
{{sentence_break}}
This is so because the end-of-sentence
marker cannot be linked to an intrinsic property of the sentence while a period marking
an abbreviation can be related to the abbreviated word itself.
{{sentence_break}}
It is our basic assumption that abbreviations are collocations of the truncated word
and the following period, and hence, that methods for the detection of collocations can
be successfully applied to abbreviation detection.
{{sentence_break}}
Firth (1957, page 181) characterizes
the collocations of a word as “statements of the habitual or customary places of that
word.” In languages that mark abbreviations with a following period, one could say
that the abbreviation is habitually made up of a truncated word (or sequence of words)
and a following period.
{{sentence_break}}
But this might even be too weak a formulation.
{{sentence_break}}
While typical
elements of a collocation can also appear together with other words, the abbreviation is
strongly tied to the following period.
{{sentence_break}}
Ideally, in the absence of homography and typing
errors, an abbreviation should always end in a final period.
{{sentence_break}}
Hence, we characterize an
abbreviation as a very strict collocation and use standard techniques for the detection of
collocations.
{{sentence_break}}
These techniques will be modified appropriately to account for the stricter
tie between an abbreviated word and the following period.
{{sentence_break}}
It should be clear from the
outset that abbreviations cannot simply be handled by listing them because they form
a productive and hence open word class; see also Muller, Amerl, and Natalis (1980, ¨
pages 52f.)
{{sentence_break}}
and Mikheev (2002, page 291).
{{sentence_break}}
We corroborate this fact with an experiment
in section 6.4.4.
{{sentence_break}}
We offer a formal characterization of abbreviations in terms of three major properties,
which only rely on the candidate word type itself and not on the local context in
which an instance of the candidate type appears.
{{sentence_break}}
First, as was already mentioned, an abbreviation
looks like a very tight collocation in that the abbreviated word preceding the
period and the period itself form a close bond.
{{sentence_break}}
Second, abbreviations have the tendency
to be rather short.
{{sentence_break}}
This does not mean that we have to assume a fixed upper bound for
the length of a possible abbreviation, but that the likelihood of being an abbreviation
declines if candidates become longer.
{{sentence_break}}
Using the length of a candidate as a counterbalance
to the collocational bond between candidate and final period allows our method
to identify quite long abbreviations, as long as the collocational bond between the candidate
type and the period is very strong.
{{sentence_break}}
As a third characteristic property, we have
identified the occurrence of word-internal periods contained in many abbreviations.
{{sentence_break}}
While we have determined the aforementioned properties experimentally, we believe
that they indeed represent crucial traits of abbreviations.
{{sentence_break}}
Using just these three characteristics, our system is able to detect abbreviations with
a very high mean accuracy of 99.38 % on newspaper corpora in eleven languages.
{{sentence_break}}
The
effectiveness of the three properties is further corroborated by an experiment we have
carried out with a log-linear classifier; compare section 6.4.6.
{{sentence_break}}
The reported figure does
not include initials and ordinal numbers because these subclasses of abbreviations cannot
be discovered using these characteristics and have to be treated differently.
{{sentence_break}}
The complete
system with special heuristics for initials and ordinal numbers achieves an accuracy
of 99.20 % for the detection of abbreviations, initials, and ordinal numbers.
{{sentence_break}}
The determination of abbreviation types already yields a large percentage of all sentence
boundaries because all periods occurring after non-abbreviation types can be classified
as end-of-sentence markers.
{{sentence_break}}
Such a disambiguation on the type level, however, is
insufficient by itself because it still has to be determined for every period following an
abbreviation whether it serves as a sentence boundary marker at the same time.
{{sentence_break}}
This
observation suggests a treatment of sentence boundary detection which is both type
and token-based.
{{sentence_break}}
We define a classifier as type-based if it uses global evidence, for example,
the distribution of a type in a corpus, to classify a type as a whole.
{{sentence_break}}
In contrast,
a token-based classifier determines a class for each individual token based on its local
context.
{{sentence_break}}
The detection of initials and of ordinal numbers, which are represented by
digits followed by a period in several languages, also requires the application of tokenbased
methods because these subclasses of abbreviations are problematic for type-based
methods.
{{sentence_break}}
The detection of sentence boundaries and abbreviations thus lends itself to a twostage
approach combining type-based and token-based classifiers.
{{sentence_break}}
In the first stage, a
resolution is performed on the type level to detect abbreviation types and ordinary
word types.
{{sentence_break}}
After this stage, the corpus receives an intermediate annotation where all
instances of abbreviations detected by the first stage are marked as such with the tag
<A> and all ellipses with the tag <E>.
{{sentence_break}}
All periods following non-abbreviations are assumed
to be sentence boundary markers and receive the annotation <S>.
{{sentence_break}}
The second,
token-based stage employs additional heuristics on the basis of the intermediate annotation
to refine and correct the output of the first classifier for each individual token.
{{sentence_break}}
The token-based classifier is particularly suited to determine abbreviations and ellipses
at the end of sentence giving them the final annotation <A><S> or <E><S>.
{{sentence_break}}
But it is
4
Kiss and Strunk Unsupervised Sentence Boundary Detection
also used to correct the intermediate annotation by detecting initials and ordinal numbers
which cannot easily be recognized with type-based methods and thus often receive
the wrong annotation from the first stage.
{{sentence_break}}
The overall architecture of the present system,
which we have baptized Punkt (German for period), is given in Figure 1.
{{sentence_break}}
The present article is structured as follows: Likelihood ratios can be considered the
heart of the present proposal.
{{sentence_break}}
Both the type-based and the token-based classifiers make
use of likelihood ratios to determine collocational bonds between a possible abbreviation
and its final period, between the sentence boundary period and a word following it,
and between words which surround a period.
{{sentence_break}}
Section 2 introduces the concept of a likelihood
ratio and discusses the specific properties of the likelihood ratios employed by
Punkt.
{{sentence_break}}
Section 3 describes the type-based classification stage, while section 4 introduces
the token-based re-classification methods.
{{sentence_break}}
Section 5 gives an account of how Punkt was
developed and how we determined some necessary parameters.
{{sentence_break}}
The experiments carried
out with the present system are discussed in section 6.
{{sentence_break}}
In section 7, we compare
Punkt to other sentence boundary detection systems proposed in the literature.
{{sentence_break}}
2 Likelihood Ratios
Punkt employs likelihood ratios to determine collocational ties in the type-based as well
as in the token-based stage.
{{sentence_break}}
The usefulness of likelihood ratios for collocation detection
has been made explicit in Dunning (1993) and has been confirmed by an evaluation of
various collocation detection methods carried out in Evert and Krenn (2001).
{{sentence_break}}
Kiss and
Strunk (2002a) and (2002b) characterize abbreviations as collocations and use Dunning’s
log-likelihood ratio (log λ) to detect them on the type level.
{{sentence_break}}
The present proposal differs
from Kiss and Strunk’s earlier suggestion in employing a highly modified log-likelihood
ratio for abbreviation detection in the type-based stage.
{{sentence_break}}
The reasons for this divergence
will be discussed in section 2.1.
{{sentence_break}}
In the token-based stage, we employ Dunning’s original
log λ, but add an additional constraint to make it one-sided.
{{sentence_break}}
This version of log λ will
be described in section 2.2.

{{sentence_break}}
golang-gopkg-neurosnap-sentences.v1-dev 1.0.6-1 / usr / share / gocode / src / gopkg.in / neurosnap / sentences.v1 / test_files / english / kiss_s.txt