This file is indexed.

/usr/share/gocode/src/gopkg.in/neurosnap/sentences.v1/test_files/english/punct.txt is in golang-gopkg-neurosnap-sentences.v1-dev 1.0.6-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
Standard automatic speech recognizers output unstructured
streams of words. They neither perform
a proper segmentation of the output into sentences,
nor predict punctuation symbols. The unavailable
punctuations and sentence boundaries in transcribed
speech texts create barriers to many subsequent
processing tasks, such as summarization,
information extraction, question answering and
machine translation. Thus, the segmentation of
long texts is necessary in many real applications.
For example, in speech-to-speech translation,
continuously transcribed speech texts need to be
segmented before being fed into subsequent machine
translation systems (Takezawa et al., 1998;
Nakamura, 2009). This is because current machine
translation (MT) systems perform the translation
at the sentence level, where various models
used in MT are trained over segmented sentences
and many algorithms inside MT have an exponential
complexity with regard to the length of inputs.
The punctuation prediction problem has attracted
research interest in both the speech processing
community and the natural language processing
community. Most previous work primarily
exploits local features in their statistical models
such as lexicons, prosodic cues and hidden
event language model (HELM) (Liu et al., 2005;
Matusov et al., 2006; Huang and Zweig, 2002;
Stolcke and Shriberg, 1996). The word-level models
integrating local features have narrow views
about the input and could not achieve satisfied
performance due to the limited context information
access (Favre et al., 2008). Naturally,
global contexts are required to model the punctuation
prediction, especially for long-range dependencies.
For instance, in English question sentences,
the ending question mark is long-range dependent
on the initial phrases (Lu and Ng, 2010),
such as “could you” in Figure 1. There has been
some work trying to incorporate syntactic features
to broaden the view of hypotheses in the punctuation
prediction models (Roark et al., 2006; Favre
et al., 2008). In their methods, the punctuation
prediction is treated as a separated post-procedure
of parsing, which may suffer from the problem of
error propagation. In addition, these approaches
are not able to incrementally process inputs and
are not efficient for very long inputs, especially in
the cases of long transcribed speech texts from
presentations where the number of streaming
words could be larger than hundreds or thousands.
In this paper, we propose jointly performing
punctuation prediction and transition-based dependency
parsing over transcribed speech text.
When the transition-based parsing consumes the
stream of words left to right with the shift-reduce
decoding algorithm, punctuation symbols are predicted
for each word based on the contexts of the
parsing tree. Two models are proposed to cause
the punctuation prediction to interact with the
transition actions in parsing. One is to conduct
transition actions of parsing followed by punctuation
predictions in a cascaded way. The other is to
associate the conventional transition actions of
parsing with punctuation perditions, so that predicted
punctuations are directly inferred from the
parsing tree. Our models have linear complexity
and are capable of handling streams of words with
any length. In addition, the computation of models
use a rich set of syntactic features, which can improve
the complicated punctuation predictions
from a global view, especially for the long range
dependencies.
Figure 1 shows an example of how parsing
helps punctuation prediction over the transcribed
speech text. As illustrated in Figure 1(b), two
commas are predicted when their preceding words
act as the adverbial modifiers (advmod) during
parsing. The period after the word “menu” is predicted
when the parsing of an adverbial clause
modifier (advcl) is completed. The question mark
at the end of the input is determined when a direct
object modifier (dobj) is identified, together with
the long range clue that the auxiliary word occurs
before the nominal subject (nsubj). Eventually,
two segmentations are formed according to the
punctuation prediction results, shown in Figure
1(c).
The training data used for our models is adapted
from Treebank data by excluding all punctuations
but keeping the punctuation contexts, so that it can
simulate the unavailable annotated transcribed
speech texts. In decoding, beam search is used to
get optimal punctuation prediction results. We
conduct experiments on both IWSLT data and
TDT4 test data sets. The experimental results
show that our method can achieve higher performance
than the CRF-based baseline method.
The paper is structured as follows: Section 2
conducts a survey of related work. The transitionbased
dependency parsing is introduced in Section
3. We explain our approach to predicting punctuations
for transcribed speech texts in Section 4.
Section 5 gives the results of our experiment. The
conclusion and future work are given in Section 6.
Sentence boundary detection and punctuation prediction
have been extensively studied in the
speech processing field and have attracted research
interest in the natural language processing
field as well. Most previous work exploits local
features for the task. Kim and Woodland (2001),
Huang and Zweig (2002), Christensen et al.
(2001), and Liu et al. (2005) integrate both prosodic
features (pitch, pause duration, etc.) and lexical
features (words, n-grams, etc.) to predict
punctuation symbols during speech recognition,
where Huang and Zweig (2002) uses a maximum
entropy model, Christensen et al. (2001) focus on
finite state and multi-layer perceptron methods,
and Liu et al. (2005) uses conditional random
fields. However, in some scenarios the prosodic
cues are not available due to inaccessible original
raw speech waveforms. Matusov et al. (2006) integrate
segmentation features into the log-linear
model in the statistical machine translation (SMT)
framework to improve the translation performance
when translating transcribed speech texts.
Lu and Ng (2010) uses dynamic conditional random
fields to perform both sentence boundary and
sentence type prediction. They achieved promising
results on both English and Chinese transcribed
speech texts. The above work only exploits local features, so they were limited to capturing
long range dependencies for punctuation
prediction.
It is natural to incorporate global knowledge,
such as syntactic information, to improve punctuation
prediction performance. Roark et al. (2006)
use a rich set of non-local features including parser
scores to re-rank full segmentations. Favre et
al. (2008) integrate syntactic information from a
PCFG parser into a log-linear and combine it with
local features for sentence segmentation. The
punctuation prediction in these works is performed
as a post-procedure step of parsing, where
a parse tree needs to be built in advance. As their
parsing over the stream of words in transcribed
speech text is exponentially complex, their approaches
are only feasible for short input processing.
Unlike these works, we incorporate punctuation
prediction into the parsing which process
left to right input without length limitations.
Numerous dependency parsing algorithms
have been proposed in the natural language processing
community, including transition-based
and graph-based dependency parsing. Compared
to graph-based parsing, transition-based parsing
can offer linear time complexity and easily leverage
non-local features in the models (Yamada and
Matsumoto, 2003; Nivre et al., 2006b; Zhang and
Clark, 2008; Huang and Sagae, 2010). Starting
with the work from (Zhang and Nivre, 2011), in
this paper we extend transition-based dependency
parsing from the sentence-level to the stream of
words and integrate the parsing with punctuation
prediction.
Joint POS tagging and transition-based dependency
parsing are studied in (Hatori et al.,
2011; Bohnet and Nivre, 2012). The improvements
are reported with the joint model compared
to the pipeline model for Chinese and other richly
inflected languages, which shows that it also
makes sense to jointly perform punctuation prediction
and parsing, although these two tasks of
POS tagging and punctuation prediction are different
in two ways: 1). The former usually works
on a well-formed single sentence while the latter
needs to process multiple sentences that are very
lengthy. 2). POS tags are must-have features to
parsing while punctuations are not. The parsing
quality in the former is more sensitive to the performance
of the entire task than in the latter.