This file is indexed.

/usr/share/doc/libsegment-java/readme.html is in libsegment-java 1.3.5~svn57+dfsg-1.1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"><html><head><title>Readme</title><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body>


<h1>Segment</h1>

<h3>Version @project.fullversion@, Date @build.date@</h3>
<hr>

<h2>Table of Contents</h2>
<ul>
	<li><a href="#introduction">Introduction</a></li>
	<li><a href="#requirements">Requirements</a></li>
	<li><a href="#running">Running</a></li>
	<li><a href="#transformation">Transformation</a></li>
	<li><a href="#performance">Performance</a></li>
	<li><a href="#testing">Testing</a></li>
	<li><a href="#formats">Data formats</a>
		<ul>
			<li><a href="#input">Input</a></li>
			<li><a href="#output">Output</a></li>
			<li><a href="#srxformat">SRX file</a></li>
		</ul>
	</li>
	<li><a href="#algorithm">Algorithm</a></li>
	<li><a href="#legacyalgorithms">Legacy algorithms</a>
		<ul>
			<li><a href="#accuratealgorithm">Accurate algorithm</a></li>
			<li><a href="#fastalgorithm">Fast algorithm</a></li>
		</ul>
	</li>
	<li><a href="#resources">Resources</a></li>
</ul>

<hr>

<h2><a name="introduction">Introduction</a></h2>
<p>
Segment program is used to split text into segments, for example sentences.
Splitting rules are read from SRX file, which is standard format for this task
(see <a href="#resources">Resources</a>). 
</p>

<h2><a name="requirements">Requirements</a></h2>
<p>
To run the project Java Runtime Environment (JRE) 1.5 is required.
To build the project from source Java Software Development Kit (JDK) 1.5 and Ant tool are required. 
Program should run on any operating system supported by Java. 
The helper startup scripts were written for Unix and Windows.
</p>

<h2><a name="running">Running</a></h2>
<p>
To run the program bin/segment script is used. 
For example on Linux, from main project directory, execute:<br/>
<code>bin/segment</code><br/>
On windows, from main directory, it looks like this:<br/>
<code>bin\segment</code><br/>
When the script does not work on your operating system program can be run 
directly using Java, look inside bin/split script for the clues how to do it.
</p>
<p>
Source text is read from standard input and resulting segments are written
on standard output, one per line. 
Without parameters text is split using simple, built-in rules.
To get help on command line parameters run:<br/>
<code>bin/segment -h</code><br/>
The most popular command line is probably:<br/>
<code>bin/segment -s rules.srx -l language -i in.txt -o out.txt</code><br/>
Where rules.srx is a file containing splitting rules, language is input file
language code, in.txt is a input file and out.txt is a output file.
To control output format useful parameters are -b and -e which define 
string that will be written before and after the segment (this replaces the 
standard end of line character).
</p>

<h2><a name="performance">Performance</a></h2>
<p>
To evaluate performance bin/segment -p option can be used. It can measure
segmentation time on any data and it is possible to generate data.
To generate random text --generate-text option should be used with text length in kilobytes as a parameter. 
To generate random SRX --generate-srx option should be used with rule count and rule length separated by a comma as a parameter.
To repeat segmentation process -2 option should be used.
Other option which controls how the text is handled is -r which instructs the application to preload the whole text into memory before segmentation (some algorithms require it). 
Size of read buffer and therefore memory usage can be controlled by setting --buffer-length option.  
As a result of performance analysis segmentation time is displayed. Common usage example:<br/>
<code>bin/segment -p -2 --generate-text 100 --generate-srx 10,10</code><br/>
</p>

<h2><a name="transformation">Transformation</a></h2>
<p>
To automatically convert rule file between old SRX version and current SRX version there is a transformation tool, 
invoked by bin/segment -t command. By default it reads SRX from standard input and writes 
transformed SRX to standard output. Usage example:<br/>
<code>bin/segment -t -i old.srx -o new.srx</code><br/>
The tool accepts some command line parameters, use bin/segment -h for details.
Underneath it uses XSLT stylesheet which can be found in resources directory and used 
separately with any XSLT processor.
</p>

<h2><a name="testing">Testing</a></h2>
<p>
The program has integrated unit tests. To run them execute:<br/>
<code>bin/segment --test.</code>
</p>

<h2><a name="formats">Data formats</a></h2>

<h3><a name="input">Input</a></h3>
<p>
Plain text, UTF-8 encoded.
</p>

<h3><a name="output">Output</a></h3>
<p>
Plain text, UTF-8 encoded. Some operating system consoles, Windows 
command prompt for example, have different encoding and special characters 
will not be displayed correctly. Output files can be opened in text editors
because most of them handle UTF-8 encoded files correctly.
Each segment is prefixed with string set with -b option (empty by default), 
and suffixed with string set with -e option (new line character by default).
</p>

<h3><a name="srxformat">SRX file</a></h3>
<p>
Valid SRX document as defined in SRX specification 
(see <a href="#resources">Resources</a>). 
Both version 1.0 and 2.0 are supported, although version 2.0 is preferred. 
Currently input is treated as plain text, formatting is not handled specially 
(contrary to specification). Example SRX files can be found in example/ directory.
</p>
<p>
Document contains header and body. 
</p>
<p>
Header is currently mostly ignored, only "cascade" attribute is
read. It determines if only the first matching language rule is 
applied (cascade="no"), or all language rules that match language code 
are applied in the same order as they occur in SRX file (cascade="yes").
</p>
<p>
Body contains language rules and map rules. Language rules contain 
break (break="yes") and exception (break="no") rules. Each of those rules
can consist of two regular expression elements, &lt;beforebreak&gt; and 
&lt;afterbreak&gt;, which must match before and after break character respectively, 
for the rule to be applied.  
Map rules specify which language rules will be used to segment the text, 
according to the text language.
</p>

<h2><a name="algorithm">Algorithm</a></h2>
<p>
The algorithm idea is as follows:
	<ol>
		<li>
			Rule matcher list is created based on SRX file and language. Each rule 
   			matcher is responsible for matching before break and after break regular 
   			expressions of one break rule.
   		</li>
   		<li>
   			Each rule matcher is matched to the text. If the rule was not found the 
   			rule matcher is removed from the list.
   		</li>
   		<li>
   			First rule matcher in terms of its break position in text is selected.
   		</li>
   		<li>
			List of exception rules corresponding to break rule is retrieved.
		</li>
		<li>
			If none of exception rules is matching in break position then 
   			the text is marked as split and new segment is created. In addition 
   			all rule matchers are moved so they start after the end of new segment 
		   (which is the same as break position of the matched rule).
		</li> 
		<li>
			All the rules that have break position behind last matched rule 
   			break position are moved until they pass it.
   		</li>
		<li>
			If segment was not found the whole process is repeated.
		</li>
	</ol>
</p>
<p>
In streaming version of this algorithm character buffer is searched. 
When the end of it is reached or break position is in the margin 
(break position > buffer size - margin) and there is more text, 
the buffer is moved in the text until it starts after last found segment. 
If this happens rule matchers are reinitialized and the text is searched again.
Streaming version has a limitation that read buffer must be at least as long 
as any segment in the text.
</p>
As this algorithm uses lookbehind extensively but Java does not permit
infinite regular expressions in lookbehind, so some patterns must be finitized. 
For example a* pattern will be changed to something like a{0,100}.
</p>

<h2><a name="legacyalgorithms">Legacy algorithms</a></h2>

<h3><a name="accuratealgorithm">Accurate algorithm</a></h3>
<p>
This is first implemented algorithm to perform segmentation task. It is stable but does 
not work on text streams and in real-world scenario with few break rules
and many exception rules it is several times slower than the other algorithms.
</p>
<p>
At the beginning the rule matcher list is created based on SRX file and language. 
Each rule matcher is responsible for matching before break and after break 
regular expressions of one rule (break or exception). 
Then each rule matcher is matched to the text. If the rule was not found the 
rule matcher is removed from the list. Next first matching rule (in terms of 
break point position) is selected. If it is break rule text is split. 
At the end all the rules that are behind last matched rule are matched until 
they pass it. The whole process is repeated until the matching rule was found
or there are no more rules on the list.
</p>

<h3><a name="fastalgorithm">Fast algorithm</a></h3>
<p>
This algorithm creates a single large regular expression incorporating all 
break rules. Then this regular expression is matched to the text. Every time
matching is found, all exception rules corresponding to this break rule 
are checked in this place. If no exception rules match, the text is split.
</p>
<p>
To create the streaming version of the algorithm ReaderCharacterSequence class 
was implemented. 
It implements character sequence interface but reads the text from a stream 
to the internal buffer. It does not work perfectly - buffer has limited size
so for example no all subsequences can be read from it.
</p>
<p>
As this algorithm uses lookbehind extensively but Java does not permit
infinite regular expressions in lookbehind, so some patterns are finitized. 
For example a* pattern will be changed to something like a{0,100}.
</p>

<h2><a name="resources">Resources</a></h2>
<p>
<ul>
	<li>
		<a href="http://www.lisa.org">LISA (The Localization Industry Standards Association)</a>. 
		Full SRX specification can be found on this page.
	</li>
</ul>
</p>

<hr>

<p>
This project was written for Poleng company, but now is distributed as Free / Open Source Software. 
Results were used to write my Master's Thesis. Happy using:)
</p>
<p>&nbsp;&nbsp; -- Jarek Lipski</p>


</body></html>