This file is indexed.

/usr/share/bibledit-gtk/site/gtk/reference/menu/menu-preferences/filters/tec.html is in bibledit-gtk-data 4.9-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <link href="../../../../../bibledit.css" rel="stylesheet" type="text/css" /><!-- 

Copyright (©) 2003-2011 Teus Benschop and Contributors to the Wiki.

Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3
or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
Texts.  A copy of the license is included in the section entitled "GNU
Free Documentation License" in the file FDL.

-->
    <title></title>
  </head>
  <body>
    <div id="menu">
      <ul>
        <li>
          <a href="../../../../../home.html">1 Bibledit</a>
        </li>
        <li>
          <a href="../filters.html">Filters</a>
        </li>
        <li style="list-style: none; display: inline">
          <hr />
        </li>
        <li>TECkit
        </li>
        <li>
          <a href="regex.html">regex</a>
        </li>
        <li>
          <a href="sed.html">sed</a>
        </li>
      </ul>
    </div>
    <div id="content">
      <h1>
        TECkit
      </h1>
      <h2>
        <a name="TOC-TECkit-Language-Reference" href="" id="TOC-TECkit-Language-Reference"></a>TECkit Language Reference
      </h2><br />
      <p>
        This reference is not a full reference, but a partial one. It describes only that part of the TECkit language that is relevant for how Bibledit uses it. It was assembled while looking at and re-using the information provided with the TECkit package. This information is used with permission of the author of TECkit, and of SIL.
      </p>
      <h3>
        <a name="introduction" href="" id="introduction"></a>Introduction
      </h3>
      <p>
        The TECkit language is built around simple mapping rules where a Unicode character on the left-hand side of the rule is mapped to or from a Unicode character on the right-hand side. From this basic structure, mapping rules can be extended by the use of character sequences rather than single characters on either side; by the addition of contextual constraints (environments) determining when a rule should apply; and by the use of character classes, optional and repeatable elements, grouping and alternation to express more complex patterns to be matched and processed.
      </p>
      <p>
        The TECkit package, as used in Bibledit, is applied to text processing operations entirely dealing with Unicode data.
      </p>
      <h3>
        <a name="filestructureandconventions" href="" id="filestructureandconventions"></a>File structure and conventions
      </h3>
      <p>
        A TECkit description file is strictly line-oriented; every statement is confined to a single logical line. To allow long rules to be broken across several lines, for easier editing, TECkit interprets a final backslash (\) as a “continuation character”; however, only quite complex mappings are likely to need rules that cannot readily be expressed in a single source line.
      </p>
      <p>
        The semicolon (;) introduces a comment that continues to the end of the (physical) line. TECkit ignores everything following a semicolon, unless it is in a quoted string.
      </p>
      <p>
        Built-in keywords in the TECkit mapping language are not case-sensitive; the compiler will accept any mixture of upper and lower case. This also applies to Unicode character names. More about this later. However, the names of character classes defined in the file itself are case-sensitive, and must be used in a consistent form. More about the character classes later too.
      </p>
      <p>
        Where “strings” are called for, these may be either single- or double-quoted. There is no mechanism to “escape” quote marks embedded in the string; therefore, a single-quoted string can contain double-quote characters, and vice versa, but it is not possible to include both single and double quotes in the same quoted string.
      </p>
      <p>
        Unicode character codes are expressed either numerically or using Unicode character names, converted into unique “identifiers” by replacing spaces and hyphens with underscores. TECkit knows a great lot Unicode character names. The preferred form is to write “U+xxxx”, where xxxx represents four to six hexadecimal digits. Normal decimal or hex numbers are also permitted.
      </p>
      <p>
        Characters may also be expressed as quoted literals. If the mapping source is Unicode text, then they may be used only for Unicode character values. It is never legal to use quoted literals on both sides of the mapping.
      </p>
      <p>
        A complete TECkit mapping description consists of a header section followed by one or more mapping passes. The simplest Unicode mapping descriptions will contain just one Unicode pass, but for some complex mappings it may be necessary to perform pre- and/or post-processing such as character reordering in other passes. The LHS code space of each pass must correspond to the RHS code space of the pass before it.
      </p>
      <h3>
        <a name="headerinformation" href="" id="headerinformation"></a>Header information
      </h3>
      <p>
        The mapping file begins with header information, which consists of a number of pieces of information about the encoding and mapping, each specified by a keyword followed by a quoted string:
      </p>
      <p>
        <span style="font-weight:bold">EncodingName</span>
      </p>
      <p>
        A name that uniquely identifies this mapping table.
      </p>
      <p>
        <span style="font-weight:bold">DescriptiveName</span>
      </p>
      <p>
        A string that describes the mapping.
      </p>
      <p>
        <span style="font-weight:bold">Version</span>
      </p>
      <p>
        The version of the mapping description.
      </p>
      <p>
        <span style="font-weight:bold">Contact</span>
      </p>
      <p>
        Contact information.
      </p>
      <p>
        <span style="font-weight:bold">RegistrationAuthority</span>
      </p>
      <p>
        The organization responsible for the encoding.
      </p>
      <p>
        <span style="font-weight:bold">RegistrationName</span>
      </p>
      <p>
        The name and version of the mapping.
      </p>
      <p>
        <span style="font-weight:bold">Copyright</span>
      </p>
      <p>
        Copyright information.
      </p>
      <p>
        Only the encoding name is required.
      </p>
      <p>
        An alternative form of header should be used for mapping descriptions that do transliterations entirely within Unicode. Instead of EncodingName and DescriptiveName, the following four fields are used:
      </p>
      <p>
        <span style="font-weight:bold">LHSName</span>
      </p>
      <p>
        Canonical name of the “source” encoding or left-hand side of the description.
      </p>
      <p>
        <span style="font-weight:bold">RHSName</span>
      </p>
      <p>
        Canonical name of the “target” encoding or right-hand side of the description.
      </p>
      <p>
        <span style="font-weight:bold">LHSDescription</span>
      </p>
      <p>
        Description for the left-hand side of the mapping.
      </p>
      <p>
        <span style="font-weight:bold">RHSDescription</span>
      </p>
      <p>
        Description for the right-hand side of the mapping.
      </p>
      <p>
        Note that while we sometimes think of the left-hand side of the description as “source” and the right-hand side as “target”, TECkit descriptions and mapping tables are bi-directional, and thus these roles can equally well be exchanged.
      </p>
      <p>
        Finally, the file header can include “flags” that specify certain features of the encoding for both the left- and right-hand sides of the mapping.
      </p>
      <p>
        <span style="font-weight:bold">LHSFlags ( list-of-flags )</span>
      </p>
      <p>
        Features of the LHS encoding.
      </p>
      <p>
        <span style="font-weight:bold">RHSFlags ( list-of-flags )</span>
      </p>
      <p>
        Features of the RHS encoding.
      </p>
      <p>
        For each side of the mapping, zero or more of the following flags can be specified:
      </p>
      <p>
        <span style="font-weight:bold">ExpectsNFC</span>
      </p>
      <p>
        Input on this side of the mapping should be in fully-composed form.
      </p>
      <p>
        <span style="font-weight:bold">ExpectsNFD</span>
      </p>
      <p>
        Input on this side of the mapping should be in fully-decomposed form.
      </p>
      <p>
        <span style="font-weight:bold">GeneratesNFC</span>
      </p>
      <p>
        Output on this side of the mapping is fully-composed.
      </p>
      <p>
        <span style="font-weight:bold">GeneratesNFD</span>
      </p>
      <p>
        Output on this side of the mapping is fully-decomposed.
      </p>
      <p>
        <span style="font-weight:bold">VisualOrder</span>
      </p>
      <p>
        This side of the mapping deals with visual (rather than logical) text order.
      </p>
      <p>
        The “<span style="font-weight:bold">expects</span>” flags can be used to specify that Unicode input to this side of the mapping should be normalized before it is presented to the actual mapping rules. By specifying a normalization form for the Unicode side of a mapping description, the author can write mapping rules assuming a particular canonical representation. The TECkit engine will take care of normalizing the input text so that it matches the expectation of the rules.
      </p>
      <p>
        The “<span style="font-weight:bold">generates</span>” flags allow the mapping author to declare which normalization form will be produced by the mapping rules. However, as it can be difficult to ensure the accuracy of this, TECkit does not “trust” this flag, but always explicitly normalizes the output if requested by the application using the mapping.
      </p>
      <p>
        A typical example of the header information might be:
      </p>
      <pre>
  EncodingName          "Bibledit-NdebeleDiglot-2008"
</pre>
      <pre>
  DescriptiveName       "Simple rules for transliteration"
</pre>
      <pre>
  Version               "1"
</pre>
      <pre>
  Contact               "mailto:author@domain.org"
</pre>
      <pre>
  RegistrationAuthority "Bibledit International Ltd."
</pre>
      <pre>
  RegistrationName      "Bibledit Ndebele Diglot"
</pre>
      <pre>
  Copyright             "(c)2008 The Author (released under GPL3"
</pre>
      <pre>
  LHSFlags              ()
</pre>
      <pre>
  RHSFlags              (ExpectsNFD GeneratesNFD)
</pre>
      <h3>
        <a name="mappingpasses" href="" id="mappingpasses"></a>Mapping passes
      </h3>
      <p>
        The heart of a mapping description is the series of mapping passes that relate characters or sequences on the LHS to those on the RHS. In simple cases there is just one pass.
      </p>
      <p>
        Each pass begins with a header line that declares the encoding space in which it operates:
      </p>
      <pre>
  pass( pass-type )
</pre>
      <p>
        where pass-type is one of:
      </p>
      <pre>
  Byte
</pre>
      <pre>
  Unicode
</pre>
      <pre>
  Byte_Unicode
</pre>
      <pre>
  Unicode_Byte
</pre>
      <p>
        As Bibledit only works with UTF-8 encoded data, the pass-type should always be Unicode.
      </p>
      <p>
        There are also special “normalization pass” types that can be used in special cases. To create a normalization pass, specify pass-type as one of:
      </p>
      <pre>
  NFC_fwd
</pre>
      <pre>
  NFD_fwd
</pre>
      <pre>
  NFC_rev
</pre>
      <pre>
  NFD_rev
</pre>
      <pre>
  NFC
</pre>
      <pre>
  NFD
</pre>
      <p>
        As the names suggest, these apply the NFC or NFD Unicode normalization forms as part of the forward, reverse, or both processing “pipelines”. Most mappings will not need to include explicit normalization passes, as the ExpectsNFC or ExpectsNFD flag can be used to request pre-normalization of Unicode data before any mapping rules are applied, and applications using TECkit can explicitly request either NFC or NFD data when mapping to Unicode. The only reason to use a normalization pass in a mapping description would be to ensure that data is in a particular normalization form somewhere in the middle of a multi-pass Unicode transduction.
      </p>
      <h3>
        <a name="classdefinitions" href="" id="classdefinitions"></a>Class definitions
      </h3>
      <p>
        Character classes may be used to make the mapping description more readable and concise; suitable class definitions allow a single rule to express a whole set of related mappings. They are typically used in contextual constraints or as elements of rules that reorder character sequences.
      </p>
      <p>
        Classes are defined with the UniClass statement:
      </p>
      <pre>
  UniClass [ name ] = ( unicodeSequence )
</pre>
      <p>
        Class names, always enclosed in square brackets, are “identifiers” that may contain letters, digits, and the underscore character; they may not begin with a digit. Unlike the keywords of the TECkit language, they are case-sensitive. The Unicode sequence is a space-separated list of character codes, similar to those used in mapping rules (see below), with the addition of a “range” notation: two character codes separated by .. represent the complete set of characters from the first to the second (inclusive).
      </p>
      <p>
        Some examples:
      </p>
      <pre>
  UniClass [control] = ( U+0000..U+001f U+007f )
</pre>
      <pre>
  UniClass [letter] = ( U+0041..U+005a U+0061..U+007a )
</pre>
      <h3>
        <a name="defaultsforunmappedcharacters" href="" id="defaultsforunmappedcharacters"></a>Defaults for unmapped characters
      </h3>
      <p>
        In Unicode passes, any characters not explicitly matched by mapping rules will be output unchanged.
      </p>
      <h3>
        <a name="mappingrules" href="" id="mappingrules"></a>Mapping rules
      </h3>
      <p>
        The actual mapping from Unicode to Unicode is expressed as a list of mapping rules. A mapping description actually contains two complete sets of mapping rules, one set that matches characters in the first Unicode text and generates Unicode, and the other set that match Unicode characters in the second text and generate Unicode for the first text. However, in most cases it is simplest to express both mappings at once, using bi-directional rules where either side of the rule can act as “match” with the other being “replacement”.
      </p>
      <p>
        The general form of a mapping rule is:
      </p>
      <pre>
  lhsSeq [ / lhsContext ] operator rhsSeq [ / rhsContext ]
</pre>
      <p>
        Here, <span style="font-style:italic">operator</span> indicates whether this rule is to be used only when mapping from the left-hand side to the right, from the right-hand side to the left, or (the most common case) in both directions:
      </p>
      <pre>
  &lt;&gt;   bidirectional mapping rule
</pre>
      <pre>
  &gt;    unidirectional LHS-to-RHS rule
</pre>
      <pre>
  &lt;    unidirectional RHS-to-LHS rule
</pre>
      <p>
        The <span style="font-style:italic">lhsSeq</span> and <span style="font-style:italic">rhsSeq</span> parts of the rule are simple lists of character codes. These may be expressed as decimal numbers or as hexadecimal (prefixed with 0x). In Unicode sequences, characters may also be listed by their Unicode character names as found in <a href="http://www.unicode.org/Public/UNIDATA/UnicodeData.txt" rel="nofollow">http://www.unicode.org/Public/UNIDATA/UnicodeData.txt</a>, with all non-alphanumeric characters in the names converted to underscores; thus, for example, <span style="font-style:italic">thai_character_ko_kai</span> may be used instead of 0x0E01 to make the mapping description file more self-documenting. The Unicode character names are not case-sensitive.
      </p>
      <p>
        During the mapping operation, whichever of <span style="font-style:italic">lhsSeq</span> or <span style="font-style:italic">rhsSeq</span> corresponds to the input side of the rule can be considered a “match string”, with the other being its “replacement”. The context associated with the match string, if any, acts as a constraint on the application of the rule. Any context associated with the replacement is irrelevant; it would be used when mapping in the other direction.
      </p>
      <p>
        Character class references may be used in the match and replacement sequences, although for clarity it may be better to list each individual character mapping. If a class is used on the replacement side of a rule, it must correspond to a class on the match side, and the resulting rules will map each character in the match class to the equivalent character in the replacement class. The classes must contain the same number of characters. Item tags, see below, may be used to associate the replacement class item with its corresponding match item; in the absence of such tags, items are matched by position within the match and replacement strings.
      </p>
      <p>
        For contextually constrained mappings, the <span style="font-style:italic">lhsContext</span> and <span style="font-style:italic">rhsContext</span> parts of the mapping rule are used. These use a “slash … underscore” notation:
      </p>
      <pre>
  / preContextSeq _ postContextSeq
</pre>
      <p>
        The match and replace strings and the pre- and post-contexts may be simple sequences of character codes, or may be more complex expressions using the following “regular expression” elements:
      </p>
      <pre>
  [cls]    match any character from the class cls
</pre>
      <pre>
  .        match any single character
</pre>
      <pre>
  #        match beginning or end of input text
</pre>
      <pre>
  ^item    ‘not item’: match anything except the given item
</pre>
      <p>
        The 'not item' applies to single items only; negated groups are not supported.
      </p>
      <pre>
  (...)    grouping (for optionality or repeat counts)
</pre>
      <pre>
  |        alternation (within group): match either preceding or following sequence
</pre>
      <pre>
  {a,b}    match preceding item minimum a times, maximum b (0 ≤ a ≤ b ≤ 15)
</pre>
      <pre>
  ?        match preceding item 0 or 1 times
</pre>
      <pre>
  *        match preceding item 0 to 15 times
</pre>
      <pre>
  +        match preceding item 1 to 15 times
</pre>
      <pre>
  =tag     tag preceding item for match/replacement association
</pre>
      <pre>
  @tag     duplicate the tagged item (including groups) from LHS
</pre>
      <p>
        The @tag can only occur on RHS. It is typically used to implement reordering.
      </p>
      <p>
        A couple of notes on the use of regular expressions and context constraints:
      </p>
      <ul>
        <li>Repeat counts or optionality may be applied to parenthesized groups as well as to individual items.
        </li>
        <li>It is meaningless to specify context on the replacement side of a unidirectional rule; contextual constraints apply to the matching process on the input side of the conversion.
        </li>
        <li>The special ‘#’ code is only meaningful as the first item in the pre-context or the last item in the post-context; in effect, there is an “end of text” pseudo-character before the first real character of input, and one after the last, which can only match this code.
        </li>
        <li>A negated item is still a “concrete” item that matches a real character in the input or the “end of text” pseudo-character.
        </li>
        <li>No repeatable item can ever match more than 15 times; unlike standard regular expressions, the <span style="font-style:italic">star</span> and <span style="font-style:italic">plus</span> operators have a fixed upper bound. In principle, a repeatable element within a repeatable group will permit a higher total number of repetitions.
        </li>
      </ul>
      <p>
        Rules are tested from the most to the least specific, where a longer rule (counting the length of context as well as the actual match string) is considered more specific than a shorter one. If there are two equally long rules that could match at a particular place in the input, the first one listed in the mapping description file will be used.
      </p>
      <p>
        The maximum potential length of any pre-context (considering all repeat counts) in a pass, plus the maximum potential match string, plus the maximum potential post-context, must not exceed 255 characters. Similarly, the maximum output that can be generated from any rule is limited to 255 characters.
      </p>
      <h3>
        <a name="macros" href="" id="macros"></a>Macros
      </h3>
      <p>
        TECkit supports a simple macro facility; this may be used to define symbols that act as “shorthand” for frequently-used fragments of a mapping description, such as character classes that are needed in multiple passes, or sequences used in the context of multiple rules.
      </p>
      <p>
        A macro is defined with a line of the form:
      </p>
      <pre>
  Define name &lt;arbitrary TECkit source&gt;
</pre>
      <p>
        Following such a line, wherever such names are found in the description, these are treated as representing the specified source texts.
      </p>
      <p>
        This is particularly useful when the context is complex, perhaps involving several alternatives or multiple repeatable items; suitably descriptive macro names may also serve to make the mapping description more self-documenting.
      </p>
      <p>
        Another use for macros is to provide more convenient names for Unicode characters. This can help make mapping descriptions more readable.
      </p>
      <p>
        Note that macros must be defined before they are used, including any use in the definition of other macros; thus, it is legitimate to say:
      </p>
      <pre>
Define NUL   0x00
</pre>
      <pre>
Define DEL   0x7F
</pre>
      <pre>
Define ASCII NUL..DEL
</pre>
      <p>
        But with the definitions rearranged so that NUL and DEL are not defined when they are used in the definition of ASCII, even if they are defined subsequently, the result will be a compile-time error:
      </p>
      <pre>
Define ASCII NUL..DEL
</pre>
      <pre>
Define NUL   0x00
</pre>
      <pre>
Define DEL   0x7F
</pre>
      <pre>
ByteClass[asc] = (ASCII)
</pre>
      <p>
        This will generate an error on the ByteClass line, because the identifiers NUL and DEL found in the expansion of ASCII will be considered undefined.
      </p>
      <h3>
        <a name="unicodeonlymappings" href="" id="unicodeonlymappings"></a>Unicode-only mappings
      </h3>
      <p>
        The TECkit system, while targeted primarily at byte/Unicode conversion, is used by Bibledit for Unicode mapping operations. A mapping description need not contain a Byte_Unicode pass at all. If it contains only Unicode passes, both input and output are Unicode data.
      </p>
      <h3>
        <a name="example" href="" id="example"></a>Example
      </h3>
      <p>
        A simple example of a TECkit mapping may help you to start writing your own quickly.
      </p>
      <pre>
EncodingName "Example"<br />pass (Unicode)<br />; This simple example transliterates the Latin characters a and b<br />; to the Greek α and β.<br />U+0061  &gt;  U+03B1<br />U+0062  &gt;  U+03B2
</pre>
      <p>
        End of example.
      </p><br />
    </div>
  </body>
</html>