/usr/share/bibledit-gtk/site/gtk/reference/menu/menu-preferences/filters/tec.html is in bibledit-gtk-data 4.9-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link href="../../../../../bibledit.css" rel="stylesheet" type="text/css" /><!--
Copyright (©) 2003-2011 Teus Benschop and Contributors to the Wiki.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3
or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
Texts. A copy of the license is included in the section entitled "GNU
Free Documentation License" in the file FDL.
-->
<title></title>
</head>
<body>
<div id="menu">
<ul>
<li>
<a href="../../../../../home.html">1 Bibledit</a>
</li>
<li>
<a href="../filters.html">Filters</a>
</li>
<li style="list-style: none; display: inline">
<hr />
</li>
<li>TECkit
</li>
<li>
<a href="regex.html">regex</a>
</li>
<li>
<a href="sed.html">sed</a>
</li>
</ul>
</div>
<div id="content">
<h1>
TECkit
</h1>
<h2>
<a name="TOC-TECkit-Language-Reference" href="" id="TOC-TECkit-Language-Reference"></a>TECkit Language Reference
</h2><br />
<p>
This reference is not a full reference, but a partial one. It describes only that part of the TECkit language that is relevant for how Bibledit uses it. It was assembled while looking at and re-using the information provided with the TECkit package. This information is used with permission of the author of TECkit, and of SIL.
</p>
<h3>
<a name="introduction" href="" id="introduction"></a>Introduction
</h3>
<p>
The TECkit language is built around simple mapping rules where a Unicode character on the left-hand side of the rule is mapped to or from a Unicode character on the right-hand side. From this basic structure, mapping rules can be extended by the use of character sequences rather than single characters on either side; by the addition of contextual constraints (environments) determining when a rule should apply; and by the use of character classes, optional and repeatable elements, grouping and alternation to express more complex patterns to be matched and processed.
</p>
<p>
The TECkit package, as used in Bibledit, is applied to text processing operations entirely dealing with Unicode data.
</p>
<h3>
<a name="filestructureandconventions" href="" id="filestructureandconventions"></a>File structure and conventions
</h3>
<p>
A TECkit description file is strictly line-oriented; every statement is confined to a single logical line. To allow long rules to be broken across several lines, for easier editing, TECkit interprets a final backslash (\) as a “continuation character”; however, only quite complex mappings are likely to need rules that cannot readily be expressed in a single source line.
</p>
<p>
The semicolon (;) introduces a comment that continues to the end of the (physical) line. TECkit ignores everything following a semicolon, unless it is in a quoted string.
</p>
<p>
Built-in keywords in the TECkit mapping language are not case-sensitive; the compiler will accept any mixture of upper and lower case. This also applies to Unicode character names. More about this later. However, the names of character classes defined in the file itself are case-sensitive, and must be used in a consistent form. More about the character classes later too.
</p>
<p>
Where “strings” are called for, these may be either single- or double-quoted. There is no mechanism to “escape” quote marks embedded in the string; therefore, a single-quoted string can contain double-quote characters, and vice versa, but it is not possible to include both single and double quotes in the same quoted string.
</p>
<p>
Unicode character codes are expressed either numerically or using Unicode character names, converted into unique “identifiers” by replacing spaces and hyphens with underscores. TECkit knows a great lot Unicode character names. The preferred form is to write “U+xxxx”, where xxxx represents four to six hexadecimal digits. Normal decimal or hex numbers are also permitted.
</p>
<p>
Characters may also be expressed as quoted literals. If the mapping source is Unicode text, then they may be used only for Unicode character values. It is never legal to use quoted literals on both sides of the mapping.
</p>
<p>
A complete TECkit mapping description consists of a header section followed by one or more mapping passes. The simplest Unicode mapping descriptions will contain just one Unicode pass, but for some complex mappings it may be necessary to perform pre- and/or post-processing such as character reordering in other passes. The LHS code space of each pass must correspond to the RHS code space of the pass before it.
</p>
<h3>
<a name="headerinformation" href="" id="headerinformation"></a>Header information
</h3>
<p>
The mapping file begins with header information, which consists of a number of pieces of information about the encoding and mapping, each specified by a keyword followed by a quoted string:
</p>
<p>
<span style="font-weight:bold">EncodingName</span>
</p>
<p>
A name that uniquely identifies this mapping table.
</p>
<p>
<span style="font-weight:bold">DescriptiveName</span>
</p>
<p>
A string that describes the mapping.
</p>
<p>
<span style="font-weight:bold">Version</span>
</p>
<p>
The version of the mapping description.
</p>
<p>
<span style="font-weight:bold">Contact</span>
</p>
<p>
Contact information.
</p>
<p>
<span style="font-weight:bold">RegistrationAuthority</span>
</p>
<p>
The organization responsible for the encoding.
</p>
<p>
<span style="font-weight:bold">RegistrationName</span>
</p>
<p>
The name and version of the mapping.
</p>
<p>
<span style="font-weight:bold">Copyright</span>
</p>
<p>
Copyright information.
</p>
<p>
Only the encoding name is required.
</p>
<p>
An alternative form of header should be used for mapping descriptions that do transliterations entirely within Unicode. Instead of EncodingName and DescriptiveName, the following four fields are used:
</p>
<p>
<span style="font-weight:bold">LHSName</span>
</p>
<p>
Canonical name of the “source” encoding or left-hand side of the description.
</p>
<p>
<span style="font-weight:bold">RHSName</span>
</p>
<p>
Canonical name of the “target” encoding or right-hand side of the description.
</p>
<p>
<span style="font-weight:bold">LHSDescription</span>
</p>
<p>
Description for the left-hand side of the mapping.
</p>
<p>
<span style="font-weight:bold">RHSDescription</span>
</p>
<p>
Description for the right-hand side of the mapping.
</p>
<p>
Note that while we sometimes think of the left-hand side of the description as “source” and the right-hand side as “target”, TECkit descriptions and mapping tables are bi-directional, and thus these roles can equally well be exchanged.
</p>
<p>
Finally, the file header can include “flags” that specify certain features of the encoding for both the left- and right-hand sides of the mapping.
</p>
<p>
<span style="font-weight:bold">LHSFlags ( list-of-flags )</span>
</p>
<p>
Features of the LHS encoding.
</p>
<p>
<span style="font-weight:bold">RHSFlags ( list-of-flags )</span>
</p>
<p>
Features of the RHS encoding.
</p>
<p>
For each side of the mapping, zero or more of the following flags can be specified:
</p>
<p>
<span style="font-weight:bold">ExpectsNFC</span>
</p>
<p>
Input on this side of the mapping should be in fully-composed form.
</p>
<p>
<span style="font-weight:bold">ExpectsNFD</span>
</p>
<p>
Input on this side of the mapping should be in fully-decomposed form.
</p>
<p>
<span style="font-weight:bold">GeneratesNFC</span>
</p>
<p>
Output on this side of the mapping is fully-composed.
</p>
<p>
<span style="font-weight:bold">GeneratesNFD</span>
</p>
<p>
Output on this side of the mapping is fully-decomposed.
</p>
<p>
<span style="font-weight:bold">VisualOrder</span>
</p>
<p>
This side of the mapping deals with visual (rather than logical) text order.
</p>
<p>
The “<span style="font-weight:bold">expects</span>” flags can be used to specify that Unicode input to this side of the mapping should be normalized before it is presented to the actual mapping rules. By specifying a normalization form for the Unicode side of a mapping description, the author can write mapping rules assuming a particular canonical representation. The TECkit engine will take care of normalizing the input text so that it matches the expectation of the rules.
</p>
<p>
The “<span style="font-weight:bold">generates</span>” flags allow the mapping author to declare which normalization form will be produced by the mapping rules. However, as it can be difficult to ensure the accuracy of this, TECkit does not “trust” this flag, but always explicitly normalizes the output if requested by the application using the mapping.
</p>
<p>
A typical example of the header information might be:
</p>
<pre>
EncodingName "Bibledit-NdebeleDiglot-2008"
</pre>
<pre>
DescriptiveName "Simple rules for transliteration"
</pre>
<pre>
Version "1"
</pre>
<pre>
Contact "mailto:author@domain.org"
</pre>
<pre>
RegistrationAuthority "Bibledit International Ltd."
</pre>
<pre>
RegistrationName "Bibledit Ndebele Diglot"
</pre>
<pre>
Copyright "(c)2008 The Author (released under GPL3"
</pre>
<pre>
LHSFlags ()
</pre>
<pre>
RHSFlags (ExpectsNFD GeneratesNFD)
</pre>
<h3>
<a name="mappingpasses" href="" id="mappingpasses"></a>Mapping passes
</h3>
<p>
The heart of a mapping description is the series of mapping passes that relate characters or sequences on the LHS to those on the RHS. In simple cases there is just one pass.
</p>
<p>
Each pass begins with a header line that declares the encoding space in which it operates:
</p>
<pre>
pass( pass-type )
</pre>
<p>
where pass-type is one of:
</p>
<pre>
Byte
</pre>
<pre>
Unicode
</pre>
<pre>
Byte_Unicode
</pre>
<pre>
Unicode_Byte
</pre>
<p>
As Bibledit only works with UTF-8 encoded data, the pass-type should always be Unicode.
</p>
<p>
There are also special “normalization pass” types that can be used in special cases. To create a normalization pass, specify pass-type as one of:
</p>
<pre>
NFC_fwd
</pre>
<pre>
NFD_fwd
</pre>
<pre>
NFC_rev
</pre>
<pre>
NFD_rev
</pre>
<pre>
NFC
</pre>
<pre>
NFD
</pre>
<p>
As the names suggest, these apply the NFC or NFD Unicode normalization forms as part of the forward, reverse, or both processing “pipelines”. Most mappings will not need to include explicit normalization passes, as the ExpectsNFC or ExpectsNFD flag can be used to request pre-normalization of Unicode data before any mapping rules are applied, and applications using TECkit can explicitly request either NFC or NFD data when mapping to Unicode. The only reason to use a normalization pass in a mapping description would be to ensure that data is in a particular normalization form somewhere in the middle of a multi-pass Unicode transduction.
</p>
<h3>
<a name="classdefinitions" href="" id="classdefinitions"></a>Class definitions
</h3>
<p>
Character classes may be used to make the mapping description more readable and concise; suitable class definitions allow a single rule to express a whole set of related mappings. They are typically used in contextual constraints or as elements of rules that reorder character sequences.
</p>
<p>
Classes are defined with the UniClass statement:
</p>
<pre>
UniClass [ name ] = ( unicodeSequence )
</pre>
<p>
Class names, always enclosed in square brackets, are “identifiers” that may contain letters, digits, and the underscore character; they may not begin with a digit. Unlike the keywords of the TECkit language, they are case-sensitive. The Unicode sequence is a space-separated list of character codes, similar to those used in mapping rules (see below), with the addition of a “range” notation: two character codes separated by .. represent the complete set of characters from the first to the second (inclusive).
</p>
<p>
Some examples:
</p>
<pre>
UniClass [control] = ( U+0000..U+001f U+007f )
</pre>
<pre>
UniClass [letter] = ( U+0041..U+005a U+0061..U+007a )
</pre>
<h3>
<a name="defaultsforunmappedcharacters" href="" id="defaultsforunmappedcharacters"></a>Defaults for unmapped characters
</h3>
<p>
In Unicode passes, any characters not explicitly matched by mapping rules will be output unchanged.
</p>
<h3>
<a name="mappingrules" href="" id="mappingrules"></a>Mapping rules
</h3>
<p>
The actual mapping from Unicode to Unicode is expressed as a list of mapping rules. A mapping description actually contains two complete sets of mapping rules, one set that matches characters in the first Unicode text and generates Unicode, and the other set that match Unicode characters in the second text and generate Unicode for the first text. However, in most cases it is simplest to express both mappings at once, using bi-directional rules where either side of the rule can act as “match” with the other being “replacement”.
</p>
<p>
The general form of a mapping rule is:
</p>
<pre>
lhsSeq [ / lhsContext ] operator rhsSeq [ / rhsContext ]
</pre>
<p>
Here, <span style="font-style:italic">operator</span> indicates whether this rule is to be used only when mapping from the left-hand side to the right, from the right-hand side to the left, or (the most common case) in both directions:
</p>
<pre>
<> bidirectional mapping rule
</pre>
<pre>
> unidirectional LHS-to-RHS rule
</pre>
<pre>
< unidirectional RHS-to-LHS rule
</pre>
<p>
The <span style="font-style:italic">lhsSeq</span> and <span style="font-style:italic">rhsSeq</span> parts of the rule are simple lists of character codes. These may be expressed as decimal numbers or as hexadecimal (prefixed with 0x). In Unicode sequences, characters may also be listed by their Unicode character names as found in <a href="http://www.unicode.org/Public/UNIDATA/UnicodeData.txt" rel="nofollow">http://www.unicode.org/Public/UNIDATA/UnicodeData.txt</a>, with all non-alphanumeric characters in the names converted to underscores; thus, for example, <span style="font-style:italic">thai_character_ko_kai</span> may be used instead of 0x0E01 to make the mapping description file more self-documenting. The Unicode character names are not case-sensitive.
</p>
<p>
During the mapping operation, whichever of <span style="font-style:italic">lhsSeq</span> or <span style="font-style:italic">rhsSeq</span> corresponds to the input side of the rule can be considered a “match string”, with the other being its “replacement”. The context associated with the match string, if any, acts as a constraint on the application of the rule. Any context associated with the replacement is irrelevant; it would be used when mapping in the other direction.
</p>
<p>
Character class references may be used in the match and replacement sequences, although for clarity it may be better to list each individual character mapping. If a class is used on the replacement side of a rule, it must correspond to a class on the match side, and the resulting rules will map each character in the match class to the equivalent character in the replacement class. The classes must contain the same number of characters. Item tags, see below, may be used to associate the replacement class item with its corresponding match item; in the absence of such tags, items are matched by position within the match and replacement strings.
</p>
<p>
For contextually constrained mappings, the <span style="font-style:italic">lhsContext</span> and <span style="font-style:italic">rhsContext</span> parts of the mapping rule are used. These use a “slash … underscore” notation:
</p>
<pre>
/ preContextSeq _ postContextSeq
</pre>
<p>
The match and replace strings and the pre- and post-contexts may be simple sequences of character codes, or may be more complex expressions using the following “regular expression” elements:
</p>
<pre>
[cls] match any character from the class cls
</pre>
<pre>
. match any single character
</pre>
<pre>
# match beginning or end of input text
</pre>
<pre>
^item ‘not item’: match anything except the given item
</pre>
<p>
The 'not item' applies to single items only; negated groups are not supported.
</p>
<pre>
(...) grouping (for optionality or repeat counts)
</pre>
<pre>
| alternation (within group): match either preceding or following sequence
</pre>
<pre>
{a,b} match preceding item minimum a times, maximum b (0 ≤ a ≤ b ≤ 15)
</pre>
<pre>
? match preceding item 0 or 1 times
</pre>
<pre>
* match preceding item 0 to 15 times
</pre>
<pre>
+ match preceding item 1 to 15 times
</pre>
<pre>
=tag tag preceding item for match/replacement association
</pre>
<pre>
@tag duplicate the tagged item (including groups) from LHS
</pre>
<p>
The @tag can only occur on RHS. It is typically used to implement reordering.
</p>
<p>
A couple of notes on the use of regular expressions and context constraints:
</p>
<ul>
<li>Repeat counts or optionality may be applied to parenthesized groups as well as to individual items.
</li>
<li>It is meaningless to specify context on the replacement side of a unidirectional rule; contextual constraints apply to the matching process on the input side of the conversion.
</li>
<li>The special ‘#’ code is only meaningful as the first item in the pre-context or the last item in the post-context; in effect, there is an “end of text” pseudo-character before the first real character of input, and one after the last, which can only match this code.
</li>
<li>A negated item is still a “concrete” item that matches a real character in the input or the “end of text” pseudo-character.
</li>
<li>No repeatable item can ever match more than 15 times; unlike standard regular expressions, the <span style="font-style:italic">star</span> and <span style="font-style:italic">plus</span> operators have a fixed upper bound. In principle, a repeatable element within a repeatable group will permit a higher total number of repetitions.
</li>
</ul>
<p>
Rules are tested from the most to the least specific, where a longer rule (counting the length of context as well as the actual match string) is considered more specific than a shorter one. If there are two equally long rules that could match at a particular place in the input, the first one listed in the mapping description file will be used.
</p>
<p>
The maximum potential length of any pre-context (considering all repeat counts) in a pass, plus the maximum potential match string, plus the maximum potential post-context, must not exceed 255 characters. Similarly, the maximum output that can be generated from any rule is limited to 255 characters.
</p>
<h3>
<a name="macros" href="" id="macros"></a>Macros
</h3>
<p>
TECkit supports a simple macro facility; this may be used to define symbols that act as “shorthand” for frequently-used fragments of a mapping description, such as character classes that are needed in multiple passes, or sequences used in the context of multiple rules.
</p>
<p>
A macro is defined with a line of the form:
</p>
<pre>
Define name <arbitrary TECkit source>
</pre>
<p>
Following such a line, wherever such names are found in the description, these are treated as representing the specified source texts.
</p>
<p>
This is particularly useful when the context is complex, perhaps involving several alternatives or multiple repeatable items; suitably descriptive macro names may also serve to make the mapping description more self-documenting.
</p>
<p>
Another use for macros is to provide more convenient names for Unicode characters. This can help make mapping descriptions more readable.
</p>
<p>
Note that macros must be defined before they are used, including any use in the definition of other macros; thus, it is legitimate to say:
</p>
<pre>
Define NUL 0x00
</pre>
<pre>
Define DEL 0x7F
</pre>
<pre>
Define ASCII NUL..DEL
</pre>
<p>
But with the definitions rearranged so that NUL and DEL are not defined when they are used in the definition of ASCII, even if they are defined subsequently, the result will be a compile-time error:
</p>
<pre>
Define ASCII NUL..DEL
</pre>
<pre>
Define NUL 0x00
</pre>
<pre>
Define DEL 0x7F
</pre>
<pre>
ByteClass[asc] = (ASCII)
</pre>
<p>
This will generate an error on the ByteClass line, because the identifiers NUL and DEL found in the expansion of ASCII will be considered undefined.
</p>
<h3>
<a name="unicodeonlymappings" href="" id="unicodeonlymappings"></a>Unicode-only mappings
</h3>
<p>
The TECkit system, while targeted primarily at byte/Unicode conversion, is used by Bibledit for Unicode mapping operations. A mapping description need not contain a Byte_Unicode pass at all. If it contains only Unicode passes, both input and output are Unicode data.
</p>
<h3>
<a name="example" href="" id="example"></a>Example
</h3>
<p>
A simple example of a TECkit mapping may help you to start writing your own quickly.
</p>
<pre>
EncodingName "Example"<br />pass (Unicode)<br />; This simple example transliterates the Latin characters a and b<br />; to the Greek α and β.<br />U+0061 > U+03B1<br />U+0062 > U+03B2
</pre>
<p>
End of example.
</p><br />
</div>
</body>
</html>
|