/usr/share/doc/ne/html/The-Encoding-Mess.html is in ne-doc 2.5-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | <html lang="en">
<head>
<title>The Encoding Mess - ne's manual</title>
<meta http-equiv="Content-Type" content="text/html">
<meta name="description" content="ne's manual">
<meta name="generator" content="makeinfo 4.13">
<link title="Top" rel="start" href="index.html#Top">
<link rel="prev" href="Motivations-and-Design.html#Motivations-and-Design" title="Motivations and Design">
<link rel="next" href="History.html#History" title="History">
<link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage">
<meta http-equiv="Content-Style-Type" content="text/css">
<style type="text/css"><!--
pre.display { font-family:inherit }
pre.format { font-family:inherit }
pre.smalldisplay { font-family:inherit; font-size:smaller }
pre.smallformat { font-family:inherit; font-size:smaller }
pre.smallexample { font-size:smaller }
pre.smalllisp { font-size:smaller }
span.sc { font-variant:small-caps }
span.roman { font-family:serif; font-weight:normal; }
span.sansserif { font-family:sans-serif; font-weight:normal; }
--></style>
</head>
<body>
<div class="node">
<a name="The-Encoding-Mess"></a>
<p>
Next: <a rel="next" accesskey="n" href="History.html#History">History</a>,
Previous: <a rel="previous" accesskey="p" href="Motivations-and-Design.html#Motivations-and-Design">Motivations and Design</a>,
Up: <a rel="up" accesskey="u" href="index.html#Top">Top</a>
<hr>
</div>
<h2 class="chapter">8 The Encoding Mess</h2>
<p><a name="index-UTF_002d8-240"></a><a name="index-ISO_002d8859-family-241"></a><a name="index-ISO_002d8859_002d1-242"></a>
The original <code>ne</code> handled 8-bit text files, and assumed that every
byte coming from the keyboard could be output to the terminal. No other
assumption was made—for instance, the up/down casing functions did not
assume a particular encoding for non-US-ASCII characters. This choice
had a significant advantage: <code>ne</code> could handle easily several
different encodings, with minor nuisances for the end user.
<p>Since version 1.30, <code>ne</code> supports UTF-8. It can use UTF-8 for its
input/output, and it can also interpret one or more buffers as containing
UTF-8 encoded text, acting accordingly. Note that the buffer content is
actual UTF-8 text—<code>ne</code> does not use wide characters. As a
positive side-effect, <code>ne</code> can support fully the ISO-10646
standard, but nonetheless non-UTF-8 texts occupy exactly one byte per
character.
<p>More precisely, <em>any</em> piece of text in <code>ne</code> is classified as
US-ASCII, 8-bit or UTF-8. A US-ASCII text contains only US-ASCII
characters. An 8-bit text sports a one-to-one correspondence between
characters and bytes, whereas an UTF-8 text is interpreted in UTF-8. Of
course, this rises a difficult question: <em>when</em> should a buffer be
classified as UTF-8?
<p>Character encodings are a mess. There is nothing we can do to change
this fact, as character encodings are <em>metadata that modify data
semantics</em>. The same file may represent different texts of different
lengths when interpreted with different encodings. Thus, there is no safe
way of guessing the encoding of a file.
<p><code>ne</code> stays on the safe side: it will never try to convert a file
from an encoding to another one. It can, however, interpret data
contained in a buffer depending on an encoding: in other words,
encodings are truly treated as metadata. You can switch off UTF-8
at any time, and see the same buffer as a standard 8-bit file.
<p>Moreover, <code>ne</code> uses a <em>lazy</em> approach to the problem: first of
all, unless the UTF-8 automatic detection flag is set
(see <a href="UTF8Auto.html#UTF8Auto">UTF8Auto</a>), no attempt is ever made to consider a file as UTF-8
encoded. Every file, clip, command line, etc., is firstly scanned for
non-US-ASCII characters: if it is entirely made of US-ASCII characters,
it is classified as US-ASCII. An US-ASCII piece of text is compatible
with anything else—it may be pasted in any buffer, or, if it is a
buffer, it may accept any form of text. Buffers classified as US-ASCII
are distinguished by an ‘<samp><span class="samp">A</span></samp>’ on the status bar.
<p>As soon as a user action forces a choice of encoding (e.g., an accented
character is typed, or an UTF-8-encoded clip is pasted), <code>ne</code> fixes
the mode to 8-bit or UTF-8 (when there is a choice, this depends on
the value of the <a href="UTF8Auto.html#UTF8Auto">UTF8Auto</a> flag). Of course, in some cases this may
be impossible, and in that case an error will be reported.
<p>All this happens behind the scenes, and it is designed so that in 99% of
the cases there is no need to think of encodings. In any case, should
<code>ne</code>'s behaviour not match your needs, you can always change at run
time the level of UTF-8 support.
</body></html>
|