/usr/share/doc/ne/html/The-Encoding-Mess.html

<html lang="en">
<head>
<title>The Encoding Mess - ne's manual</title>
<meta http-equiv="Content-Type" content="text/html">
<meta name="description" content="ne's manual">
<meta name="generator" content="makeinfo 4.13">
<link title="Top" rel="start" href="index.html#Top">
<link rel="prev" href="Motivations-and-Design.html#Motivations-and-Design" title="Motivations and Design">
<link rel="next" href="History.html#History" title="History">
<link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage">
<meta http-equiv="Content-Style-Type" content="text/css">
<style type="text/css"><!--
  pre.display { font-family:inherit }
  pre.format  { font-family:inherit }
  pre.smalldisplay { font-family:inherit; font-size:smaller }
  pre.smallformat  { font-family:inherit; font-size:smaller }
  pre.smallexample { font-size:smaller }
  pre.smalllisp    { font-size:smaller }
  span.sc    { font-variant:small-caps }
  span.roman { font-family:serif; font-weight:normal; } 
  span.sansserif { font-family:sans-serif; font-weight:normal; } 
--></style>
</head>
<body>
<div class="node">
<a name="The-Encoding-Mess"></a>
<p>
Next:&nbsp;<a rel="next" accesskey="n" href="History.html#History">History</a>,
Previous:&nbsp;<a rel="previous" accesskey="p" href="Motivations-and-Design.html#Motivations-and-Design">Motivations and Design</a>,
Up:&nbsp;<a rel="up" accesskey="u" href="index.html#Top">Top</a>
<hr>
</div>

<h2 class="chapter">8 The Encoding Mess</h2>

<p><a name="index-UTF_002d8-240"></a><a name="index-ISO_002d8859-family-241"></a><a name="index-ISO_002d8859_002d1-242"></a>
The original <code>ne</code> handled 8-bit text files, and assumed that every
byte coming from the keyboard could be output to the terminal. No other
assumption was made&mdash;for instance, the up/down casing functions did not
assume a particular encoding for non-US-ASCII characters. This choice
had a significant advantage: <code>ne</code> could handle easily several
different encodings, with minor nuisances for the end user.

   <p>Since version 1.30, <code>ne</code> supports UTF-8. It can use UTF-8 for its
input/output, and it can also interpret one or more buffers as containing
UTF-8 encoded text, acting accordingly. Note that the buffer content is
actual UTF-8 text&mdash;<code>ne</code> does not use wide characters. As a
positive side-effect, <code>ne</code> can support fully the ISO-10646
standard, but nonetheless non-UTF-8 texts occupy exactly one byte per
character.

   <p>More precisely, <em>any</em> piece of text in <code>ne</code> is classified as
US-ASCII, 8-bit or UTF-8. A US-ASCII text contains only US-ASCII
characters. An 8-bit text sports a one-to-one correspondence between
characters and bytes, whereas an UTF-8 text is interpreted in UTF-8.  Of
course, this rises a difficult question: <em>when</em> should a buffer be
classified as UTF-8?

   <p>Character encodings are a mess. There is nothing we can do to change
this fact, as character encodings are <em>metadata that modify data
semantics</em>. The same file may represent different texts of different
lengths when interpreted with different encodings. Thus, there is no safe
way of guessing the encoding of a file.

   <p><code>ne</code> stays on the safe side: it will never try to convert a file
from an encoding to another one. It can, however, interpret data
contained in a buffer depending on an encoding: in other words,
encodings are truly treated as metadata. You can switch off UTF-8
at any time, and see the same buffer as a standard 8-bit file.

   <p>Moreover, <code>ne</code> uses a <em>lazy</em> approach to the problem: first of
all, unless the UTF-8 automatic detection flag is set
(see <a href="UTF8Auto.html#UTF8Auto">UTF8Auto</a>), no attempt is ever made to consider a file as UTF-8
encoded.  Every file, clip, command line, etc., is firstly scanned for
non-US-ASCII characters: if it is entirely made of US-ASCII characters,
it is classified as US-ASCII. An US-ASCII piece of text is compatible
with anything else&mdash;it may be pasted in any buffer, or, if it is a
buffer, it may accept any form of text. Buffers classified as US-ASCII
are distinguished by an &lsquo;<samp><span class="samp">A</span></samp>&rsquo; on the status bar.

   <p>As soon as a user action forces a choice of encoding (e.g., an accented
character is typed, or an UTF-8-encoded clip is pasted), <code>ne</code> fixes
the mode to 8-bit or UTF-8 (when there is a choice, this depends on
the value of the <a href="UTF8Auto.html#UTF8Auto">UTF8Auto</a> flag). Of course, in some cases this may
be impossible, and in that case an error will be reported.

   <p>All this happens behind the scenes, and it is designed so that in 99% of
the cases there is no need to think of encodings. In any case, should
<code>ne</code>'s behaviour not match your needs, you can always change at run
time the level of UTF-8 support.

   </body></html>
ne-doc 2.5-1 / usr / share / doc / ne / html / The-Encoding-Mess.html