/usr/share/doc/recode-doc/Library.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Created by GNU Texinfo 5.1, http://www.gnu.org/software/texinfo/ -->
<head>
<title>The recode reference manual: Library</title>

<meta name="description" content="The recode reference manual: Library">
<meta name="keywords" content="The recode reference manual: Library">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link href="index.html#Top" rel="start" title="Top">
<link href="Concept-Index.html#Concept-Index" rel="index" title="Concept Index">
<link href="Charset-and-Surface-Index.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="index.html#Top" rel="up" title="Top">
<link href="Universal.html#Universal" rel="next" title="Universal">
<link href="Invoking-recode.html#Debugging" rel="previous" title="Debugging">
<style type="text/css">
<!--
a.summary-letter {text-decoration: none}
blockquote.smallquotation {font-size: smaller}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
div.indentedblock {margin-left: 3.2em}
div.lisp {margin-left: 3.2em}
div.smalldisplay {margin-left: 3.2em}
div.smallexample {margin-left: 3.2em}
div.smallindentedblock {margin-left: 3.2em; font-size: smaller}
div.smalllisp {margin-left: 3.2em}
kbd {font-style:oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
pre.smalldisplay {font-family: inherit; font-size: smaller}
pre.smallexample {font-size: smaller}
pre.smallformat {font-family: inherit; font-size: smaller}
pre.smalllisp {font-size: smaller}
span.nocodebreak {white-space:nowrap}
span.nolinebreak {white-space:nowrap}
span.roman {font-family:serif; font-weight:normal}
span.sansserif {font-family:sans-serif; font-weight:normal}
ul.no-bullet {list-style: none}
-->
</style>


</head>

<body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000">
<a name="Library"></a>
<div class="header">
<p>
Next: <a href="Universal.html#Universal" accesskey="n" rel="next">Universal</a>, Previous: <a href="Invoking-recode.html#Invoking-recode" accesskey="p" rel="previous">Invoking recode</a>, Up: <a href="index.html#Top" accesskey="u" rel="up">Top</a> &nbsp; [<a href="Charset-and-Surface-Index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
</div>
<a name="A-recoding-library"></a>
<h2 class="chapter">4 A recoding library</h2>

<a name="index-recoding-library"></a>
<p>The program named <code>recode</code> is just an application of its recoding
library.  The recoding library is available separately for other C
programs.  A good way to acquire some familiarity with the recoding
library is to get acquainted with the <code>recode</code> program itself.
</p>
<p>To use the recoding library once it is installed, a C program needs to
have a line:
</p>
<div class="example">
<pre class="example">#include &lt;recode.h&gt;
</pre></div>

<p>near its beginning, and the user should have &lsquo;<samp>-lrecode</samp>&rsquo; on the
linking call, so modules from the recoding library are found.
</p>
<p>The library is still under development.  As it stands, it contains four
identifiable sets of routines: the outer level functions, the request
level functions, the task level functions and the charset level functions.
There are discussed in separate sections.
</p>
<p>For effectively using the recoding library in most applications, it should
be rarely needed to study anything beyond the main initialisation function
at outer level, and then, various functions at request level.
</p>
<table class="menu" border="0" cellspacing="0">
<tr><td align="left" valign="top">&bull; <a href="#Outer-level" accesskey="1">Outer level</a>:</td><td>&nbsp;&nbsp;</td><td align="left" valign="top">Outer level functions
</td></tr>
<tr><td align="left" valign="top">&bull; <a href="#Request-level" accesskey="2">Request level</a>:</td><td>&nbsp;&nbsp;</td><td align="left" valign="top">Request level functions
</td></tr>
<tr><td align="left" valign="top">&bull; <a href="#Task-level" accesskey="3">Task level</a>:</td><td>&nbsp;&nbsp;</td><td align="left" valign="top">Task level functions
</td></tr>
<tr><td align="left" valign="top">&bull; <a href="#Charset-level" accesskey="4">Charset level</a>:</td><td>&nbsp;&nbsp;</td><td align="left" valign="top">Charset level functions
</td></tr>
<tr><td align="left" valign="top">&bull; <a href="#Errors" accesskey="5">Errors</a>:</td><td>&nbsp;&nbsp;</td><td align="left" valign="top">Handling errors
</td></tr>
</table>

<hr>
<a name="Outer-level"></a>
<div class="header">
<p>
Next: <a href="#Request-level" accesskey="n" rel="next">Request level</a>, Previous: <a href="#Library" accesskey="p" rel="previous">Library</a>, Up: <a href="#Library" accesskey="u" rel="up">Library</a> &nbsp; [<a href="Charset-and-Surface-Index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
</div>
<a name="Outer-level-functions"></a>
<h3 class="section">4.1 Outer level functions</h3>

<a name="index-outer-level-functions"></a>
<p>The outer level functions mainly prepare the whole recoding library for
use, or do actions which are unrelated to specific recodings.  Here is
an example of a program which does not really make anything useful.
</p>
<div class="example">
<pre class="example">#include &lt;stdbool.h&gt;
#include &lt;recode.h&gt;

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (true);

  recode_delete_outer (outer);
  exit (0);
}
</pre></div>

<a name="index-RECODE_005fOUTER-structure"></a>
<p>The header file <code>&lt;recode.h&gt;</code> declares an opaque <code>RECODE_OUTER</code>
structure, which the programmer should use for allocating a variable in
his program (let&rsquo;s assume the programmer is a male, here, no prejudice
intended).  This &lsquo;<samp>outer</samp>&rsquo; variable is given as a first argument to
all outer level functions.
</p>
<a name="index-stdbool_002eh-header"></a>
<a name="index-bool-data-type"></a>
<p>The <code>&lt;recode.h&gt;</code> header file uses the Boolean type setup by the
system header file <code>&lt;stdbool.h&gt;</code>.  But this header file is still
fairly new in C standards, and likely does not exist everywhere.  If you
system does not offer this system header file yet, the proper compilation
of the <code>&lt;recode.h&gt;</code> file could be guaranteed through the replacement
of the inclusion line by:
</p>
<div class="example">
<pre class="example">typedef enum {false = 0, true = 1} bool;
</pre></div>

<p>People wanting wider portability, or Autoconf lovers, might arrange their
<samp>configure.in</samp> for being able to write something more general, like:
</p>
<div class="example">
<pre class="example">#if STDC_HEADERS
# include &lt;stdlib.h&gt;
#endif

/* Some systems do not define EXIT_*, even with STDC_HEADERS.  */
#ifndef EXIT_SUCCESS
# define EXIT_SUCCESS 0
#endif
#ifndef EXIT_FAILURE
# define EXIT_FAILURE 1
#endif
/* The following test is to work around the gross typo in systems like Sony
   NEWS-OS Release 4.0C, whereby EXIT_FAILURE is defined to 0, not 1.  */
#if !EXIT_FAILURE
# undef EXIT_FAILURE
# define EXIT_FAILURE 1
#endif

#if HAVE_STDBOOL_H
# include &lt;stdbool.h&gt;
#else
typedef enum {false = 0, true = 1} bool;
#endif

#include &lt;recode.h&gt;

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (true);

  recode_term_outer (outer);
  exit (EXIT_SUCCESS);
}
</pre></div>

<p>but we will not insist on such details in the examples to come.
</p>
<ul>
<li> Initialisation functions
<a name="index-initialisation-functions_002c-outer"></a>

<div class="example">
<pre class="example">RECODE_OUTER recode_new_outer (<var>auto_abort</var>);
bool recode_delete_outer (<var>outer</var>);
</pre></div>

<a name="index-recode_005fnew_005fouter"></a>
<a name="index-recode_005fdelete_005fouter"></a>
<p>The recoding library absolutely needs to be initialised before being used,
and <code>recode_new_outer</code> has to be called once, first.  Besides the
<var>outer</var> it is meant to initialise, the function accepts a Boolean
argument whether or not the library should automatically issue diagnostics
on standard and abort the whole program on errors.  When <var>auto_abort</var>
is <code>true</code>, the library later conveniently issues diagnostics itself,
and aborts the calling program on errors.  This is merely a convenience,
because if this parameter was <code>false</code>, the calling program should always
take care of checking the return value of all other calls to the recoding
library functions, and when any error is detected, issue a diagnostic and
abort processing itself.
</p>
<p>Regardless of the setting of <var>auto_abort</var>, all recoding library
functions return a success status.  Most functions are geared for returning
<code>false</code> for an error, and <code>true</code> if everything went fine.
Functions returning structures or strings return <code>NULL</code> instead
of the result, when the result cannot be produced.  If <var>auto_abort</var>
is selected, functions either return <code>true</code>, or do not return at all.
</p>
<p>As in the example above, <code>recode_new_outer</code> is called only once in
most cases.  Calling <code>recode_new_outer</code> implies some overhead, so
calling it more than once should preferably be avoided.
</p>
<p>The termination function <code>recode_delete_outer</code> reclaims the memory
allocated by <code>recode_new_outer</code> for a given <var>outer</var> variable.
Calling <code>recode_delete_outer</code> prior to program termination is more
aesthetic then useful, as all memory resources are automatically reclaimed
when the program ends.  You may spare this terminating call if you prefer.
</p>
</li><li> The <code>program_name</code> declaration

<a name="index-program_005fname-variable"></a>
<p>As we just explained, the user may set the <code>recode</code> library so that,
in case of problems error, it issues the diagnostic itself and aborts the
whole processing.  This capability may be quite convenient.  When this
feature is used, the aborting routine includes the name of the running
program in the diagnostic.  On the other hand, when this feature is not
used, the library merely return error codes, giving the library user fuller
control over all this.  This behaviour is more like what usual libraries
do: they return codes and never abort.  However, I would rather not force
library users to necessarily check all return codes themselves, by leaving
no other choice.  In most simple applications, letting the library diagnose
and abort is much easier, and quite welcome.  This is precisely because
both possibilities exist that the <code>program_name</code> variable is needed: it
may be used by the library <em>when</em> the user sets it to diagnose itself.
</p></li></ul>

<hr>
<a name="Request-level"></a>
<div class="header">
<p>
Next: <a href="#Task-level" accesskey="n" rel="next">Task level</a>, Previous: <a href="#Outer-level" accesskey="p" rel="previous">Outer level</a>, Up: <a href="#Library" accesskey="u" rel="up">Library</a> &nbsp; [<a href="Charset-and-Surface-Index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
</div>
<a name="Request-level-functions"></a>
<h3 class="section">4.2 Request level functions</h3>

<a name="index-request-level-functions"></a>
<p>The request level functions are meant to cover most recoding needs
programmers may have; they should provide all usual functionality.
Their API is almost stable by now.  To get started with request level
functions, here is a full example of a program which sole job is to filter
<code>ibmpc</code> code on its standard input into <code>latin1</code> code on its
standard output.
</p>
<div class="example">
<pre class="example">#include &lt;stdio.h&gt;
#include &lt;stdbool.h&gt;
#include &lt;recode.h&gt;

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (true);
  RECODE_REQUEST request = recode_new_request (outer);
  bool success;

  recode_scan_request (request, &quot;ibmpc..latin1&quot;);

  success = recode_file_to_file (request, stdin, stdout);

  recode_delete_request (request);
  recode_delete_outer (outer);

  exit (success ? 0 : 1);
}
</pre></div>

<a name="index-RECODE_005fREQUEST-structure"></a>
<p>The header file <code>&lt;recode.h&gt;</code> declares a <code>RECODE_REQUEST</code> structure,
which the programmer should use for allocating a variable in his program.
This <var>request</var> variable is given as a first argument to all request
level functions, and in most cases, may be considered as opaque.
</p>
<ul>
<li> Initialisation functions
<a name="index-initialisation-functions_002c-request"></a>

<div class="example">
<pre class="example">RECODE_REQUEST recode_new_request (<var>outer</var>);
bool recode_delete_request (<var>request</var>);
</pre></div>

<a name="index-recode_005fnew_005frequest"></a>
<a name="index-recode_005fdelete_005frequest"></a>
<p>No <var>request</var> variable may not be used in other request level
functions of the recoding library before having been initialised by
<code>recode_new_request</code>.  There may be many such <var>request</var>
variables, in which case, they are independent of one another and
they all need to be initialised separately.  To avoid memory leaks, a
<var>request</var> variable should not be initialised a second time without
calling <code>recode_delete_request</code> to &ldquo;un-initialise&rdquo; it.
</p>
<p>Like for <code>recode_delete_outer</code>, calling <code>recode_delete_request</code>
prior to program termination, in the example above, may be left out.
</p>
</li><li> Fields of <code>struct recode_request</code>
<a name="index-recode_005frequest-structure"></a>

<p>Here are the fields of a <code>struct recode_request</code> which may be
meaningfully changed, once a <var>request</var> has been initialised by
<code>recode_new_request</code>, but before it gets used.  It is not very frequent,
in practice, that these fields need to be changed.  To access the fields,
you need to include <samp>recodext.h</samp> <em>instead</em> of <samp>recode.h</samp>,
in which case there also is a greater chance that you need to recompile
your programs if a new version of the recoding library gets installed.
</p>
<dl compact="compact">
<dt><code>verbose_flag</code></dt>
<dd><a name="index-verbose_005fflag"></a>
<p>This field is initially <code>false</code>.  When set to <code>true</code>, the
library will echo to stderr the sequence of elementary recoding steps
needed to achieve the requested recoding.
</p>
</dd>
<dt><code>diaeresis_char</code></dt>
<dd><a name="index-diaeresis_005fchar"></a>
<p>This field is initially the ASCII value of a double quote <kbd>&quot;</kbd>,
but it may also be the ASCII value of a colon <kbd>:</kbd>.  In <code>texte</code>
charset, some countries use double quotes to mark diaeresis, while other
countries prefer colons.  This field contains the diaeresis character
for the <code>texte</code> charset.
</p>
</dd>
<dt><code>make_header_flag</code></dt>
<dd><a name="index-make_005fheader_005fflag"></a>
<p>This field is initially <code>false</code>.  When set to <code>true</code>, it
indicates that the program is merely trying to produce a recoding table in
source form rather than completing any actual recoding.  In such a case,
the optimisation of step sequence can be attempted much more aggressively.
If the step sequence cannot be reduced to a single step, table production
will fail.
</p>
</dd>
<dt><code>diacritics_only</code></dt>
<dd><a name="index-diacritics_005fonly"></a>
<p>This field is initially <code>false</code>.  For <code>HTML</code> and <code>LaTeX</code>
charset, it is often convenient to recode the diacriticized characters
only, while just not recoding other HTML code using ampersands or angular
brackets, or LaTeX code using backslashes.  Set the field to <code>true</code>
for getting this behaviour.  In the other charset, one can edit text as
well as HTML or LaTeX directives.
</p>
</dd>
<dt><code>ascii_graphics</code></dt>
<dd><a name="index-ascii_005fgraphics"></a>
<p>This field is initially <code>false</code>, and relate to characters 176 to
223 in the <code>ibmpc</code> charset, which are use to draw boxes.  When set
to <code>true</code>, while getting out of <code>ibmpc</code>, ASCII characters are
selected so to graphically approximate these boxes.
</p></dd>
</dl>

</li><li> Study of request strings

<div class="example">
<pre class="example">bool recode_scan_request (<var>request</var>, &quot;<var>string</var>&quot;);
</pre></div>

<a name="index-recode_005fscan_005frequest"></a>
<p>The main role of a <var>request</var> variable is to describe a set of
recoding transformations.  Function <code>recode_scan_request</code> studies
the given <var>string</var>, and stores an internal representation of it into
<var>request</var>.  Note that <var>string</var> may be a full-fledged <code>recode</code>
request, possibly including surfaces specifications, intermediary
charsets, sequences, aliases or abbreviations (see <a href="Invoking-recode.html#Requests">Requests</a>).
</p>
<p>The internal representation automatically receives some pre-conditioning
and optimisation, so the <var>request</var> may then later be used many times
to achieve many actual recodings.  It would not be efficient calling
<code>recode_scan_request</code> many times with the same <var>string</var>, it is
better having many <var>request</var> variables instead.
</p>
</li><li> Actual recoding jobs

<p>Once the <var>request</var> variable holds the description of a recoding
transformation, a few functions use it for achieving an actual recoding.
Either input or output of a recoding may be string, an in-memory buffer,
or a file.
</p>
<p>Functions with names like
<code>recode_<var>input-type</var>_to_<var>output-type</var></code> request an actual
recoding, and are described below.  It is easy to remember which arguments
each function accepts, once grasped some simple principles for each
possible <var>type</var>.  However, one of the recoding function escapes these
principles and is discussed separately, first.
</p>
<div class="example">
<pre class="example">recode_string (<var>request</var>, <var>string</var>);
</pre></div>

<a name="index-recode_005fstring"></a>
<p>The function <code>recode_string</code> recodes <var>string</var> according
to <var>request</var>, and directly returns the resulting recoded string
freshly allocated, or <code>NULL</code> if the recoding could not succeed for
some reason.  When this function is used, it is the responsibility of
the programmer to ensure that the memory used by the returned string is
later reclaimed.
</p>
<a name="index-recode_005fstring_005fto_005fbuffer"></a>
<a name="index-recode_005fstring_005fto_005ffile"></a>
<a name="index-recode_005fbuffer_005fto_005fbuffer"></a>
<a name="index-recode_005fbuffer_005fto_005ffile"></a>
<a name="index-recode_005ffile_005fto_005fbuffer"></a>
<a name="index-recode_005ffile_005fto_005ffile"></a>
<div class="example">
<pre class="example">char *recode_string_to_buffer (<var>request</var>,
  <var>input_string</var>,
  &amp;<var>output_buffer</var>, &amp;<var>output_length</var>, &amp;<var>output_allocated</var>);
bool recode_string_to_file (<var>request</var>,
  <var>input_file</var>,
  <var>output_file</var>);
bool recode_buffer_to_buffer (<var>request</var>,
  <var>input_buffer</var>, <var>input_length</var>,
  &amp;<var>output_buffer</var>, &amp;<var>output_length</var>, &amp;<var>output_allocated</var>);
bool recode_buffer_to_file (<var>request</var>,
  <var>input_buffer</var>, <var>input_length</var>,
  <var>output_file</var>);
bool recode_file_to_buffer (<var>request</var>,
  <var>input_file</var>,
  &amp;<var>output_buffer</var>, &amp;<var>output_length</var>, &amp;<var>output_allocated</var>);
bool recode_file_to_file (<var>request</var>,
  <var>input_file</var>,
  <var>output_file</var>);
</pre></div>

<p>All these functions return a <code>bool</code> result, <code>false</code> meaning that
the recoding was not successful, often because of reversibility issues.
The name of the function well indicates on which types it reads and which
type it produces.  Let&rsquo;s discuss these three types in turn.
</p>
<dl compact="compact">
<dt>string</dt>
<dd>
<p>A string is merely an in-memory buffer which is terminated by a <code>NUL</code>
character (using as many bytes as needed), instead of being described
by a byte length.  For input, a pointer to the buffer is given through
one argument.
</p>
<p>It is notable that there is no <code>to_string</code> functions.  Only one
function recodes into a string, and it is <code>recode_string</code>, which
has already been discussed separately, above.
</p>
</dd>
<dt>buffer</dt>
<dd>
<p>A buffer is a sequence of bytes held in computer memory.  For input, two
arguments provide a pointer to the start of the buffer and its byte size.
Note that for charsets using many bytes per character, the size is given
in bytes, not in characters.
</p>
<p>For output, three arguments provide the address of three variables, which
will receive the buffer pointer, the used buffer size in bytes, and the
allocated buffer size in bytes.  If at the time of the call, the buffer
pointer is <code>NULL</code>, then the allocated buffer size should also be zero,
and the buffer will be allocated afresh by the recoding functions.  However,
if the buffer pointer is not <code>NULL</code>, it should be already allocated,
the allocated buffer size then gives its size.  If the allocated size
gets exceeded while the recoding goes, the buffer will be automatically
reallocated bigger, probably elsewhere, and the allocated buffer size will
be adjusted accordingly.
</p>
<p>The second variable, giving the in-memory buffer size, will receive the
exact byte size which was needed for the recoding.  A <code>NUL</code> character
is guaranteed at the end of the produced buffer, but is not counted in the
byte size of the recoding.  Beyond that <code>NUL</code>, there might be some
extra space after the recoded data, extending to the allocated buffer size.
</p>
</dd>
<dt>file</dt>
<dd>
<a name="index-recode_005ffilter_005fopen_002c-not-available"></a>
<a name="index-recode_005ffilter_005fclose_002c-not-available"></a>
<p>A file is a sequence of bytes held outside computer memory, but
buffered through it.  For input, one argument provides a pointer to a
file already opened for read.  The file is then read and recoded from its
current position until the end of the file, effectively swallowing it in
memory if the destination of the recoding is a buffer.  For reading a file
filtered through the recoding library, but only a little bit at a time, one
should rather use <code>recode_filter_open</code> and <code>recode_filter_close</code>
(these two functions are not yet available).
</p>
<p>For output, one argument provides a pointer to a file already opened
for write.  The result of the recoding is written to that file starting
at its current position.
</p></dd>
</dl>
</li></ul>

<a name="index-recode_005fformat_005ftable"></a>
<p>The following special function is still subject to change:
</p>
<div class="example">
<pre class="example">void recode_format_table (<var>request</var>, <var>language</var>, &quot;<var>name</var>&quot;);
</pre></div>

<p>and is not documented anymore for now.
</p>
<hr>
<a name="Task-level"></a>
<div class="header">
<p>
Next: <a href="#Charset-level" accesskey="n" rel="next">Charset level</a>, Previous: <a href="#Request-level" accesskey="p" rel="previous">Request level</a>, Up: <a href="#Library" accesskey="u" rel="up">Library</a> &nbsp; [<a href="Charset-and-Surface-Index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
</div>
<a name="Task-level-functions"></a>
<h3 class="section">4.3 Task level functions</h3>
<a name="index-task-level-functions"></a>

<p>The task level functions are used internally by the request level
functions, they allow more explicit control over files and memory
buffers holding input and output to recoding processes.  The interface
specification of task level functions is still subject to change a bit.
</p>
<p>To get started with task level functions, here is a full example of a
program which sole job is to filter <code>ibmpc</code> code on its standard input
into <code>latin1</code> code on its standard output.  That is, this program has
the same goal as the one from the previous section, but does its things
a bit differently.
</p>
<div class="example">
<pre class="example">#include &lt;stdio.h&gt;
#include &lt;stdbool.h&gt;
#include &lt;recodext.h&gt;

const char *program_name;

int
main (int argc, char *const *argv)
{
  program_name = argv[0];
  RECODE_OUTER outer = recode_new_outer (false);
  RECODE_REQUEST request = recode_new_request (outer);
  RECODE_TASK task;
  bool success;

  recode_scan_request (request, &quot;ibmpc..latin1&quot;);

  task = recode_new_task (request);
  task-&gt;input.file = &quot;&quot;;
  task-&gt;output.file = &quot;&quot;;
  success = recode_perform_task (task);

  recode_delete_task (task);
  recode_delete_request (request);
  recode_delete_outer (outer);

  exit (success ? 0 : 1);
}
</pre></div>

<a name="index-RECODE_005fTASK-structure"></a>
<p>The header file <code>&lt;recode.h&gt;</code> declares a <code>RECODE_TASK</code>
structure, which the programmer should use for allocating a variable in
his program.  This <code>task</code> variable is given as a first argument to
all task level functions.  The programmer ought to change and possibly
consult a few fields in this structure, using special functions.
</p>
<ul>
<li> Initialisation functions
<a name="index-initialisation-functions_002c-task"></a>

<a name="index-recode_005fnew_005ftask"></a>
<a name="index-recode_005fdelete_005ftask"></a>
<div class="example">
<pre class="example">RECODE_TASK recode_new_task (<var>request</var>);
bool recode_delete_task (<var>task</var>);
</pre></div>

<p>No <var>task</var> variable may be used in other task level functions
of the recoding library without having first been initialised with
<code>recode_new_task</code>.  There may be many such <var>task</var> variables,
in which case, they are independent of one another and they all need to be
initialised separately.  To avoid memory leaks, a <var>task</var> variable should
not be initialised a second time without calling <code>recode_delete_task</code> to
&ldquo;un-initialise&rdquo; it.  This function also accepts a <var>request</var> argument
and associates the request to the task.  In fact, a task is essentially
a set of recoding transformations with the specification for its current
input and its current output.
</p>
<p>The <var>request</var> variable may be scanned before or after the call to
<code>recode_new_task</code>, it does not matter so far.  Immediately after
initialisation, before further changes, the <var>task</var> variable associates
<var>request</var> empty in-memory buffers for both input and output.
The output buffer will later get allocated automatically on the fly,
as needed, by various task processors.
</p>
<p>Even if a call to <code>recode_delete_task</code> is not strictly mandatory
before ending the program, it is cleaner to always include it.  Moreover,
in some future version of the recoding library, it might become required.
</p>
</li><li> Fields of <code>struct task_request</code>
<a name="index-task_005frequest-structure"></a>

<p>Here are the fields of a <code>struct task_request</code> which may be meaningfully
changed, once a <var>task</var> has been initialised by <code>recode_new_task</code>.
In fact, fields are expected to change.  Once again, to access the fields,
you need to include <samp>recodext.h</samp> <em>instead</em> of <samp>recode.h</samp>,
in which case there also is a greater chance that you need to recompile
your programs if a new version of the recoding library gets installed.
</p>
<dl compact="compact">
<dt><code>request</code></dt>
<dd>
<p>The field <code>request</code> points to the current recoding request, but may
be changed as needed between recoding calls, for example when there is
a need to achieve the construction of a resulting text made up of many
pieces, each being recoded differently.
</p>
</dd>
<dt><code>input.name</code></dt>
<dt><code>input.file</code></dt>
<dd>
<p>If <code>input.name</code> is not <code>NULL</code> at start of a recoding, this is
a request that a file by that name be first opened for reading and later
automatically closed once the whole file has been read. If the file name is
not <code>NULL</code> but an empty string, it means that standard input is to
be used.  The opened file pointer is then held into <code>input.file</code>.
</p>
<p>If <code>input.name</code> is <code>NULL</code> and <code>input.file</code> is not, than
<code>input.file</code> should point to a file already opened for read, which
is meant to be recoded.
</p>
</dd>
<dt><code>input.buffer</code></dt>
<dt><code>input.cursor</code></dt>
<dt><code>input.limit</code></dt>
<dd>
<p>When both <code>input.name</code> and <code>input.file</code> are <code>NULL</code>, three
pointers describe an in-memory buffer containing the text to be recoded.
The buffer extends from <code>input.buffer</code> to <code>input.limit</code>,
yet the text to be recoded only extends from <code>input.cursor</code> to
<code>input.limit</code>.  In most situations, <code>input.cursor</code> starts with
the value that <code>input.buffer</code> has.  (Its value will internally advance
as the recoding goes, until it reaches the value of <code>input.limit</code>.)
</p>
</dd>
<dt><code>output.name</code></dt>
<dt><code>output.file</code></dt>
<dd>
<p>If <code>output.name</code> is not <code>NULL</code> at start of a recoding, this
is a request that a file by that name be opened for write and later
automatically closed after the recoding is done.  If the file name is
not <code>NULL</code> but an empty string, it means that standard output is to
be used.  The opened file pointer is then held into <code>output.file</code>.
If several passes with intermediate files are needed to produce the
recoding, the <code>output.name</code> file is opened only for the final pass.
</p>
<p>If <code>output.name</code> is <code>NULL</code> and <code>output.file</code> is not, then
<code>output.file</code> should point to a file already opened for write, which
will receive the result of the recoding.
</p>
</dd>
<dt><code>output.buffer</code></dt>
<dt><code>output.cursor</code></dt>
<dt><code>output.limit</code></dt>
<dd>
<p>When both <code>output.name</code> and <code>output.file</code> are <code>NULL</code>, three
pointers describe an in-memory buffer meant to receive the text, once it
is recoded.  The buffer is already allocated from <code>output.buffer</code>
to <code>output.limit</code>.  In most situations, <code>output.cursor</code> starts
with the value that <code>output.buffer</code> has.  Once the recoding is done,
<code>output.cursor</code> will point at the next free byte in the buffer,
just after the recoded text, so another recoding could be called without
changing any of these three pointers, for appending new information to it.
The number of recoded bytes in the buffer is the difference between
<code>output.cursor</code> and <code>output.buffer</code>.
</p>
<p>Each time <code>output.cursor</code> reaches <code>output.limit</code>, the buffer
is reallocated bigger, possibly at a different location in memory, always
held up-to-date in <code>output.buffer</code>.  It is still possible to call a
task level function with no output buffer at all to start with, in which
case all three fields should have <code>NULL</code> as a value.  This is the
situation immediately after a call to <code>recode_new_task</code>.
</p>
</dd>
<dt><code>strategy</code></dt>
<dd><a name="index-strategy"></a>
<a name="index-RECODE_005fSTRATEGY_005fUNDECIDED"></a>
<p>This field, which is of type <code>enum recode_sequence_strategy</code>, tells
how various recoding steps (passes) will be interconnected.  Its initial
value is <code>RECODE_STRATEGY_UNDECIDED</code>, which is a constant defined in
the header file <samp>&lt;recodext.h&gt;</samp>.  Other possible values are:
</p>
<dl compact="compact">
<dt><code>RECODE_SEQUENCE_IN_MEMORY</code></dt>
<dd><a name="index-RECODE_005fSEQUENCE_005fIN_005fMEMORY"></a>
<p>Keep intermediate recodings in memory.
</p></dd>
<dt><code>RECODE_SEQUENCE_WITH_FILES</code></dt>
<dd><a name="index-RECODE_005fSEQUENCE_005fWITH_005fFILES"></a>
<p>Do not fork, use intermediate files.
</p></dd>
<dt><code>RECODE_SEQUENCE_WITH_PIPE</code></dt>
<dd><a name="index-RECODE_005fSEQUENCE_005fWITH_005fPIPE"></a>
<p>Fork processes connected with <code>pipe(2)</code>.
</p></dd>
</dl>

<p>The best for now is to leave this field alone, and let the recoding
library decide its strategy, as many combinations have not been tested yet.
</p>
</dd>
<dt><code>byte_order_mark</code></dt>
<dd><a name="index-byte_005forder_005fmark"></a>
<p>This field, which is preset to <code>true</code>, indicates that a byte order
mark is to be expected at the beginning of any canonical <code>UCS-2</code>
or <code>UTF-16</code> text, and that such a byte order mark should be also
produced for these charsets.
</p>
</dd>
<dt><code>fail_level</code></dt>
<dd><a name="index-fail_005flevel"></a>
<p>This field, which is of type <code>enum recode_error</code> (see <a href="#Errors">Errors</a>),
sets the error level at which task level functions should report a failure.
If an error being detected is equal or greater than <code>fail_level</code>,
the function will eventually return <code>false</code> instead of <code>true</code>.
The preset value for this field is <code>RECODE_NOT_CANONICAL</code>, that means
that if not reset to another value, the library will report failure on
<em>any</em> error.
</p>
</dd>
<dt><code>abort_level</code></dt>
<dd><a name="index-abort_005flevel"></a>
<a name="index-RECODE_005fMAXIMUM_005fERROR"></a>
<p>This field, which is of type <code>enum recode_error</code> (see <a href="#Errors">Errors</a>), sets
the error level at which task level functions should immediately interrupt
their processing.  If an error being detected is equal or greater than
<code>abort_level</code>, the function returns immediately, but the returned
value (<code>true</code> or <code>false</code>) is still is decided from the setting
of <code>fail_level</code>, not <code>abort_level</code>.  The preset value for this
field is <code>RECODE_MAXIMUM_ERROR</code>, that means that is not reset to
another value, the library will never interrupt a recoding task.
</p>
</dd>
<dt><code>error_so_far</code></dt>
<dd><a name="index-error_005fso_005ffar"></a>
<p>This field, which is of type <code>enum recode_error</code> (see <a href="#Errors">Errors</a>),
maintains the maximum error level met so far while the recoding task
was proceeding.  The preset value is <code>RECODE_NO_ERROR</code>.
</p></dd>
</dl>

</li><li> Task execution
<a name="index-task-execution"></a>

<a name="index-recode_005fperform_005ftask"></a>
<a name="index-recode_005ffilter_005fopen"></a>
<a name="index-recode_005ffilter_005fclose"></a>
<div class="example">
<pre class="example">recode_perform_task (<var>task</var>);
recode_filter_open (<var>task</var>, <var>file</var>);
recode_filter_close (<var>task</var>);
</pre></div>

<p>The function <code>recode_perform_task</code> reads as much input as possible,
and recode all of it on prescribed output, given a properly initialised
<var>task</var>.
</p>
<p>Functions <code>recode_filter_open</code> and <code>recode_filter_close</code> are
only planned for now.  They are meant to read input in piecemeal ways.
Even if functionality already exists informally in the library, it has
not been made available yet through such interface functions.
</p></li></ul>

<hr>
<a name="Charset-level"></a>
<div class="header">
<p>
Next: <a href="#Errors" accesskey="n" rel="next">Errors</a>, Previous: <a href="#Task-level" accesskey="p" rel="previous">Task level</a>, Up: <a href="#Library" accesskey="u" rel="up">Library</a> &nbsp; [<a href="Charset-and-Surface-Index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
</div>
<a name="Charset-level-functions"></a>
<h3 class="section">4.4 Charset level functions</h3>
<a name="index-charset-level-functions"></a>

<a name="index-internal-functions"></a>
<p>Many functions are internal to the recoding library.  Some of them
have been made external and available, for the <code>recode</code> program
had to retain all its previous functionality while being transformed
into a mere application of the recoding library.  These functions are
not really documented here for the time being, as we hope that many of
them will vanish over time.  When this set of routines will stabilise,
it would be convenient to document them as an API for handling charset
names and contents.
</p>
<a name="index-find_005fcharset"></a>
<a name="index-list_005fall_005fcharsets"></a>
<a name="index-list_005fconcise_005fcharset"></a>
<a name="index-list_005ffull_005fcharset"></a>
<div class="example">
<pre class="example">RECODE_CHARSET find_charset (<var>name</var>, <var>cleaning-type</var>);
bool list_all_charsets (<var>charset</var>);
bool list_concise_charset (<var>charset</var>, <var>list-format</var>);
bool list_full_charset (<var>charset</var>);
</pre></div>

<hr>
<a name="Errors"></a>
<div class="header">
<p>
Previous: <a href="#Charset-level" accesskey="p" rel="previous">Charset level</a>, Up: <a href="#Library" accesskey="u" rel="up">Library</a> &nbsp; [<a href="Charset-and-Surface-Index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
</div>
<a name="Handling-errors"></a>
<h3 class="section">4.5 Handling errors</h3>
<a name="index-error-handling"></a>
<a name="index-handling-errors"></a>

<a name="index-error-messages"></a>
<p>The <code>recode</code> program, while using the <code>recode</code> library, needs to
control whether recoding problems are reported or not, and then reflect
these in the exit status.  The program should also instruct the library
whether the recoding should be abruptly interrupted when an error is
met (so sparing processing when it is known in advance that a wrong
result would be discarded anyway), or if it should proceed nevertheless.
Here is how the library groups errors into levels, listed here in order
of increasing severity.
</p>
<dl compact="compact">
<dt><code>RECODE_NO_ERROR</code></dt>
<dd><a name="index-RECODE_005fNO_005fERROR"></a>

<p>No error was met on previous library calls.
</p>
</dd>
<dt><code>RECODE_NOT_CANONICAL</code></dt>
<dd><a name="index-RECODE_005fNOT_005fCANONICAL"></a>
<a name="index-non-canonical-input_002c-error-message"></a>

<p>The input text was using one of the many alternative codings for some
phenomenon, but not the one <code>recode</code> would have canonically generated.
So, if the reverse recoding is later attempted, it would produce a text
having the same <em>meaning</em> as the original text, yet not being byte
identical.
</p>
<p>For example, a <code>Base64</code> block in which end-of-lines appear elsewhere
that at every 76 characters is not canonical.  An e-circumflex in TeX
which is coded as &lsquo;<samp>\^{e}</samp>&rsquo; instead of &lsquo;<samp>\^e</samp>&rsquo; is not canonical.
</p>
</dd>
<dt><code>RECODE_AMBIGUOUS_OUTPUT</code></dt>
<dd><a name="index-RECODE_005fAMBIGUOUS_005fOUTPUT"></a>
<a name="index-ambiguous-output_002c-error-message"></a>

<p>It has been discovered that if the reverse recoding was attempted on
the text output by this recoding, we would not obtain the original text,
only because an ambiguity was generated by accident in the output text.
This ambiguity would then cause the wrong interpretation to be taken.
</p>
<p>Here are a few examples.  If the <code>Latin-1</code> sequence &lsquo;<samp>e^</samp>&rsquo;
is converted to Easy French and back, the result will be interpreted
as e-circumflex and so, will not reflect the intent of the original two
characters.  Recoding an <code>IBM-PC</code> text to <code>Latin-1</code> and back,
where the input text contained an isolated <kbd>LF</kbd>, will have a spurious
<kbd>CR</kbd> inserted before the <kbd>LF</kbd>.
</p>
<p>Currently, there are many cases in the library where the production of
ambiguous output is not properly detected, as it is sometimes a difficult
problem to accomplish this detection, or to do it speedily.
</p>
</dd>
<dt><code>RECODE_UNTRANSLATABLE</code></dt>
<dd><a name="index-RECODE_005fUNTRANSLATABLE"></a>
<a name="index-untranslatable-input_002c-error-message"></a>

<p>One or more input character could not be recoded, because there is just
no representation for this character in the output charset.
</p>
<p>Here are a few examples.  Non-strict mode often allows <code>recode</code> to
compute on-the-fly mappings for unrepresentable characters, but strict
mode prohibits such attribution of reversible translations: so strict
mode might often trigger such an error.  Most <code>UCS-2</code> codes used to
represent Asian characters cannot be expressed in various Latin charsets.
</p>
</dd>
<dt><code>RECODE_INVALID_INPUT</code></dt>
<dd><a name="index-RECODE_005fINVALID_005fINPUT"></a>
<a name="index-invalid-input_002c-error-message"></a>

<p>The input text does not comply with the coding it is declared to hold.  So,
there is no way by which a reverse recoding would reproduce this text,
because <code>recode</code> should never produce invalid output.
</p>
<p>Here are a few examples.  In strict mode, <code>ASCII</code> text is not allowed
to contain characters with the eight bit set.  <code>UTF-8</code> encodings
ought to be minimal<a name="DOCF7" href="#FOOT7"><sup>7</sup></a>.
</p>
</dd>
<dt><code>RECODE_SYSTEM_ERROR</code></dt>
<dd><a name="index-RECODE_005fSYSTEM_005fERROR"></a>
<a name="index-system-detected-problem_002c-error-message"></a>

<p>The underlying system reported an error while the recoding was going on,
likely an input/output error.
(This error symbol is currently unused in the library.)
</p>
</dd>
<dt><code>RECODE_USER_ERROR</code></dt>
<dd><a name="index-RECODE_005fUSER_005fERROR"></a>
<a name="index-misuse-of-recoding-library_002c-error-message"></a>

<p>The programmer or user requested something the recoding library is unable
to provide, or used the API wrongly.
(This error symbol is currently unused in the library.)
</p>
</dd>
<dt><code>RECODE_INTERNAL_ERROR</code></dt>
<dd><a name="index-RECODE_005fINTERNAL_005fERROR"></a>
<a name="index-internal-recoding-bug_002c-error-message"></a>

<p>Something really wrong, which should normally never happen, was detected
within the recoding library.  This might be due to genuine bugs in the
library, or maybe due to un-initialised or overwritten arguments to
the API.
(This error symbol is currently unused in the library.)
</p>
</dd>
<dt><code>RECODE_MAXIMUM_ERROR</code></dt>
<dd><a name="index-RECODE_005fMAXIMUM_005fERROR-1"></a>

<p>This error code should never be returned, it is only internally used as
a sentinel for the list of all possible error codes.
</p></dd>
</dl>

<a name="index-error-level-threshold"></a>
<a name="index-threshold-for-error-reporting"></a>
<p>One should be able to set the error level threshold for returning failure
at end of recoding, and also the threshold for immediate interruption.
If many errors occur while the recoding proceed, which are not severe
enough to interrupt the recoding, then the most severe error is retained,
while others are forgotten<a name="DOCF8" href="#FOOT8"><sup>8</sup></a>.  So, in case of an error,
the possible actions currently are:
</p>
<ul>
<li> do nothing and let go, returning success at end of recoding,
</li><li> just let go for now, but return failure at end of recoding,
</li><li> interrupt recoding right away and return failure now.
</li></ul>

<p>See <a href="#Task-level">Task level</a>, and particularly the description of the fields
<code>fail_level</code>, <code>abort_level</code> and <code>error_so_far</code>, for more
information about how errors are handled.
</p>

<div class="footnote">
<hr>
<h4 class="footnotes-heading">Footnotes</h4>

<h3><a name="FOOT7" href="#DOCF7">(7)</a></h3>
<p>The minimality of an <code>UTF-8</code> encoding
is guaranteed on output, but currently, it is not checked on input.</p>
<h3><a name="FOOT8" href="#DOCF8">(8)</a></h3>
<p>Another approach would have been
to define the level symbols as masks instead, and to give masks to
threshold setting routines, and to retain all errors&mdash;yet I never
met myself such a need in practice, and so I fear it would be overkill.
On the other hand, it might be interesting to maintain counters about
how many times each kind of error occurred.</p>
</div>
<hr>
<div class="header">
<p>
Previous: <a href="#Charset-level" accesskey="p" rel="previous">Charset level</a>, Up: <a href="#Library" accesskey="u" rel="up">Library</a> &nbsp; [<a href="Charset-and-Surface-Index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
</div>



</body>
</html>
recode-doc 3.6-21 / usr / share / doc / recode-doc / Library.html