This file is indexed.

/usr/share/doc/netrik/hacking-load.html is in netrik 1.16.1-2build2.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
<html>
<head>
<title>netrik hacker's manual: file loader</title>
</head>
<body>

<h1 align="center">netrik hacker's manual<br />>========================&lt;</h1>

<p>
[This file contains a description of the file loading module. See hacking.txt or
<a href="hacking.html">hacking.html</a> for an overview of the manual.]
</p>

<h2>load.c</h2>

<p>
Every open input resource has a descriptor associated with it. The descriptor
is created when opening the with init_load().
</p>

<p>
The struct for this descriptor is declared in load.h, and contains:
<ul> <li>
 The URL of the resource is stored by "url", which points to a "struct Url"
 containing the (split up) effective URL
</li><li>
 "type" is an enum of type "Res_type", describing what kind of resource it is.
 (RES_STDIN for stdin, RES_FILE for local files, RES_HTTP for HTTP connections
 handled by builtin HTTP code, or RES_PIPE for piping data from wget; it may
 also be set to RES_FAIL during the loading process, indicating that the
 loading failed)
</li><li>
 The union "handle" contains data necessary to handle the resource itself,
 e.g. a file handle.
</li><li>
 "buf" is a pointer to a string, where every loaded data block is stored for
 processing.
</li><li>
 "buf_end" contains the current end of the data in the buffer. (Not always the
 whole buffer is filled, escpecially at EOF.)
</li><li>
 "buf_ptr" points to the current read position inside the buffer.
</li><li>
 The "user_break" flag indicates when loading was interrupted by a user break
 and the document wasn't loaded completely.
</li></ul>
</p>

<a name="initLoad" id="initLoad">

<h3>init_load()</h3>

<p>
init_load() takes two arguments: a string containing the target URL to be
loaded, and a "struct Url" containing a base URL. (Normally the URL of the file
visible in the pager up to now.) Besides of creating the descriptor it also
allocs the "buf[]" and prepares the resource for reading data.
</p>

<p>
An additional "form" argument can be given to submit the data of an HTML form
along with the HTTP request. See
<a href="hacking-links.html#submit">Submitting</a> in
hacking-links.* for details on this.
</p>

<p>
Calling this function with only a base URL and no target means that a page is
to be reloaded from history.
</p>

<p>
For all other target URLs, first the effective URL is determined. This is done
by <a href="#mergeUrls">merge_urls()</a>, which first splits the
URL up into its components using
<a href="#splitUrl">split_url()</a>, and then merges it with the
base URL components. A pointer to the resulting merged URL ist stored in
"res->url".
</p>

<p>
If merging the URLs failed for some reason, the resource type is set to
"RES_FAIL", and init_load() returns normally. Thus, there is no special error
handling necessary in the calling function; all necessary handling is done in
<a href="#load">load()</a>.
</p>

<p>
If the URL is "-", meaning load from stdin, "res->url->proto.type" is set to
PT_INTERNAL (meaning no relative links are allowed, and the page is not to be
kept in history).
</p>

<p>
No URL merging is done if a page from history is reloaded. The split URL passed
as the base is already the destination URL.
</p>

<p>
Now having the destination URL, the resource is opened appropriately, depending
on what resource type is set in the URL.
</p>

<p>
If it is a HTTP URL, by default
<a href="#initHttp">http_init_load()</a> is used to open a connection
to the server and read the HTTP headers. "res->type" is set to HTTP.
</p>

<p>
Alternatively, "cfg.wget" may be set (using --no-builtin-http), in which case
"wget" is used instead of the builtin HTTP code.
This is done by init_wget(), which starts wget with the correct
filename, and initiates reading its standard output through a pipe. The popen()
function returns our pipe end as a stream; we use this as the input.
"res->type" is set to "RES_PIPE".
</p>

<p>
FTP URLs are always fetched using wget.
</p>

<p>
Local files are simply opened as the input stream, and "res->type" is set to
"RES_FILE".
</p>

<p>
Things are more complicated however as local files are allowed to be compressed
with gzip or bzip2, which is handled transparently, i.e. the extension ".gz" or
".bz2" needn't be present in the given URL. Thus we have to check for the
alternative file names in a loop, and if one of them exists memorize it.
</p>

<p>
If we have a compressed file name (presently recognized only by the file
extension) -- regardless whether the complete file name was given or
transparently extended as described above -- we do not read the file directly,
but pipe it from gunzip/bunzip2, just es FTP is piped from wget. The check for
compressed files is done in a second loop, independent of the file name loop.
</p>

<p>
Internal URLs presently always mean loading from stdin. (Error pages also use
"PT_INTERNAL", but init_load() returns immediataly after setting it, so this
needn't be handled in the switch.) The standard input is reopened as a stream,
and this one is used as the input. As we need some way to read user commands
from the terminal, "stderr" is reopend in place of "stdin". (Some programs
reopen "/dev/tty" instead -- no idea which one is better...) This way the
normal "stdin" descriptor points to a terminal again, while the input file is
read from the pipe that is (hopefully) connected to the original standard
input. "res->type" is set to "RES_STDIN".
</p>

<p>
Things get a bit more complicated if there was no protocol specification in the
URL, and no base was supplied (meaning the URL has to be treated as an absolute
URL in any case). In this case we have to guess whether it is a local file or
an HTTP URL.
</p>

<p>
First we try to open a local file with the path returned by merge_urls().
(Should be identical to the given URL.) This is achieved by handling PT_UNKNOWN
together with PT_FILE.
</p>

<p>
If there is no local file of the given name (also not a compressed one, s.a.),
and the URL could be an HTTP URL without the protocol specification (which is
the case if the string doesn't start with '/'), we try to load it as HTTP. As
these URLs are split in another way, we have to prepend the "http://" to the
URL string and call merge_urls() again. Afterwards, we proceed just as with any
other HTTP URL. (Fallthrough to PT_HTTP.)
</p>

<p>
After opening an HTTP connection (regardless whether it was PT_HTTP or
PT_UNKNOWN recognized as HTTP), we need to check for HTTP redirects. If a
redirect is set (a "Location:" header exists), the loading has to be repeated
with the redirect URL. The new "main_url" is taken from the "Location:" value;
the "base_url" is set to "res->url", i.e. the original absolute target URL, so
the redirect will be relative to that; the "form" is also cleared, as redirects
always use a simple GET. With this new URL(s), the whole process is repeated.
In this reiteration the protocol type always has to be PT_HTTP; redirections to
other resource types are not allowed. All this can happen up to five times;
when more, it's probably a redirection loop, and we abort.
</p>

<p>
If opening a file fails, init_load() returns immediately, only setting
"res->type" to "RES_FAIL" and "res->url->proto" to "PT_INTERNAL". "RES_FAIL" is
then handled appropriately in load(); this way, the caller needn't bother about
it. (Except an additional error message and setting an error code in
<a href="hacking-layout.html#parseSyntax">parse_syntax()</a>; see
<a href="hacking-layout.html#warn">Warning Mesages</a> in
hacking-layout.* for that.) "PT_INTERNAL" means that the page isn't to be kept
in history, that following relative URLs isn't allowed etc. Thus, the link/page
history handling functions also do not need special handling for the error
pages.
</p>

<p>
HTTP loading errors inside http_init_load() are handled the same way.
</p>

</a>    <!--initLoad-->

<a name="load" id="load">

<h3>load()</h3>

<p>
Data is read by calling load() with the resource descriptor returned by
init_load().
</p>

<p>
Every call to load() reads one data block (of size BUF_SIZE) into a buffer. The
reading function
(<a href="hacking-layout.html#parseSyntax">parse_syntax()</a> or
<a href="#parseHeader">parse_header()</a>) then processes the data, keeping track of the current read
position inside the buffer by "res->buf_ptr"; when it reaches the end of the
data block, load() is called again to read the next block.
</p>

<p>
If "res->type" is "RES_HTTP", read() is used to read a data block from the
socket; otherwise, fread() is used to read data from the input stream. (It
doesn't matter if this stream is a normal file (RES_FILE), stdin (RES_STDIN),
or a pipe (RES_PIPE).)
</p>

<p>
"RES_FAIL" means that opening the input resource failed for some reason, or an
error emerged in a previous load() call. In this case, no data is read; an
empty buffer is returned, which normally would mean EOF. This causes
<a href="hacking-layout.html#parseSyntax">parse_syntax()</a> to
generate an empty page, or stops parsing at this point if some data already has
been read before the error occured. For the latter case, some (little)
additional handling is necessary in parse_syntax(), to ensure that an
appropriate error message is printed and an error code returned to main().
(Causing a keywait before starting the pager.)
</p>

<p>
The data block is stored in the "buf[]" referenced by the descriptor, and
"buf_end" is set to the end of the data inside the buffer; "buf_ptr" is set to
the beginning of the buffer.
</p>

</a>    <!--load-->

<a name="uninitLoad" id="uninitLoad">

<h3>uninit_load()</h3>

<p>
uninit_load() is used to close the input stream, and free the memory used to
read the file.
</p>

<p>
If the input was a pipe created by popen() (RES_PIPE), it needs to be closed
with pclose() instead of fclose(). This function also passes the exit code from
wget, which is necessary to decide whether the load via wget was sucessfull.
</p>

<p>
If the input was a HTTP socket, close() is used instead of fclose().
</p>

<p>
After closing the stream, the input buffer and the "res" struct are freed.
</p>

</a>    <!--uninitLoad-->

<h2>url.c</h2>

<p>
This file contains a couple of functions for handling of URLs, which are used
chiefly by the file loading module.
</p>

<a name="splitUrl" id="splitUrl">

<h3>split_url()</h3>

<p>
To allow operating on the URLs and loading the addressed files, the URL string
given by the user or a link needs to be split up into components. split_url()
parses the URL string, and returns the components by a pointer to a newly
allocated "struct Url".
</p>

<p>
The Url struct is used for all following processing steps. It contains the
following data:
<ul><li>
 The protocol (resource type) specification string in "proto.str"
</li><li>
 The protocol type as an "enum Protocol" in "proto.type"
</li><li>
 The host name of a HTTP or FTP URL "host"
</li><li>
 The port number (HTTP/FTP) "port"
</li><li>
 The complete directory name "dir" (path without file name)
</li><li>
 The file name in "name"
</li><li>
 The parameters for CGI scripts in "params"
</li><li>
 The fragment identifier (anchor) in "frag"
</li><li>
 "full_url", which stores a complete URL string (but for the fragment
 identifier)
</li><li>
 "path", which stores a part of the complete URL, consisting of directory, file
 name, and CGI parameters
</li><li>
 for merged URLs, the "absolute" flag tells whether the URL was generated from
 an absolute URL, or was relative to a previous page (necessary to decide
 the document was from the same site when moving through page history)
</li><li>
 for merged URLs, the "local" flag tells whether the URL points to a local
 anchor (in the same document)
</li></ul>
    
Before the URL is actually parsed, various special characters that are not
allowed directly in a URL are escaped. (Note: While escaping some invalid
characters in a URL we get is merely convenient, various control char *have* to
be escaped to prevent security problems!)
</p>

<p>
The escaping is done by building a new URL string in a loop, storing for each
character of the original string either the character itself if it is OK, or
the %hh representation (three chars) if it needs to be escaped. Storing is done
using snprintf, which is overkill in case of unescaped characters, but used for
both cases for consistency. The "size" parameter for snprintf is determined by
the string length minus the already stored length. (We use snprintf(), although
theoretically no overflow is possible here -- but isn't this always the case
with buffer overflows?...)
</p>

<p>
We need to escape '\'' although it is a valid URL char, as it would end the
shell quoting when loading with wget/gunzip/bunzip2 otherwise.
</p>

<p>
On the other hand, we do not escape the characters demarked as "national" in
the URI standard -- while the standard discourages using them directly in a URL
(as it may cause trouble when passing URLs in non-digital form), there is no
reason not to process them digitally once we get them.
</p>

<p>
The parser is very similar to the HTML parser in
<a href="hacking-layout.html#parseSyntax">parse_syntax.c</a> (see
<a href="hacking-layout.html#parsing">Parsing</a> in
hacking-layout.*): The URL string is processed char by char in a loop. In every
iteration, one char is examined, and action is taken (in a switch statement)
depending on what character it is, and in what mode the parser currently is.
</p>

<p>
There is a parsing mode for every URL component. (There are also some
additional ones for constructs like the "://" after the protocol
specification.) Every time we encounter some special character seperating
different components, the mode is switched to the one fitting the next
component, and its beginning position (normaly the char after the one causing
the mode change) is stored in "word_start". Everything between the previous
"word_start" and the current char is stored to the respective split URL field
of the component parsed up to now, using store_component().
</p>

<p><pre>
url: "http://domain:80/dir1/dir2/name.ext?params#fragment"
             ^     ^url_char
             word_start
parse_mode: PM_HOST
</pre></p>

<p><pre>
url: "http://domain:80/dir1/dir2/name.ext?params#fragment"
           url_char^^
                    word_start
parse_mode: PM_PORT
components->host: "domain"
</pre></p>

<p>
If the separating char is not the one that would introduce the component
normally following now, that means that component is missing, and we
immediately have to proceed with the next one.
</p>

<p><pre>
url: "http://domain/dir1/dir2/name.ext#fragment"
                              url_char^^
                                       word_start
parse_mode: PM_PARAMS
</pre></p>

<p>
This is done by setting "recycle", meaning that the current character is to be
parsed again. In this new iteration the parser will see the separating char
again, thus introducing a second mode change. store_component() will store a
NULL string for that component, and the parser will go on with parsing the next
one.
</p>

<p><pre>
url: "http://domain/dir1/dir2/name.ext#fragment"
                              url_char^^
                                       word_start
parse_mode: PM_FRAG
components->params: NULL
</pre></p>

<p>
The first mode change is a bit more tricky: At the beginning of the URL, we do
not know if it is a full qualified one (starting with a protocol
specification), or a relative URL without a protocol. At first we assume that
it starts with the protocol. If a ":" follows the first word, our guess was
right, and we proceed normally with the host.
</p>

<p><pre>
url: "http://domain:80/dir1/dir2/name.ext?params#fragment"
  url_char^^
           word_start
parse_mode: PM_PROTO_END1
components->proto.str: "http"
</pre></p>

<p>
If any other separating char occurs instead, we have to skip protocol, host,
and port, and switch immediately to path parsing. We also do a "recycle" then,
as the current char needs to be parsed in "PM_PATH" mode.
</p>

<p><pre>
url: "dir1/dir2/name.ext?params#fragment"
          ^url_char
      ^word_start
parse_mode: PM_PATH
</pre></p>

<p>
Path parsing is also a bit more complicated, as the directory name and the file
name are stored separately. For that purpose, while parsing the path we keep
track where the last '/' was (in "name_start"); everything before it
(inclusive) belongs to the directory, and what follows it is the file name.
</p>

<p><pre>
url: "http://domain:80/dir1/dir2/name.ext?params#fragment"
                      ^          ^       ^url_char
             word_start          name_start
                      |&lt;- dir ->|
                                 |&lt;name>|
parse_mode: PM_PARA
components->dir: "/dir1/dir2/"
components->name: "name.ext"
</pre></p>

<p>
There is one exception to this: If the last part is either "." or "..", then it
is *not* a file name, but also part of the directory name. The whole path is
stored as the directory name (and file name is left empty), after appending a
'/' character, to prevent problems in further processing steps. (Which rely on
the path ending in '/'.)
</p>

<p>
"full_url" and "path" aren't filled in split_url(), as this URL won't be used
directly; they are only necessary for the final URL created in
<a href="#mergeUrls">merge_urls()</a>. The "absolute" and
"local" flags are also set only in merge_urls().
</p>

<p>
If an error occurs during URL parsing (either the protocol specification
contains an unknown protocol type, or an unexpected character is encountered),
split_url() sets "proto.type" to "PT_INTERNAL" and immediately returns. As
"PT_INTERNAL" normally can't ever be generated in split_url(), the caller knows
it was an error by this. Misusing "PT_INTERNAL" for this is surely a bit
confusing; it has a big advantage, though: The created (empty) pages are
correctly handled as temporary by all the page history functions without
needing any exception handling.
</p>

</a>    <!--splitUrl-->

<a name="mergeUrls" id="mergeUrls">

<h3>merge_urls()</h3>

<p>
When loading a relative URL using the ":e" command, or when following a link,
the absolute URL of the target needs to be determined by combining the URL of
the current page with the given relative URL; and for both relative and
absolute URLs, fields not given in the URL(s) need to be set to default values.
All that is done by merge_urls().
</p>

<p>
This function takes a base URL supplied as a "Url" struct (already split up),
and a main url given as string and split up from within merge_urls(); it
returns a newly allocated "Url" struct pointer. Base being NULL means to treat
the main URL as an absolute one.
</p>

<p>
merge_urls() also takes an optional "form" parameter. This string contains the
(URL encoded) data of a form to be submitted to the HTTP server as part of the
URL. ("GET" method.) If present, it is stored in the resulting URL in place of
any other CGI parameters.
</p>

<p>
But for a few exeptions, merging is done component by component. If a
component is present in the main URL, it is taken from there; if it's not,
either it is taken from "base_url", or a default value is used if no "base_url"
is given. After the first component was taken from "main_url", "base_url" is no
longer used; all following components have to be specified, or default values
are taken. This is achieved by setting "base_url" to NULL.
</p>

<p>
The "port" component has no test on its own -- it is always taken from where
the "host" is taken.
</p>

<p>
The handling of "dir" is a bit more complicated: If "main_url" contains a
relative "dir" (not starting with '/'), and a "base_url" is given, the new
"dir" has to be created by joining both. (Concatenating the one in "main_url"
to the one in "base_url".)
</p>

<p>
A relative path starting with "." and/or ".." needs addtional handling: The "."
is just a dummy and can be simply skipped.
</p>

<p><pre>
base: netrik/test/
merge:            ./../test/forms.html
                  ^^skip
 
base: netrik/test/
merge:            ../test/forms.html
</pre></p>

<p>
The ".." is also skipped, but additionally requires discarding the last
component of the base path:
</p>

<p><pre>
             vvvvv discard
base: netrik/test/
merge:            ../test/forms.html
                  ^^^skip
 
base: netrik/
merge:       test/forms.html
</pre></p>

<p>
This procedure needs to be repeated if the merge path has more than one
occurence of "..". It is only stopped when there are no more components to
discard in the base path; the remaining ".." occurences are kept in this case.
</p>

<p>
(The standard actually requires handling "." and ".." occuring *anywhere*
inside the merged URL; however, we do not see any point in adding a
considerable amount of extra code for handling such nonsensical cases.
Especially as most servers can cope with passed "." and ".." compoenents
anyways -- we didn't handle them at all for a long time, and seldom there were
problems.)
</p>

<p>
There is also an exeption about the default value if
nothing was supplied: For local (or unknown) URLs there is no default dir (the
current directory is used), while HTTP and other use the root dir ('/') as
default.
</p>

<p>
Handling of CGI parameters works as the other components, except that if "form"
is present, this is stored as "url->params", and both the params in the main
and base URLs are ignored. The form data string is stored just as it were part
of the main URL -- the string is copied and "base_url" is set to NULL. This
ensures that form submits are always handled as new documents ("local" flag not
set), even if the submit URL is otherwise identical to the current URL, and
that any fragment identifier from the old URL is discarded.
</p>

<p>
After merging the URLs, the "full_url" and "path" components have to be set.
The "full_url" is created by concatenating all components (and separators),
except for the fragment identifier. (Concatenating is done by the str_append()
function created for that. Maybe we should try to use asprintf() or something
instead?) "path" is simply a pointer to the starting position of the "dir"
(and/or following) component(s) inside the "full_url" string. Thus it contains
the directory, file name, and CGI paramters.
</p>

<p>
The "absolute" flag is determined after merging the protocol specification. If
"base_url" is NULL at this point, we know we have an absolute URL: Either it
was NULL from the beginning (meaning absolute in any case), or it was reset
when merging the protocol, because "main_url" contains a protocol specification
-- also meaning an absolute URL.
</p>

<p>
Setting the "local" flag is very similar: If, after all components except the
fragment identifier have been merged, "base_url" is still not NULL, we know
that "main_url" didn't contain any components up to now; it *can* only consist
of a fragment identifier, meaning it references a local anchor.
</p>

<p>
If an error occured during URL splitting, merge_urls() only prints an
additional error message, and proceeds normally. The result is that the
"PT_INTERNAL" indicating the error is stored in the merged URL, and can be
handled by the caller, just as for split_url() itself.
</p>

</a>    <!-- mergeUrls -->

<a name="freeUrl" id="freeUrl">

<h3>free_url()</h3>

<p>
This function destroys a split URL structure by freeing the memory used by all
the component strings, and afterwards the struct itself.
</p>

</a>

<h2>http.c</h2>

<p>
The functions for handling HTTP resources are a bit more complicated, and
reside in a source file on their own.
</p>

<a name="initHttp" id="initHttp">

<p>
http_init_load() is called from init_load(). It prepares the handle, and opens
the HTTP connection. This includes looking up the IP address, opening a socket,
connecting to the desired server, creating and sending a HTTP request for the
desired page, and reading/parsing the HTTP header of the file returned by the
server.
</p>

<p>
Most of the work is done in get_http_socket(), which is called from
http_init_load() after creating the HTTP handle.
</p>

<p>
First, this function checks for a possible proxy to send the request to.
</p>

<p>
Having this, it creates the connect structure "sap", and then looks up the IP
address of the connect server -- which is the proxy if one is present, or the
target host otherwise. This, among other data, is stored in "sap".
</p>

<p>
The next step is opening a socket using socket(), and then establishing a TCP
connection to the desired server using the "sap" structure prepared before.
</p>

<p>
Finally, the HTTP request is constructed in get_http_cmd(), and submitted over
the socket.
</p>

<p>
get_http_cmd() normally just puts together a very simple HTTP request, which
consists only of the request line with the path (or, for proxies, the full
URL), and a "Host:" header containing the target host. (This is required in
HTTP/1.1, to allow multiple hosts on one IP.)
</p>

<p>
Things get slightly more interesting when some form data is to be POSTed. (GET
needn't be handled here, as the form data is already encoded into the URL
before calling http_init_load() in this case.) Besides of adding the
"Content-Type:" and "Content-Encoding:" header fields, the form data has to be
submitted inside the body of the request. For that purpose, the form data is
encoded using
<a href="hacking-links.html#mimeEncode">mime_encode()</a> or
<a href="hacking-links.html#urlEncode">url_encode()</a> (see
<a href="hacking-links.html">hacking-links.*</a>) -- dependending on
the desired encoding stored in the "method" field of the form item -- and
stored in "form_data", which is a string submitted at the end of the request.
</p>

<p>
During the whole exection time of get_http_socket(), user breaks are enabled,
so that all the slow functions (DNS lookup, connecting to server, submitting
request), which are often even waiting for a timeout on failure, can be
interrupted.
</p>

<p>
After the connection has been established with get_http_socket(),
parse_header() is used to read and parse the HTTP headers.
</p>

<p>
Errors in HTTP loading are handled by setting "RES_FAIL" and "PT_INTERNAL",
just like file loading errors are handled in init_load().
</p>

<a name="parseHeader" id="parseHeader">

<h3>parse_header()</h3>

<p>
The parser is very similar to the one in
<a href="hacking-layout.html#parseSyntax">parse_syntax()</a> (see
<a href="hacking-layout.html#parsing">Parsing</a> in
<a href="hacking-layout.html">hacking-layout.*</a>): The input is
processed character-wise by an FSM parser, where each char is read from a
buffer which is refilled (more or less) transparently each time all chars have
been processed.
</p>

<p>
The parser first skips the status line (we do not care about the return code,
for now...), and afterwards parses all header lines, extracting the name and
value of each, and storing them inside the "headers" structure of the HTTP
handle so they can be used later. (Presently, the only one used is the
"Location:" header, which is cheked in
<a href="#initLoad">init_load()</a> to handle redirections.)
</p>

<p>
The parser tries to be as tolerant as possible about broken (or unknown...)
syntax. For that matter, '\r' characters are completely ignored, so both the
DOS-like '\r\n' linefeeds and unix-like '\n' do work. Illegal characters in the
header name are just ignored, as well as a missing space after the ':'
separating header name and value, and spaces at the beginning of a line which
can't be a folding.
</p>

<p>
There are a couple of both correct and broken example files in the test/
directory. Use them by piping the contents to tcplisten and pointing netrik to
the listening port. (You can also concatenate some HTML file to avoid "No data"
errors.)
</p>

<p>
The data (from the TCP socket) ending before the whole header was parsed is
treated as an HTTP loading error, except when a user break was performed, in
which case parse_header() simply returns without taking any more action.
</p>

</a>    <!-- parseHeader -->

</a>    <!-- initHTTP -->

<h2>User Breaks</h2>

<p>
The loading of a file or HTTP page can be interrupted by sending SIGINT.
interrupt.c contains a couple of functions to faciliate the SIGINT handling.
Those are called from various places in load.c and http.c; main() calls
init_int() at startup, which sets some constants used in the other functions.
</p>

<p>
At the beginning of init_load(), hold_int() is called. This function uses
sigprocmask() to block SIGINT. The signal is put on hold, i.e. it doesn't have
it's usual effect of aborting the program anymore, but it's also not discarded;
instead, it's stored, and awaits its release.
</p>

<p>
The signal is released using enable_load(), which is called in two places:
during read() or fread() in load(), and during get_http_socket() in
http_init_load(). (These are the functions which may take fairly long, and need
a way to be interrupted directly.)
</p>

<p>
enable_load() sets int_handler() as the handler for SIGINT (using sigaction()),
and then unblocks the signal with procsigmask(). However, not only arriving new
signals will invoke int_handler() now; if some signal was sent between the
hold_int() and enable_int(), it will be deliverd after enable_int(), too. This
way, every SIGINT during the whole loading process will cause a break, not only
if it was sent during the periods where it can be handled.
</p>

<p>
Before calling enable_load(), a return point is set using setjmp(), and stored
in "label_int". int_handler() does nothing else but immediately jumping to that
return point with longjmp(). That means, when an interrupt occurs before or
during the read()/fread() in load(), the signal handler jumps to another
position in load(). Here, "res->user_break" is set to indicate the interrupt,
and load() returns, instead of continuing the read(). Likewise in
http_init_load().
</p>

<p>
At the end of load() and http_init_load(), hold_int() is called, so SIGINT will
be put on hold again till the next call of load().
</p>

<p>
uninit_load() calls disable_int(). This function used sigaction() to set the
SIGINT handler to SIG_IGN, and then unblocks the signal. Thus, any SIGINTs
arriving after file loading has finished are discarded. They are only enabled
again in init_load(), when the next file load beginns.
</p>

</body>
</html>