This file is indexed.

/usr/share/doc/hyperestraier/cguide-en.html is in hyperestraier 1.4.13-12ubuntu1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

<head>
<meta http-equiv="Content-Language" content="en" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="author" content="Mikio Hirabayashi" />
<meta name="keywords" content="Hyper Estraier, Estraier, full-text search, web crawler" />
<meta name="description" content="Crawler Guide of Hyper Estraier" />
<link rel="contents" href="./" />
<link rel="alternate" href="cguide-ja.html" hreflang="ja" title="the Japanese version" />
<link rel="stylesheet" href="common.css" />
<link rel="icon" href="icon16.png" />
<link rev="made" href="mailto:mikio@users.sourceforge.net" />
<title>Crawler Guide of Hyper Estraier Version 1</title>
</head>

<body>

<h1>Crawler Guide</h1>

<div class="note">Copyright (C) 2004-2007 Mikio Hirabayashi</div>
<div class="note">Last Update: Tue, 06 Mar 2007 12:05:18 +0900</div>
<div class="navi">[<span class="void">English</span>/<a href="cguide-ja.html" hreflang="ja">Japanese</a>] [<a href="index.html">HOME</a>]</div>

<hr />

<h2 id="tableofcontents">Table of Contents</h2>

<ol>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#tutorial">Tutorial</a></li>
<li><a href="#estwaver">Crawler Command</a></li>
</ol>

<hr />

<h2 id="introduction">Introduction</h2>

<p>This guide describes usage of Hyper Estraier's web crawler.  If you haven't read <a href="uguide-en.html">user's guide</a> and <a href="nguide-en.html">P2P guide</a> yet, now is a good moment to do so.</p>

<p><code>estcmd</code> can index files on local file system only.  Though files on remote hosts can be indexed by using NFS or SMB remote mount mechanism, unspecified number of web sites on Internet can not be mounted by them.  Though such web crawlers as <code>wget</code> can do prefetch of those files, it involves high overhead and wastes much disk space.</p>

<p>The command <code>estwaver</code> is useful to crawl arbitrary web sites and to index their documents directly.  <code>estwaver</code> is so intelligent that it supports not only depth first order and width first but also similarity oriented order.  It crawls documents similar to specified seed documents preferentially.</p>

<hr />

<h2 id="tutorial">Tutorial</h2>

<p>First step is creation of the crawler root directory which contains a configuration file and some databases.  Following command will create <code>casket</code>, the crawler root directory:</p>

<pre>estwaver init casket
</pre>

<p>By default, the configuration is to start crawling at the project page of Hyper Estraier.  Let's try it as it is:</p>

<pre>estwaver crawl casket
</pre>

<p>Then, documents are fetched one after another and they are indexed into the index.  To stop the operation, you can press <code>Ctrl-C</code> on terminal.</p>

<p>When the operation finishes, there is a directory <code>_index</code> in the crawler root directory.  It is an index which can be treated with <code>estcmd</code> and so on.  Let's try to search the index as with the following command:</p>

<pre>estcmd search -vs casket/_index "hyper estraier"
</pre>

<p>If you want to resume the crawling operation, perform <code>estwaver crawl</code> again.</p>

<hr />

<h2 id="estwaver">Crawler Command</h2>

<p>This section describes specification of <code>estwaver</code>, whose purpose is to index documents on the Web.</p>

<h3>Synopsis and Description</h3>

<p><code>estwaver</code> is an aggregation of sub commands.  The name of a sub command is specified by the first argument.  Other arguments are parsed according to each sub command.  The argument <code>rootdir</code> specifies the crawler root directory which contains configuration file and so on.</p>

<dl>
<dt><kbd>estwaver init [-apn|-acc] [-xs|-xl|-xh] [-sv|-si|-sa] <var>rootdir</var></kbd></dt>
<dd>Create the crawler root directory.</dd>
<dd>If -apn is specified, N-gram analysis is performed against European text also.</dd>
<dd>If -acc is specified, character category analysis is performed instead of N-gram analysis.</dd>
<dd>If -xs is specified, the index is tuned to register less than 50000 documents.</dd>
<dd>If -xl is specified, the index is tuned to register more than 300000 documents.</dd>
<dd>If -xh is specified, the index is tuned to register more than 1000000 documents.</dd>
<dd>If -sv is specified, scores are stored as void.</dd>
<dd>If -si is specified, scores are stored as 32-bit integer.</dd>
<dd>If -sa is specified, scores are stored as-is and marked not to be tuned when search.</dd>
</dl>

<dl>
<dt><kbd>estwaver crawl [-restart|-revisit|-revcont] <var>rootdir</var></kbd></dt>
<dd>Start crawling.</dd>
<dd>If -restart is specified, crawling is restarted from the seed documents.</dd>
<dd>If -revisit is specified, collected documents are revisited.</dd>
<dd>If -revcont is specified, collected documents are revisited and then crawling is continued.</dd>
</dl>

<dl>
<dt><kbd>estwaver unittest <var>rootdir</var></kbd></dt>
<dd>Perform unit tests.</dd>
</dl>

<dl>
<dt><kbd>estwaver fetch [-proxy <var>host</var> <var>port</var>] [-tout <var>num</var>] [-il <var>lang</var>] <var>url</var></kbd></dt>
<dd>Fetch a document.</dd>
<dd><var>url</var> specifies the URL of a document.</dd>
<dd>-proxy specifies the host name and the port number of the proxy server.</dd>
<dd>-tout specifies timeout in seconds.</dd>
<dd>-il specifies the preferred language.  By default, it is English.</dd>
</dl>

<p>All sub commands return 0 if the operation is success, else return 1.  A running crawler finishes with closing the database when it catches the signal 1 (SIGHUP), 2 (SIGINT), 3 (SIGQUIT), or 15 (SIGTERM).</p>

<p>When crawling finishes, there is a directory <code>_index</code> in the crawler root directory.  It is an index available by <code>estcmd</code> and so on.</p>

<h3>Constitution of the Crawler Root Directory</h3>

<p>The crawler root directory contains the following files and directories.</p>

<ul>
<li><kbd>_conf</kbd> : configuration file.</li>
<li><kbd>_log</kbd> : log file.</li>
<li><kbd>_meta</kbd> : database file for meta data.</li>
<li><kbd>_queue</kbd> : priority queue of URLs to be crawled.</li>
<li><kbd>_trace/</kbd> : tracking records of crawled URLs.</li>
<li><kbd>_index/</kbd> : index directory.</li>
<li><kbd>_tmp/</kbd> : directory for temporary files.</li>
</ul>

<h3>Configuration File</h3>

<p>The configuration file is composed of lines and the name of an variable and the value separated by "<code>:</code>" are in each line.  By default, the following configuration is there.</p>

<pre>seed: 1.5|http://hyperestraier.sourceforge.net/uguide-en.html
seed: 1.0|http://hyperestraier.sourceforge.net/pguide-en.html
seed: 1.0|http://hyperestraier.sourceforge.net/nguide-en.html
seed: 0.0|http://qdbm.sourceforge.net/
proxyhost:
proxyport:
interval: 500
timeout: 30
strategy: 0
inherit: 0.4
seeddepth: 0
maxdepth: 20
masscheck: 500
queuesize: 50000
replace: ^http://127.0.0.1/{{!}}http://localhost/
allowrx: ^http://
denyrx: \.(css|js|csv|tsv|log|md5|crc|conf|ini|inf|lnk|sys|tmp|bak)$
denyrx: \.(zip|tar|tgz|gz|bz2|tbz2|z|lha|lzh)(\?.*)?$
denyrx: ://(localhost|[a-z]*\.localdomain|127\.0\.0\.1)/
noidxrx: /\?[a-z]=[a-z](;|$)
urlrule: \.est${{!}}text/x-estraier-draft
urlrule: \.(eml|mime|mht|mhtml)${{!}}message/rfc822
typerule: ^text/x-estraier-draft${{!}}[DRAFT]
typerule: ^text/plain${{!}}[TEXT]
typerule: ^(text/html|application/xhtml+xml)${{!}}[HTML]
typerule: ^message/rfc822${{!}}[MIME]
language: 0
textlimit: 128
seedkeynum: 256
savekeynum: 32
threadnum: 10
docnum: 10000
period: 10000s
revisit: 7d
cachesize: 256
#nodeserv: 1|http://admin:admin@localhost:1978/node/node1
#nodeserv: 2|http://admin:admin@localhost:1978/node/node2
#nodeserv: 3|http://admin:admin@localhost:1978/node/node3
logfile: _log
loglevel: 2
draftdir:
entitydir:
postproc:
</pre>

<p>Meaning of each variable is the following.</p>

<ul>
<li><kbd>seed</kbd> : specifies the weight and the URL of a seed document, separated by "<code>|</code>".  This can be more than once.</li>
<li><kbd>proxyhost</kbd> : specifies the host name of the proxy server.</li>
<li><kbd>proxyport</kbd> : specifies the port number of the proxy server.</li>
<li><kbd>interval</kbd> : specifies waiting interval of each request (in milliseconds).</li>
<li><kbd>timeout</kbd> : specifies timeout of each request (in seconds).</li>
<li><kbd>strategy</kbd> : specifies strategy of crawling path (0:balanced, 1:similarity, 2:depth, 3:width, 4:random).</li>
<li><kbd>inherit</kbd> : specifies inheritance ratio of similarity from the parent.</li>
<li><kbd>seeddepth</kbd> : specifies maximum depth of seed documents.</li>
<li><kbd>maxdepth</kbd> : specifies maximum depth of recursion.</li>
<li><kbd>masscheck</kbd> : specifies standard value for checking mass sites.</li>
<li><kbd>queuesize</kbd> : specifies maximum number of records of the priority queue.</li>
<li><kbd>replace</kbd> : specifies regular expressions and replacement strings to normalize URLs.  This can be more than once.</li>
<li><kbd>allowrx</kbd> : specifies allowing regular expressions of URLs to be visited.  This can be more than once.</li>
<li><kbd>denyrx</kbd> : specifies denying regular expressions of URLs to be visited.  This can be more than once.</li>
<li><kbd>noidxrx</kbd> : specifies denying regular expressions of URLs to be indexed.  This can be more than once.</li>
<li><kbd>urlrule</kbd> : specifies URL rules (regular expressions and media types).  This can be more than once.</li>
<li><kbd>typerule</kbd> : specifies media type rules (regular expressions and filter commands).  This can be more than once.</li>
<li><kbd>language</kbd> : specifies the preferred language (0:English, 1:Japanese, 2:Chinese, 3:Korean, 4:misc).</li>
<li><kbd>textlimit</kbd> : specifies text size limitation (in kilobytes).</li>
<li><kbd>seedkeynum</kbd> : specifies the total number of keywords for seed documents.</li>
<li><kbd>savekeynum</kbd> : specifies the number of keywords saved for each document.</li>
<li><kbd>threadnum</kbd> : specifies the number of threads running in parallel.</li>
<li><kbd>docnum</kbd> : specifies the number of documents to collect.</li>
<li><kbd>period</kbd> : specifies running time period (in s:seconds, m:minutes, h:hours, d:days).</li>
<li><kbd>revisit</kbd> : specifies revisit span (in s:seconds, m:minutes, h:hours, d:days).</li>
<li><kbd>cachesize</kbd> : specifies the maximum size of the index cache (in megabytes).</li>
<li><kbd>nodeserv</kbd> : specifies the ID number and the URL of a node server, separated by "<code>|</code>".  This can be more than once.</li>
<li><kbd>logfile</kbd> : specifies the path of the log file (relative path or absolute path).</li>
<li><kbd>loglevel</kbd> : specifies logging level (1:debug, 2:information, 3:warning, 4:error, 5:none).</li>
<li><kbd>draftdir</kbd> : specifies the path of the draft directory (relative path or absolute path).</li>
<li><kbd>entitydir</kbd> : specifies the path of the entity directory (relative path or absolute path).</li>
<li><kbd>postproc</kbd> : the postprocessor for retrieved files.</li>
</ul>

<p><code>allowrx</code>, <code>denyrx</code>, and <code>noidxrx</code> are evaluated in the order of description.  Alphabetical characters are case-insensitive.</p>

<p>Arbitrary filter commands can be specified with <code>typerule</code>.  The interface of filter command is same as with <code>-fx</code> option of <code>estcmd gather</code>.  For example, the following specifies to process PDF documents.</p>

<pre>typerule: ^application/pdf${{!}}H@/usr/local/share/hyperestraier/filter/estfxpdftohtml
</pre>

<hr />

</body>

</html>

<!-- END OF FILE -->