/usr/share/pyshared/zope/index/text/textindex.txt

Text Indexes
============

Text indexes combine an inverted index and a lexicon to support text
indexing and searching.  A text index can be created without passing
any arguments:

    >>> from zope.index.text.textindex import TextIndex
    >>> index = TextIndex()

By default, it uses an "Okapi" inverted index and a lexicon with a
pipeline consistening is a simple word splitter, a case normalizer,
and a stop-word remover.

We index text using the `index_doc` method:

    >>> index.index_doc(1, u"the quick brown fox jumps over the lazy dog")
    >>> index.index_doc(2,
    ...    u"the brown fox and the yellow fox don't need the retriever")
    >>> index.index_doc(3, u"""
    ... The Conservation Pledge
    ... =======================
    ... 
    ... I give my pledge, as an American, to save, and faithfully
    ... to defend from waste, the natural resources of my Country; 
    ... it's soils, minerals, forests, waters and wildlife.
    ... """)
    >>> index.index_doc(4, u"Fran\xe7ois") 
    >>> word = (
    ...     u"\N{GREEK SMALL LETTER DELTA}"
    ...     u"\N{GREEK SMALL LETTER EPSILON}"
    ...     u"\N{GREEK SMALL LETTER LAMDA}"
    ...     u"\N{GREEK SMALL LETTER TAU}"
    ...     u"\N{GREEK SMALL LETTER ALPHA}"
    ...     )
    >>> index.index_doc(5, word + u"\N{EM DASH}\N{GREEK SMALL LETTER ALPHA}")
    >>> index.index_doc(6, u"""
    ... What we have here, is a failure to communicate.
    ... """)
    >>> index.index_doc(7, u"""
    ... Hold on to your butts!
    ... """)
    >>> index.index_doc(8, u"""
    ... The Zen of Python, by Tim Peters
    ... 
    ... Beautiful is better than ugly.
    ... Explicit is better than implicit.
    ... Simple is better than complex.
    ... Complex is better than complicated.
    ... Flat is better than nested.
    ... Sparse is better than dense.
    ... Readability counts.
    ... Special cases aren't special enough to break the rules.
    ... Although practicality beats purity.
    ... Errors should never pass silently.
    ... Unless explicitly silenced.
    ... In the face of ambiguity, refuse the temptation to guess.
    ... There should be one-- and preferably only one --obvious way to do it.
    ... Although that way may not be obvious at first unless you're Dutch.
    ... Now is better than never.
    ... Although never is often better than *right* now.
    ... If the implementation is hard to explain, it's a bad idea.
    ... If the implementation is easy to explain, it may be a good idea.
    ... Namespaces are one honking great idea -- let's do more of those!
    ... """)

Then we can search using the apply method, which takes a search
string.

    >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown fox').items()]
    [(1, '0.6153'), (2, '0.6734')]

    >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'quick fox').items()]
    [(1, '0.6153')]

    >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown python').items()]
    []

    >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'dalmatian').items()]
    []

    >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown or python').items()]
    [(1, '0.2602'), (2, '0.2529'), (8, '0.0934')]

    >>> [(k, "%.4f" % v) for (k, v) in index.apply(u'butts').items()]
    [(7, '0.6948')]

The outputs are mappings from document ids to float scores. Items
with higher scores are more relevent.

We can use unicode characters in search strings.

    >>> [(k, "%.4f" % v) for (k, v) in index.apply(u"Fran\xe7ois").items()]
    [(4, '0.7427')]

    >>> [(k, "%.4f" % v) for (k, v) in index.apply(word).items()]
    [(5, '0.7179')]

We can use globbing in search strings.

    >>> [(k, "%.3f" % v) for (k, v) in index.apply('fo*').items()]
    [(1, '2.179'), (2, '2.651'), (3, '2.041')]

Text indexes support basic statistics:

    >>> index.documentCount()
    8
    >>> index.wordCount()
    114

If we index the same document twice, once with a zero value, and then
with a normal value, it should still work:

    >>> index2 = TextIndex()
    >>> index2.index_doc(1, [])
    >>> index2.index_doc(1, ["Zorro"])
    >>> [(k, "%.4f" % v) for (k, v) in index2.apply("Zorro").items()]
    [(1, '0.4545')]


Tracking Changes
================

If we index a document the first time it updates the _totaldoclen of
the underlying object.

    >>> index = TextIndex()
    >>> index.index._totaldoclen()
    0
    >>> index.index_doc(100, u"a new funky value")
    >>> index.index._totaldoclen()
    3

If we index it a second time, the underlying index length should not
be changed.

    >>> index.index_doc(100, u"a new funky value")
    >>> index.index._totaldoclen()
    3

But if we change it the length changes too.

    >>> index.index_doc(100, u"an even newer funky value")
    >>> index.index._totaldoclen()
    5

The same as for index_doc applies to unindex_doc, if an object is
unindexed that is not indexed no indexes chould change state.

    >>> index.unindex_doc(100)
    >>> index.index._totaldoclen()
    0

    >>> index.unindex_doc(100)
    >>> index.index._totaldoclen()
    0
python-zope.index 3.6.4-0ubuntu1 / usr / share / pyshared / zope / index / text / textindex.txt