Indexers

Three indexing tools are used to index collections in Greenstone. They are MG, MGPP and Lucene. The index is physically located under folder: collect/collect-name/index.

MG

MG is the original indexer used by Greenstone which is described in the book "Managing Gigabytes". It does section level indexing and compression of the source documents. MG is implemented in C.

Search type: plain

The level information is specified in the index definition. Search and retrieval can only be done either at document level or section level. For the demo collection, three separate physical indexes are created by MG below:

indexes document:text section:text section:Title

document:text contains the text of each document, section:text contains the section text, and section:Title contains the title of each section

The document and section parts determine the granularity of the searching and of the items retrieved. The document index returns a list of document numbers, the two section indexes return section numbers.

It uses a document level index rather than a word level index, so cannot do phrase searching or proximity searching.

MG can do:

compressed text
case folding
stemming
Boolean (AND OR NOT) or ranked searches (but not both at once)
phrase searching with MG in Greenstone will do an AND search. The results will be post processed to find the phrase.

Command line of running MG query or the new Java Queryer program:

mgquery -f <indexdir> -t <textdir>
java org.greenstone.mg.Queryer <basedir> <indexdir> <textdir> [[-h]]

where indexdir and textdir are the paths to the files in the index or compressed text, without the filename extension. e.g. collect/demo/index/dte/demo

MGPP

MGPP is a reimplementation of MG that provides word-level indexes and enables proximity, phrase and field searching. MGPP is implemented in C++. MGPP user guide

search type: plain, fielded
The MGPP index definition is like

indexes allfield, text, Title

The index specification can include the keywords: allfields, text or metadata, along with metadata elements that be found in the collection. For example, "Title","Subject","Organization" etc. The main advantage MGPP over MG is that searches can be done across multiple fields. For example, a search such as "Smith in Author and snail in Title" can be done.

The order of the indexes determines the order that they will be presented in the index list, with the first entry being the default index.
Only one physical index is built (index/idx) [document/section levels and text/metadata fields are all handled by the one index.]

MGPP can do:

compressed text
phrase search
fields search
case folding, stemming, accent folding
proximity searching
Boolean operators:
- & AND
- | OR
- ! not
- with () for precedence
wildcard queries, such as "comput*".

Command line of running MGPP query:

Usage: java org.greenstone.mgpp.Queryer <basedir> <indexdir> <textdir>

Lucene

Lucene is java-based full-featured text indexing and searching system developed by Apache. Lucene home page

Lucene builds word-level indexes at separate levels.

Indexes are built only single level at either the document level or the section level. Therefore document and section indexes for a collection require two separate indexes.

The document level index is physically stored at index/didx, while the section level index is physically stored at index/sidx.

search type: plain, fields

Lucene can be used to perform incremental collection building in greenstone. When new documents are appended to the Lucene collection, only the new coming documents are needed to be built rather than the whole collection, which will greatly reduce the building time.

Lucene can do

ranked searching – best results returned first
query types: phrase queries, wildcard queries, proximity queries
fielded searching (e.g., title, author, contents)
sorting by any field
multiple-index searching with merged results
case folding (default), stemming

Command line of running Lucene query:

Usage: lucene_query.pl full-index-dir [query] [-fuzziness value] [-filter filter_string]
[-sort sort_field] [-dco AND|OR] [-startresults number -endresults number] [-out out_file]

Related info

GDBM database is applied to record indexes

Collection importing

import.pl collect-name

Import original files and "metadata.xml" (metadata signed by users) to the achieves folder

Collection building

buildcol.pl collect-name

Three steps to finish collection building

compressing and store text
indexing
save metadata and generated classifies into the database

MG takes 4 passes to build the collection

2 passes for the text compression

Pass 1: docs → mg_passes -T1
mg_compression.dict (create the dictionary)
Pass 2: docs → mg_passes -T2

Two passes for indexing

Pass 1: docs → mg_passes -T1 (create the index dictionary)
Pass 2: docs → mg_passes -T2 (invert text - word position)
mg_weight_build
mg_invert_dict (standard dictionary)
mg_stem_idx

MGPP takes 4 passes to build the collection

2 passes for the text compression

Pass 1 : docs → mgpp_passes -T1
Pass 2 : mgpp_passes -T2

2 passes for indexing

Pass 1 : docs → mgpp_passes -I1
Pass 2 : mgpp_passes -I2

Lucene takes 3 passes to build the collection

1 pass for the text storage (as XML files; The Lucene indexer doesn't store text)
1 pass for indexing (A Perl script calls java codes to generate indexes)
1 pass for storing metadata and classifies into database

Table of Contents

Indexers

MG

MGPP

Lucene

Related info