Table of Contents

Indexers

Three indexing tools are used to index collections in Greenstone. They are MG, MGPP and Lucene. The index is physically located under folder: collect/collect-name/index.

MG

MG is the original indexer used by Greenstone which is described in the book "Managing Gigabytes". It does section level indexing and compression of the source documents. MG is implemented in C.

indexes document:text section:text section:Title 

document:text contains the text of each document, section:text contains the section text, and section:Title contains the title of each section

The document and section parts determine the granularity of the searching and of the items retrieved. The document index returns a list of document numbers, the two section indexes return section numbers.

It uses a document level index rather than a word level index, so cannot do phrase searching or proximity searching.

MG can do:

Command line of running MG query or the new Java Queryer program:

mgquery -f <indexdir> -t <textdir>
java org.greenstone.mg.Queryer <basedir> <indexdir> <textdir> [[-h]]

where indexdir and textdir are the paths to the files in the index or compressed text, without the filename extension. e.g. collect/demo/index/dte/demo

MGPP

MGPP is a reimplementation of MG that provides word-level indexes and enables proximity, phrase and field searching. MGPP is implemented in C++. MGPP user guide

indexes allfield, text, Title

The index specification can include the keywords: allfields, text or metadata, along with metadata elements that be found in the collection. For example, "Title","Subject","Organization" etc. The main advantage MGPP over MG is that searches can be done across multiple fields. For example, a search such as "Smith in Author and snail in Title" can be done.

MGPP can do:

Command line of running MGPP query:

Usage: java org.greenstone.mgpp.Queryer <basedir> <indexdir> <textdir>

Lucene

Lucene is java-based full-featured text indexing and searching system developed by Apache. Lucene home page

The document level index is physically stored at index/didx, while the section level index is physically stored at index/sidx.

Lucene can do

Command line of running Lucene query:

Usage: lucene_query.pl full-index-dir [query] [-fuzziness value] [-filter filter_string]
[-sort sort_field] [-dco AND|OR] [-startresults number -endresults number] [-out out_file]
import.pl collect-name

Import original files and "metadata.xml" (metadata signed by users) to the achieves folder

buildcol.pl collect-name

Three steps to finish collection building

MG takes 4 passes to build the collection

2 passes for the text compression

Two passes for indexing

MGPP takes 4 passes to build the collection

2 passes for the text compression

2 passes for indexing

Lucene takes 3 passes to build the collection