Three indexing tools are used to index collections in Greenstone. They are MG, MGPP and Lucene. The index is physically located under folder: collect/collect-name/index.
MG is the original indexer used by Greenstone which is described in the book "Managing Gigabytes". It does section level indexing and compression of the source documents. MG is implemented in C.
indexes document:text section:text section:Title
document:text contains the text of each document, section:text contains the section text, and section:Title contains the title of each section
The document and section parts determine the granularity of the searching and of the items retrieved. The document index returns a list of document numbers, the two section indexes return section numbers.
It uses a document level index rather than a word level index, so cannot do phrase searching or proximity searching.
MG can do:
Command line of running MG query or the new Java Queryer program:
mgquery -f <indexdir> -t <textdir> java org.greenstone.mg.Queryer <basedir> <indexdir> <textdir> [[-h]]
where indexdir and textdir are the paths to the files in the index or compressed text, without the filename extension. e.g. collect/demo/index/dte/demo
MGPP is a reimplementation of MG that provides word-level indexes and enables proximity, phrase and field searching. MGPP is implemented in C++. MGPP user guide
indexes allfield, text, Title
The index specification can include the keywords: allfields, text or metadata, along with metadata elements that be found in the collection. For example, "Title","Subject","Organization" etc. The main advantage MGPP over MG is that searches can be done across multiple fields. For example, a search such as "Smith in Author and snail in Title" can be done.
MGPP can do:
Command line of running MGPP query:
Usage: java org.greenstone.mgpp.Queryer <basedir> <indexdir> <textdir>
Lucene is java-based full-featured text indexing and searching system developed by Apache. Lucene home page
The document level index is physically stored at index/didx, while the section level index is physically stored at index/sidx.
Lucene can do
Command line of running Lucene query:
Usage: lucene_query.pl full-index-dir [query] [-fuzziness value] [-filter filter_string] [-sort sort_field] [-dco AND|OR] [-startresults number -endresults number] [-out out_file]
import.pl collect-name
Import original files and "metadata.xml" (metadata signed by users) to the achieves folder
buildcol.pl collect-name
Three steps to finish collection building
MG takes 4 passes to build the collection
2 passes for the text compression
Two passes for indexing
MGPP takes 4 passes to build the collection
2 passes for the text compression
2 passes for indexing
Lucene takes 3 passes to build the collection