User Tools

Site Tools


en:developer:indexers

This is an old revision of the document!


Table of Contents

Indexers

Three indexing tools are used to index collections in Greenstone. They are MG, MGPP and Lucene. The index is physically located under folder: collect/collect-name/index.

MG

MG is the original indexer used by Greenstone which is described in the book "Managing Gigabytes". It does section level indexing and compression of the source documents. MG is implemented in C.

  • Search type: plain
  • The level information is specified in the index definition. Search and retrieval can only be done either at document level or section level. For the demo collection, three separate physical indexes are created by MG below:
indexes document:text section:text section:Title 

document:text contains the text of each document, section:text contains the section text, and section:Title contains the title of each section

The document and section parts determine the granularity of the searching and of the items retrieved. The document index returns a list of document numbers, the two section indexes return section numbers.

It uses a document level index rather than a word level index, so cannot do phrase searching or proximity searching.

MG can do:

  • compressed text
  • case folding
  • stemming
  • Boolean (AND OR NOT) or ranked searches (but not both at once)
  • phrase searching with MG in Greenstone will do an AND search. The results will be post processed to find the phrase.

Command line of running MG query or the new Java Queryer program:

mgquery -f <indexdir> -t <textdir>
java org.greenstone.mg.Queryer <basedir> <indexdir> <textdir> [[-h]]

where indexdir and textdir are the paths to the files in the index or compressed text, without the filename extension. e.g. collect/demo/index/dte/demo

MGPP

MGPP is a reimplementation of MG that provides word-level indexes and enables proximity, phrase and field searching. MGPP is implemented in C++. MGPP user guide

  • search type: plain, fielded
  • The MGPP index definition is like
indexes allfield, text, Title

The index specification can include the keywords: allfields, text or metadata, along with metadata elements that be found in the collection. For example, "Title","Subject","Organization" etc. The main advantage MGPP over MG is that searches can be done across multiple fields. For example, a search such as "Smith in Author and snail in Title" can be done.

  • The order of the indexes determines the order that they will be presented in the index list, with the first entry being the default index.
  • Only one physical index is built (index/idx) [document/section levels and text/metadata fields are all handled by the one index.]

MGPP can do:

  • compressed text
  • phrase search
  • fields search
  • case folding, stemming, accent folding
  • proximity searching
  • Boolean operators:
    • & AND
    • | OR
    • ! not
    • with () for precedence
  • wildcard queries, such as "comput*".

Command line of running MGPP query:

Usage: java org.greenstone.mgpp.Queryer <basedir> <indexdir> <textdir>

Lucene

Lucene is java-based full-featured text indexing and searching system developed by Apache. Lucene home page

  • Lucene builds word-level indexes at separate levels.
  • Indexes are built only single level at either the document level or the section level. Therefore document and section indexes for a collection require two separate indexes.

The document level index is physically stored at index/didx, while the section level index is physically stored at index/sidx.

  • search type: plain, fields
  • Lucene can be used to perform incremental collection building in greenstone. When new documents are appended to the Lucene collection, only the new coming documents are needed to be built rather than the whole collection, which will greatly reduce the building time.

Lucene can do

  • ranked searching – best results returned first
  • query types: phrase queries, wildcard queries, proximity queries
  • fielded searching (e.g., title, author, contents)
  • sorting by any field
  • multiple-index searching with merged results
  • case folding (default), stemming

Command line of running Lucene query:

Usage: lucene_query.pl full-index-dir [query] [-fuzziness value] [-filter filter_string]
[-sort sort_field] [-dco AND|OR] [-startresults number -endresults number] [-out out_file]
  • GDBM database is applied to record indexes
  • Collection importing
import.pl collect-name

Import original files and "metadata.xml" (metadata signed by users) to the achieves folder

  • Collection building
buildcol.pl collect-name

Three steps to finish collection building

  • compressing and store text
  • indexing
  • save metadata and generated classifies into the database

MG takes 4 passes to build the collection

2 passes for the text compression

  • Pass 1: docs → mg_passes -T1
  • mg_compression.dict (create the dictionary)
  • Pass 2: docs → mg_passes -T2

Two passes for indexing

  • Pass 1: docs → mg_passes -T1 (create the index dictionary)
  • Pass 2: docs → mg_passes -T2 (invert text - word position)
  • mg_weight_build
  • mg_invert_dict (standard dictionary)
  • mg_stem_idx

MGPP takes 4 passes to build the collection

2 passes for the text compression

  • Pass 1 : docs → mgpp_passes -T1
  • Pass 2 : mgpp_passes -T2

2 passes for indexing

  • Pass 1 : docs → mgpp_passes -I1
  • Pass 2 : mgpp_passes -I2

Lucene takes 3 passes to build the collection

  • 1 pass for the text storage (as XML files; The Lucene indexer doesn't store text)
  • 1 pass for indexing (A Perl script calls java codes to generate indexes)
  • 1 pass for storing metadata and classifies into database
en/developer/indexers.1522100376.txt.gz · Last modified: 2018/03/26 21:39 by kjdon