More about indexing

How do I change the search results order?
The order of search results is dependent on the kind of query you are running. For simple (MG) collections, search results are either ranked (for a 'some' or ranked search) or in build order (for an 'or' or boolean search). MG cannot do ranking and boolean searching at the same time. For advanced (MGPP) collections, search results can be in ranked or build order as above, but this doesn't depend on the kind of search you are doing. Boolean searches may be ranked.

Build order is the seemingly random order that documents are processed during import. This can be changed by using the sortmeta option to import.pl. If a metadata element is specified here, then documents will be sorted during import by that metadata.

This option can be specified as an option to import.pl (in GLI Expert mode), or specified in the collect.cfg file. Note that it needs to be added manually to collect.cfg like e.g.: sortmeta dc.Date It cannot be added to the config file using GLI at this stage.

For MGPP collections, in advanced searching mode, the query form has a drop down box specifying "display search results in ranked/natural order". If you have sorted the documents by metadata, then you may like to change the text for this box, e.g. have it display "ranked/date order". To achieve this, add the following line to the collection's collect.cfg file: collectionmacro query:textnatural "date"

For Lucene collections, sorting does not depend on build order. The user can sort by rank, or by any of the fields that have been indexed (apart from text and allfields). This will be offered automatically by Greenstone. If you are indexing sections and want sorting by field to work, you need to make sure that each section has the appropriate metadata. Use the "-sections_index_document_metadata unless_section_metadata_exists" option to buildcol.pl to give each section all the metadata from the top level document.

What's the difference between MG, MGPP, Lucene?
Greenstone gives you a choice of three indexing tools to index your collection. MG is the default indexer, MGPP and Lucene can be used by turning on "Enable Advanced Searching" in the "Search Types" section of the "Design" panel in the Librarian Interface.


 * MG: This is the original indexer used by Greenstone, developed mainly by Alistair Moffat and described in the classic book Managing Gigabytes. It does section level indexing, and searches can be boolean or ranked (not both at once). For each index specified in the collection, a separate physical index is created. For phrase searching, Greenstone does an "AND" search on all the terms, then scans the resulting hits to see if the phrase is present. It has been extensively tested on very large collections (many GB of text). MG in Greenstone.
 * MGPP: This new version of MG (MG plus plus) was developed by the New Zealand Digital Library Project. It does word level indexing, which allows fielded, phrase and proximity searching to be handled by the indexer. Boolean searches can be ranked. Only a single index is created for a Greenstone collection: document/section levels and text/metadata fields are all handled by the one index. For collections with many indexes, this results in a smaller collection size than using MG. For large collections, searching may be a bit slower due to the index being word level rather than section level. MGPP user guide
 * Lucene: Lucene was developed by the Apache Software Foundation. It handles field and proximity searching, but only at a single level (e.g. complete documents or individual sections, but not both). Therefore document and section indexes for a collection require two separate indexes. It provides a similar range of search functionality to MGPP with the addition of single-character wildcards and range searching. It was added to Greenstone to facilitate incremental collection building, which MG and MGPP can't provide. Lucene home page

Can Greenstone search Arabic OCR text?
(Thanks to Graeme Foster for this info)

Because Arabic is a cursive script it requires a different character set when displaying it on the screen (in very crude terms to ensure that each character joins up correctly, and that is crude because they do look different depending upon whether the character appears at the start, middle or end of the word.)

For example:

The character ڀ "The Letter Beheh" has UNICODE U+0680

The same character in the different presentation forms are: Isolated ﭚ has UNICODE U+FB5A Final ﭛ has UNICODE U+FB5B Initial ﭜ has UNICODE U+FB5A Medial ﭝ has UNICODE U+FB5A

In addition to this there is the merging of multiple letters together when presenting the script.

The problem: When the data is saved it should not be saved in any presentation form, to quote the UNICODE FAQ on this matter:

Q: Can one use the Arabic presentation forms in a data file?

A: It is strongly discouraged and not recommended because it does not guarantee data integrity and interoperability. Data files should include only the Arabic script code values that are defined in Row 6, U+0600 to U+06FF.

The issue is that when the data is stored in presentation form the words will not be matched when doing a search, this should be understandable when you realise that the underlying UNCODE is very different (even if the word searched for is presented identically).