Searching

In Greenstone, you can dictate how users will be able to search each collection. You can select one of three indexers to determine how documents are indexed, and you can create indexes based on any number of metadata fields and the text of the documents.

Greenstone also provides sub-collection and super-collection searching functionality through partition indexes and cross-collection searching, respectively.

Search Indexers

The search indexer determines how documents in a collection will be indexed. The search indexer can be changed in the GlI by clicking the "Change…" button in the top right of the "Search Indexes" section of the Design panel.

These indexers are available:

  • MG: MG the original indexer used by Greenstone, developed mainly by Alistair Moffat and described in the classic book Managing Gigabytes. It does section level indexing, and searches can be boolean or ranked (not both at once). For each index specified in the collection, a separate physical index is created. For phrase searching, Greenstone does an "AND" search on all the terms, then scans the resulting hits to see if the phrase is present. It has been extensively tested on very large collections (many GB of text).
  • MGPP: MGPP (MG plus plus), the new version of MG, was developed by the New Zealand Digital Library Project. It does word level indexing, which allows fielded, phrase and proximity searching to be handled by the indexer. Boolean searches can be ranked. Only a single index is created for a Greenstone collection: document/section levels and text/metadata fields are all handled by the one index. For collections with many indexes, this results in a smaller collection size than using MG. For large collections, searching may be a bit slower due to the index being word level rather than section level.
  • Lucene: Lucene was developed by the Apache Software Foundation. It handles field and proximity searching, but only at a single level (e.g. complete documents or individual sections, but not both). Therefore document and section indexes for a collection require two separate indexes. It provides a similar range of search functionality to MGPP with the addition of single-character wildcards and range searching. It was added to Greenstone to facilitate incremental collection building, which MG and MGPP can't provide.

Changing the indexer affects how the indexes are built, and may affect search functionality. The following table compares the indexers' features, which are explained in the sections below.

Comparison of indexers
Search Indexer Index Options Index Levels allfields Index Incremental Building Wild card Hot key
Stem Case Accent CJK Document Section Paragraph Inclusive Set by
MG X X X X X X Index
MGPP X X X X X X X Collection X X
Lucene X* X* X X X X Collection X X X
*This option cannot be turned off

Search Indexes

Search indexes specify what metadata fields are searchable. In the GLI, search indexes can be specified in the "Search Indexes" section of the Design panel.

The Assigned Indexes list shows what indexes are currently assigned to the collection. To add an index, click New Index… A popup window appears with a list of sources, which includes text and metadata. Select which sources you want to index. The Select All and Select None buttons will check or uncheck all of the items in the list, respectively. Once a new index has been defined, click Add Index to add it to the collection. Add Index will only become active once the settings describe a new index that is not already assigned to the collection.

For MG indexes, you also need to choose the granularity of the index (document, section, or paragraph), using the "Indexing level:" menu. Each index can only be set to one index level.

For MGPP and Lucene indexes, index granularity is determined globally, not per index. The possible levels (document and section) are displayed on the main "Search Indexes" pane, and can be added to the collection by ticking the checkboxes. You can also select which level is the default for searching.

If either MGPP or Lucene is in use, an "allfields" index and an "Add All" button are available. The "allfields" index provides combined searching over all other assigned indexes, without having to specify a separate index that contains all sources. To add this index, check the "Add combined searching over all assigned indexes (allfields)" check box and click "Add Index". The "Add All" button adds individual indexes for all metadata and text sources, i.e. every metadata field has its own index.

To edit an index, select it and click "Edit Index". A similar dialog to the "New Index" one is shown. To remove an index, select it from the list of Assigned indexes and click "Remove Index".

The order in which the indexes are specified in the Assigned Indexes list is the order they appear in the drop down menu in the search area and on the search page. Use the "Move Up" and "Move Down" buttons to change this ordering.

The default search index, tagged with "[Default Index]" in the "Assigned Indexes" list, can be set by selecting an index from the list and clicking "Set Default". If no default index is set, the first one in the list will be used as the default.

Indexing Options

There are some additional options controlling how the indexes are built. These may not be available for a particular index, in which case will be greyed out.

Stemming and case-folding may be enabled or disabled for MG and MGPP indexes. If enabled, stemmed and case-folded indexes will be created, and the user will have the option of searching with case folding and stemming on or off. If disabled, searching will be case-sensitive and unstemmed, and the options will not be displayed on the preferences page of the collection.

Accent-folding is available for MGPP indexes. This works in a similar way to case-folding, but instead of lower and upper case letters matching, letters with diacritics match those without.

A Lucene index is always accent-folded; no option to switch this on and off will be displayed to the user on the collection's preferences page.

Chinese, Japanese and Korean text is often not segmented into individual words. As indexing relies on word breaks being present in the text, this results in an unsearchable index. Setting the "CJK Text Segmentation" option will add spaces between each Chinese/Japanese/Korean character in the text and in search terms, so that character level searching is carried out. The table below summarizes the functionality of each option.

Index OptionDescription
Stemming Generate a stemmed index, which enables searching on word stems. For example, searching for "farm" will also match "farms" "farming" "farmers" (useful for English and French only).
Casefolding Generate a case-folded index, which enables case-insensitive searching.
Accent folding Generate an accent-folded index, which enables searching for words ignoring accents.
CJK Text Segmentation Segment CJK (Chinese, Japanese, Korean) text. Currently involves inserting a space between each CJK character. This is necessary for searching in CJK text unless the text has been segmented into words already.

Setting Index Name

Index names can be modified (after the collection is built) in the "Search" section under the Format panel. This section contains a table listing each search index, level, and partition. Here you can enter the text to be used for the names in the drop-down lists on the search page. This pane only allows you to set the text for one language—the current language used by GLI.

Partition Indexes

Indexes are built on particular text or metadata sources. The search space can be further controlled by partitioning the indexes, either by language or by a predetermined filter. Partition indexes can be set in the "Partition Indexes" section under the Design panel. Partition indexes are a way to create a "sub-collection" within the collection for search purposes.

The "Partition Indexes" view has three tabs; "Define Filters", "Assign Partitions" and "Assign Languages". For more on how to create partitions, visit the partition indexes page.

Searching a collection

MGPP

In collections build with MGPP, users can also use hotkeys in their queries. By appending a hotkey to the end of a word in a query, you can explicit set case-folding and stemming, regardless of whether or not that features has been turned "on" in the search options.

Hotkeys in MGPP
HotkeyDescriptionExample
#signore word endings catch#s will match catch,catcher,catches,catching,etc.
#uwhole word must match catch#u will only match catch
#iignore case differences NASA#i will match nasa,NASA,nASA
#cupper/lower case must match NASA#c will only match NASA

You can also use more than one hotkey: catch#si will match catch,Catch,catches,CATCHES,etc. The most useful function of hotkeys, however, is the ability to use hotkeys only one certain words in the query. For example, if the search options are set to ignore case, a search query for NASA#c shuttle will match documents that include NASA and shuttle or NASA and Shuttle. The case for NASA must match, while the case for shuttle need not.

Lucene

Searches in collections built with the Lucene indexer are always case- and accent-insensitive. So searching for "Thøgersen" will match "thogersen" and vice versa. You can read about Lucene query syntax here. Some helpful hints are listed below.

ModifierDescriptionExamples
"[phrase]"phrase search"cat jam" will only return documents that include the phrase "cat jam"
*multiple letter wildcard; includes zero or more occurrences of any character cat* will return cat,catch,cataract, etc.; j*s will return jams, jackpots, jeans, etc.
?single letter wildcard; includes exactly one occurrence of any character cat* will return cats; j?m will return jam,jim,jem
OR or ||Boolean operator; (default operator if multiple words are used in a query) cat OR jam and cat||jam return documents that include either "cat" or "jam"
ANDBoolean operator; document must contain both words cat AND jam will only return documents that include both the terms cat and jam
+[word] required operator; word next to plus must be in the document; +cat jam will return documents that must include "cat" and may or may not include "jam"
NOT or -Boolean operator; excludes documents that contain the term after the NOT cat NOT jam and cat -jam searches for documents that contain "cat" and do not contain "jam"
()Grouping (cat OR jam) NOT dog searches for documents that contain either "cat" or "jam" and do not contain "dog"

Search Results

The order of search results is dependent on the kind of query you are running. For simple (MG) collections, search results are either ranked (for a 'some' or ranked search) or in build order (for an 'or' or boolean search). MG cannot perform ranking and boolean searching at the same time. For advanced (MGPP) collections, search results can be in ranked or build order as above, but this doesn't depend on the kind of search you are doing. Boolean searches may be ranked.

Build order is the seemingly random order that documents are processed during import. This can be changed by using the sortmeta option to import.pl. If a metadata element is specified here, then documents will be sorted during import by that metadata.

This option can be specified as an option to import.pl (in GLI Expert mode), or specified in the collect.cfg file. Note that it needs to be added manually to collect.cfg like e.g.: sortmeta dc.Date It cannot be added to the config file using GLI at this stage.

NOTE: For Greenstone 2.86, to use the sortmeta option with import.pl, you also need to specify -sort or -reverse_sort with ArchivesInfPlugin.

For MGPP collections, in advanced searching mode, the query form has a drop down box specifying "display search results in ranked/natural order". If you have sorted the documents by metadata, then you may like to change the text for this box, e.g. have it display "ranked/date order". To achieve this, add the following line to the collection's collect.cfg file: collectionmacro query:textnatural "date"

For Lucene collections, sorting does not depend on build order. The user can sort by rank, or by any of the fields that have been indexed (apart from text and allfields). This will be offered automatically by Greenstone. If you are indexing sections and want sorting by field to work, you need to make sure that each section has the appropriate metadata. Use the "-sections_index_document_metadata unless_section_metadata_exists" option to buildcol.pl to give each section all the metadata from the top level document.

Faceted Search Results

In Greenstone 3, if you use SOLR as your search indexer, you can have faceted searching. This means you can filter search results based on other metadata. The facet options need to be set up manually in the collectionConfig.xml file as GLI does not allow you to enter them manually yet.

Add <facet> elements into the <search> element, in a simlar fashion to the index elements:

     <facet name="dls.Organization">
      <displayItem name="name" key="Organization.buttonname"/>
    </facet>

To control how many values for each facet are shown at once, add the following option into your search format element. The default number of rows is 8.

<gsf:option name="facetTableRows" value="4"/>

Cross-collection searching

Greenstone has a facility for “cross-collection searching,” which allows several collections to be searched at once, with the results combined behind the scenes as though you were searching a single unified collection. Any subset of the collections can be searched.

Formatting cross-collection search results

SQL Search forms

There are 2 SQL search forms: simple and advanced.

To provide an SQL search form on your collection's search page, you'll need to build your collection with SQLITE as the database in use rather than GDBM or JDBM. Further, you will have to include "sqlform" in the SearchTypes format feature. When previewing, go to Preferences, activate SQL Fielded form, and choose if you want a simple or advanced query form.

The steps are:

  1. In GLI, go to Design > Browsing Classifiers. Press the button near the top right marked "Change" to change the Database in Use to "SQLITE".
  2. Rebuild your collection.
  3. In the meantime, you can go to Format > Format Features. In the Choose Features box, select SearchTypes. In the HTML Format String box, append "sqlform" to the comma-separated list.
  4. Preview the collection.
  5. Go to Preferences.
  6. In the Search Preferences section, under "Query Style", choose "SQL fielded with fields".
  7. Press the Set Preferences button and click on the Search link in the navbar. Now the SQL fielded search form should display.

Additional Resources

Lucene Query Syntax