Indexing Using SOLR

In Greenstone 3, (version 3.06rc1 onwards) you can use SOLR as the collection indexer (in place of MG/MGPP/Lucene).

See http://lucene.apache.org/solr/ for more details about SOLR.

There is only rudimentary, partial support in GLI for building a Greenstone collection with solr as the indexer. At present, GLI will preserve solr-specific elements, such as the option subelement of index above, as well as solr and facet elements. Building a SOLR collection in GLI has 2 drawbacks in Greenstone 3.06rc1:

  • Building a solr collection in GLI will stop the Greenstone server before building the collection and restart it when the collection has been rebuilt
  • Java 7 is needed to successfully build a solr collection in GLI, at least on Windows. For this to work, you will first need to have a JDK 7 installed on your machine (with the JAVA_HOME environment variable set up, and with JAVA_HOME/bin added to your PATH environment variable). Secondly, you will need to move your Greenstone 3 installation's packages/jre out of the way before running GLI, so that GLI finds your Java 7 instead.

Accessing Solr Admin

  • In the case of a Greenstone server and client running on the same PC, in browser open up http://127.0.0.1:8383/solr
  • In the case of a remote Greenstone server you need to forward ports by ssh -L 8383:127.0.0.1:8383 greenstoneserver

Using analyzers specifically suited to different languages

Solr is based on Lucene. In 3.06rc1, lucene and solr have been upgraded from v 3.3.0 to 4.7.2, as this gives access to stemmers and filters for many languages. You can create a solr collection and configure the language of each indexable field as follows:

  1. Create a collection in GLI and gather the documents. Assign metadata.
  2. Close GLI.
  3. Use a text editor to manually edit the collection's etc/collectionConfig.xml configuration file.
  4. Locate the search element and make sure to change its type attribute to solr.
  5. Next, for each indexed field that you do not want analyzed by the default "text_en_splitting", create an option element and set its solrfieldtype attribute to the analyzer for your chosen language, such as text_ja (which will use the Japanese Kuromoji analyzer) or text_es (which uses the PorterStemmer for Spanish).
  <search type="solr">
        <level name="section">
            <displayItem lang="en" name="name">chapter</displayItem>
        </level>
        <level name="document">
            <displayItem lang="en" name="name">book</displayItem>
        </level>
        <defaultLevel name="section"/>
        <index name="allfields">
          <displayItem lang="en" name="name">all fields</displayItem>
          <option name="solrfieldtype" value="text_ja" />
        </index>
        <index name="text">
            <displayItem lang="en" name="name">text</displayItem>
            <option name="solrfieldtype" value="text_es" />
        </index>
        <index name="dc.Title,Title">
            <displayItem lang="en" name="name">titles</displayItem>
        </index>
        <index name="dc.Subject">
            <displayItem lang="en" name="name">subjects</displayItem>
        </index>
        <index name="dls.Organization">
            <displayItem lang="en" name="name">organisations</displayItem>
            <option name="solrfieldtype" value="text_es" />
        </index>
    ...

The above example will use the Spanish analyzer on the full-text and dls.Organization metadata fields, and the Japanese analyzer for the combined allfields index, and use the default English analyzer for the remaining indexed metadata fields.

The analyzers that will be used for each language are defined in the file ext/solr/conf/schema.xml(.in) located in your Greenstone 3.06 installation. For instance, Japanese uses the Kuromoji analyzer by default, which is optimised to allow natural searching in Japanese. Spanish by default has been set up to use the SnowballPorterFilter.

Diego Spano has investigated this Spanish analyzer's stemming abilities and has found that it does not always produce the expected results. Diego has read that Hunspell may be a better analyzer for Spanish. Hunspell is also available for many other languages. Instructions on how to modify the ext/solr/conf/schema.xml(.in) file to use Hunspell for a language instead are at http://wiki.apache.org/solr/HunspellStemFilterFactory. An example for Polish is at http://solr.pl/en/2012/04/02/solr-4-0-and-polish-language-analysis/

You will need to modify the ext/solr/conf/schema.xml(.in) file before building a new solr collection that will use the modifications.

If you want your own analyzer, you need to have the FilterFactory and Filter Solr classes placed in a jar archive. The jar file should be placed in WEB-INF/lib dir of the solr.war archive located in ./packages/tomcat/webapps/solr.war
Also you need to describe your analyzer in ext/solr/conf/schema.xml.in. An example of implementing the Russian morphology analyzer:

 <fieldType name="text_ru_morph" class="solr.TextField" positionIncrementGap="100">
          <analyzer>
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt" format="snowball" />
          <filter class="org.apache.lucene.morphology.russian.RussianFilterFactory"/>
          </analyzer>
 </fieldType>

In the example above, the search string is sliced to words(tokens) by the tokenizer.
Further, all characters in each token are converted to lowercase to simplify analyzing by the filter LowerCaseFilterFactory.
In the next stage, the filter StopFilterFactory removes tokens that represent common words.
The last stage involves getting the normalized form of word(token) by custom Filter.