Indexing Using SOLR

In Greenstone 3, (version 3.06rc1 onwards) you can use SOLR as the collection indexer (in place of MG/MGPP/Lucene).

See http://lucene.apache.org/solr/ for more details about SOLR.

There is only rudimentary, partial support in GLI for building a Greenstone collection with solr as the indexer. At present, GLI will preserve solr-specific elements, such as the option subelement of index above, as well as solr and facet elements. Building a SOLR collection in GLI has 2 drawbacks in Greenstone 3.06rc1:

Building a solr collection in GLI will stop the Greenstone server before building the collection and restart it when the collection has been rebuilt
Java 7 is needed to successfully build a solr collection in GLI, at least on Windows. For this to work, you will first need to have a JDK 7 installed on your machine (with the JAVA_HOME environment variable set up, and with JAVA_HOME/bin added to your PATH environemtn variable). Secondly, you will need to move your Greenstone 3 installation's packages/jre out of the way before running GLI, so that GLI finds your Java 7 instead.

Accessing Solr Admin

In case Greenstone server and client on the same PC open in browser http://127.0.0.1:8383/solr

In case remote Greenstone server you need to forward ports by ssh -L 8383:127.0.0.1:8383 greenstoneserver

Using analyzers specifically suited to different languages

Solr is based on Lucene. In 3.06rc1, lucene and solr have been upgraded from v 3.3.0 to 4.7.2, as this gives access to stemmers and filters for many languages. You can create a solr collection and configure the language of each indexable field as follows:

Create a collection in GLI and gather the documents. Assign metadata.
Close GLI.
Use a text editor to manually edit the collection's etc/collectionConfig.xml configuration file.
Locate the search element and make sure to change its type attribute to solr.
Next, for each indexed field that you do not want analyzed by the default "text_en_splitting", create an option element and set its solrfieldtype attribute to the analyzer for your chosen language, such as text_ja (which will use the Japanese Kuromoji analyzer) or text_es (which uses the PorterStemmer for Spanish).

  <search type="solr">
        <level name="section">
            <displayItem lang="en" name="name">chapter</displayItem>
        </level>
        <level name="document">
            <displayItem lang="en" name="name">book</displayItem>
        </level>
        <defaultLevel name="section"/>
        <index name="allfields">
          <displayItem lang="en" name="name">all fields</displayItem>
          <option name="solrfieldtype" value="text_ja" />
        </index>
        <index name="text">
            <displayItem lang="en" name="name">text</displayItem>
            <option name="solrfieldtype" value="text_es" />
        </index>
        <index name="dc.Title,Title">
            <displayItem lang="en" name="name">titles</displayItem>
        </index>
        <index name="dc.Subject">
            <displayItem lang="en" name="name">subjects</displayItem>
        </index>
        <index name="dls.Organization">
            <displayItem lang="en" name="name">organisations</displayItem>
            <option name="solrfieldtype" value="text_es" />
        </index>
    ...

The above example will use the Spanish analyzer on the full-text and dls.Organization metadata fields, and the Japanese analyzer for the combined allfields index, and use the default English analyzer for the remaining indexed metadata fields.

The analyzers that will be used for each language are defined in the file ext/solr/conf/schema.xml(.in) located in your Greenstone 3.06 installation. For instance, Japanese uses the Kuromoji analyzer by default, which is optimised to allow natural searching in Japanese. Spanish by default has been set up to use the SnowballPorterFilter.

If you want your own analyzer, you need to have FilterFactory and Filter Solr classes in placed in jar archive. Jar file should be placed in WEB-INF/lib dir of solr.war archive located in ./packages/tomcat/webapps/solr.war Also you need to describe your analyzer in ext/solr/conf/schema.xml.in Example of implementing russian morphology analyzer:

 <fieldType name="text_ru_morph" class="solr.TextField" positionIncrementGap="100">
          <analyzer>
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt" format="snowball" />
          <filter class="org.apache.lucene.morphology.russian.RussianFilterFactory"/>
          </analyzer>
 </fieldType>

In example above search string is sliced to words(tokens) by tokenizer. Further all charcaters in each token are converted to lowercase to simplify analyzing by filter LowerCaseFilterFactory. On the next stage filter StopFilterFactory remove tokens that represents common words. Last stage is getting normalized form of word(token) by custom Filter.

Diego Spano has investigated this analyzer's stemming abilities and has found that it does not always produce the expected results. Diego has read that Hunspell may be a better analyzer for Spanish. Hunspell is also available for many other languages. Instructions on how to modify the ext/solr/conf/schema.xml(.in) file to use Hunspell for a language instead are at http://wiki.apache.org/solr/HunspellStemFilterFactory. An example for Polish is at http://solr.pl/en/2012/04/02/solr-4-0-and-polish-language-analysis/

You will need to modify the ext/solr/conf/schema.xml(.in) file before building a new solr collection that will use the modifications.