In Greenstone 3, (version 3.06rc1 onwards) you can use SOLR as the collection indexer (in place of MG/MGPP/Lucene).
See http://lucene.apache.org/solr/ for more details about SOLR.
There is only rudimentary, partial support in GLI for building a Greenstone collection with solr as the indexer. At present, GLI will preserve solr-specific elements, such as the option
subelement of index
mentioned below, as well as solr
and facet
elements. Building a SOLR collection in GLI has 2 drawbacks in Greenstone 3.06rc1:
packages/jre
out of the way before running GLI, so that GLI finds your Java 7 instead.ssh -L 8383:127.0.0.1:8383 <greenstone-server-machine>
Solr is based on Lucene. In 3.06rc1, lucene and solr have been upgraded from v 3.3.0 to 4.7.2, as this gives access to stemmers and filters for many languages. You can create a solr collection and configure the language of each indexable field as follows:
search
element and make sure to change its type
attribute to solr
.solrfieldtype
attribute to the analyzer for your chosen language, such as text_ja
(which will use the Japanese Kuromoji analyzer) or text_es
(which uses the PorterStemmer for Spanish).<search type="solr"> <level name="section"> <displayItem lang="en" name="name">chapter</displayItem> </level> <level name="document"> <displayItem lang="en" name="name">book</displayItem> </level> <defaultLevel name="section"/> <index name="allfields"> <displayItem lang="en" name="name">all fields</displayItem> <option name="solrfieldtype" value="text_ja" /> </index> <index name="text"> <displayItem lang="en" name="name">text</displayItem> <option name="solrfieldtype" value="text_es" /> </index> <index name="dc.Title,Title"> <displayItem lang="en" name="name">titles</displayItem> </index> <index name="dc.Subject"> <displayItem lang="en" name="name">subjects</displayItem> </index> <index name="dls.Organization"> <displayItem lang="en" name="name">organisations</displayItem> <option name="solrfieldtype" value="text_es" /> </index> ...
The above example will use the Spanish analyzer on the full-text and dls.Organization metadata fields, and the Japanese analyzer for the combined allfields
index, and use the default English analyzer for the remaining indexed metadata fields.
The analyzers that will be used for each language are defined in the file ext/solr/conf/schema.xml(.in)
located in your Greenstone 3.06 installation. For instance, Japanese uses the Kuromoji analyzer by default, which is optimised to allow natural searching in Japanese. Spanish by default has been set up to use the SnowballPorterFilter.
Diego Spano has investigated this Spanish analyzer's stemming abilities and has found that it does not always produce the expected results. Diego has read that Hunspell
may be a better analyzer for Spanish. Hunspell is also available for many other languages. Instructions on how to modify the ext/solr/conf/schema.xml(.in)
file to use Hunspell for a language instead are at http://wiki.apache.org/solr/HunspellStemFilterFactory. An example for Polish is at http://solr.pl/en/2012/04/02/solr-4-0-and-polish-language-analysis/
You will need to modify the ext/solr/conf/schema.xml(.in)
file before building a new solr collection that will use the modifications.
If you want your own analyzer, you need to have the FilterFactory and Filter Solr classes placed in a jar archive. The jar file should be placed in WEB-INF/lib
dir of the solr.war
archive located in ./packages/tomcat/webapps/solr.war
Also you need to describe your analyzer in ext/solr/conf/schema.xml.in
. An example of implementing the Russian morphology analyzer:
<fieldType name="text_ru_morph" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt" format="snowball" /> <filter class="org.apache.lucene.morphology.russian.RussianFilterFactory"/> </analyzer> </fieldType>
In the example above, the search string is sliced to words(tokens) by the tokenizer.
Further, all characters in each token are converted to lowercase to simplify analyzing by the filter LowerCaseFilterFactory.
In the next stage, the filter StopFilterFactory removes tokens that represent common words.
The last stage involves getting the normalized form of word(token) by custom Filter.