Indexing Using SOLR
In Greenstone 3, (version 3.06rc1 onwards) you can use SOLR as the collection indexer (in place of MG/MGPP/Lucene).
See http://lucene.apache.org/solr/ for more details about SOLR.
There is only rudimentary, partial support in GLI for building a Greenstone collection with solr as the indexer. At present, GLI will preserve solr-specific elements, such as the option
subelement of index
mentioned below, as well as solr
and facet
elements. Building a SOLR collection in GLI has 2 drawbacks in Greenstone 3.06rc1:
- Building a solr collection in GLI will stop the Greenstone server before building the collection and restart it when the collection has been rebuilt. In more recent versions of GS3 this is no longer a problem, as newer GS3 versions do not stop and start the server for building a solr collection any more.
- Java 7+ is needed to successfully build a solr collection in GLI, at least on Windows. For this to work, you will first need to have a JDK 7 installed on your machine (with the JAVA_HOME environment variable set up, and with JAVA_HOME/bin added to your PATH environment variable). Secondly, you will need to move your Greenstone 3 installation's
packages/jre
out of the way before running GLI, so that GLI finds your Java 7 instead.
Accessing Solr Admin
- In the case of a Greenstone server and client running on the same PC, open up http://127.0.0.1:8383/solr in your browser
- In the case of a remote Greenstone server, you need to forward ports. Assuming the default GS3 port 8383, on a Linux terminal you'd do:
ssh -L 8383:127.0.0.1:8383 <greenstone-server-machine>
Using analyzers specifically suited to different languages
Solr is based on Lucene. In 3.06rc1, lucene and solr have been upgraded from v 3.3.0 to 4.7.2, as this gives access to stemmers and filters for many languages. You can create a solr collection and configure the language of each indexable field as follows:
- Create a collection in GLI and gather the documents. Assign metadata.
- Close GLI.
- Use a text editor to manually edit the collection's etc/collectionConfig.xml configuration file.
- Locate the
search
element and make sure to change itstype
attribute tosolr
. - Next, for each indexed field that you do not want analyzed by the default "text_en_splitting", create an option element and set its
solrfieldtype
attribute to the analyzer for your chosen language, such astext_ja
(which will use the Japanese Kuromoji analyzer) ortext_es
(which uses the PorterStemmer for Spanish).
<search type="solr"> <level name="section"> <displayItem lang="en" name="name">chapter</displayItem> </level> <level name="document"> <displayItem lang="en" name="name">book</displayItem> </level> <defaultLevel name="section"/> <index name="allfields"> <displayItem lang="en" name="name">all fields</displayItem> <option name="solrfieldtype" value="text_ja" /> </index> <index name="text"> <displayItem lang="en" name="name">text</displayItem> <option name="solrfieldtype" value="text_es" /> </index> <index name="dc.Title,Title"> <displayItem lang="en" name="name">titles</displayItem> </index> <index name="dc.Subject"> <displayItem lang="en" name="name">subjects</displayItem> </index> <index name="dls.Organization"> <displayItem lang="en" name="name">organisations</displayItem> <option name="solrfieldtype" value="text_es" /> </index> ...
The above example will use the Spanish analyzer on the full-text and dls.Organization metadata fields, and the Japanese analyzer for the combined allfields
index, and use the default English analyzer for the remaining indexed metadata fields.
The analyzers that will be used for each language are defined in the file ext/solr/conf/schema.xml(.in)
located in your Greenstone 3.06 installation. For instance, Japanese uses the Kuromoji analyzer by default, which is optimised to allow natural searching in Japanese. Spanish by default has been set up to use the SnowballPorterFilter.
Diego Spano has investigated this Spanish analyzer's stemming abilities and has found that it does not always produce the expected results. Diego has read that Hunspell
may be a better analyzer for Spanish. Hunspell is also available for many other languages. Instructions on how to modify the ext/solr/conf/schema.xml(.in)
file to use Hunspell for a language instead are at http://wiki.apache.org/solr/HunspellStemFilterFactory. An example for Polish is at http://solr.pl/en/2012/04/02/solr-4-0-and-polish-language-analysis/
You will need to modify the ext/solr/conf/schema.xml(.in)
file before building a new solr collection that will use the modifications.
If you want your own analyzer, you need to have the FilterFactory and Filter Solr classes placed in a jar archive. The jar file should be placed in WEB-INF/lib
dir of the solr.war
archive located in ./packages/tomcat/webapps/solr.war
Also you need to describe your analyzer in ext/solr/conf/schema.xml.in
. An example of implementing the Russian morphology analyzer:
<fieldType name="text_ru_morph" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt" format="snowball" /> <filter class="org.apache.lucene.morphology.russian.RussianFilterFactory"/> </analyzer> </fieldType>
In the example above, the search string is sliced to words(tokens) by the tokenizer.
Further, all characters in each token are converted to lowercase to simplify analyzing by the filter LowerCaseFilterFactory.
In the next stage, the filter StopFilterFactory removes tokens that represent common words.
The last stage involves getting the normalized form of word(token) by custom Filter.