This version (2014/04/14 11:52) is a draft.
Approvals: 0/1

Language Considerations

When creating collections of documents in some languages, certain technical issues should be considered to ensure optimal searching and browsing capabilities for users.

Searching Arabic OCR text

Because Arabic is a cursive script, it requires a different character set when displaying it on the screen; characters look different depending on ehether the character appears at the start, middle or end of the word.

For example:

The character ڀ "The Letter Beheh" has Unicode U+0680

The same character in the different presentation forms are:

  • Isolated ﭚ has UNICODE U+FB5A
  • Final ﭛ has UNICODE U+FB5B
  • Initial ﭜ has UNICODE U+FB5C
  • Medial ﭝ has UNICODE U+FB5D

In addition to this there is the merging of multiple letters together when presenting the script.

When the data is saved it should not be saved in any presentation form, as stated on the Unicode website:

"[Using the Arabic presentation forms in a data file] is strongly discouraged and not recommended because it does not guarantee data integrity and interoperability. Data files should include only the Arabic letters in the Arabic block (U+0600..U+06FF) or the Arabic Supplement block (U+0750..U+077F)." Unicode Middle Eastern Scripts and Languages FAQ

The issue is that, when the data is stored in presentation form, the words will not be matched when doing a search. This should be understandable when you realise that the underlying UNICODE is very different (even if the word searched for is presented identically).

Accent-folding Diacritics

In any collection with diacritics in searchable/browsable metadata the problem occurs that (international) users might not be able to search (or at least are inhibited from searching) for these terms since they don't have those characters on their keyboards or that the sorting behavior might appear improper.

In such cases the filter_text() function can be used to map "unwanted" characters onto less problematic ones. This How-to is intended to aid with the necessary modifications.

Adding collection-specific service text strings

Text strings for services etc are defined in Java properties files. These live in web/WEB-INF/classes. Each class may have its own file, otherwise it will use a super class's file. For example, on an MGPP search page, the text string in the submit button "Search" is defined by the key TextQuery.submit. The MGPP TextQuery service is defined by the GS2MGPPSearch class. Greenstone will first look for a properties file called GS2MGPPSearch.properties. If the key is not found here, it will look at all the super classes properties files. At the time of writing, these are: AbstractGS2FieldSearch, AbstractGS2TextSearch, AbstractTextSearch, AbstractSearch.

Within each class file search, Java's resource bundle loading will apply, trying to load current language first, then default.

For example, say the default language is English (en) and the current language is Maori (mi). To find the TextQuery.submit string, we ask Java to load GS2MGPPSearch resource bundle. Java will look for GS2MGPPSearch_mi.properties, then GS2MGPPSearch_en.properties, then GS2MGPPSearch.properties. If the key is found, then good. Otherwise we will ask to load up the AbstractGS2FieldSearch resource bundle, and try all languages there. So, GS2MGPPSearch.properties will be used in preference to AbstractGS2FieldSearch_mi.properties.

To make collection specific text strings: You need to make a copy of the most specific file(s) - in this case GS2MGPPSearch.properties. (Or whichever language version(s) you are changing). Add these into your collection's resources folder (greenstone3/web/sites/localsite/collect/collname/resources). Modify the ones you want to change. Note, you need to define all strings from the original, as if Greenstone finds eg GS2MGPPSearch.properties in the collection, it won't load up GS2MGPPSearch.properties from the default area.

Entering non-English metadata in GLI

Metadata in the GLI should be entered in UTF-8. If your system doesn't allow typing directly in UTF-8 (your metadata looks like ??? in GLI), then type your metadata in another application such as Notepad, save it as UTF-8, then open it again and cut and paste into GLI. If the metadata has been properly entered in UTF-8, then it should appear fine in a browser once the collection is built.

If your metadata appears as square boxes in GLI, then you will need to use a different font to display it. You can change the font in GLI by going to File→Preferences. The font that you will need to use depends on what language you are using and what fonts are installed on your computer. A good one to try is Arial Unicode MS, PLAIN, 12.