en:user:language_considerations
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | en:user:language_considerations [2023/03/13 01:46] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | |||
+ | |||
+ | |||
+ | ====== Language Considerations ====== | ||
+ | |||
+ | When creating collections of documents in some languages, certain technical issues | ||
+ | should be considered to ensure optimal searching and browsing capabilities for users. | ||
+ | |||
+ | |||
+ | ===== Searching Arabic OCR text ===== | ||
+ | |||
+ | Because Arabic is a cursive script, it requires a different character set when displaying it on the screen; characters look different depending on ehether the character appears at the start, middle or end of the word. | ||
+ | |||
+ | For example: | ||
+ | |||
+ | The character ڀ "The Letter Beheh" has Unicode U+0680 | ||
+ | |||
+ | The same character in the different presentation forms are: | ||
+ | * Isolated ﭚ has UNICODE U+FB5A | ||
+ | * Final ﭛ has UNICODE U+FB5B | ||
+ | * Initial ﭜ has UNICODE U+FB5C | ||
+ | * Medial ﭝ has UNICODE U+FB5D | ||
+ | |||
+ | In addition to this there is the merging of multiple letters together when presenting the script. | ||
+ | |||
+ | When the data is saved it should not be saved in any presentation form, as stated on the Unicode website: | ||
+ | |||
+ | //" | ||
+ | recommended because it does not guarantee data integrity and interoperability. Data files | ||
+ | should include only the Arabic letters in the Arabic block (U+0600..U+06FF) or the Arabic | ||
+ | Supplement block (U+0750..U+077F)."// | ||
+ | |||
+ | The issue is that, when the data is stored in presentation form, the words will not be matched when doing a search. This should be understandable when you realise that the underlying UNICODE is very different (even if the word searched for is presented identically). | ||
+ | |||
+ | =====Accent-folding Diacritics ===== | ||
+ | In any collection with [[http:// | ||
+ | | ||
+ | that (international) users might not be able to search (or at least are inhibited | ||
+ | from searching) for these terms since they don't have | ||
+ | those characters on their keyboards or that the sorting behavior might appear improper. | ||
+ | |||
+ | In such cases the // | ||
+ | onto less problematic ones. This How-to is intended to aid with the necessary modifications. | ||
+ | |||
+ | |||
+ | =====Adding collection-specific service text strings ===== | ||
+ | Text strings for services etc are defined in Java properties files. These live in web/ | ||
+ | |||
+ | Within each class file search, Java's resource bundle loading will apply, trying to load current language first, then default. | ||
+ | |||
+ | For example, say the default language is English (en) and the current language is Maori (mi). To find the TextQuery.submit string, we ask Java to load GS2MGPPSearch resource bundle. Java will look for GS2MGPPSearch_mi.properties, | ||
+ | So, GS2MGPPSearch.properties will be used in preference to AbstractGS2FieldSearch_mi.properties. | ||
+ | |||
+ | To make collection specific text strings: You need to make a copy of the most specific file(s) - in this case GS2MGPPSearch.properties. (Or whichever language version(s) you are changing). Add these into your collection' | ||
+ | Note, you need to define all strings from the original, as if Greenstone finds eg GS2MGPPSearch.properties in the collection, it won't load up GS2MGPPSearch.properties from the default area. | ||
+ | |||
+ | |||
+ | =====Entering non-English metadata in GLI===== | ||
+ | Metadata in the GLI should be entered in UTF-8. If your system doesn' | ||
+ | |||
+ | If your metadata appears as square boxes in GLI, then you will need to use a different font to display it. You can change the font in GLI by going to File-> | ||
en/user/language_considerations.txt · Last modified: 2023/03/13 01:46 by 127.0.0.1