greenstone.org greenstone wiki greenstone trac planet greenstone

More about indexing

From GreenstoneWiki

Can Greenstone search Arabic OCR text?

(Thanks to Graeme Foster for this info)

Because Arabic is a cursive script it requires a different character set when displaying it on the screen (in very crude terms to ensure that each character joins up correctly, and that is crude because they do look different depending upon whether the character appears at the start, middle or end of the word.)

For example:

The character ڀ "The Letter Beheh" has UNICODE U+0680

The same character in the different presentation forms are: Isolated ﭚ has UNICODE U+FB5A Final ﭛ has UNICODE U+FB5B Initial ﭜ has UNICODE U+FB5A Medial ﭝ has UNICODE U+FB5A

In addition to this there is the merging of multiple letters together when presenting the script.

The problem: When the data is saved it should not be saved in any presentation form, to quote the UNICODE FAQ on this matter:

Q: Can one use the Arabic presentation forms in a data file?

A: It is strongly discouraged and not recommended because it does not guarantee data integrity and interoperability. Data files should include only the Arabic script code values that are defined in Row 6, U+0600 to U+06FF.

The issue is that when the data is stored in presentation form the words will not be matched when doing a search, this should be understandable when you realise that the underlying UNCODE is very different (even if the word searched for is presented identically).