How to handle diacritics

Problem statement

In any collection with diacritics in searchable/browsable metadata the problem occurs that (international) users might not be able to search (or at least are inhibited from searching) for these terms since they don't have those characters on their keyboards or that the sorting behaviour might appear improper.

In such cases the filter_text() function can be used to map "unwanted" characters onto less problematic ones. This How-to is intended to aid with the necessary modifications.

NOTE: Since version 2.71, MGPP comes with an "accent folding" option. This is described here [Shaoqun Wu]

Preliminaries

The following is tailored to v2.62 but applies to other (only newer?) versions as well. Regarding modifications to classifiers an example for AZCompactList is given.

In order for users still being able to enter diacritics into the search form additional modifications to some macros are required. However, this might not be necessary in any case, depending on what you want to offer to your users. Besides this gets quite complicated if mgpp is used (maybe I'll add that later).

The instructions are organized as follows:

file name [line number/position]:

description

 original code snippet

modify to:

 modified code snippet

Where the file names are given relative to the installation's perllib directory ($GSDLHOME/perllib).

Instructions

Perl modules

mgbuildproc.pm [49-55] and mgppbuildproc.pm [142-148]:

Comment out in order to have the filter_text() function from basebuildproc.pm applied to both mg and mgpp collections.

 sub filter_text {
     # $self->filter_text ($field, $new_text);
     # don't want to do anything for this version, however,
     # in a particular collection you might want to override
     # this method to post-process certain fields depending on
     # the field, or whether we are outputting it for indexing
 }

modify to:

#sub filter_text {
#    # $self->filter_text ($field, $new_text);
#    # don't want to do anything for this version, however,
#    # in a particular collection you might want to override
#    # this method to post-process certain fields depending on
#    # the field, or whether we are outputting it for indexing
#}

basebuildproc.pm [near end of file]:

Use this filter_text() function for both mg and mgpp.

add:

sub filter_text {
    # only filter if we are indexing
    return unless shift->{'indexing_text'};

    &sorttools::filter_characters($_[1]);
}

sorttools.pm [43-45]:

Filter characters for sorting.

    if ($metaname eq "Language") {
        $metavalue = $iso639::fromiso639{$metavalue};
        return $metavalue;
    }

    my $lang;
    if (defined $doc_obj) {
        $lang = $doc_obj->get_metadata_element ($doc_obj->get_top_section(), 'Language');
    }
    $lang = 'en' unless defined $lang;

modify to:

    if ($metaname eq "Language") {
        $metavalue = $iso639::fromiso639{$metavalue};
        return $metavalue;
    }

    # filter characters for sorting
    filter_characters($metavalue, 1);

    my $lang;
    if (defined $doc_obj) {
        $lang = $doc_obj->get_metadata_element ($doc_obj->get_top_section(), 'Language');
    }
    $lang = 'en' unless defined $lang;

sorttools.pm [near end of file, but before ''1;']:

This is where the action takes place

add:

# filter characters for searching and sorting
# NOTE: keep in sync with the filter_characters
# function in _query:dummypagescriptextra_

# for searching (will also affect sorting)
my %indx_charmap = (
    'à' => 'a',   # A WITH GRAVE
    'á' => 'a',   # A WITH ACUTE
    'â' => 'a',   # A WITH CIRCUMFLEX
    'ã' => 'a',   # A WITH TILDE
    'å' => 'a',   # A WITH RING ABOVE
    'æ' => 'a',   # AE
    'ç' => 'c',   # C WITH CEDILLA
    'è' => 'e',   # E WITH GRAVE
    'é' => 'e',   # E WITH ACUTE
    'ê' => 'e',   # E WITH CIRCUMFLEX
    'ë' => 'e',   # E WITH DIAERESIS
    'ì' => 'i',   # I WITH GRAVE
    'í' => 'i',   # I WITH ACUTE
    'î' => 'i',   # I WITH CIRCUMFLEX
    'ï' => 'i',   # I WITH DIAERESIS
    'ð' => 'dh',  # ETH
    'ñ' => 'n',   # N WITH TILDE
    'ò' => 'o',   # O WITH GRAVE
    'ó' => 'o',   # O WITH ACUTE
    'ô' => 'o',   # O WITH CIRCUMFLEX
    'õ' => 'o',   # O WITH TILDE
    'ø' => 'o',   # O WITH STROKE
    'ù' => 'u',   # U WITH GRAVE
    'ú' => 'u',   # U WITH ACUTE
    'û' => 'u',   # U WITH CIRCUMFLEX
    'ý' => 'y',   # Y WITH ACUTE
    'þ' => 'th',  # THORN
    'ÿ' => 'y',   # Y WITH DIAERESIS

    # now we want to break umlauts
    'ä' => 'ae',  # A WITH DIAERESIS
    'ö' => 'oe',  # O WITH DIAERESIS
    'ü' => 'ue',  # U WITH DIAERESIS
    'ß' => 'ss'   # SHARP S
);

# for sorting in classifier lists (won't affect searching)
my %sort_charmap = (
    'ä' => 'a',   # A WITH DIAERESIS
    'ö' => 'o',   # O WITH DIAERESIS
    'ü' => 'u',   # U WITH DIAERESIS
    'ß' => 'ss'   # SHARP S
);

# "duplicate" maps for utf8 representations
$indx_charmap{&unicode::ascii2utf8(\$_)} = $indx_charmap{$_}
    foreach keys %indx_charmap;
$sort_charmap{&unicode::ascii2utf8(\$_)} = $sort_charmap{$_}
    foreach keys %sort_charmap;

# join charmaps ('sort' overwriting 'indx')
my %charmap = (
    'indx' => { %indx_charmap },
    'sort' => { %indx_charmap, %sort_charmap }
);

# create OR lists of characters
my %or_chars = (
    'indx' => join('|' => sort keys %{$charmap{'indx'}}),
    'sort' => join('|' => sort keys %{$charmap{'sort'}})
);

# and now the actual filter function -- simple, eh? ;-)
# with a true second argument %sort_charmap is taken
# into account (taking precedence over %indx_charmap)
sub filter_characters {
    my $arg = $_[1] ? 'sort' : 'indx';

    my $chr = $or_chars{$arg};
    my $map = $charmap{$arg};

    $_[0] =~ s/($chr)/$map->{$1}/g;
}

classify/AZCompactList.pm [555-557]:

Filter characters before sorting.

 sub alpha_numeric_cmp
 {
     my ($self,$a,$b) = @_;
 
     my $title_a = $self->{'reclassifylist'}->{$a};
     my $title_b = $self->{'reclassifylist'}->{$b};
 
     if ($title_a =~ m/^(\d+(\.\d+)?)/)
     {
         my $val_a = $1;
         if ($title_b =~ m/^(\d+(\.\d+)?)/)
         {
             my $val_b = $1;
             if ($val_a != $val_b)
             {
                 return ($val_a <=> $val_b);
             }
         }
     }
 
     return ($title_a cmp $title_b);
 }

modify to:

 sub alpha_numeric_cmp
 {
     my ($self,$a,$b) = @_;
 
     my $title_a = $self->{'reclassifylist'}->{$a};
     my $title_b = $self->{'reclassifylist'}->{$b};
 
     if ($title_a =~ m/^(\d+(\.\d+)?)/)
     {
         my $val_a = $1;
         if ($title_b =~ m/^(\d+(\.\d+)?)/)
         {
             my $val_b = $1;
             if ($val_a != $val_b)
             {
                 return ($val_a <=> $val_b);
             }
         }
     }
 
     **# this makes it quite slow -- but we need it to get proper sorting**
     **&amp;sorttools::filter_characters($title_a, 1);**
     **&amp;sorttools::filter_characters($title_b, 1);**
 
     return ($title_a cmp $title_b);
 }

classify/AZCompactList.pm [689-691]:

Get proper sorting for browsing list nodes.

 # first split up the list into separate A-Z and 0-9 classifications
     foreach my $classification (@$classlistref) {
         my $title = $self->{'reclassifylist'}->{$classification};
 
         $title =~ s/&amp;(.){2,4};//g; # remove any HTML special chars
         $title =~ s/^\W+//g; # remove leading non-word chars

modify to:

 # first split up the list into separate A-Z and 0-9 classifications
     foreach my $classification (@$classlistref) {
         my $title = $self->{'reclassifylist'}->{$classification};
 
         **# we should do filtering here as well**
         **&amp;sorttools::filter_characters($title, 1);**
 
         $title =~ s/&amp;(.){2,4};//g; # remove any HTML special chars
         $title =~ s/^\W+//g; # remove leading non-word chars

Macro files

In your collection's extra.dm or in the appropriate system macro files (applies to mg only):

Overwrite/modify the _queryform_ macro in order to have the filter_characters() function called when submitting the query.
Add said function to the page script (NOTE: This function has to correspond to the above perl function!).
Have these scripts appear for the about page, too.

add/modify:

 #############
 package query
 #############
 
 _queryform_ {
 <!-- query form (\_query:plainqueryform\_) -->
  <form name=QueryForm method=get action="_gwcgi_" onSubmit="filter_characters();">
 <input type=hidden name="a" value="q">
 <input type=hidden name="r" value="1">
 <input type=hidden name="hs" value="1">
 <input type=hidden name="e" value="_decodedcompressedoptions_">
 _queryformcontent_
 _optdatesearch_
 
 </form>
 <!-- end of query form -->
 }
 
 _dummypagescriptextra_{
 function initialize() \{
 \}
 
 function filter_characters() \{
   var query = document.QueryForm.q;
 
   var oldval = query.value.toLowerCase();
   var newval = "";
 
   var chars = [
     ["224", "a"],   // A WITH GRAVE
     ["225", "a"],   // A WITH ACUTE
     ["226", "a"],   // A WITH CIRCUMFLEX
     ["227", "a"],   // A WITH TILDE
     ["229", "a"],   // A WITH RING ABOVE
     ["230", "a"],   // AE
     ["231", "c"],   // C WITH CEDILLA
     ["232", "e"],   // E WITH GRAVE
     ["233", "e"],   // E WITH ACUTE
     ["234", "e"],   // E WITH CIRCUMFLEX
     ["235", "e"],   // E WITH DIAERESIS
     ["236", "i"],   // I WITH GRAVE
     ["237", "i"],   // I WITH ACUTE
     ["238", "i"],   // I WITH CIRCUMFLEX
     ["239", "i"],   // I WITH DIAERESIS
     ["240", "dh"],  // ETH
     ["241", "n"],   // N WITH TILDE
     ["242", "o"],   // O WITH GRAVE
     ["243", "o"],   // O WITH ACUTE
     ["244", "o"],   // O WITH CIRCUMFLEX
     ["245", "o"],   // O WITH TILDE
     ["248", "o"],   // O WITH STROKE
     ["249", "u"],   // U WITH GRAVE
     ["250", "u"],   // U WITH ACUTE
     ["251", "u"],   // U WITH CIRCUMFLEX
     ["253", "y"],   // Y WITH ACUTE
     ["254", "th"],  // THORN
     ["255", "y"],   // Y WITH DIAERESIS
 
     // now we want to break umlauts
     ["228", "ae"],  // A WITH DIAERESIS
     ["246", "oe"],  // O WITH DIAERESIS
     ["252", "ue"],  // U WITH DIAERESIS
     ["223", "ss"]   // SHARP S
   ]
 
   for (var i = 0; i < oldval.length; i++) \{
     var c = oldval.substr(i, 1);
     var z = 0;
 
     for (var j = 0; j < chars.length; j++) \{
       if (z != 1 &amp;&amp; c == String.fromCharCode(chars[j][0])) \{
         newval += chars[j][1];
         z = 1;
       \}
     \}
 
     if (z != 1) \{
       newval += c;
     \}
   \}
 
   query.value = newval;
 \}
 }
 
 #############
 package about
 #############
 
 _pagescriptextra_ {_query:dummypagescriptextra_}

Epilogue

Hopefully, this can be of help to anybody If you encounter problems or errors or have suggestions for improvement feel free to edit this article.

TODO

Add macro modifications for mgpp.
Address further problems:
- Elimination of articles in sorttools.pm.
- Tidying up of use of white space in classify/AZCompactList.pm.
Upload complete files.

Greenstone Wiki

Table of Contents

How to handle diacritics

Problem statement

Preliminaries

Instructions

Perl modules

Macro files

Epilogue

TODO