//**This page is in the 'old' namespace, and was imported from our previous wiki. 
We recommend checking for more up-to-date information using the search box.**//

====== How to handle diacritics ======
====Problem statement==== 
In any collection with [[http://en.wikipedia.org/wiki/Diacritic|diacritics]] in searchable/browsable metadata the problem occurs 
that (international) users might not be able to search (or at least are inhibited from searching) for these terms since they don't have
 those characters on their keyboards or that the sorting behaviour might appear improper.

In such cases the //filter_text()// function can be used to map "unwanted" characters onto less problematic ones. This How-to is intended to aid with the necessary modifications.

NOTE: Since version 2.71, MGPP comes with an "accent folding" option. This is described [[Customizing_collections#How_to_handle_diacritics | here]] [Shaoqun Wu]

====Preliminaries==== 
The following is tailored to v2.62 but applies to other (only newer?) versions as well. Regarding modifications to classifiers an example for AZCompactList is given.

In order for users still being able to enter diacritics into the search form additional modifications to some macros are required. However, this might not be necessary in any case, depending on what you want to offer to your users. Besides this gets quite complicated if mgpp is used (maybe I'll add that later).

The instructions are organized as follows:

**file name** [line number/position]:
  * description
<code>
 original code snippet
</code>
modify to:
<code>
 modified code snippet
</code>

Where the file names are given relative to the installation's perllib directory ($GSDLHOME/perllib).

====Instructions==== 
===Perl modules===
**mgbuildproc.pm** [49-55] and **mgppbuildproc.pm** [142-148]:
  * Comment out in order to have the //filter_text()// function from basebuildproc.pm applied to both mg and mgpp collections.
<code>
 sub filter_text {
     # $self->filter_text ($field, $new_text);
     # don't want to do anything for this version, however,
     # in a particular collection you might want to override
     # this method to post-process certain fields depending on
     # the field, or whether we are outputting it for indexing
 }
</code>
modify to:
<code>
#sub filter_text {
#    # $self->filter_text ($field, $new_text);
#    # don't want to do anything for this version, however,
#    # in a particular collection you might want to override
#    # this method to post-process certain fields depending on
#    # the field, or whether we are outputting it for indexing
#}
</code>


**basebuildproc.pm** [near end of file]:
  * Use this //filter_text()// function for both mg and mgpp.
add:
<code>
sub filter_text {
    # only filter if we are indexing
    return unless shift->{'indexing_text'};

    &sorttools::filter_characters($_[1]);
}
</code>


**sorttools.pm** [43-45]:
  * Filter characters for sorting.
<code>
    if ($metaname eq "Language") {
        $metavalue = $iso639::fromiso639{$metavalue};
        return $metavalue;
    }

    my $lang;
    if (defined $doc_obj) {
        $lang = $doc_obj->get_metadata_element ($doc_obj->get_top_section(), 'Language');
    }
    $lang = 'en' unless defined $lang;
</code>
modify to:
<code>
    if ($metaname eq "Language") {
        $metavalue = $iso639::fromiso639{$metavalue};
        return $metavalue;
    }

    # filter characters for sorting
    filter_characters($metavalue, 1);

    my $lang;
    if (defined $doc_obj) {
        $lang = $doc_obj->get_metadata_element ($doc_obj->get_top_section(), 'Language');
    }
    $lang = 'en' unless defined $lang;
</code>


**sorttools.pm** [near end of file, but before ''1;']:
  * This is where the action takes place ;-)
add:
<code>
# filter characters for searching and sorting
# NOTE: keep in sync with the filter_characters
# function in _query:dummypagescriptextra_

# for searching (will also affect sorting)
my %indx_charmap = (
    'à' => 'a',   # A WITH GRAVE
    'á' => 'a',   # A WITH ACUTE
    'â' => 'a',   # A WITH CIRCUMFLEX
    'ã' => 'a',   # A WITH TILDE
    'å' => 'a',   # A WITH RING ABOVE
    'æ' => 'a',   # AE
    'ç' => 'c',   # C WITH CEDILLA
    'è' => 'e',   # E WITH GRAVE
    'é' => 'e',   # E WITH ACUTE
    'ê' => 'e',   # E WITH CIRCUMFLEX
    'ë' => 'e',   # E WITH DIAERESIS
    'ì' => 'i',   # I WITH GRAVE
    'í' => 'i',   # I WITH ACUTE
    'î' => 'i',   # I WITH CIRCUMFLEX
    'ï' => 'i',   # I WITH DIAERESIS
    'ð' => 'dh',  # ETH
    'ñ' => 'n',   # N WITH TILDE
    'ò' => 'o',   # O WITH GRAVE
    'ó' => 'o',   # O WITH ACUTE
    'ô' => 'o',   # O WITH CIRCUMFLEX
    'õ' => 'o',   # O WITH TILDE
    'ø' => 'o',   # O WITH STROKE
    'ù' => 'u',   # U WITH GRAVE
    'ú' => 'u',   # U WITH ACUTE
    'û' => 'u',   # U WITH CIRCUMFLEX
    'ý' => 'y',   # Y WITH ACUTE
    'þ' => 'th',  # THORN
    'ÿ' => 'y',   # Y WITH DIAERESIS

    # now we want to break umlauts
    'ä' => 'ae',  # A WITH DIAERESIS
    'ö' => 'oe',  # O WITH DIAERESIS
    'ü' => 'ue',  # U WITH DIAERESIS
    'ß' => 'ss'   # SHARP S
);

# for sorting in classifier lists (won't affect searching)
my %sort_charmap = (
    'ä' => 'a',   # A WITH DIAERESIS
    'ö' => 'o',   # O WITH DIAERESIS
    'ü' => 'u',   # U WITH DIAERESIS
    'ß' => 'ss'   # SHARP S
);

# "duplicate" maps for utf8 representations
$indx_charmap{&unicode::ascii2utf8(\$_)} = $indx_charmap{$_}
    foreach keys %indx_charmap;
$sort_charmap{&unicode::ascii2utf8(\$_)} = $sort_charmap{$_}
    foreach keys %sort_charmap;

# join charmaps ('sort' overwriting 'indx')
my %charmap = (
    'indx' => { %indx_charmap },
    'sort' => { %indx_charmap, %sort_charmap }
);

# create OR lists of characters
my %or_chars = (
    'indx' => join('|' => sort keys %{$charmap{'indx'}}),
    'sort' => join('|' => sort keys %{$charmap{'sort'}})
);

# and now the actual filter function -- simple, eh? ;-)
# with a true second argument %sort_charmap is taken
# into account (taking precedence over %indx_charmap)
sub filter_characters {
    my $arg = $_[1] ? 'sort' : 'indx';

    my $chr = $or_chars{$arg};
    my $map = $charmap{$arg};

    $_[0] =~ s/($chr)/$map->{$1}/g;
}
</code>


**classify/AZCompactList.pm** [555-557]:
  * Filter characters before sorting.
<code>
 sub alpha_numeric_cmp
 {
     my ($self,$a,$b) = @_;
 
     my $title_a = $self->{'reclassifylist'}->{$a};
     my $title_b = $self->{'reclassifylist'}->{$b};
 
     if ($title_a =~ m/^(\d+(\.\d+)?)/)
     {
         my $val_a = $1;
         if ($title_b =~ m/^(\d+(\.\d+)?)/)
         {
             my $val_b = $1;
             if ($val_a != $val_b)
             {
                 return ($val_a <=> $val_b);
             }
         }
     }
 
     return ($title_a cmp $title_b);
 }
</code>
modify to:
<code>
 sub alpha_numeric_cmp
 {
     my ($self,$a,$b) = @_;
 
     my $title_a = $self->{'reclassifylist'}->{$a};
     my $title_b = $self->{'reclassifylist'}->{$b};
 
     if ($title_a =~ m/^(\d+(\.\d+)?)/)
     {
         my $val_a = $1;
         if ($title_b =~ m/^(\d+(\.\d+)?)/)
         {
             my $val_b = $1;
             if ($val_a != $val_b)
             {
                 return ($val_a <=> $val_b);
             }
         }
     }
 
     **# this makes it quite slow -- but we need it to get proper sorting**
     **&amp;sorttools::filter_characters($title_a, 1);**
     **&amp;sorttools::filter_characters($title_b, 1);**
 
     return ($title_a cmp $title_b);
 }
</code>


**classify/AZCompactList.pm** [689-691]:
  * Get proper sorting for browsing list nodes.
<code>
 # first split up the list into separate A-Z and 0-9 classifications
     foreach my $classification (@$classlistref) {
         my $title = $self->{'reclassifylist'}->{$classification};
 
         $title =~ s/&amp;(.){2,4};//g; # remove any HTML special chars
         $title =~ s/^\W+//g; # remove leading non-word chars
</code>
modify to:
<code>
 # first split up the list into separate A-Z and 0-9 classifications
     foreach my $classification (@$classlistref) {
         my $title = $self->{'reclassifylist'}->{$classification};
 
         **# we should do filtering here as well**
         **&amp;sorttools::filter_characters($title, 1);**
 
         $title =~ s/&amp;(.){2,4};//g; # remove any HTML special chars
         $title =~ s/^\W+//g; # remove leading non-word chars
</code>
===Macro files===

In your collection's **extra.dm** or in the appropriate system macro files (applies to **mg only**):
  - Overwrite/modify the _queryform_ macro in order to have the //filter_characters()// function called when submitting the query.
  - Add said function to the page script (NOTE: This function has to correspond to the above perl function!).
  - Have these scripts appear for the about page, too.
add/modify:
<code>
 #############
 package query
 #############
 
 _queryform_ {
 <!-- query form (\_query:plainqueryform\_) -->
  <form name=QueryForm method=get action="_gwcgi_" onSubmit="filter_characters();">
 <input type=hidden name="a" value="q">
 <input type=hidden name="r" value="1">
 <input type=hidden name="hs" value="1">
 <input type=hidden name="e" value="_decodedcompressedoptions_">
 _queryformcontent_
 _optdatesearch_
 
 </form>
 <!-- end of query form -->
 }
 
 _dummypagescriptextra_{
 function initialize() \{
 \}
 
 function filter_characters() \{
   var query = document.QueryForm.q;
 
   var oldval = query.value.toLowerCase();
   var newval = "";
 
   var chars = [
     ["224", "a"],   // A WITH GRAVE
     ["225", "a"],   // A WITH ACUTE
     ["226", "a"],   // A WITH CIRCUMFLEX
     ["227", "a"],   // A WITH TILDE
     ["229", "a"],   // A WITH RING ABOVE
     ["230", "a"],   // AE
     ["231", "c"],   // C WITH CEDILLA
     ["232", "e"],   // E WITH GRAVE
     ["233", "e"],   // E WITH ACUTE
     ["234", "e"],   // E WITH CIRCUMFLEX
     ["235", "e"],   // E WITH DIAERESIS
     ["236", "i"],   // I WITH GRAVE
     ["237", "i"],   // I WITH ACUTE
     ["238", "i"],   // I WITH CIRCUMFLEX
     ["239", "i"],   // I WITH DIAERESIS
     ["240", "dh"],  // ETH
     ["241", "n"],   // N WITH TILDE
     ["242", "o"],   // O WITH GRAVE
     ["243", "o"],   // O WITH ACUTE
     ["244", "o"],   // O WITH CIRCUMFLEX
     ["245", "o"],   // O WITH TILDE
     ["248", "o"],   // O WITH STROKE
     ["249", "u"],   // U WITH GRAVE
     ["250", "u"],   // U WITH ACUTE
     ["251", "u"],   // U WITH CIRCUMFLEX
     ["253", "y"],   // Y WITH ACUTE
     ["254", "th"],  // THORN
     ["255", "y"],   // Y WITH DIAERESIS
 
     // now we want to break umlauts
     ["228", "ae"],  // A WITH DIAERESIS
     ["246", "oe"],  // O WITH DIAERESIS
     ["252", "ue"],  // U WITH DIAERESIS
     ["223", "ss"]   // SHARP S
   ]
 
   for (var i = 0; i < oldval.length; i++) \{
     var c = oldval.substr(i, 1);
     var z = 0;
 
     for (var j = 0; j < chars.length; j++) \{
       if (z != 1 &amp;&amp; c == String.fromCharCode(chars[j][0])) \{
         newval += chars[j][1];
         z = 1;
       \}
     \}
 
     if (z != 1) \{
       newval += c;
     \}
   \}
 
   query.value = newval;
 \}
 }
 
 #############
 package about
 #############
 
 _pagescriptextra_ {_query:dummypagescriptextra_}
</code>
====Epilogue==== 
Hopefully, this can be of help to anybody ;-) 
If you encounter problems or errors or have suggestions for
 improvement feel free to edit this article.<!-- 
or drop me a line on my talk page. cheers, Jens 16:29, 21 March 2006 (PST)-->

====TODO==== 
  * Add macro modifications for mgpp.
  * Address further problems:
    * Elimination of articles in **sorttools.pm**.
    * Tidying up of use of white space in **classify/AZCompactList.pm**.
  * Upload complete files.