How to handle diacritics

Problem statement
In any collection with diacritics in searchable/browsable metadata the problem occurs that (international) users might not be able to search (or at least are inhibited from searching) for these terms since they don't have those characters on their keyboards or that the sorting behaviour might appear improper.

In such cases the filter_text function can be used to map "unwanted" characters onto less problematic ones. This How-to is intended to aid with the necessary modifications.

NOTE: Since version 2.71, MGPP comes with an "accent folding" option. This is described here [Shaoqun Wu]

Preliminaries
The following is tailored to v2.62 but applies to other (only newer?) versions as well. Regarding modifications to classifiers an example for AZCompactList is given.

In order for users still being able to enter diacritics into the search form additional modifications to some macros are required. However, this might not be necessary in any case, depending on what you want to offer to your users. Besides this gets quite complicated if mgpp is used (maybe I'll add that later).

The instructions are organized as follows:

file name [line number/position]:
 * description

original code snippet modify to: modified code snippet

Where the file names are given relative to the installation's perllib directory ($GSDLHOME/perllib).

Perl modules
mgbuildproc.pm [49-55] and mgppbuildproc.pm [142-148]:
 * Comment out in order to have the filter_text function from basebuildproc.pm applied to both mg and mgpp collections.

sub filter_text { # $self->filter_text ($field, $new_text); # don't want to do anything for this version, however, # in a particular collection you might want to override # this method to post-process certain fields depending on    # the field, or whether we are outputting it for indexing } modify to: #sub filter_text { #   # $self->filter_text ($field, $new_text); #   # don't want to do anything for this version, however, #   # in a particular collection you might want to override #   # this method to post-process certain fields depending on #    # the field, or whether we are outputting it for indexing #}

basebuildproc.pm [near end of file]:
 * Use this filter_text function for both mg and mgpp.

add: sub filter_text { # only filter if we are indexing return unless shift->{'indexing_text'}; &sorttools::filter_characters($_[1]); }

sorttools.pm [43-45]:
 * Filter characters for sorting.

if ($metaname eq "Language") { $metavalue = $iso639::fromiso639{$metavalue}; return $metavalue; }    my $lang; if (defined $doc_obj) { $lang = $doc_obj->get_metadata_element ($doc_obj->get_top_section, 'Language'); }    $lang = 'en' unless defined $lang; modify to: if ($metaname eq "Language") { $metavalue = $iso639::fromiso639{$metavalue}; return $metavalue; }    # filter characters for sorting filter_characters($metavalue, 1); my $lang; if (defined $doc_obj) { $lang = $doc_obj->get_metadata_element ($doc_obj->get_top_section, 'Language'); }    $lang = 'en' unless defined $lang;

sorttools.pm [near end of file, but before ]:
 * This is where the action takes place ;-)

add: my %indx_charmap = (    'à' => 'a',   # A WITH GRAVE     'á' => 'a',   # A WITH ACUTE     'â' => 'a',   # A WITH CIRCUMFLEX     'ã' => 'a',   # A WITH TILDE     'å' => 'a',   # A WITH RING ABOVE     'æ' => 'a',   # AE     'ç' => 'c',   # C WITH CEDILLA     'è' => 'e',   # E WITH GRAVE     'é' => 'e',   # E WITH ACUTE     'ê' => 'e',   # E WITH CIRCUMFLEX     'ë' => 'e',   # E WITH DIAERESIS     'ì' => 'i',   # I WITH GRAVE     'í' => 'i',   # I WITH ACUTE     'î' => 'i',   # I WITH CIRCUMFLEX     'ï' => 'i',   # I WITH DIAERESIS     'ð' => 'dh',  # ETH     'ñ' => 'n',   # N WITH TILDE     'ò' => 'o',   # O WITH GRAVE     'ó' => 'o',   # O WITH ACUTE     'ô' => 'o',   # O WITH CIRCUMFLEX     'õ' => 'o',   # O WITH TILDE     'ø' => 'o',   # O WITH STROKE     'ù' => 'u',   # U WITH GRAVE     'ú' => 'u',   # U WITH ACUTE     'û' => 'u',   # U WITH CIRCUMFLEX     'ý' => 'y',   # Y WITH ACUTE 'þ' => 'th', # THORN 'ÿ' => 'y',  # Y WITH DIAERESIS # now we want to break umlauts 'ä' => 'ae', # A WITH DIAERESIS 'ö' => 'oe', # O WITH DIAERESIS 'ü' => 'ue', # U WITH DIAERESIS 'ß' => 'ss'  # SHARP S ); my %sort_charmap = ( 'ä' => 'a',  # A WITH DIAERESIS 'ö' => 'o',  # O WITH DIAERESIS 'ü' => 'u',  # U WITH DIAERESIS 'ß' => 'ss'  # SHARP S ); $indx_charmap{&unicode::ascii2utf8(\$_)} = $indx_charmap{$_}     foreach keys %indx_charmap; $sort_charmap{&unicode::ascii2utf8(\$_)} = $sort_charmap{$_}     foreach keys %sort_charmap; my %charmap = ( 'indx' => { %indx_charmap }, 'sort' => { %indx_charmap, %sort_charmap } ); my %or_chars = ( 'indx' => join('|' => sort keys %{$charmap{'indx'}}), 'sort' => join('|' => sort keys %{$charmap{'sort'}}) ); sub filter_characters {    my $arg = $_[1] ? 'sort' : 'indx';     my $chr = $or_chars{$arg};     my $map = $charmap{$arg};     $_[0] =~ s/($chr)/$map->{$1}/g; }
 * 1) filter characters for searching and sorting
 * 2) NOTE: keep in sync with the filter_characters
 * 3) function in _query:dummypagescriptextra_
 * 1) for searching (will also affect sorting)
 * 1) for sorting in classifier lists (won't affect searching)
 * 1) "duplicate" maps for utf8 representations
 * 1) join charmaps ('sort' overwriting 'indx')
 * 1) create OR lists of characters
 * 1) and now the actual filter function -- simple, eh? ;-)
 * 2) with a true second argument %sort_charmap is taken
 * 3) into account (taking precedence over %indx_charmap)

classify/AZCompactList.pm [555-557]:
 * Filter characters before sorting.

sub alpha_numeric_cmp {    my ($self,$a,$b) = @_; my $title_a = $self->{'reclassifylist'}->{$a}; my $title_b = $self->{'reclassifylist'}->{$b}; if ($title_a =~ m/^(\d+(\.\d+)?)/) {        my $val_a = $1; if ($title_b =~ m/^(\d+(\.\d+)?)/) {            my $val_b = $1; if ($val_a != $val_b) {                return ($val_a <=> $val_b); }        }     }     return ($title_a cmp $title_b); } modify to: sub alpha_numeric_cmp {    my ($self,$a,$b) = @_; my $title_a = $self->{'reclassifylist'}->{$a}; my $title_b = $self->{'reclassifylist'}->{$b}; if ($title_a =~ m/^(\d+(\.\d+)?)/) {        my $val_a = $1; if ($title_b =~ m/^(\d+(\.\d+)?)/) {            my $val_b = $1; if ($val_a != $val_b) {                return ($val_a <=> $val_b); }        }     }     # this makes it quite slow -- but we need it to get proper sorting &sorttools::filter_characters($title_a, 1); &sorttools::filter_characters($title_b, 1); return ($title_a cmp $title_b); }

classify/AZCompactList.pm [689-691]:
 * Get proper sorting for browsing list nodes.

foreach my $classification (@$classlistref) { my $title = $self->{'reclassifylist'}->{$classification}; $title =~ s/&(.){2,4};//g; # remove any HTML special chars $title =~ s/^\W+//g; # remove leading non-word chars modify to: foreach my $classification (@$classlistref) { my $title = $self->{'reclassifylist'}->{$classification}; # we should do filtering here as well &sorttools::filter_characters($title, 1); $title =~ s/&(.){2,4};//g; # remove any HTML special chars $title =~ s/^\W+//g; # remove leading non-word chars
 * 1) first split up the list into separate A-Z and 0-9 classifications
 * 1) first split up the list into separate A-Z and 0-9 classifications

Macro files
In your collection's extra.dm or in the appropriate system macro files (applies to mg only):
 * 1. Overwrite/modify the _queryform_ macro in order to have the filter_characters function called when submitting the query.
 * 2. Add said function to the page script (NOTE: This function has to correspond to the above perl function!).
 * 3. Have these scripts appear for the about page, too.

add/modify: package query _queryform_ {      _queryformcontent_ _optdatesearch_ } _dummypagescriptextra_{ function initialize \{ \} function filter_characters \{ var query = document.QueryForm.q;  var oldval = query.value.toLowerCase; var newval = ""; var chars = [ ["224", "a"],  // A WITH GRAVE ["225", "a"],  // A WITH ACUTE ["226", "a"],  // A WITH CIRCUMFLEX ["227", "a"],  // A WITH TILDE ["229", "a"],  // A WITH RING ABOVE ["230", "a"],  // AE     ["231", "c"],   // C WITH CEDILLA ["232", "e"],  // E WITH GRAVE ["233", "e"],  // E WITH ACUTE ["234", "e"],  // E WITH CIRCUMFLEX ["235", "e"],  // E WITH DIAERESIS ["236", "i"],  // I WITH GRAVE ["237", "i"],  // I WITH ACUTE ["238", "i"],  // I WITH CIRCUMFLEX ["239", "i"],  // I WITH DIAERESIS ["240", "dh"], // ETH ["241", "n"],  // N WITH TILDE ["242", "o"],  // O WITH GRAVE ["243", "o"],  // O WITH ACUTE ["244", "o"],  // O WITH CIRCUMFLEX ["245", "o"],  // O WITH TILDE ["248", "o"],  // O WITH STROKE ["249", "u"],  // U WITH GRAVE ["250", "u"],  // U WITH ACUTE ["251", "u"],  // U WITH CIRCUMFLEX ["253", "y"],  // Y WITH ACUTE ["254", "th"], // THORN ["255", "y"],  // Y WITH DIAERESIS // now we want to break umlauts ["228", "ae"], // A WITH DIAERESIS ["246", "oe"], // O WITH DIAERESIS ["252", "ue"], // U WITH DIAERESIS ["223", "ss"]  // SHARP S   ] for (var i = 0; i < oldval.length; i++) \{ var c = oldval.substr(i, 1); var z = 0; for (var j = 0; j < chars.length; j++) \{ if (z != 1 && c == String.fromCharCode(chars[j][0])) \{ newval += chars[j][1]; z = 1; \}    \}     if (z != 1) \{ newval += c;    \} \}  query.value = newval; \} } package about _pagescriptextra_ {_query:dummypagescriptextra_}

Epilogue
Hopefully, this can be of help to anybody ;-) If you encounter problems or errors or have suggestions for improvement feel free to edit this article or drop me a line on my talk page. cheers, jens (talk) 16:29, 21 March 2006 (PST)

TODO

 * Add macro modifications for mgpp.
 * Address further problems:
 * Elimination of articles in sorttools.pm.
 * Tidying up of use of white space in classify/AZCompactList.pm.
 * Upload complete files.