no way to compare when less than two revisions

Differences

This shows you the differences between two versions of the page.

@@ Line 1: / Line 1: @@
+====== Getting the most out of your documents ======
+Collections can be individualised to make the information they contain accessible in different ways. This chapter describes how Greenstone extracts information from documents and presents it to the user: the document processing (Section [[#plugins|plugins]]) and classification structures (Section [[#classifiers|classifiers]]), and user interface tools (Sections [[#formatting_greenstone_output|formatting_greenstone_output]] and [[#controlling_the_greenstone_user_interface|controlling_the_greenstone_user_interface]]).
+===== Plugins =====
+Plugins parse the imported documents and extract metadata from them. For example, the html plugin converts html pages to the Greenstone archive format and extracts metadata which is explicit in the document format—such as titles, enclosed by //<title></title>// tags.
+Plugins are written in the Perl language. They all derive from a basic plugin called //BasPlug//, which performs universally-required operations like creating a new Greenstone archive document to work with, assigning an object identifier (OID), and handling the sections in a document. Plugins are kept in the //perllib/plugins// directory.
+To find more about any plugin, just type //pluginfo.pl plugin-name// at the command prompt. (You need to invoke the appropriate //setup// script first, if you haven't already, and on Windows you need to type //perl —S pluginfo.pl plugin-name// if your environment is not set up to associate files ending in //.pl// as Perl executables). This displays information about the plugin on the screen—what plugin-specific options it takes, and what general options are allowed.
+You can easily write new plugins that process document formats not handled by existing plugins, format documents in some special way, or extract a new kind of metadata.
+==== General Options ====
+Table <tblref table_options_applicable_to_all_plugins> shows options that are accepted by any plugin derived from BasPlug.
+<tblcaption table_options_applicable_to_all_plugins|Options applicable to all plugins></tblcaption>
+|< - 132 397 >|
+| ''<!--i-->input_encoding<!--/i-->'' | Character encoding of the source documents. The default is to automatically work out the character encoding of each individual document. It is sometimes useful to set this value though, for example, if you know that all your documents are plain ASCII, setting the input encoding to //ascii// greatly increases the speed at which your collection is imported and built. There are many possible values. Use //pluginfo.pl BasPlug// to get a complete list. |
+| ''<!--i-->default_encoding<!--/i-->'' | The encoding that is used if //input_encoding// is //auto// and automatic encoding detection fails. |
+| ''<!--i-->process_exp<!--/i-->'' | A Perl regular expression to match against filenames (for example, to locate a certain kind of file extension). This dictates which files a plugin processes. Each plugin has a default (//HTMLPlug//'s default is //(?i).html?//—that is, anything with the extension //.htm// or //.html//). |
+| ''<!--i-->block_exp<!--/i-->'' | A regular expression to match against filenames that are not to be passed on to subsequent plugins. This can prevent annoying error messages about files you aren't interested in. Some plugins have default blocking expressions—for example, //HTMLPlug// blocks files with //.gif//, //.jpg//, //.jpeg//, //.png//, //.rtf// and //.css// extensions. |
+| ''<!--i-->cover_image<!--/i-->'' | Look for a //.jpg// file (with the same name as the file being processed) and associate it with the document as a cover image. |
+| ''<!--i-->extract_acronyms<!--/i-->'' | Extract acronyms from documents and add them as metadata to the corresponding Greenstone archive documents. |
+| ''<!--i-->markup_acronyms<!--/i-->'' | Add acronym information into document text. |
+| ''<!--i-->extract_language<!--/i-->'' | Identify each document's language and associate it as metadata. Note that this is done automatically if //input_encoding// is //auto//. |
+| ''<!--i-->default_language<!--/i-->'' | If automatic language extraction fails, language metadata is set to this value. |
+| ''<!--i-->first<!--/i-->'' | Extract a comma-separated list of the first stretch of text and add it as //FirstNNN// metadata (often used as a substitute for //Title//). |
+| ''<!--i-->extract_email<!--/i-->'' | Extract E-mail addresses and add them as document metadata. |
+| ''<!--i-->extract_date<!--/i-->'' | Extract dates relating to the content of historical documents and add them as //Coverage// metadata. |
+==== Document processing plugins ====
+<tblcaption table_greenstone_plugins|Greenstone plugins></tblcaption>
+|< - 60 72 236 85 76 >|
+| | | **Purpose** | **File types** | **Ignores files** |
+| **General** | //ArcPlug// | Processes files named in the file //archives.inf//, which is used to communicate between the import and build processes. Must be included (unless //import.pl// will not be used). | //—// | //—// |
+| | //RecPlug// | Recurses through a directory structure by checking to see whether a filename is a directory and if so, inserting all files in the directory into the plugin pipeline. Assigns metadata if //—use_metadata_files// option is set and //metadata.xml// files are present. | //—// | //—// |
+| | //GAPlug// | Processes Greenstone archive files generated by //import.pl.// Must be included (unless //import.pl// will not be used). | //.xml// | //—// |
+| | TEXTPlug | Processes plain text by placing it between //<pre> </pre>// tags (treating it as preformatted). | //.txt, .text// | //—// |
+| | //HTMLPlug// | Processes html, replacing hyperlinks appropriately. If the linked document is not in the collection, an intermediate page is inserted warning the user they are leaving the collection. Extracts readily available metadata such as //Title//. | //.htm, .html, .cgi, .php, .asp, .shm, .shtml// | //.gif, .jpeg, .jpg, .png, .css, .rtf// |
+| | //WordPlug// | Processes Microsoft Word documents, extracting author and title where available, and keeping diagrams and pictures in their proper places. The conversion utilities used by this plugin sometimes produce html that is poorly formatted, and we recommend that you provide the original documents for viewing when building collections of WORD files. However, the text that is extracted from the documents is adequate for searching and indexing purposes. | //.doc// | //.gif, .jpeg, .jpg, .png, .css, .rtf// |
+| | //PDFPlug// | Processes PDF documents, extracting the first line of text as a title. The //pdftohtml// program fails on some PDF files. What happens is that the conversion process takes an exceptionally long time, and often an error message relating to the conversion process appears on the screen. If this occurs, the only solution that we can offer is to remove the offending document from the collection and re-import. | //.pdf// | //.gif, .jpeg, .jpg, .png, .css, .rtf// |
+| | //PSPlug// | Processes PostScript documents, optionally extracting date, title and page number metadata. | //.ps// | //.eps// |
+| | //EMAILPlug// | Processes E-mail messages, recognising author, subject, date, etc. This plugin does not yet handle MIME-encoded E-mails propoerly—although legible, they often look rather strange. | Must end in digits or digits followed by //.Email// | //—// |
+| | //BibTexPlug// | Processes bibliography files in //BibTex// format | //.bib// | //—// |
+| | //ReferPlug// | Processes bibliography files in //refer// format | //.bib// | //—// |
+| | //SRCPlug// | Processes source code files | //Makefile, Readme, .c, .cc, .cpp, .h, .hpp, .pl, .pm, .sh// | //.o, .obj, .a, .so, .dll// |
+| | //ImagePlug// | Processes image files for creating a library of images. Only works on UNIX. | //.jpeg, .jpg, .gif, .png, .bmp, .xbm, .tif, .tiff// | //—// |
+| | //SplitPlug// | Like BasPlug and ConvertToPlug, this plugin should not be called directly, but may be inherited byplugins that need to process files containing several documents | //—// | //—// |
+| | //FOXPlug// | Processes FoxBASE dbt files | //.dbt, .dbf// | //—// |
+| | //ZIPPlug// | Uncompresses //gzip//, //bzip//, //zip//, and //tar// files, provided the appropriate Gnu tools are available. | //.gzip, .bzip, .zip, .tar, .gz, .bz, .tgz, .taz// | //—// |
+| **Collection <br/>Specific** | //PrePlug// | Processes html output using PRESCRIPT, splitting documents into pages for the Computer Science Technical Reports collection. | //.html, .html.gz// | //—// |
+| | //GBPlug// | Processes Project Gutenberg etext—which includes manually-entered title information. | //.txt.gz, .html, .htm// | //—// |
+| | //TCCPlug// | Processes E-mail documents from Computists' Weekly | Must begin with //tcc or cw// | //—// |
+Document processing plugins are used by the collection-building software to parse each source document in a way that depends on its format. A collection's configuration file lists all plugins that are used when building it. During the import operation, each file or directory is passed to each plugin in turn until one is found that can process it—thus earlier plugins take priority over later ones. If no plugin can process the file, a warning is printed (to standard error) and processing passes to the next file. (This is where the //block_exp// option can be useful—to prevent these error messages for files that might be present but don't need processing.) During building, the same procedure is used, but the //archives// directory is processed instead of the //import// directory.
+The standard Greenstone plugins are listed in Table <tblref table_greenstone_plugins>. Recursion is necessary to traverse directory hierarchies. Although the import and build programs do not perform explicit recursion, some plugins cause indirect recursion by passing files or directory names into the plugin pipeline. For example, the standard way of recursing through a directory hierarchy is to specify //RecPlug//, which does exactly this. If present, it should be the last element in the pipeline. Only the first two plugins in Table <tblref table_greenstone_plugins> cause indirect recursion.
+Some plugins are written for specific collections that have a document format not found elsewhere, like the E-text used in the Gutenberg collection. These collection-specific plugins are found in the collection's //perllib/plugins// directory. Collection-specific plugins can be used to override general plugins with the same name.
+Some document-processing plugins use external programs that parse specific proprietary formats—for example, Microsoft Word—into either plain text or html. A general plugin called //ConvertToPlug// invokes the appropriate conversion program and passes the result to either //TEXTPlug// or //HTMLPlug//. We describe this in more detail shortly.
+Some plugins have individual options, which control what they do in finer detail than the general options allow. Table <tblref table_plugin-specific_options> describes them.
+<tblcaption table_plugin-specific_options|Plugin-specific options></tblcaption>
+|< - 132 132 265 >|
+| | **Option** | **Purpose** |
+| //HTMLPlug// | //nolinks// | Do not trap links within the collection. This speeds up the import/build process, but any links in the collection will be broken. |
+| | //description_tags// | Interpret tagged document files as described in the subsection below. |
+| | //keep_head// | Do not strip out html headers. |
+| | //no_metadata// | Do not seek any metadata (this may speed up the import/build process). |
+| | //metadata_fields// | Takes a comma-separated list of metadata types (defaults to //Title//) to extract. To rename the metadata in the Greenstone archive file, use //tag<newname>// where //tag// is the html tag sought and //newname// its new name. |
+| | //hunt_creator_metadata// | Find as much metadata as possible about authorship and put it in the Greenstone archive document as //Creator// metadata. You also need to include //Creator// using the //metadata_fields// option. |
+| | //file_is_url// | Use this option if a web mirroring program has been used to create the structure of the documents to be imported. |
+| | //assoc_files// | Gives a Perl regular expression that describes file types to be treated as associated files. The default types are //.jpg//, //.jpeg//, //.gif//, //.png//, //.css// |
+| | //rename_assoc_files// | Rename files associated with documents. During this process the directory structure of any associated files will become much shallower (useful if a collection must be stored in limited space). |
+| //HTMLPlug// and <br/>//TEXTPlug// | //title_sub// | Perl substitution expression to modify titles. |
+| //PSPlug// | //extract_date// | Extract the creation date from the PostScript header and store it as metadata. |
+| | //extract_title// | Extract the document title from the PostScript header and store it as title metadata. |
+| | //extract_pages// | Extract the page numbers from the PostScript document and add them to the appropriate sections as metadata with the tag //Pages.// |
+| //RecPlug// | //use_metadata_files// | Assign metadata from a file as described in the subsection below. |
+| //ImagePlug// | Various options | See //ImagePlug.pm.// |
+| //SRCPlug// | //remove_prefix// | Gives a Perl regular expression of a leading pattern which is to be removed from the filename. Default behaviour is to remove the whole path. |
+==== Plugins to import proprietary formats ====
+Proprietary formats pose difficult problems for any digital library system. Although documentation may be available about how they work, they are subject to change without notice, and it is difficult to keep up with changes. Greenstone has adopted the policy of using GPL (Gnu Public License) conversion utilities written by people dedicated to the task. Utilities to convert Word and PDF formats are included in the //packages// directory. These all convert documents to either text or html. Then //HTMLPlug// and //TEXTPlug// are used to further convert them to the Greenstone archive format. //ConvertToPlug// is used to include the conversion utilities. Like //BasPlug// it is never called directly. Rather, plugins written for individual formats are derived from it as illustrated in Figure <imgref figure_plugin_inheritance_hierarchy>. //ConvertToPlug// uses Perl's dynamic inheritance scheme to inherit from either //TEXTPlug// or //HTMLPlug//, depending on the format to which a source document has been converted.
+<imgcaption figure_plugin_inheritance_hierarchy|%!-- id:537 --%Plugin inheritance hierarchy ></imgcaption>
+{{..:images:dev_fig_9.gif?229x193&direct}}
+When //ConvertToPlug// receives a document, it calls //gsConvert.pl// (found in //GSDLHOME/bin/script//) to invoke the appropriate conversion utility. Once the document has been converted, it is returned to //ConvertToPlug//, which invokes the text or html plugin as appropriate. Any plugin derived from //ConvertToPlug// has an option //convert_to//, whose argument is either //text// or //html//, to specify which intermediate format is preferred. Text is faster, but html generally looks better, and includes pictures.
+Sometimes there are several conversion utilities for a particular format, and //gsConvert// may try different ones on a given document. For example, the preferred Word conversion utility //wvWare// does not cope with anything less than Word 6, and a program called //AnyToHTML//, which essentially just extracts whatever text strings can be found, is called to convert Word 5 documents.
+The steps involved in adding a new external document conversion utility are:
+  - Install the new conversion utility so that it is accessible by Greenstone (put it in the //packages// directory).
+  - Alter //gsConvert.pl// to use the new conversion utility. This involves adding a new clause to the //if// statement in the //main// function, and adding a function that calls the conversion utility.
+  - Write a top-level plugin that inherits from //ConvertToPlug// to catch the format and pass it on.
+==== Assigning metadata from a file ====
+The standard plugin //RecPlug// also incorporates a way of assigning metadata to documents from manually (or automatically) created XML files. We describe this in some detail, so that you can create metadata files in the appropriate format. If the //use_metadata_files// option is specified, //RecPlug// uses an auxiliary metadata file called //metadata.xml//. Figure <imgref figure_xml_format> shows the XML Document Type Definition (DTD) for the metadata file format, while Figure <imgref figure_xml_format_1> shows an example //metadata.xml// file.
+<imgcaption figure_xml_format|%!-- id:547 --%(a) %!-- id:546 --%XML format: (a) Document Type Definition (DTD); (b) Example metadata file ></imgcaption>
+<code>
+<!DOCTYPE GreenstoneDirectoryMetadata [
+   <!ELEMENT DirectoryMetadata (FileSet*)>
+   <!ELEMENT FileSet (FileName+,Description)>
+   <!ELEMENT FileName (#PCDATA)>
+   <!ELEMENT Description (Metadata*)>
+   <!ELEMENT Metadata (#PCDATA)>
+   <ATTLIST Metadata name CDATA #REQUIRED>
+   <ATTLIST Metadata mode (accumulate|override) "override">
+]>
+</code>
+<imgcaption figure_xml_format_1|%!-- id:549 --%(b) %!-- id:548 --% ></imgcaption>
+<code>
+<?xml version="1.0" ?>
+<!DOCTYPE GreenstoneDirectoryMetadata SYSTEM
+"http://greenstone.org/dtd/GreenstoneDirectoryMetadata/<br/>1.0/GreenstoneDirectoryMetadata.dtd">
+<DirectoryMetadata>
+   <FileSet>
+       <FileName>nugget.*</FileName>
+       <Description>
+       <Metadata name="Title">Nugget Point Lighthouse</Metadata>
+           <Metadata name="Place" mode="accumulate">Nugget Point</Metadata>
+       </Description>
+   </FileSet>
+   <FileSet>
+       <FileName>nugget-point-1.jpg</FileName>
+       <Description>
+           <Metadata name="Title">Nugget Point Lighthouse , The Catlins </Metadata>
+           <Metadata name="Subject">Lighthouse</Metadata>
+       </Description>
+   </FileSet>
+</DirectoryMetadata>
+</code>
+The example file contains two metadata structures. In each one, the //filename// element describes files to which the metadata applies, in the form of a regular expression. Thus //<FileName>nugget.*</FileName>// indicates that the first metadata record applies to every file whose name starts with “nugget”.((Note that in Greenstone, regular expressions are interpreted in the Perl language, which is subtly different from some other conventions. For example, “*” matches zero or more occurrences of the previous character, while “.” matches any character—so //nugget.*// matches any string with prefix “nugget,” whether or not it contains a period after the prefix. To insist on a period you would need to escape it, and write //nugget\..*// instead.))For these files, //Title// metadata is set to “Nugget Point Lighthouse.”
+Metadata elements are processed in the order in which they appear. The second structure above sets //Title// metadata for the file named //nugget-point-1.jpg// to “Nugget Point Lighthouse, The Catlins,” overriding the previous specification. It also adds a //Subject// metadata field.
+Sometimes metadata is multi-valued and new values should accumulate, rather than overriding previous ones. The //mode=accumulate// attribute does this. It is applied to //Place// metadata in the first specification above, which will therefore be multi-valued. To revert to a single metadata element, write //<Metadata name=“Place” mode=“override”>New Zealand</Metadata>//. In fact, you could omit this mode specification because every element overrides unless otherwise specified. To accumulate metadata for some field, //mode=accumulate// must be specified in every occurrence.
+When its //use_metadata_files// option is set, //RecPlug// checks each input directory for an XML file called //metadata.xml// and applies its contents to all the directory's files and subdirectories.
+The //metadata.xml// mechanism that is embodied in RecPlug is just one way of specifying metadata for documents. It is easy to write different plugins that accept metadata specifications in completely different formats.
+==== Tagging document files ====
+Source documents often need to be structured into sections and subsections, and this information needs to be communicated to Greenstone so that it can preserve the hierarchical structure. Also, metadata—typically the title—might be associated with each section and subsection.
+The simplest way of doing this is often simply to edit the source files. The HTML plugin has a //description_tags// option that processes tags in the text like this:
+<code>
+<!--
+<Section>
+   <Description>
+       <Metadata name="Title"> Realizing human rights for poor people: Strategies for achieving the international development targets</Metadata>
+   </Description>
+-->
+</code>
+//(text of section goes here)//
+<code>
+<!--
+</Section>
+-->
+</code>
+The %!-- … --% markers are used because they indicate comments in HTML; thus these section tags will not affect document formatting. In the //Description// part other kinds of metadata can be specified, but this is not done for the style of collection we are describing here. Also, the tags can be nested, so the line marked //text of section goes here// above can itself include further subsections, such as
+//(text of first part of section goes here)//
+<code>
+<!--
+<Section>
+   <Description>
+       <Metadata name="Title"> The international development targets</Metadata>
+   </Description>
+-->
+</code>
+//(text of subsection goes here)//
+<code>
+<!--
+</Section>
+-->
+</code>
+//(text of last part of section goes here)//
+This functionality is inherited by any plugins that use HTMLPlug. In particular, the Word plugin converts its input to HTML form, and so exactly the same way of specifying metadata can be used in Word (and RTF) files. (This involves a bit of work behind the scenes, because when Word documents are converted to HTML care is normally taken to neutralize HTML's special interpretation of stray “<” and “>” signs; we have arranged to override this in the case of the above specifications.) Note that exactly the same format as above is used, even in Word files, including the surrounding “%!--” and “--%”. Font and spacing is ignored.
+===== Classifiers =====
+Classifiers are used to create a collection's browsing indexes. Examples are the //dlpeople// collection's //Titles A-Z// index, and the //Subject//, //How to//, //Organisation// and //Titles A-Z// indexes in the Humanity Development Library—of which the Demo collection is a subset. The navigation bar near the top of the screenshots in Figures 3 and 8a include the //search// function, which is always provided, followed by buttons for any classifiers that have been defined. The information used to support browsing is stored in the collection information database, and is placed there by classifiers that are called during the final phase of //buildcol.pl//.
+<imgcaption figure_azlist_classifier|%!-- id:566 --%//AZList// classifier ></imgcaption>
+{{..:images:dev_fig_11.png?390x182&direct}}
+Classifiers, like plugins, are specified in a collection's configuration file. For each one there is a line starting with the keyword //classify// and followed by the name of the classifier and any options it takes. The basic collection configuration file discussed in Section [[#configuration_file|configuration_file]] includes the line //classify AZList—metadata Title//, which makes an alphabetic list of titles by taking all those with a //Title// metadata field, sorting them and splitting them into alphabetic ranges. An example is shown in Figure <imgref figure_azlist_classifier>.
+<imgcaption figure_list_classifier|%!-- id:568 --%//List// classifier ></imgcaption>
+{{..:images:dev_fig_12.png?394x143&direct}}
+A simpler classifier, called //List//, illustrated in Figure <imgref figure_list_classifier>, creates a sorted list of a given metadata element and displays it without any alphabetic subsections. An example is the //how to// metadata in the Demo collection, which is produced by a line //classify List —metadata Howto// in the collection configuration file.((Note that more recent versions of the Demo collection use a Hierarchy classifier to display the how to metadata. In this case they will be displayed slightly differently to what is shown in Figure <imgref figure_list_classifier>.)) Another general-purpose list classifier is //DateList//, illustrated in Figure <imgref figure_datelist_classifier>, which generates a selection list of date ranges. (The //DateList// classifier is also used in the Greenstone Archives collection.)
+<imgcaption figure_datelist_classifier|%!-- id:570 --%//DateList// classifier ></imgcaption>
+{{..:images:dev_fig_13.png?390x159&direct}}
+Other classifiers generate browsing structures that are explicitly hierarchical. Hierarchical classifications are useful for subject classifications and subclassifications, and organisational hierarchies. The Demo collection's configuration file contains the line<br/>//classify Hierarchy —hfile sub.txt —metadata Subject —sort Title//, and Figure <imgref figure_hierarchy_classifier> shows the subject hierarchy browser that it produces. The bookshelf with a bold title is the one currently being perused; above it you can see the subject classification to which it belongs. In this example the hierarchy for classification is stored in a simple text format in //sub.txt//.
+<imgcaption figure_hierarchy_classifier|%!-- id:572 --%//Hierarchy// classifier ></imgcaption>
+{{..:images:dev_fig_14.png?394x198&direct}}
+All classifiers generate a hierarchical structure that is used to display a browsing index. The lowest levels (i.e. leaves) of the hierarchy are usually documents, but in some classifiers they are sections. The internal nodes of the hierarchy are either //Vlist//, //Hlist//, or //Datelist//. A //Vlist// is a list of items displayed vertically down the page, like the “how to” index in the Demo collection (see Figure <imgref figure_list_classifier>). An //Hlist// is displayed horizontally. For example, the //AZList// display in Figure <imgref figure_azlist_classifier> is a two-level hierarchy of internal nodes consisting of an //Hlist//(giving the A-Z selector) whose children are //Vlists// —and their children, in turn, are documents. A //Datelist// (Figure <imgref figure_datelist_classifier>) is a special kind of //Vlist// that allows selection by year and month.
+The lines used to specify classifiers in collection configuration files contain a //metadata// argument that identifies the metadata by which the documents are classified and sorted. Any document in the collection that does not have this metadata defined will be omitted from the classifier (but it is still indexed, and consequently searchable). If no //metadata// argument is specified, all documents are included in the classifier, in the order in which they are encountered during the building process. This is useful if you want a list of all documents in your collection.
+<tblcaption table_greenstone_classifiers|Greenstone classifiers></tblcaption>
+|< - 132 92 305 >|
+| | **Argument** | **Purpose** |
+| //Hierarchy// | | Hierarchical classification |
+| | //hfile// | Classification file |
+| | //metadata// | Metadata element to test against //hfile// identifier |
+| | //sort// | Metadata element used to sort documents within leaves (defaults to //Title//) |
+| | //buttonname// | Name of the button used to access this classifier (defaults to value of metadata argument) |
+| //List// | | Alphabetic list of documents |
+| | //metadata// | Include documents containing this metadata element |
+| | //buttonname// | Name of button used to access this classifier (defaults to value of metadata argument) |
+| //SectionList// | | List of sections in documents |
+| //AZList// | | List of documents split into alphabetical ranges |
+| | //metadata// | Include all documents containing this metadata element |
+| | //buttonname// | Name of button used to access this classifier (defaults to value of metadata argument) |
+| //AZSectionList// | | Like //AZList// but includes every section of the document |
+| //DateList// | | Similar to //AZList// but sorted by date |
+The current set of classifiers is listed in Table <tblref table_greenstone_classifiers>. Just as you can use the //pluginfo.pl// program to find out about any plugin, there is a //classinfo.pl// program that gives you information about any classifier, and the options it provides.
+All classifiers accept the argument //buttonname//, which defines what is written on the Greenstone navigation button that invokes the classifier (it defaults to the name of the metadata argument). Buttons are provided for each Dublin Core metadata type, and for some other types of metadata.
+Each classifier receives an implicit name from its position in the configuration file. For example, the third classifier specified in the file is called CL3. This is used to name the collection information database fields that define the classifier hierarchy.
+Collection-specific classifiers can be written, and are stored in the collection's //perllib/classify// directory. The Development Library has a collection-specific classifier called //HDLList//, which is a minor variant of //AZList//.
+==== List classifiers ====
+The various flavours of list classifier are shown below.
+  * //SectionList//—like //List// but the leaves are sections rather than documents. All document sections are included except the top level. This is used to create lists of sections (articles, chapters or whatever) such as in the Computists' Weekly collection (available through // nzdl.org //), where each issue is a single document and comprises several independent news items, each in its own section.
+  * //AZList//—generates a two-level hierarchy comprising an //HList// whose children are //VLists//, whose children are documents. The //HList// is an A-Z selector that divides the documents into alphabetic ranges. Documents are sorted alphabetically by metadata, and the resulting list is split into ranges.
+  * //AZSectionList//—like //AZList// but the leaves are sections rather than documents.
+  * //DateList//—like //AZList// except that the top-level //HList// allows selection by year and its children are //DateLists// rather than //VLists//. The metadata argument defaults to //Date//.
+==== The hierarchy classifier ====
+All classifiers are hierarchical. However, the list classifiers described above have a fixed number of levels, whereas the “hierarchy” classifiers described in this section have an arbitrary number of levels. Hierarchy classifiers are more complex to specify than list classifiers.
+<imgcaption figure_part_of_the_file_sub|%!-- id:618 --%Part of the file //sub.txt// ></imgcaption>
+<code>
+           1       "General reference"
+.2       1.2       "Dictionaries, glossaries, language courses, terminology
+           2       "Sustainable Development, International cooperation, Pro
+.1       2.1       "Development policy and theory, international cooperatio
+.2       2.2       "Development, national planning, national plans"
+.3       2.3       "Project planning and evaluation (incl. project managem
+.4       2.4       "Regional development and planning incl. regional profil
+.5       2.5       "Nongovernmental organisations (NGOs) in general, self-
+.6       2.6       "Organisations, institutions, United Nations (general, d
+.6.1   2.6.1       "United Nations"
+.6.2   2.6.2       "International organisations"
+.6.3   2.6.3       "Regional organisations"
+.6.5   2.6.5       "European Community - European Union"
+.7       2.7       "Sustainable Development, Development models and example
+.8       2.8       "Basic Human Needs"
+.9       2.9       "Hunger and Poverty Alleviation"
+</code>
+The //hfile// argument gives the name of a file, like that in Figure <imgref figure_part_of_the_file_sub>, which defines the metadata hierarchy. Each line describes one classification, and the descriptions have three parts:
+  * Identifier, which matches the value of the metadata (given by the //metadata// argument) to the classification.
+  * Position-in-hierarchy marker, in multi-part numeric form, e.g. 2, 2.12, 2.12.6.
+  * The name of the classification. (If this contains spaces, it should be placed in quotation marks.)
+Figure <imgref figure_part_of_the_file_sub> is part of the //sub.txt// file used to create the subject hierarchy in the Development Library (and the Demo collection). This example is a slightly confusing one because the number representing the hierarchy appears twice on each line. The metadata type //Hierarchy// is represented in documents with values in hierarchical numeric form, which accounts for the first occurrence. It is the second occurrence that is used to determine the hierarchy that the hierarchy browser implements.
+The //hierarchy// classifier has an optional argument, //sort//, which determines how the documents at the leaves are ordered. Any metadata can be specified as the sort key. The default is to produce the list in the order in which the building process encounters the documents. Ordering at internal nodes is determined by the order in which things are specified in the //hfile// argument.
+==== How classifiers work ====
+Classifiers are Perl objects, derived from //BasClas.pm//, and are stored in the //perllib/classify// directory. They are used when the collection is built. When they are executed, the following four steps occur.
+  - The //new// method creates the classifier object.
+  - The //init// method initialises the object with parameters such as metadata type, button name and sort criterion.
+  - The //classify// method is invoked once for each document, and stores information about the classification made within the classifier object.
+  - The //get_classify_info// method returns the locally stored classification information to the build process, which it then writes to the collection information database for use when the collection is displayed at runtime.
+The //classify// method retrieves each document's OID, the metadata value on which the document is to be classified, and, where necessary, the metadata value on which the documents are to be sorted. The //get_classify_info// method performs all sorting and classifier-specific processing. For example, in the case of the //AZList// classifier, it splits the list into ranges.
+The build process initialises the classifiers as soon as the //builder// object is created. Classifications are created during the build phase, when the information database is created, by //classify.pm//, which resides in Greenstone's //perllib// directory.
+<tblcaption table_items_appearing_in_format_strings|Items appearing in format strings></tblcaption>
+|< - 132 397 >|
+| //[Text]// | The document's text |
+| //[link] … [/link]// | The html to link to the document itself |
+| //[icon]// | An appropriate icon (e.g. the little text icon in a //Search Results// string) |
+| //[num]// | The document number (useful for debugging). |
+| //[metadata-name]// | The value of this metadata element for the document, e.g. //[Title]// |
+===== Formatting Greenstone output =====
+The web pages you see when using Greenstone are not pre-stored but are generated “on the fly” as they are needed. The appearance of many aspects of the pages is controlled using “format strings.” Format strings belong in the collection configuration file, introduced by the keyword //format// followed by the name of the element to which the format applies. There are two different kinds of page element that are controlled by format strings. The first comprises the items on the page that show documents or parts of documents. The second comprises the lists produced by classifiers and searches. All format strings are interpreted at the time that pages are displayed. Since they take effect as soon as any changes in //collect.cfg// are saved, experimenting with format strings is quick and easy.
+Table <tblref table_items_appearing_in_format_strings> shows the format statements that affect the way documents look. The //DocumentButtons// option controls what buttons are displayed on a document page. Here, //string// is a list of buttons (separated by |), possible values being //Detach//, //Highlight//, //Expand Text//, and //Expand Contents//. Reordering the list reorders the buttons.
+<tblcaption table_the_format_options|The //format// options></tblcaption>
+|< - 246 283 >|
+| //format DocumentImages true/false// | If //true//, display a cover image at the top left of the document page (default //false//). |
+| //format DocumentHeading formatstring// | If //DocumentImages// is //false//, the format string controls how the document header shown at the top left of the document page looks (default //[Title]//). |
+| //format DocumentContents true/false// | Display table of contents (if document is hierarchical), or next/previous section arrows and “page k of n” text (if not). |
+| //format DocumentButtons string// | Controls the buttons that are displayed on a document page (default //Detach|Highlight//). |
+| //format DocumentText formatstring// | Format of the text to be displayed on a document page: default \\ ''<!--i--><center><table width=537><!--/i-->'' \\ ''<!--i--><tr><td>[Text]</td></tr><!--/i-->'' \\ ''<!--i--></table></center><!--/i-->'' |
+| //format DocumentArrowsBottom true/false// | Display next/previous section arrows at bottom of document page (default //true//). |
+| //format DocumentUseHTML true/false// | If //true//, each document is displayed inside a separate frame. The Preferences page will also change slightly, adding options applicable to a collection of html documents, including the ability to go directly to the original source document (anywhere on the Web) rather than to the Greenstone copy. |
+==== Formatting Greenstone lists ====
+Format strings that control how lists look can apply at different levels of the display structure. They can alter all lists of a certain type within a collection (for example //DateList//), or all parts of a list (for example all the entries in the //Search// list), or specific parts of a certain list (for example, the vertical list part of an //AZList// classifier on title).
+Following the keyword //format// is a two-part keyword, only one part of which is mandatory. The first part identifies the list to which the format applies. The list generated by a search is called //Search//, while the lists generated by classifiers are called //CL1, CL2, CL3,…// for the first, second, third,… classifier specified in //collect.cfg//. The second part of the keyword is the part of the list to which the formatting is to apply—either //HList// (for horizontal list, like the A-Z selector in an //AZList//), //VList// (for vertical list, like the list of titles under an //AZList//), or //DateList//. For example:
+> //format CL4VList ...//                                  applies to all //VLists// in CL4
+> //format CL2HList ...//                                applies to all //HLists// in CL2
+> //format CL1DateList ...//                      applies to all //DateLists// in CL1
+> //format SearchVList ...//                        applies to the Search Results list
+> //format CL3 ...//                                                    applies to all nodes in CL3, unless otherwise specified
+> //format VList ...//                                                applies to all //VLists// in all classifiers, unless otherwise specified
+The “...” in these examples stand for html format specifications that control the information, and its layout, that appear on web pages displaying the classifier. As well as html specifications, any metadata may appear within square brackets: its value is interpolated in the indicated place. Also, any of the items in Table <tblref table_the_format_options> may appear in format strings. The syntax for the strings also includes a conditional statement, which is illustrated in an example below.
+Recall that all classifiers produce hierarchies. Each level of the hierarchy is displayed in one of four possible ways. We have already encountered //HList//, //VList//, and //DateList//. There is also //Invisible//, which is how the very top levels of hierarchies are displayed—because the name of the classifier is already shown separately on the Greenstone navigation bar.
+==== Examples of classifiers and format strings ====
+<imgcaption figure_excerpt_from_the_demo_collection_collect|%!-- id:674 --%Excerpt from the Demo collection's //collect.cfg// %!-- withLineNumber --%></imgcaption>
+<code 1>
+classify Hierarchy -hfile sub.txt -metadata Subject -sort Title
+classify AZList       -metadata Title
+classify Hierarchy -hfile org.txt -metadata Organisation -sort Title
+classify List           -metadata Howto
+format SearchVList " <td valign=top [link][icon][/link]</td><td>{If}
+                                       {[parent(All':'):Title],[parent(All':'):Title]:}
+                                       [link][Title][/link]</td> "
+format CL4Vlist                 "<br>[link][Howto][/link] "
+format DocumentImages   true
+format DocumentText       "<h3>[Title]</h3>\\n\\n<p>[Text]"
+format DocumentButtons " Expand Text|Expand contents|Detach|Highlight"
+</code>
+Figure <imgref figure_excerpt_from_the_demo_collection_collect> shows part of the collection configuration file for the Demo collection. We use this as an example because it has several classifiers that are richly formatted. Note that statements in collection configuration files must not contain newline characters—in the Table, longer lines are broken up for readability.
+Line 4 specifies the Demo collection's //How To// classifier. This is the fourth in the collection configuration file, and is therefore referred to as CL4. The corresponding format statement is line 7 of Figure <imgref figure_excerpt_from_the_demo_collection_collect>. The “how to” information is generated from the //List// classifier, and its structure is the plain list of titles shown in Figure <imgref figure_list_classifier>. The titles are linked to the documents themselves: clicking a title brings up the relevant document. The children of the hierarchy's top level are displayed as a //VList//(vertical list), which lists the sections vertically. As the associated //format// statement indicates, each element of the list is on a newline (“ //<br>// ”) and contains the //Howto// text, hyperlinked to the document itself.
+Line 1 specifies the Demo collection's //Subject// classification, referred to as CL1 (the first in the configuration file), and Line 3 the //Organisation// classification CL3. Both are generated by the //Hierarchy// classifier and therefore comprise a hierarchical structure of //VLists//.
+Line 2 specifies the remaining classification for the Demo collection, //Titles A-Z// (CL2). Note that there are no corresponding format strings for the classifiers CL1 -CL3. Greenstone has built-in defaults for each format string type and so it's not necessary to set a format string unless you want to override the default.
+<imgcaption figure_formatting_the_document|%!-- id:679 --%Formatting the document ></imgcaption>
+{{..:images:dev_fig_17.gif?394x277&direct}}
+This accounts for the four //classify// lines in Figure <imgref figure_excerpt_from_the_demo_collection_collect>. Coincidentally, there are also four //format// lines. We have already discussed one, the //CL4Vlist// one. The remaining three are the first type of format string, documented in Table <tblref table_items_appearing_in_format_strings>. For example, line 8 places the cover image at the top left of each document page. Line 9 formats the actual document text, with the title of the relevant chapter or section preceding the text itself. These are illustrated in Figure <imgref figure_formatting_the_document>.
+<imgcaption figure_formatting_the_search_results|%!-- id:681 --%Formatting the search results ></imgcaption>
+{{..:images:dev_fig_18.gif?396x243&direct}}
+Line 5 of Figure <imgref figure_excerpt_from_the_demo_collection_collect> is a rather complicated specification that formats the query result list returned by a search, whose parts are illustrated in Figure <imgref figure_formatting_the_search_results>. A simplified version of the format string is
+<code>
+<td valign=top>[link][icon][/link]</td>
+<td>[link][Title][/link]</td>
+</code>
+This is designed to appear as a table row, which is how the query results list is formatted. It gives a small icon linked to the text, as usual, and the document title, hyperlinked to the document itself.
+In this collection, documents are hierarchical. In fact, the above hyperlink anchor evaluates to the title of the section returned by the query. However, it would be better to augment it with the title of the enclosing section, the enclosing chapter, and the book in which it occurs. There is a special metadata item, //parent//, which is not stored in documents but is implicit in any hierarchical document, that produces such a list. It either returns the parent document, or, if used with the qualifier //All//, the list of hierarchically enclosing parents, separated by a character string that can be given after the //All// qualifier. Thus
+<code>
+<td valign=top>[link][icon][/link]</td> <br/><td>{[parent(All': '):Title]: }[link][Title][/link]</td>
+</code>
+has the effect of producing a list containing the book title, chapter title, etc. that enclose the target section, separated by colons, with a further colon followed by a hyperlink to the target section's title.
+Unfortunately, if the target is itself a book, there is no parent and so an empty string will appear followed by a colon. To circumvent such problems you can use //if// and //or … else// statements in a format string:
+<code>
+{If}{[metadata], action-if-non-null, action-if-null}
+{Or}{action, else another-action, else another-action, etc}
+</code>
+In either case curly brackets are used to signal that the statements should be interpreted and not just printed out as text. The //If// tests whether the metadata is empty and takes the first clause if not, otherwise the second one (if it exists). Any metadata item can be used, including the special metadata //parent//. The //Or// statement evaluates each action in turn until one is found that is non-null. That one is sent to the output and the remaining actions are skipped.
+Returning to line 5 of Figure <imgref figure_excerpt_from_the_demo_collection_collect>, the full format string is
+<code>
+<td valign=top>[link][icon][/link]</td>
+<td>{If}{[parent(All': '):Title],
+           [parent(All': '):Title]:}
+      [link][Title][/link]</td>
+</code>
+This precedes the //parent// specification with a conditional that checks whether the result is empty and only outputs the parent string when it is present. Incidentally, //parent// can be qualified by //Top// instead of //All//, which gives the top-level document name that encloses a section—in this case, the book name. No separating string is necessary with //Top//.
+Some final examples illustrate other features. The //DateList// in Figure <imgref figure_datelist_classifier> is used in the //Dates// classification of the Computists' Weekly collection (which happens to be the second classifier, CL2). The classifier and format specifications are shown below. The //DateList// classifier differs from //AZList// in that it always sorts by //Date// metadata, and the bottom branches of the browsing hierarchy use //DateList// instead of //VList//, which causes the year and month to be added at the left of the document listings.
+<code>
+classify AZSectionList metadata=Creator
+format CL2Vlist "<td>[link][icon][/link]</td> <br/><td>[Creator]</td> <br/><td>&nbsp;&nbsp;[Title]</td> <br/><td>[parent(Top):Date]</td> "
+</code>
+The format specification shows these //VLists// in the appropriate way.
+The format-string mechanism is flexible but tricky to learn. The best way is by studying existing collection configuration files.
+==== Linking to different document versions ====
+Using the //[link] … [/link]// mechanism in a format string inserts a hyperlink to the text of a document, and when the link is clicked the html version of the document is displayed. In some collections, it is useful to be able to display other versions of the document. For example, in a collection of Microsoft Word documents, it is nice to be able to display the Word version of each document rather than the html that is extracted from it; similarly for PDF documents.
+The key to being able to show different versions of a document is to embed the necessary information about where the other versions reside into the Greenstone archive form of the document. The information is represented in the form of metadata. Recall that putting
+<code>
+[link][Title][/link]
+</code>
+into a format string creates a link to the html form of the document, whose anchor text is the document's title. The Word and PDF plugins both generate //srclink// metadata so that if you put
+<code>
+[srclink][Title][/srclink]
+</code>
+into a format string, a link is created to the Word or PDF form of the document; again the anchor in this example is the document's title. In order that the appropriate icon can be displayed for Word and PDF documents, these plugins also generate //srcicon// metadata so that
+<code>
+[srclink][srcicon][/srclink]
+</code>
+creates a link which is labeled by the standard Word or PDF icon (whichever is appropriate), rather than the document's title.
+===== Controlling the Greenstone user interface =====
+The entire Greenstone user interface is controlled by macros which reside in the //GSDLHOME/macros// directory. They are written in a language designed especially for Greenstone, and are used run time to generate web pages. Translating the macro language into html is the last step in displaying a page. Thus changes to a macro file affect the display immediately, making experimentation quick and easy. All macro files used by Greenstone are listed in //GSDLHOME/etc/main.cfg// and are loaded every time it starts. One exception to this is when using the Windows Local LIbrary; in this case it is necessary to restart the process.
+Web pages are generated on the fly for a number of reasons, and the macro system is how Greenstone implements the necessary flexibility. Pages can be presented in many languages, and a different macro file is used to store all the interface text in each language. When Greenstone displays a page the macro interpreter checks a language variable and loads the page in the appropriate language (this does not, unfortunately, extend to translating document content). Also, the values of certain display variables, like the number of documents found by a search, are not known ahead of time; these are interpolated into the page text in the form of macros.
+==== The macro file format ====
+Macro files have a //.dm// extension. Each file defines one or more //packages//, each containing a series of macros used for a single purpose. Like classifiers and plugins, there is a basis from which to build macros, called //base.dm// ; this file defines the basic content of a page.
+Macros have names that begin and end with an underscore, and their content is defined using curly brackets. Content can be plain text, html (including links to Java applets and JavaScript), macro names, or any combination of these. This macro from //base.dm// defines the content of a page in the absence of any overriding macro:
+<code>
+_content_ {<p><h2>Oops</h2>_textdefaultcontent_}
+</code>
+The page will read “Oops” at the top, and //_textdefaultcontent_//, which is defined, in English, to be //The requested page could not be found. Please use your browsers 'back' button or the above home button to return to the Greenstone Digital Library,// and in other languages to be a suitable translation of this sentence.
+//_textdefaultcontent_// and //_content_// both reside in the //global// package because they are required by all parts of the user interface. Macros can use macros from other packages as content, but they must prefix their names with their package name. For example,
+<code>
+_collectionextra_ {This collection contains _about:numdocs_ documents. It was last built _about:builddate_ days ago.)
+</code>
+comes from //english.dm//, and is used as the default description of a collection. It is part of the //global// package, but //_numdocs_// and //_builddate_// are both in the //about// package—hence the //about:// preceding their names.
+Macros often contain conditional statements. They resemble the format string conditional described above, though their appearance is slightly different. The basic format is //_If_(x,y,z)//, where //x// is a condition, //y// is the macro content to use if that condition is true, and //z// the content if it is false. Comparison operators are the same as the simple ones used in Perl (less than, greater than, equals, not equals). This example from //base.dm// is used to determine how to display the top of a collection's //about// page:
+<code>
+_imagecollection_ {
+       _If_( "_iconcollection_ " ne "",
+                 <a href = "_httppageabout_ ">
+                         <img src = "_iconcollection_ " border = 0>
+                         </a>,
+                 _imagecollectionv_)
+}
+</code>
+This looks rather obscure. //_iconcollection_// resolves to the empty string if the collection doesn't have an icon, or the filename of an image. To paraphrase the above code: If there is a collection image, display the //About this Collection// page header (referred to by //_httppageabout_//) and then the image; otherwise use the alternative display //_imagecollectionv_//.
+Macros can take arguments. Here is a second definition for the //_imagecollection_// macro which immediately follows the definition given above in the //base.dm// file:
+<code>
+_imagecollection_[v=1]{_imagecollectionv_}
+</code>
+The argument //[v=1]// specifies that the second definition is used when Greenstone is running in text-only mode. The language macros work similarly—apart from //english.dm//, because it is the default, all language macros specify their language as an argument. For example,
+<code>
+_textimagehome_ {Home Page}
+</code>
+appears in the English language macro file, whereas the German version is
+<code>
+_textimagehome_ [l=de] {Hauptaseite}
+</code>
+The English and German versions are in the same package, though they are in separate files (package definitions may span more than one file). Greenstone uses its //l// argument at run time to determine which language to display.
+<imgcaption figure_part_of_the_aboutdm_macro_file|%!-- id:714 --%Part of the //about.dm// macro file ></imgcaption>
+<code>
+package about
+##############################################
+# about page content
+###############################################
+_pagetitle_ {_collectionname_}
+_content_ {
+<center>
+_navigationbar_
+</center>
+_query:queryform_
+<p>_iconblankbar_
+<p>_textabout_
+_textsubcollections_
+<h3>_help:textsimplehelpheading_</h3>
+_help:simplehelp_
+}
+_textabout_ {
+<h3>_textabcol_</h3>
+_Global:collectionextra_
+}
+</code>
+As a final example, Figure <imgref figure_part_of_the_aboutdm_macro_file> shows an exerpt from the macro file //about.dm// that is used to generate the “About this collection” page for each collection. It shows three macros being defined, //_pagetitle_//, //_content_// and //_textabout_//.
+==== Using macros ====
+Macros are powerful, and can be a little obscure. However, with a good knowledge of html and a bit of practice, they become a quick and easy way to customise your Greenstone site.
+For example, suppose you wanted to create a static page that looked like your current Greenstone site. You could create a new package, called //static//, for example, in a new file, and override the //_content_// macro. Add the new filename to the list of macros in //GSDLHOME/etc/main.cfg// which Greenstone loads every time it is invoked. Finally, access the new page by using your regular Greenstone URL and appending the arguments //?a=p&p=static// (e.g. //http:%%//%%servername/cgi-bin/library?a=p&p=static//).
+To change the “look and feel” of Greenstone you can edit the //base// and //style// packages. To change the Greenstone home page, edit the //home// package (this is described in the //Greenstone Digital Library Installer's Guide//). To change the query page, edit //query.dm//.
+Experiment freely with macros. Changes appear instantly, because macros are interpreted as pages are displayed. The macro language is a useful tool that can be used to make your Greenstone site your own.
+===== The packages directory =====
+<tblcaption table_the_packages_directory_1|The //packages// directory></tblcaption>
+|< - 132 217 180 >|
+| | **Package** | **URL** |
+| //mg// | mg, short for “Managing Gigabytes.” Compression, indexing and search software used to manage textual information in Greenstone collections. | // www.citri.edu.au/mg // |
+| //wget// | Web mirroring software for use with Greenstone. Written in C++ | // www.tuwien.ac.at/~prikryl/ wget.html // |
+| //w3mir// | A web mirroring program written in Perl. This is not Greenstone's preferred mirroring program because it relies on a specific outdated version of a certain Perl module (which is distributed in the //w3mir// directory). | // www.math.uio.no/~janl/w3mir // |
+| //windows// | Packages used when running under Windows. | //—// |
+| //windows/gdbm// | Version of the Gnu Database Manager created for Windows. Gdbm comes as a standard part of Linux. | //—// |
+| //windows/crypt// | Encryption program used for passwords for Greenstone's administrative functions. | //—// |
+| //windows/stlport// | Standard Template Library, for use when compiling Greenstone with certain Windows compilers. | //—// |
+| //wv// | Microsoft Word converter (for building collections from Word documents) slimmed down for Greenstone. | //sourceforge.net/projects/ wvware// |
+| //pdftohtml// | PDF converter used when building collections from PDF documents. | //www.ra.informatik.uni-stutt gart.de/ <br/>~gosho/pdftohtml// |
+| //yaz// | Z39.50 client program being used for research in making Greenstone Z39.50 compliant. Progress is reported in the //README.gsdl// file. | // www.indexdata.dk // |
+The //packages// directory, whose contents are shown in Table <tblref table_the_packages_directory_1>, is where all the code used by Greenstone but written by other research teams resides. All software distributed with Greenstone has been released under the Gnu Public license. The executables produced by these packages are placed in the Greenstone //bin// directory. Each package is stored in a directory of its own. Their functions vary widely, from indexing and compression to converting Microsoft Word documents to html. Each package has a README file which gives more information about it.