User Tools

Site Tools


legacy:manuals:en:develop:getting_the_most_out_of_your_documents

Getting the most out of your documents

Collections can be individualised to make the information they contain accessible in different ways. This chapter describes how Greenstone extracts information from documents and presents it to the user: the document processing (Section plugins) and classification structures (Section classifiers), and user interface tools (Sections formatting_greenstone_output and controlling_the_greenstone_user_interface).

Plugins

Plugins parse the imported documents and extract metadata from them. For example, the html plugin converts html pages to the Greenstone archive format and extracts metadata which is explicit in the document format—such as titles, enclosed by <title></title> tags.

Plugins are written in the Perl language. They all derive from a basic plugin called BasPlug, which performs universally-required operations like creating a new Greenstone archive document to work with, assigning an object identifier (OID), and handling the sections in a document. Plugins are kept in the perllib/plugins directory.

To find more about any plugin, just type pluginfo.pl plugin-name at the command prompt. (You need to invoke the appropriate setup script first, if you haven't already, and on Windows you need to type perl —S pluginfo.pl plugin-name if your environment is not set up to associate files ending in .pl as Perl executables). This displays information about the plugin on the screen—what plugin-specific options it takes, and what general options are allowed.

You can easily write new plugins that process document formats not handled by existing plugins, format documents in some special way, or extract a new kind of metadata.

General Options

Table <tblref table_options_applicable_to_all_plugins> shows options that are accepted by any plugin derived from BasPlug.

<tblcaption table_options_applicable_to_all_plugins|Options applicable to all plugins></tblcaption>

< - 132 397 >
input_encoding Character encoding of the source documents. The default is to automatically work out the character encoding of each individual document. It is sometimes useful to set this value though, for example, if you know that all your documents are plain ASCII, setting the input encoding to ascii greatly increases the speed at which your collection is imported and built. There are many possible values. Use pluginfo.pl BasPlug to get a complete list.
default_encoding The encoding that is used if input_encoding is auto and automatic encoding detection fails.
process_exp A Perl regular expression to match against filenames (for example, to locate a certain kind of file extension). This dictates which files a plugin processes. Each plugin has a default (HTMLPlug's default is (?i).html?—that is, anything with the extension .htm or .html).
block_exp A regular expression to match against filenames that are not to be passed on to subsequent plugins. This can prevent annoying error messages about files you aren't interested in. Some plugins have default blocking expressions—for example, HTMLPlug blocks files with .gif, .jpg, .jpeg, .png, .rtf and .css extensions.
cover_image Look for a .jpg file (with the same name as the file being processed) and associate it with the document as a cover image.
extract_acronyms Extract acronyms from documents and add them as metadata to the corresponding Greenstone archive documents.
markup_acronyms Add acronym information into document text.
extract_language Identify each document's language and associate it as metadata. Note that this is done automatically if input_encoding is auto.
default_language If automatic language extraction fails, language metadata is set to this value.
first Extract a comma-separated list of the first stretch of text and add it as FirstNNN metadata (often used as a substitute for Title).
extract_email Extract E-mail addresses and add them as document metadata.
extract_date Extract dates relating to the content of historical documents and add them as Coverage metadata.

Document processing plugins

<tblcaption table_greenstone_plugins|Greenstone plugins></tblcaption>

< - 60 72 236 85 76 >
Purpose File types Ignores files
General ArcPlug Processes files named in the file archives.inf, which is used to communicate between the import and build processes. Must be included (unless import.pl will not be used).
RecPlug Recurses through a directory structure by checking to see whether a filename is a directory and if so, inserting all files in the directory into the plugin pipeline. Assigns metadata if —use_metadata_files option is set and metadata.xml files are present.
GAPlug Processes Greenstone archive files generated by import.pl. Must be included (unless import.pl will not be used). .xml
TEXTPlug Processes plain text by placing it between <pre> </pre> tags (treating it as preformatted). .txt, .text
HTMLPlug Processes html, replacing hyperlinks appropriately. If the linked document is not in the collection, an intermediate page is inserted warning the user they are leaving the collection. Extracts readily available metadata such as Title. .htm, .html, .cgi, .php, .asp, .shm, .shtml .gif, .jpeg, .jpg, .png, .css, .rtf
WordPlug Processes Microsoft Word documents, extracting author and title where available, and keeping diagrams and pictures in their proper places. The conversion utilities used by this plugin sometimes produce html that is poorly formatted, and we recommend that you provide the original documents for viewing when building collections of WORD files. However, the text that is extracted from the documents is adequate for searching and indexing purposes. .doc .gif, .jpeg, .jpg, .png, .css, .rtf
PDFPlug Processes PDF documents, extracting the first line of text as a title. The pdftohtml program fails on some PDF files. What happens is that the conversion process takes an exceptionally long time, and often an error message relating to the conversion process appears on the screen. If this occurs, the only solution that we can offer is to remove the offending document from the collection and re-import. .pdf .gif, .jpeg, .jpg, .png, .css, .rtf
PSPlug Processes PostScript documents, optionally extracting date, title and page number metadata. .ps .eps
EMAILPlug Processes E-mail messages, recognising author, subject, date, etc. This plugin does not yet handle MIME-encoded E-mails propoerly—although legible, they often look rather strange. Must end in digits or digits followed by .Email
BibTexPlug Processes bibliography files in BibTex format .bib
ReferPlug Processes bibliography files in refer format .bib
SRCPlug Processes source code files Makefile, Readme, .c, .cc, .cpp, .h, .hpp, .pl, .pm, .sh .o, .obj, .a, .so, .dll
ImagePlug Processes image files for creating a library of images. Only works on UNIX. .jpeg, .jpg, .gif, .png, .bmp, .xbm, .tif, .tiff
SplitPlug Like BasPlug and ConvertToPlug, this plugin should not be called directly, but may be inherited byplugins that need to process files containing several documents
FOXPlug Processes FoxBASE dbt files .dbt, .dbf
ZIPPlug Uncompresses gzip, bzip, zip, and tar files, provided the appropriate Gnu tools are available. .gzip, .bzip, .zip, .tar, .gz, .bz, .tgz, .taz
Collection <br/>Specific PrePlug Processes html output using PRESCRIPT, splitting documents into pages for the Computer Science Technical Reports collection. .html, .html.gz
GBPlug Processes Project Gutenberg etext—which includes manually-entered title information. .txt.gz, .html, .htm
TCCPlug Processes E-mail documents from Computists' Weekly Must begin with tcc or cw

Document processing plugins are used by the collection-building software to parse each source document in a way that depends on its format. A collection's configuration file lists all plugins that are used when building it. During the import operation, each file or directory is passed to each plugin in turn until one is found that can process it—thus earlier plugins take priority over later ones. If no plugin can process the file, a warning is printed (to standard error) and processing passes to the next file. (This is where the block_exp option can be useful—to prevent these error messages for files that might be present but don't need processing.) During building, the same procedure is used, but the archives directory is processed instead of the import directory.

The standard Greenstone plugins are listed in Table <tblref table_greenstone_plugins>. Recursion is necessary to traverse directory hierarchies. Although the import and build programs do not perform explicit recursion, some plugins cause indirect recursion by passing files or directory names into the plugin pipeline. For example, the standard way of recursing through a directory hierarchy is to specify RecPlug, which does exactly this. If present, it should be the last element in the pipeline. Only the first two plugins in Table <tblref table_greenstone_plugins> cause indirect recursion.

Some plugins are written for specific collections that have a document format not found elsewhere, like the E-text used in the Gutenberg collection. These collection-specific plugins are found in the collection's perllib/plugins directory. Collection-specific plugins can be used to override general plugins with the same name.

Some document-processing plugins use external programs that parse specific proprietary formats—for example, Microsoft Word—into either plain text or html. A general plugin called ConvertToPlug invokes the appropriate conversion program and passes the result to either TEXTPlug or HTMLPlug. We describe this in more detail shortly.

Some plugins have individual options, which control what they do in finer detail than the general options allow. Table <tblref table_plugin-specific_options> describes them.

<tblcaption table_plugin-specific_options|Plugin-specific options></tblcaption>

< - 132 132 265 >
Option Purpose
HTMLPlug nolinks Do not trap links within the collection. This speeds up the import/build process, but any links in the collection will be broken.
description_tags Interpret tagged document files as described in the subsection below.
keep_head Do not strip out html headers.
no_metadata Do not seek any metadata (this may speed up the import/build process).
metadata_fields Takes a comma-separated list of metadata types (defaults to Title) to extract. To rename the metadata in the Greenstone archive file, use tag<newname> where tag is the html tag sought and newname its new name.
hunt_creator_metadata Find as much metadata as possible about authorship and put it in the Greenstone archive document as Creator metadata. You also need to include Creator using the metadata_fields option.
file_is_url Use this option if a web mirroring program has been used to create the structure of the documents to be imported.
assoc_files Gives a Perl regular expression that describes file types to be treated as associated files. The default types are .jpg, .jpeg, .gif, .png, .css
rename_assoc_files Rename files associated with documents. During this process the directory structure of any associated files will become much shallower (useful if a collection must be stored in limited space).
HTMLPlug and <br/>TEXTPlug title_sub Perl substitution expression to modify titles.
PSPlug extract_date Extract the creation date from the PostScript header and store it as metadata.
extract_title Extract the document title from the PostScript header and store it as title metadata.
extract_pages Extract the page numbers from the PostScript document and add them to the appropriate sections as metadata with the tag Pages.
RecPlug use_metadata_files Assign metadata from a file as described in the subsection below.
ImagePlug Various options See ImagePlug.pm.
SRCPlug remove_prefix Gives a Perl regular expression of a leading pattern which is to be removed from the filename. Default behaviour is to remove the whole path.

Plugins to import proprietary formats

Proprietary formats pose difficult problems for any digital library system. Although documentation may be available about how they work, they are subject to change without notice, and it is difficult to keep up with changes. Greenstone has adopted the policy of using GPL (Gnu Public License) conversion utilities written by people dedicated to the task. Utilities to convert Word and PDF formats are included in the packages directory. These all convert documents to either text or html. Then HTMLPlug and TEXTPlug are used to further convert them to the Greenstone archive format. ConvertToPlug is used to include the conversion utilities. Like BasPlug it is never called directly. Rather, plugins written for individual formats are derived from it as illustrated in Figure <imgref figure_plugin_inheritance_hierarchy>. ConvertToPlug uses Perl's dynamic inheritance scheme to inherit from either TEXTPlug or HTMLPlug, depending on the format to which a source document has been converted.

<imgcaption figure_plugin_inheritance_hierarchy|%!– id:537 –%Plugin inheritance hierarchy ></imgcaption>

When ConvertToPlug receives a document, it calls gsConvert.pl (found in GSDLHOME/bin/script) to invoke the appropriate conversion utility. Once the document has been converted, it is returned to ConvertToPlug, which invokes the text or html plugin as appropriate. Any plugin derived from ConvertToPlug has an option convert_to, whose argument is either text or html, to specify which intermediate format is preferred. Text is faster, but html generally looks better, and includes pictures.

Sometimes there are several conversion utilities for a particular format, and gsConvert may try different ones on a given document. For example, the preferred Word conversion utility wvWare does not cope with anything less than Word 6, and a program called AnyToHTML, which essentially just extracts whatever text strings can be found, is called to convert Word 5 documents.

The steps involved in adding a new external document conversion utility are:

  1. Install the new conversion utility so that it is accessible by Greenstone (put it in the packages directory).
  2. Alter gsConvert.pl to use the new conversion utility. This involves adding a new clause to the if statement in the main function, and adding a function that calls the conversion utility.
  3. Write a top-level plugin that inherits from ConvertToPlug to catch the format and pass it on.

Assigning metadata from a file

The standard plugin RecPlug also incorporates a way of assigning metadata to documents from manually (or automatically) created XML files. We describe this in some detail, so that you can create metadata files in the appropriate format. If the use_metadata_files option is specified, RecPlug uses an auxiliary metadata file called metadata.xml. Figure <imgref figure_xml_format> shows the XML Document Type Definition (DTD) for the metadata file format, while Figure <imgref figure_xml_format_1> shows an example metadata.xml file.

<imgcaption figure_xml_format|%!– id:547 –%(a) %!– id:546 –%XML format: (a) Document Type Definition (DTD); (b) Example metadata file ></imgcaption>

<!DOCTYPE GreenstoneDirectoryMetadata [
   <!ELEMENT DirectoryMetadata (FileSet*)>
   <!ELEMENT FileSet (FileName+,Description)>
   <!ELEMENT FileName (#PCDATA)>
   <!ELEMENT Description (Metadata*)>
   <!ELEMENT Metadata (#PCDATA)>
   <ATTLIST Metadata name CDATA #REQUIRED>
   <ATTLIST Metadata mode (accumulate|override) "override">
]>

<imgcaption figure_xml_format_1|%!– id:549 –%(b) %!– id:548 –% ></imgcaption>

<?xml version="1.0" ?>
<!DOCTYPE GreenstoneDirectoryMetadata SYSTEM
"http://greenstone.org/dtd/GreenstoneDirectoryMetadata/<br/>1.0/GreenstoneDirectoryMetadata.dtd">
<DirectoryMetadata>
   <FileSet>
       <FileName>nugget.*</FileName>
       <Description>
       <Metadata name="Title">Nugget Point Lighthouse</Metadata>
           <Metadata name="Place" mode="accumulate">Nugget Point</Metadata>
       </Description>
   </FileSet>
   <FileSet>
       <FileName>nugget-point-1.jpg</FileName>
       <Description>
           <Metadata name="Title">Nugget Point Lighthouse , The Catlins </Metadata>
           <Metadata name="Subject">Lighthouse</Metadata>
       </Description>
   </FileSet>
</DirectoryMetadata>

The example file contains two metadata structures. In each one, the filename element describes files to which the metadata applies, in the form of a regular expression. Thus <FileName>nugget.*</FileName> indicates that the first metadata record applies to every file whose name starts with “nugget”.1)For these files, Title metadata is set to “Nugget Point Lighthouse.”

Metadata elements are processed in the order in which they appear. The second structure above sets Title metadata for the file named nugget-point-1.jpg to “Nugget Point Lighthouse, The Catlins,” overriding the previous specification. It also adds a Subject metadata field.

Sometimes metadata is multi-valued and new values should accumulate, rather than overriding previous ones. The mode=accumulate attribute does this. It is applied to Place metadata in the first specification above, which will therefore be multi-valued. To revert to a single metadata element, write <Metadata name=“Place” mode=“override”>New Zealand</Metadata>. In fact, you could omit this mode specification because every element overrides unless otherwise specified. To accumulate metadata for some field, mode=accumulate must be specified in every occurrence.

When its use_metadata_files option is set, RecPlug checks each input directory for an XML file called metadata.xml and applies its contents to all the directory's files and subdirectories.

The metadata.xml mechanism that is embodied in RecPlug is just one way of specifying metadata for documents. It is easy to write different plugins that accept metadata specifications in completely different formats.

Tagging document files

Source documents often need to be structured into sections and subsections, and this information needs to be communicated to Greenstone so that it can preserve the hierarchical structure. Also, metadata—typically the title—might be associated with each section and subsection.

The simplest way of doing this is often simply to edit the source files. The HTML plugin has a description_tags option that processes tags in the text like this:

<!--
<Section>
   <Description>
       <Metadata name="Title"> Realizing human rights for poor people: Strategies for achieving the international development targets</Metadata>
   </Description>
-->

(text of section goes here)

<!--
</Section>
-->

The %!– … –% markers are used because they indicate comments in HTML; thus these section tags will not affect document formatting. In the Description part other kinds of metadata can be specified, but this is not done for the style of collection we are describing here. Also, the tags can be nested, so the line marked text of section goes here above can itself include further subsections, such as

(text of first part of section goes here)

<!--
<Section>
   <Description>
       <Metadata name="Title"> The international development targets</Metadata>
   </Description>
-->

(text of subsection goes here)

<!--
</Section>
-->

(text of last part of section goes here)

This functionality is inherited by any plugins that use HTMLPlug. In particular, the Word plugin converts its input to HTML form, and so exactly the same way of specifying metadata can be used in Word (and RTF) files. (This involves a bit of work behind the scenes, because when Word documents are converted to HTML care is normally taken to neutralize HTML's special interpretation of stray “<” and “>” signs; we have arranged to override this in the case of the above specifications.) Note that exactly the same format as above is used, even in Word files, including the surrounding “%!–” and “–%”. Font and spacing is ignored.

Classifiers

Classifiers are used to create a collection's browsing indexes. Examples are the dlpeople collection's Titles A-Z index, and the Subject, How to, Organisation and Titles A-Z indexes in the Humanity Development Library—of which the Demo collection is a subset. The navigation bar near the top of the screenshots in Figures 3 and 8a include the search function, which is always provided, followed by buttons for any classifiers that have been defined. The information used to support browsing is stored in the collection information database, and is placed there by classifiers that are called during the final phase of buildcol.pl.

<imgcaption figure_azlist_classifier|%!– id:566 –%AZList classifier ></imgcaption>

Classifiers, like plugins, are specified in a collection's configuration file. For each one there is a line starting with the keyword classify and followed by the name of the classifier and any options it takes. The basic collection configuration file discussed in Section configuration_file includes the line classify AZList—metadata Title, which makes an alphabetic list of titles by taking all those with a Title metadata field, sorting them and splitting them into alphabetic ranges. An example is shown in Figure <imgref figure_azlist_classifier>.

<imgcaption figure_list_classifier|%!– id:568 –%List classifier ></imgcaption>

A simpler classifier, called List, illustrated in Figure <imgref figure_list_classifier>, creates a sorted list of a given metadata element and displays it without any alphabetic subsections. An example is the how to metadata in the Demo collection, which is produced by a line classify List —metadata Howto in the collection configuration file.2) Another general-purpose list classifier is DateList, illustrated in Figure <imgref figure_datelist_classifier>, which generates a selection list of date ranges. (The DateList classifier is also used in the Greenstone Archives collection.)

<imgcaption figure_datelist_classifier|%!– id:570 –%DateList classifier ></imgcaption>

Other classifiers generate browsing structures that are explicitly hierarchical. Hierarchical classifications are useful for subject classifications and subclassifications, and organisational hierarchies. The Demo collection's configuration file contains the line<br/>classify Hierarchy —hfile sub.txt —metadata Subject —sort Title, and Figure <imgref figure_hierarchy_classifier> shows the subject hierarchy browser that it produces. The bookshelf with a bold title is the one currently being perused; above it you can see the subject classification to which it belongs. In this example the hierarchy for classification is stored in a simple text format in sub.txt.

<imgcaption figure_hierarchy_classifier|%!– id:572 –%Hierarchy classifier ></imgcaption>

All classifiers generate a hierarchical structure that is used to display a browsing index. The lowest levels (i.e. leaves) of the hierarchy are usually documents, but in some classifiers they are sections. The internal nodes of the hierarchy are either Vlist, Hlist, or Datelist. A Vlist is a list of items displayed vertically down the page, like the “how to” index in the Demo collection (see Figure <imgref figure_list_classifier>). An Hlist is displayed horizontally. For example, the AZList display in Figure <imgref figure_azlist_classifier> is a two-level hierarchy of internal nodes consisting of an Hlist(giving the A-Z selector) whose children are Vlists —and their children, in turn, are documents. A Datelist (Figure <imgref figure_datelist_classifier>) is a special kind of Vlist that allows selection by year and month.

The lines used to specify classifiers in collection configuration files contain a metadata argument that identifies the metadata by which the documents are classified and sorted. Any document in the collection that does not have this metadata defined will be omitted from the classifier (but it is still indexed, and consequently searchable). If no metadata argument is specified, all documents are included in the classifier, in the order in which they are encountered during the building process. This is useful if you want a list of all documents in your collection.

<tblcaption table_greenstone_classifiers|Greenstone classifiers></tblcaption>

< - 132 92 305 >
Argument Purpose
Hierarchy Hierarchical classification
hfile Classification file
metadata Metadata element to test against hfile identifier
sort Metadata element used to sort documents within leaves (defaults to Title)
buttonname Name of the button used to access this classifier (defaults to value of metadata argument)
List Alphabetic list of documents
metadata Include documents containing this metadata element
buttonname Name of button used to access this classifier (defaults to value of metadata argument)
SectionList List of sections in documents
AZList List of documents split into alphabetical ranges
metadata Include all documents containing this metadata element
buttonname Name of button used to access this classifier (defaults to value of metadata argument)
AZSectionList Like AZList but includes every section of the document
DateList Similar to AZList but sorted by date

The current set of classifiers is listed in Table <tblref table_greenstone_classifiers>. Just as you can use the pluginfo.pl program to find out about any plugin, there is a classinfo.pl program that gives you information about any classifier, and the options it provides.

All classifiers accept the argument buttonname, which defines what is written on the Greenstone navigation button that invokes the classifier (it defaults to the name of the metadata argument). Buttons are provided for each Dublin Core metadata type, and for some other types of metadata.

Each classifier receives an implicit name from its position in the configuration file. For example, the third classifier specified in the file is called CL3. This is used to name the collection information database fields that define the classifier hierarchy.

Collection-specific classifiers can be written, and are stored in the collection's perllib/classify directory. The Development Library has a collection-specific classifier called HDLList, which is a minor variant of AZList.

List classifiers

The various flavours of list classifier are shown below.

  • SectionList—like List but the leaves are sections rather than documents. All document sections are included except the top level. This is used to create lists of sections (articles, chapters or whatever) such as in the Computists' Weekly collection (available through nzdl.org ), where each issue is a single document and comprises several independent news items, each in its own section.
  • AZList—generates a two-level hierarchy comprising an HList whose children are VLists, whose children are documents. The HList is an A-Z selector that divides the documents into alphabetic ranges. Documents are sorted alphabetically by metadata, and the resulting list is split into ranges.
  • AZSectionList—like AZList but the leaves are sections rather than documents.
  • DateList—like AZList except that the top-level HList allows selection by year and its children are DateLists rather than VLists. The metadata argument defaults to Date.

The hierarchy classifier

All classifiers are hierarchical. However, the list classifiers described above have a fixed number of levels, whereas the “hierarchy” classifiers described in this section have an arbitrary number of levels. Hierarchy classifiers are more complex to specify than list classifiers.

<imgcaption figure_part_of_the_file_sub|%!– id:618 –%Part of the file sub.txt ></imgcaption>

1           1       "General reference"
1.2       1.2       "Dictionaries, glossaries, language courses, terminology
2           2       "Sustainable Development, International cooperation, Pro
2.1       2.1       "Development policy and theory, international cooperatio
2.2       2.2       "Development, national planning, national plans"
2.3       2.3       "Project planning and evaluation (incl. project managem
2.4       2.4       "Regional development and planning incl. regional profil
2.5       2.5       "Nongovernmental organisations (NGOs) in general, self-
2.6       2.6       "Organisations, institutions, United Nations (general, d
2.6.1   2.6.1       "United Nations"
2.6.2   2.6.2       "International organisations"
2.6.3   2.6.3       "Regional organisations"
2.6.5   2.6.5       "European Community - European Union"
2.7       2.7       "Sustainable Development, Development models and example
2.8       2.8       "Basic Human Needs"
2.9       2.9       "Hunger and Poverty Alleviation"

The hfile argument gives the name of a file, like that in Figure <imgref figure_part_of_the_file_sub>, which defines the metadata hierarchy. Each line describes one classification, and the descriptions have three parts:

  • Identifier, which matches the value of the metadata (given by the metadata argument) to the classification.
  • Position-in-hierarchy marker, in multi-part numeric form, e.g. 2, 2.12, 2.12.6.
  • The name of the classification. (If this contains spaces, it should be placed in quotation marks.)

Figure <imgref figure_part_of_the_file_sub> is part of the sub.txt file used to create the subject hierarchy in the Development Library (and the Demo collection). This example is a slightly confusing one because the number representing the hierarchy appears twice on each line. The metadata type Hierarchy is represented in documents with values in hierarchical numeric form, which accounts for the first occurrence. It is the second occurrence that is used to determine the hierarchy that the hierarchy browser implements.

The hierarchy classifier has an optional argument, sort, which determines how the documents at the leaves are ordered. Any metadata can be specified as the sort key. The default is to produce the list in the order in which the building process encounters the documents. Ordering at internal nodes is determined by the order in which things are specified in the hfile argument.

How classifiers work

Classifiers are Perl objects, derived from BasClas.pm, and are stored in the perllib/classify directory. They are used when the collection is built. When they are executed, the following four steps occur.

  1. The new method creates the classifier object.
  2. The init method initialises the object with parameters such as metadata type, button name and sort criterion.
  3. The classify method is invoked once for each document, and stores information about the classification made within the classifier object.
  4. The get_classify_info method returns the locally stored classification information to the build process, which it then writes to the collection information database for use when the collection is displayed at runtime.

The classify method retrieves each document's OID, the metadata value on which the document is to be classified, and, where necessary, the metadata value on which the documents are to be sorted. The get_classify_info method performs all sorting and classifier-specific processing. For example, in the case of the AZList classifier, it splits the list into ranges.

The build process initialises the classifiers as soon as the builder object is created. Classifications are created during the build phase, when the information database is created, by classify.pm, which resides in Greenstone's perllib directory.

<tblcaption table_items_appearing_in_format_strings|Items appearing in format strings></tblcaption>

< - 132 397 >
[Text] The document's text
[link] … [/link] The html to link to the document itself
[icon] An appropriate icon (e.g. the little text icon in a Search Results string)
[num] The document number (useful for debugging).
[metadata-name] The value of this metadata element for the document, e.g. [Title]

Formatting Greenstone output

The web pages you see when using Greenstone are not pre-stored but are generated “on the fly” as they are needed. The appearance of many aspects of the pages is controlled using “format strings.” Format strings belong in the collection configuration file, introduced by the keyword format followed by the name of the element to which the format applies. There are two different kinds of page element that are controlled by format strings. The first comprises the items on the page that show documents or parts of documents. The second comprises the lists produced by classifiers and searches. All format strings are interpreted at the time that pages are displayed. Since they take effect as soon as any changes in collect.cfg are saved, experimenting with format strings is quick and easy.

Table <tblref table_items_appearing_in_format_strings> shows the format statements that affect the way documents look. The DocumentButtons option controls what buttons are displayed on a document page. Here, string is a list of buttons (separated by |), possible values being Detach, Highlight, Expand Text, and Expand Contents. Reordering the list reorders the buttons.

<tblcaption table_the_format_options|The format options></tblcaption>

< - 246 283 >
format DocumentImages true/false If true, display a cover image at the top left of the document page (default false).
format DocumentHeading formatstring If DocumentImages is false, the format string controls how the document header shown at the top left of the document page looks (default [Title]).
format DocumentContents true/false Display table of contents (if document is hierarchical), or next/previous section arrows and “page k of n” text (if not).
format DocumentButtons string Controls the buttons that are displayed on a document page (default Detach|Highlight).
format DocumentText formatstring Format of the text to be displayed on a document page: default
<center><table width=537>
<tr><td>[Text]</td></tr>
</table></center>
format DocumentArrowsBottom true/false Display next/previous section arrows at bottom of document page (default true).
format DocumentUseHTML true/false If true, each document is displayed inside a separate frame. The Preferences page will also change slightly, adding options applicable to a collection of html documents, including the ability to go directly to the original source document (anywhere on the Web) rather than to the Greenstone copy.

Formatting Greenstone lists

Format strings that control how lists look can apply at different levels of the display structure. They can alter all lists of a certain type within a collection (for example DateList), or all parts of a list (for example all the entries in the Search list), or specific parts of a certain list (for example, the vertical list part of an AZList classifier on title).

Following the keyword format is a two-part keyword, only one part of which is mandatory. The first part identifies the list to which the format applies. The list generated by a search is called Search, while the lists generated by classifiers are called CL1, CL2, CL3,… for the first, second, third,… classifier specified in collect.cfg. The second part of the keyword is the part of the list to which the formatting is to apply—either HList (for horizontal list, like the A-Z selector in an AZList), VList (for vertical list, like the list of titles under an AZList), or DateList. For example:

format CL4VList … applies to all VLists in CL4
format CL2HList … applies to all HLists in CL2
format CL1DateList … applies to all DateLists in CL1
format SearchVList … applies to the Search Results list
format CL3 … applies to all nodes in CL3, unless otherwise specified
format VList … applies to all VLists in all classifiers, unless otherwise specified

The “…” in these examples stand for html format specifications that control the information, and its layout, that appear on web pages displaying the classifier. As well as html specifications, any metadata may appear within square brackets: its value is interpolated in the indicated place. Also, any of the items in Table <tblref table_the_format_options> may appear in format strings. The syntax for the strings also includes a conditional statement, which is illustrated in an example below.

Recall that all classifiers produce hierarchies. Each level of the hierarchy is displayed in one of four possible ways. We have already encountered HList, VList, and DateList. There is also Invisible, which is how the very top levels of hierarchies are displayed—because the name of the classifier is already shown separately on the Greenstone navigation bar.

Examples of classifiers and format strings

<imgcaption figure_excerpt_from_the_demo_collection_collect|%!– id:674 –%Excerpt from the Demo collection's collect.cfg %!– withLineNumber –%></imgcaption>

classify Hierarchy -hfile sub.txt -metadata Subject -sort Title
classify AZList       -metadata Title
classify Hierarchy -hfile org.txt -metadata Organisation -sort Title
classify List           -metadata Howto
format SearchVList " <td valign=top [link][icon][/link]</td><td>{If}
                                       {[parent(All':'):Title],[parent(All':'):Title]:}
                                       [link][Title][/link]</td> "
format CL4Vlist                 "<br>[link][Howto][/link] "
format DocumentImages   true
format DocumentText       "<h3>[Title]</h3>\\n\\n<p>[Text]"
format DocumentButtons " Expand Text|Expand contents|Detach|Highlight"

Figure <imgref figure_excerpt_from_the_demo_collection_collect> shows part of the collection configuration file for the Demo collection. We use this as an example because it has several classifiers that are richly formatted. Note that statements in collection configuration files must not contain newline characters—in the Table, longer lines are broken up for readability.

Line 4 specifies the Demo collection's How To classifier. This is the fourth in the collection configuration file, and is therefore referred to as CL4. The corresponding format statement is line 7 of Figure <imgref figure_excerpt_from_the_demo_collection_collect>. The “how to” information is generated from the List classifier, and its structure is the plain list of titles shown in Figure <imgref figure_list_classifier>. The titles are linked to the documents themselves: clicking a title brings up the relevant document. The children of the hierarchy's top level are displayed as a VList(vertical list), which lists the sections vertically. As the associated format statement indicates, each element of the list is on a newline (“ <br> ”) and contains the Howto text, hyperlinked to the document itself.

Line 1 specifies the Demo collection's Subject classification, referred to as CL1 (the first in the configuration file), and Line 3 the Organisation classification CL3. Both are generated by the Hierarchy classifier and therefore comprise a hierarchical structure of VLists.

Line 2 specifies the remaining classification for the Demo collection, Titles A-Z (CL2). Note that there are no corresponding format strings for the classifiers CL1 -CL3. Greenstone has built-in defaults for each format string type and so it's not necessary to set a format string unless you want to override the default.

<imgcaption figure_formatting_the_document|%!– id:679 –%Formatting the document ></imgcaption>

This accounts for the four classify lines in Figure <imgref figure_excerpt_from_the_demo_collection_collect>. Coincidentally, there are also four format lines. We have already discussed one, the CL4Vlist one. The remaining three are the first type of format string, documented in Table <tblref table_items_appearing_in_format_strings>. For example, line 8 places the cover image at the top left of each document page. Line 9 formats the actual document text, with the title of the relevant chapter or section preceding the text itself. These are illustrated in Figure <imgref figure_formatting_the_document>.

<imgcaption figure_formatting_the_search_results|%!– id:681 –%Formatting the search results ></imgcaption>

Line 5 of Figure <imgref figure_excerpt_from_the_demo_collection_collect> is a rather complicated specification that formats the query result list returned by a search, whose parts are illustrated in Figure <imgref figure_formatting_the_search_results>. A simplified version of the format string is

<td valign=top>[link][icon][/link]</td>
<td>[link][Title][/link]</td>

This is designed to appear as a table row, which is how the query results list is formatted. It gives a small icon linked to the text, as usual, and the document title, hyperlinked to the document itself.

In this collection, documents are hierarchical. In fact, the above hyperlink anchor evaluates to the title of the section returned by the query. However, it would be better to augment it with the title of the enclosing section, the enclosing chapter, and the book in which it occurs. There is a special metadata item, parent, which is not stored in documents but is implicit in any hierarchical document, that produces such a list. It either returns the parent document, or, if used with the qualifier All, the list of hierarchically enclosing parents, separated by a character string that can be given after the All qualifier. Thus

<td valign=top>[link][icon][/link]</td> <br/><td>{[parent(All': '):Title]: }[link][Title][/link]</td>

has the effect of producing a list containing the book title, chapter title, etc. that enclose the target section, separated by colons, with a further colon followed by a hyperlink to the target section's title.

Unfortunately, if the target is itself a book, there is no parent and so an empty string will appear followed by a colon. To circumvent such problems you can use if and or … else statements in a format string:

{If}{[metadata], action-if-non-null, action-if-null}
{Or}{action, else another-action, else another-action, etc}

In either case curly brackets are used to signal that the statements should be interpreted and not just printed out as text. The If tests whether the metadata is empty and takes the first clause if not, otherwise the second one (if it exists). Any metadata item can be used, including the special metadata parent. The Or statement evaluates each action in turn until one is found that is non-null. That one is sent to the output and the remaining actions are skipped.

Returning to line 5 of Figure <imgref figure_excerpt_from_the_demo_collection_collect>, the full format string is

<td valign=top>[link][icon][/link]</td>
<td>{If}{[parent(All': '):Title],
           [parent(All': '):Title]:}
      [link][Title][/link]</td>

This precedes the parent specification with a conditional that checks whether the result is empty and only outputs the parent string when it is present. Incidentally, parent can be qualified by Top instead of All, which gives the top-level document name that encloses a section—in this case, the book name. No separating string is necessary with Top.

Some final examples illustrate other features. The DateList in Figure <imgref figure_datelist_classifier> is used in the Dates classification of the Computists' Weekly collection (which happens to be the second classifier, CL2). The classifier and format specifications are shown below. The DateList classifier differs from AZList in that it always sorts by Date metadata, and the bottom branches of the browsing hierarchy use DateList instead of VList, which causes the year and month to be added at the left of the document listings.

classify AZSectionList metadata=Creator
format CL2Vlist "<td>[link][icon][/link]</td> <br/><td>[Creator]</td> <br/><td>&nbsp;&nbsp;[Title]</td> <br/><td>[parent(Top):Date]</td> "

The format specification shows these VLists in the appropriate way.

The format-string mechanism is flexible but tricky to learn. The best way is by studying existing collection configuration files.

Linking to different document versions

Using the [link] … [/link] mechanism in a format string inserts a hyperlink to the text of a document, and when the link is clicked the html version of the document is displayed. In some collections, it is useful to be able to display other versions of the document. For example, in a collection of Microsoft Word documents, it is nice to be able to display the Word version of each document rather than the html that is extracted from it; similarly for PDF documents.

The key to being able to show different versions of a document is to embed the necessary information about where the other versions reside into the Greenstone archive form of the document. The information is represented in the form of metadata. Recall that putting

[link][Title][/link]

into a format string creates a link to the html form of the document, whose anchor text is the document's title. The Word and PDF plugins both generate srclink metadata so that if you put

[srclink][Title][/srclink]

into a format string, a link is created to the Word or PDF form of the document; again the anchor in this example is the document's title. In order that the appropriate icon can be displayed for Word and PDF documents, these plugins also generate srcicon metadata so that

[srclink][srcicon][/srclink]

creates a link which is labeled by the standard Word or PDF icon (whichever is appropriate), rather than the document's title.

Controlling the Greenstone user interface

The entire Greenstone user interface is controlled by macros which reside in the GSDLHOME/macros directory. They are written in a language designed especially for Greenstone, and are used run time to generate web pages. Translating the macro language into html is the last step in displaying a page. Thus changes to a macro file affect the display immediately, making experimentation quick and easy. All macro files used by Greenstone are listed in GSDLHOME/etc/main.cfg and are loaded every time it starts. One exception to this is when using the Windows Local LIbrary; in this case it is necessary to restart the process.

Web pages are generated on the fly for a number of reasons, and the macro system is how Greenstone implements the necessary flexibility. Pages can be presented in many languages, and a different macro file is used to store all the interface text in each language. When Greenstone displays a page the macro interpreter checks a language variable and loads the page in the appropriate language (this does not, unfortunately, extend to translating document content). Also, the values of certain display variables, like the number of documents found by a search, are not known ahead of time; these are interpolated into the page text in the form of macros.

The macro file format

Macro files have a .dm extension. Each file defines one or more packages, each containing a series of macros used for a single purpose. Like classifiers and plugins, there is a basis from which to build macros, called base.dm ; this file defines the basic content of a page.

Macros have names that begin and end with an underscore, and their content is defined using curly brackets. Content can be plain text, html (including links to Java applets and JavaScript), macro names, or any combination of these. This macro from base.dm defines the content of a page in the absence of any overriding macro:

_content_ {<p><h2>Oops</h2>_textdefaultcontent_}

The page will read “Oops” at the top, and _textdefaultcontent_, which is defined, in English, to be The requested page could not be found. Please use your browsers 'back' button or the above home button to return to the Greenstone Digital Library, and in other languages to be a suitable translation of this sentence.

_textdefaultcontent_ and _content_ both reside in the global package because they are required by all parts of the user interface. Macros can use macros from other packages as content, but they must prefix their names with their package name. For example,

_collectionextra_ {This collection contains _about:numdocs_ documents. It was last built _about:builddate_ days ago.)

comes from english.dm, and is used as the default description of a collection. It is part of the global package, but _numdocs_ and _builddate_ are both in the about package—hence the about: preceding their names.

Macros often contain conditional statements. They resemble the format string conditional described above, though their appearance is slightly different. The basic format is _If_(x,y,z), where x is a condition, y is the macro content to use if that condition is true, and z the content if it is false. Comparison operators are the same as the simple ones used in Perl (less than, greater than, equals, not equals). This example from base.dm is used to determine how to display the top of a collection's about page:

_imagecollection_ {
       _If_( "_iconcollection_ " ne "",
                 <a href = "_httppageabout_ "> 
                         <img src = "_iconcollection_ " border = 0>
                         </a>,
                 _imagecollectionv_)
}

This looks rather obscure. _iconcollection_ resolves to the empty string if the collection doesn't have an icon, or the filename of an image. To paraphrase the above code: If there is a collection image, display the About this Collection page header (referred to by _httppageabout_) and then the image; otherwise use the alternative display _imagecollectionv_.

Macros can take arguments. Here is a second definition for the _imagecollection_ macro which immediately follows the definition given above in the base.dm file:

_imagecollection_[v=1]{_imagecollectionv_}

The argument [v=1] specifies that the second definition is used when Greenstone is running in text-only mode. The language macros work similarly—apart from english.dm, because it is the default, all language macros specify their language as an argument. For example,

_textimagehome_ {Home Page}

appears in the English language macro file, whereas the German version is

_textimagehome_ [l=de] {Hauptaseite}

The English and German versions are in the same package, though they are in separate files (package definitions may span more than one file). Greenstone uses its l argument at run time to determine which language to display.

<imgcaption figure_part_of_the_aboutdm_macro_file|%!– id:714 –%Part of the about.dm macro file ></imgcaption>

package about
##############################################
# about page content
###############################################
_pagetitle_ {_collectionname_}
_content_ {
<center>
_navigationbar_
</center>
_query:queryform_
<p>_iconblankbar_
<p>_textabout_
_textsubcollections_
<h3>_help:textsimplehelpheading_</h3>
_help:simplehelp_
}
_textabout_ {
<h3>_textabcol_</h3>
_Global:collectionextra_
}

As a final example, Figure <imgref figure_part_of_the_aboutdm_macro_file> shows an exerpt from the macro file about.dm that is used to generate the “About this collection” page for each collection. It shows three macros being defined, _pagetitle_, _content_ and _textabout_.

Using macros

Macros are powerful, and can be a little obscure. However, with a good knowledge of html and a bit of practice, they become a quick and easy way to customise your Greenstone site.

For example, suppose you wanted to create a static page that looked like your current Greenstone site. You could create a new package, called static, for example, in a new file, and override the _content_ macro. Add the new filename to the list of macros in GSDLHOME/etc/main.cfg which Greenstone loads every time it is invoked. Finally, access the new page by using your regular Greenstone URL and appending the arguments ?a=p&p=static (e.g. http://servername/cgi-bin/library?a=p&p=static).

To change the “look and feel” of Greenstone you can edit the base and style packages. To change the Greenstone home page, edit the home package (this is described in the Greenstone Digital Library Installer's Guide). To change the query page, edit query.dm.

Experiment freely with macros. Changes appear instantly, because macros are interpreted as pages are displayed. The macro language is a useful tool that can be used to make your Greenstone site your own.

The packages directory

<tblcaption table_the_packages_directory_1|The packages directory></tblcaption>

< - 132 217 180 >
Package URL
mg mg, short for “Managing Gigabytes.” Compression, indexing and search software used to manage textual information in Greenstone collections. www.citri.edu.au/mg
wget Web mirroring software for use with Greenstone. Written in C++ www.tuwien.ac.at/~prikryl/ wget.html
w3mir A web mirroring program written in Perl. This is not Greenstone's preferred mirroring program because it relies on a specific outdated version of a certain Perl module (which is distributed in the w3mir directory). www.math.uio.no/~janl/w3mir
windows Packages used when running under Windows.
windows/gdbm Version of the Gnu Database Manager created for Windows. Gdbm comes as a standard part of Linux.
windows/crypt Encryption program used for passwords for Greenstone's administrative functions.
windows/stlport Standard Template Library, for use when compiling Greenstone with certain Windows compilers.
wv Microsoft Word converter (for building collections from Word documents) slimmed down for Greenstone. sourceforge.net/projects/ wvware
pdftohtml PDF converter used when building collections from PDF documents. www.ra.informatik.uni-stutt gart.de/ <br/>~gosho/pdftohtml
yaz Z39.50 client program being used for research in making Greenstone Z39.50 compliant. Progress is reported in the README.gsdl file. www.indexdata.dk

The packages directory, whose contents are shown in Table <tblref table_the_packages_directory_1>, is where all the code used by Greenstone but written by other research teams resides. All software distributed with Greenstone has been released under the Gnu Public license. The executables produced by these packages are placed in the Greenstone bin directory. Each package is stored in a directory of its own. Their functions vary widely, from indexing and compression to converting Microsoft Word documents to html. Each package has a README file which gives more information about it.

1)
Note that in Greenstone, regular expressions are interpreted in the Perl language, which is subtly different from some other conventions. For example, “*” matches zero or more occurrences of the previous character, while “.” matches any character—so nugget.* matches any string with prefix “nugget,” whether or not it contains a period after the prefix. To insist on a period you would need to escape it, and write nugget\..* instead.
2)
Note that more recent versions of the Demo collection use a Hierarchy classifier to display the how to metadata. In this case they will be displayed slightly differently to what is shown in Figure <imgref figure_list_classifier>.
legacy/manuals/en/develop/getting_the_most_out_of_your_documents.txt · Last modified: 2023/03/13 01:46 by 127.0.0.1