Greenstone tutorial exercise

Back to wiki
Back to index
Sample files: niupepa.zip
Devised for Greenstone version: 2.60|3.06
Modified for Greenstone version: 2.86|3.08

Scanned image collection

Here we build a small replica of Niupepa, the Maori Newspaper collection, using five newspapers taken from two newspaper series. It allows full text searching and browsing by title and date. When a newspaper is viewed, a preview image and its corresponding plain text are presented side by side, with a "go to page" navigation feature at the top of the page.

The collection involves a mixture of plugins, classifiers, and format statements. The bulk of the work is done by PagedImagePlugin, a plugin designed precisely for the kind of data we have in this example. For each document, an "item" file is prepared that specifies a list of image files that constitute the document, tagged with their page number and (optionally) accompanied by a text file containing the machine-readable version of the image, which is used for full text searching. Three newspapers in our collection (all from the series "Te Whetu o Te Tau") have text representations, and two (from "Te Waka o Te Iwi") have images only. Item files can also specify metadata. In our example the newspaper series is recorded as ex.Title and its date of publication as ex.Date. Issue ex.Volume and ex.Number metadata is also recorded, where appropriate. This metadata is extracted as part of the building process.

Start a new collection called Paged Images and fill out the fields with appropriate information: it is a collection sourced from an excerpt of Niupepa documents.

In the Gather panel, open the sample_files → niupepa → sample_items folder and drag the two subfolders into your collection on the right-hand side. A popup window asks whether you want to add PagedImagePlugin to the collection: click <Add Plugin>, because this plugin will be needed to process the item files.

PagedImagePlugin will process the item files, creating a document for each one with a separate section for each page listed. Thumbnail and screen-resolution sized images of each page image will be generated.

Go to the Create panel, build the collection and preview the result. Search for "waka" and view one of the titles listed (all three appear as Te Whetu o Te Tau). Browse by titles and view one of the Te Waka o Te Iwi newspapers. Note that only the Te Whetu o Te Tau newspapers have text; Te Waka o Te Iwi papers don't.

This collection was built with Greenstone's default settings. You can locate items of interest, but the information is less clearly and attractively presented than in the full Niupepa collection.

Grouping documents by series title and displaying dates within each group

Under titles, documents from the same series are repeated without any distinguishing features such as date, volume or number. It would be better to group them by series title and display other information within each group. This can be accomplished using the -bookshelf_type option to the List classifier, and tuning the classifier's format statement.

In the Design panel, under the Browsing Classifiers section, delete the List classifier for ex.Source. This classifier is not much use.

Select the classifier for dc.Title;ex.Title and click <Configure Classifier...>. Set bookshelf_type to always. This will create a bookshelf for each Title in the collection. Note, setting this option to duplicate_only will only create a bookshelf when more than one document shares a Title.

Build the collection, and preview the titles list.

Now we change the format statement for titles to display more information about the documents. In the Format Features section of the Format panel, select the dc.Title;ex.Title classifier (CL1) in the Choose Feature list. Click <Add Format> to add this format statement to your collection. Edit the contents of the dc.Title;ex.Title classifier format statement by removing the following in the documentNode template:

<td valign="top"> <gsf:link type="source"> <gsf:choose-metadata> <gsf:metadata name="thumbicon"/> <gsf:metadata name="srcicon"/> </gsf:choose-metadata> </gsf:link> </td> <td valign="top"> <gsf:link type="document"> <xsl:call-template name="choose-title"/> </gsf:link> <gsf:switch> <gsf:metadata name="Source"/> <gsf:when test="exists"> <gsf:metadata name="Source"/>) </gsf:when> </gsf:switch> </td>

In its place, insert the following (which can be copied from sample_files → niupepa → formats → titles_tweak_gs3.txt):

<td valign="top"> Volume: <gsf:metadata name="Volume"/> Number: <gsf:metadata name="Number"/> Date: <gsf:metadata format="formatDate" name="Date"/> </td>

Then, in classifierNode template for VLists, replace the contents of the final <td> table cell element with the following which can also be copied from the file titles_tweak_gs3.txt:

<td valign="top"> <xsl:call-template name="choose-title"/> (<gsf:metadata name="numleafdocs"/>) </td>

Preview the new titles list.
As a consequence of using the bookshelf_type option of the List classifier, bookshelf icons appear when titles are browsed. This revised format statement has the effect of specifying in brackets how many items are contained within a bookshelf, for classifier nodes. For document nodes, Title is not displayed. Instead, Volume, Number and Date information are displayed.

Browsing documents by Date.

Back in the Design panel, under the Browsing Classifiers section, add a DateList classifier, leaving its metadata option set to ex.Date.

In the Format Features section of the Format panel, select DateList in the Choose Feature list, and click <Add Format> to add this format statement to your collection. In the documentNode template of the new DateList feature, replace:

<gsf:switch> <gsf:metadata name="Source"/> <gsf:when test="exists"/> (<gsf:metadata name="Source">) </gsf:when> </gsf:switch>

with this, which can also be copied from the file titles_tweak_gs3.txt:

</td> <td valign="top"> <xsl:call-template name="choose-date"/>

The above makes reference to the "choose-date" template which we're about to create: select the global format statement in the Format Features and append the following definition for the "choose-date" template (which can be copied from sample_files → niupepa → formats → global_tweak_gs3.txt):

<gsf:template name="choose-date"> <gsf:choose-metadata> <gsf:metadata format="formatDate" name="dc.Date"/> <gsf:metadata format="formatDate" name="exp.Date"/> <gsf:metadata format="formatDate" name="ex.dc.Date"/> <gsf:metadata format="formatDate" name="Date"/> <gsf:default>undated</gsf:default> </gsf:choose-metadata> </gsf:template>

Build the collection, and preview the dates list.

The dates list groups documents by date. Greenstone's internal date format is YYYYMMDD, for example 18580601, and this is crucial for the DateList classifier to correctly parse date metadata and generate an ordered date list. However, the date has been made to look nice by adding a "format=formatDate" attribute to Date metadata in the format statement.

Back in the global format statement, edit the display of the date metadata to remove the special date-formatting, so that it looks like:

<gsf:template name="choose-date"> <gsf:choose-metadata> <gsf:metadata name="dc.Date"/> <gsf:metadata name="exp.Date"/> <gsf:metadata name="ex.dc.Date"/> <gsf:metadata name="Date"/> <gsf:default>undated</gsf:default> </gsf:choose-metadata> </gsf:template>

Refresh in the web browser to view the new dates list. The dates are now shown in internal format.

Change the format statement back to reinstate the nicely formatted dates. This can be done by selecting global in assigned format statements panel and clicking <Undo> a few times.

Searching at page level

The newspaper documents are split into sections, one per page. For large documents, it is useful to be able to search on sections rather than documents. This allows users to more easily locate the relevant information in the document.

Go to the Search Indexes section of the Design panel. Remove the ex.Source index and, if not already the case, check the section checkbox to build the indexes on section level as well as document level. Make section level the default by selecting its Default radio button.

Set the display text used for the level drop-down menu by going to the Search section on the Format panel. Set the document level text to "newspaper", and the section level text to "page".

Build and preview the collection.
Choose form search. Compare searching at "newspaper" level with searching at "page" level. A useful search term for this collection is "aroha".

You might notice that newspaper level search results only display the newspaper Title, and not any volume information, while page level search results only show the Title of the page (the page number), and not the Title of the newspaper. We'll modify the format statement to show Volume and Number information, and for page results, the newspaper title as well as the page number.

In the Format Features section, select Search in Choose Feature to adjust how search results are displayed.
The extracted Title for the current section is specified as <gsf:metadata name="Title"/> while the Title for the parent section is <gsf:metadata name="Title" select="parent"/>. Since the same SearchVList format statement is used when searching both whole newspapers and newspaper pages, we need to make sure it works in both cases.
Replace the lines comprising the final <td> table cell element with the following format statement (it can be copied and pasted from the file sample_files → niupepa → formats → search_tweak_gs3.txt):

<td> <gsf:switch> <gsf:metadata name="Title" select="parent"/> <gsf:when test="exists"> <gsf:metadata name="Title" select="parent"/> Volume:<gsf:metadata name="Volume" select="parent"/> Number:<gsf:metadata name="Number" select="parent"/> - Page:<gsf:metadata name="Title"/> </gsf:when> <gsf:otherwise> <gsf:metadata name="Title"/> Volume:<gsf:metadata name="Volume"/> Number:<gsf:metadata name="Number"/> </gsf:otherwise> </gsf:switch> <gsf:choose-metadata> <gsf:metadata name="Date" select="parent" format="formatDate" /> <gsf:metadata name="Date" format="formatDate" /> <gsf:default>undated</gsf:default> </gsf:choose-metadata> </td>

Preview the search results. Items display newspaper Title, Volume, Number and Date, and pages also display the page number.

The collection you have just built involves a fairly complex document structure. There are two series of newspapers, Te Waka and Te Whetu.

In the Te Waka series there are two actual newspapers, Volume 1 Numbers 1 and 2. Number 1 has 4 pages, numbered 1, 2, 3, 4; Number 2 has 4 pages, numbered 5, 6, 7, 8. The page numbers increase consecutively through each volume, despite the fact that the volume is divided into different Numbers. Each page in the Te Waka series is represented by a single file, a GIF image of the page.

The Te Whetu series has three actual newspapers, Volume 1 Numbers 1, 2, and 3. Number 1 has 4 pages, numbered 1, 2, 3, 4; Number 2 has 5 pages, numbered 5, 6, 7, 8, 9; Number 3 has 5 pages, numbered 10, 11, 12, 13, 14. Again the page numbers increase consecutively through each volume. Each page in this series is represented by two files, a GIF image of the page and a text file containing the OCR’d text that appears on it.

The key to this structure is in the respective .item files. Here is a synopsis of the information they contain:

(9-1-1) Te Waka Volume 1 Number 1 p.1 gif p.2 gif p.3 gif p.4 gif (9-1-2) Te Waka Volume 1 Number 2 p.5 gif p.6 gif p.7 gif p.8 gif (10-1-1) Te Whetu Volume 1 Number 1 p.1 gif text p.2 gif text p.3 gif text p.4 gif text (10-1-2) Te Whetu Volume 1 Number 2 p.5 gif text … p.9 gif text (10-1-3) Te Whetu Volume 1 Number 3 p.10 gif text … p.14 gif text

Copyright © 2005-2016 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”