Greenstone tutorial exercise
Scanned image collection
Here we build a small replica of Niupepa, the Maori Newspaper collection, using five newspapers taken from two newspaper series. It allows full text searching and browsing by title and date. When a newspaper is viewed, a preview image and its corresponding plain text are presented side by side, with a "go to page" navigation feature at the top of the page.
The collection involves a mixture of plugins, classifiers, and format statements. The bulk of the work is done by PagedImagePlugin, a plugin designed precisely for the kind of data we have in this example. For each document, an "item" file is prepared that specifies a list of image files that constitute the document, tagged with their page number and (optionally) accompanied by a text file containing the machine-readable version of the image, which is used for full text searching. Three newspapers in our collection (all from the series "Te Whetu o Te Tau") have text representations, and two (from "Te Waka o Te Iwi") have images only. Item files can also specify metadata. In our example the newspaper series is recorded as ex.Title and its date of publication as ex.Date. Issue ex.Volume and ex.Number metadata is also recorded, where appropriate. This metadata is extracted as part of the building process.
-
Start a new collection called Paged Images and fill out the fields with appropriate information: it is a collection sourced from an excerpt of Niupepa documents.
-
In the Gather panel, open the sample_files → niupepa → sample_items folder and drag the two subfolders into your collection on the right-hand side. A popup window asks whether you want to add PagedImagePlugin to the collection: click <Add Plugin>, because this plugin will be needed to process the item files.
-
Some of the files you have just dragged in are the newspaper images; others are text files that contain the text extracted from these images. We want these to be processed by PagedImagePlugin, not ImagePlugin or TEXTPlugin. Switch to the Document Plugins section of the Design panel and delete ImagePlugin and TEXTPlugin.
-
Open up the configuration window for PagedImagePlugin by double-clicking on the plugin. Switch on its screenview configuration option by checking the box. The source images we use were scanned at high resolution and are large files for a browser to download. The screenview option generates smaller screen-resolution images of each page when the collection is built. Click <OK>.
-
Now go to the Create panel, build the collection and preview the result. Search for "waka" and view one of the titles listed (all three appear as Te Whetu o Te Tau). Browse by Titles and view one of the Te Waka o Te Iwi newspapers. Note that only the Te Whetu o Te Tau newspapers have text; Te Waka o Te Iwi papers don't.
This collection was built with Greenstone's default settings. You can locate items of interest, but the information is less clearly and attractively presented than in the full Niupepa collection.
Grouping documents by series title and displaying dates within each group
Under Titles documents from the same series are repeated without any distinguishing features such as date, volume or number. It would be better to group them by series title and display other information within each group. This can be accomplished using an AZCompactList classifier rather than AZList, and tuning the classifier's format statement.
-
In the Design panel, under the Browsing Classifiers section, delete the AZList classifiers for ex.Source and ex.Title.
-
Now add an AZCompactList classifier, setting its metadata option to ex.Title, and add a DateList classifier, setting its metadata option to ex.Date.
-
Build the collection, and preview the Titles list and the Dates list.
-
Now we change the format statement for Titles to display more information about the documents. In the Format Features section of the Format panel, select the ex.Title classifier in the Choose Feature list, and VList in the Affected Component list. Click <Add Format> to add this format statement to your collection. Delete the contents of the HTML Format String box, and add the following text. (This format statement can be copied and pasted from the file sample_files → niupepa → formats → titles_tweak.txt.)
<td valign="top">[link][icon][/link]</td>
<td valign="top">
{If}{[numleafdocs],[ex.Title] ([numleafdocs]),
Volume [ex.Volume] Number [ex.Number] Date [ex.Date]}
</td>
-
Refresh in the web browser to view the new Titles list.
As a consequence of using the AZCompactList classifier, bookshelf icons appear when titles are browsed. This revised format statement has the effect of specifying in brackets how many items are contained within a bookshelf. It works by exploiting the fact that only bookshelf icons define [numleafdocs] metadata. For document nodes, Title is not displayed. Instead, Volume, Number and Date information are displayed.
-
The Dates list groups documents by date. A numeric date is displayed at the end of each document title, for example 18580601. This is in the Greenstone internal date format, which is crucial for the DateList classifier to correctly parse date metadata and generate an ordered date list. However, you can make the date look nice by adding a [Format:] macro to date metadata.
In the Format Features section of the Format panel, select the DateList classifier. Replace the last line
<td>{Or}{[dc.Date],[exp.Date],[ex.Date]}</td>
with
<td>{Or}{[dc.Date],[exp.Date],[format:ex.Date]}</td>
Refresh in the web browser to view the new Dates list.
Displaying scanned images and suppressing dummy text
When you reach a newspaper, only its associated text is displayed. When either of the Te Waka o Te Iwi newspapers is accessed, the document view presents the message "This document has no text." No scanned image information (screen-view resolution or otherwise) is shown, even though it has been computed and stored with the document. This can be fixed by a format statement that modifies the default behaviour for DocumentText.
-
In the Format Features section of the Format panel, select the DocumentText format statement. The default format string displays the document's plain text, which, if there is none, is set to "This document has no text." Change this to the following text. (This format statement can be copied and pasted from the file sample_files → niupepa → formats → doc_tweak.txt)
<table><tr>
<td valign=top>[srclink][screenicon][/srclink]</td>
<td valign=top>[Text]</td>
</tr></table>
Including [screenicon] has the effect of embedding the screen-sized image generated by switching the screenview option on in PagedImagePlugin. It is hyperlinked to the original image by the construct [srclink]...[/srclink]. This is a large image but it may be scaled by your browser.
This modification will display screenview image, but does nothing about the dummy text "This document has no text.", which will still be displayed. To get rid of this, edit the DocumentText format statement again and replace
<td valign=top>[Text]</td>
with
{If}{[NoText] ne '1',<td valign=top>[Text]</td>}
-
Preview the collection and view one of the Te Waka o Te Iwi documents. The line "This document has no text." should now be gone.
Searching at page level
-
The newspaper documents are split into sections, one per page. For large documents, it is useful to be able to search on sections rather than documents. This allows users to more easily locate the relevant information in the document.
-
Go to the Search Indexes section of the Design panel. Remove the ex.Source index. Check the section checkbox to build the indexes on section level as well as document level. Make section level the default by selecting its Default radio button.
-
Build and preview the collection.
-
Set the display text used for the level drop-down menu by going to the Search section on the Format panel. Set the document level text to "newspaper", and the section level text to "page".
Refresh in your web browser. Compare searching at "newspaper" level with searching at "page" level. A useful search term for this collection is "aroha".
-
You will notice that when searching for individual pages, the newspaper image is displayed in the search results. As these images are very large, this is not very useful. Go to Format Features section of the Format panel in the Librarian Interface, choose All Features in Choose Feature list, and select the VList format statement from the list of assigned format statements. Remove the second line from the HTML Format String:
<td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
While we are here, let's remove the filename from the display. Remove the following from the last line:
{If}{[ex.Source],<br><i>([ex.Source])</i>}
Preview the collection—the search results should be back to normal.
-
Now you will notice that page level search results only show the Title of the page (the page number), and not the Title of the newspaper. We'll modify the format statement to show the newspaper title as well as the page number. Also, lets add in Volume and Number information too.
In the Format Features section, select Search in Choose Feature, and VList in Affected Component. Click <Add Format> to add this format to the collection. The previous changes modified VList, so they will apply to all VLists that don't have specific format statements. These next changes are made to SearchVList so will only apply to search results.
The extracted Title for the current section is specified as [ex.Title] while the Title for the parent section is [parent:ex.Title]. Since the same SearchVList format statement is used when searching both whole newspapers and newspaper pages, we need to make sure it works in both cases.
Set the format statement to the following text (it can be copied and pasted from the file sample_files → niupepa → formats → search_tweak.txt):
<td valign="top">[link][icon][/link]</td>
<td valign="top">
{If}{[parent:ex.Title],[parent:ex.Title] Volume [parent:ex.Volume] Number [parent:ex.Number]: Page [ex.Title],
[ex.Title] Volume [ex.Volume] Number [ex.Number]}
<br/><i>({Or}{[parent:ex.Date],[ex.Date],undated})</i></td>
</td>
Preview the search results. Items display newspaper title, Volume, Number and Date, and pages also display the page number.
The collection you have just built involves a fairly complex document structure. There are two series of newspapers, Te Waka and Te Whetu.
In the Te Waka series there are two actual newspapers, Volume 1 Numbers 1 and 2. Number 1 has 4 pages, numbered 1, 2, 3, 4; Number 2 has 4 pages, numbered 5, 6, 7, 8. The page numbers increase consecutively through each volume, despite the fact that the volume is divided into different Numbers. Each page in the Te Waka series is represented by a single file, a GIF image of the page.
The Te Whetu series has three actual newspapers, Volume 1 Numbers 1, 2, and 3. Number 1 has 4 pages, numbered 1, 2, 3, 4; Number 2 has 5 pages, numbered 5, 6, 7, 8, 9; Number 3 has 5 pages, numbered 10, 11, 12, 13, 14. Again the page numbers increase consecutively through each volume. Each page in this series is represented by two files, a GIF image of the page and a text file containing the OCR’d text that appears on it.
The key to this structure is in the respective .item files. Here is a synopsis of the information they contain:
(9-1-1) Te Waka Volume 1 Number 1
p.1 gif
p.2 gif
p.3 gif
p.4 gif
(9-1-2) Te Waka Volume 1 Number 2
p.5 gif
p.6 gif
p.7 gif
p.8 gif
(10-1-1) Te Whetu Volume 1 Number 1
p.1 gif text
p.2 gif text
p.3 gif text
p.4 gif text
(10-1-2) Te Whetu Volume 1 Number 2
p.5 gif text
…
p.9 gif text
(10-1-3) Te Whetu Volume 1 Number 3
p.10 gif text
…
p.14 gif text