Greenstone tutorial exercise

Back to wiki
Back to index
Prerequisite: Scanned image collection
Sample files: niupepa.zip
Devised for Greenstone version: 2.70|3.06
Modified for Greenstone version: 2.87|3.08

Advanced scanned image collection

In this exercise we build upon the collection created in the Scanned image collection exercise. We add a new newspaper by creating an item file for it, add a new newspaper using the extended XML item file format, and modify the formatting.

Adding another newspaper to the collection

Another newspaper has been scanned and OCRed, but has no item file. We will add this newspaper into the collection, and create an item file for it.

  1. In the Librarian Interface, open up the Paged Image collection that was created in exercise Scanned image collection if it is not already open (FileOpen...).

  1. In the Gather panel, add the folder sample_files → niupepa → new_papers → 12 to your collection.

    Inside the 12 folder you can see that there are 4 images and 4 text files.

  1. Create an item file for the collection. Have a look at an existing item file to see the format. Start up a text editor (e.g. WordPad) to open a new document. Add some metadata. The Title for this newspaper is "Te Haeata 1859-1862". The Volume is 3, Number is 6, and the Date is "18610902". (Greenstone's date format is yyyymmdd.) Metadata must be added in the form:

    <Metadata name>Metadata value

    For this document, the metadata looks like:

    <Title>Te Haeata 1859-1862
    <Date>18610902
    <Volume>3
    <Number>6

  1. For each page, add a line in the file in the following format:

    pagenum:imagefile:textfile

    For example, the first page entry would look like

    1:images/12_3_6_1.gif:text/12_3_6_1.txt

    Note that if there is no text file, you can leave that space blank. You need to add a line for each page in the document. Make sure you increment the page number as well as the image number for each line. (The full text for this file can be copied from sample_files → niupepa → formats → 12_3_6.item.)

  1. Save the file using Filename 12_3_6.item, and save as a plain text document. (If you are using Windows, make sure the file doesn't accidentally end up getting saved as 12_3_6.item.txt.) Back in the Gather panel of the Librarian Interface, locate the new file in the Workspace tree, and drag it into the collection, adding it to the 12 folder.

  1. Build the collection and preview. Check that your new document has been added.

XML based item file

There are two styles of item files. The first, which was used in the previous section, uses a simple text based format, and consists of a list of metadata for the document, and a list of pages. This format allows specification of document level metadata, and a single list of pages.

The second style is an extended format, and uses XML. It allows a hierarchy of pages, and metadata specification at the page level as well as at the document level. In this section, we add in two newspapers which use XML-based item files.

  1. In the Gather panel, add the folder sample_files → niupepa → new_papers → xml (you need to add the xml folder, not the 23 folder) to your collection.

  1. Open up the file xml → 23 → 23__2.item and have a look at the XML. This is Number 2 of the newspaper titled Matariki 1881. The contents of this document have been grouped into two sections: Supplementary Material, which contains an Abstract, and Newspaper Pages, which contains the page images (and OCR text).

  1. Build and preview the collection. The xml style items have been included, but the document display for these items is not very nice.

Using process_exp to control document processing

  1. Paged documents can be presented with a hierarchical table of contents, or with next and previous page arrows, and a "go to page" box (like we have done so far). The display type is specified by the documenttype (hierarchy|paged) option to PagedImagePlugin. The next and previous arrows suit the linear sequence documents, while the table of contents suits the hierarchically organised document.

    Ordinarily, a Greenstone collection would have one plugin per document type, and all documents of that type get the same processing. In this case, we want to treat the XML-based item files differently from the text-based item files. We can achieve this by adding two PagedImagePlugin plugins to the collection, and configuring them differently.

  1. Go to the Document Plugins section of the Design panel, and add a new PagedImagePlugin plugin. Enable the create_screenview option, set the documenttype option to hierarchy and set the process_exp option to xml.*\.item$ and click OK.

  1. Move this PagedImagePlugin plugin above the original one in the Assigned Plugins list.

  1. The XML based newspapers have been grouped into a folder called xml. This enables us to process these files differently, by utilizing the process_exp option which all plugins support. The first PagedImagePlugin in the list looks for item files underneath the xml folder. These documents will be processed as 'hierarchical' documents. Item files that don't match the process expression (i.e. aren't underneath the xml folder) will be passed onto the second PagedImagePlugin, and these are treated as 'paged' documents.

    Rebuild and preview the collection. Compare the document display for a paged document e.g. Te Waka o Te Iwi, Vol. 1, No. 1 with a hierarchical document, e.g. Matariki 1881, No. 1.

Switching between images and text

We can modify the document display to switch between the text version and the screenview and full size versions. We do this using a combination of format statements and macro files.

  1. First of all we will add a macro file to the collection. Close the collection in the Librarian Interface. In a file browser outside of Greenstone, locate the Paged Image collection in your Greenstone installation: Greenstone → collect → pagedima.

    Also in a file browser, locate the file sample_files → niupepa → macros → extra.dm. Copy this file and paste it into the macros folder inside the pagedima collection.

  1. Back in the Librarian Interface, open up the collection again, and go to the Format Features section of the Format panel.

  1. Select AllowExtendedOptions in the Choose Feature list, and click <Add Format>. Tick the Enabled checkbox. This gives us more control over the layout of the page—in this case, we want to replace the standard DETACH and NO HIGHLIGHTING buttons with buttons that switch between images and text.

  1. Select the DocumentHeading format item and set it to the following text (which can copied from sample_files → niupepa → formats → adv_doc_heading.txt).

    <div class="heading_title">{Or}{[parent(Top):ex.Title],[ex.Title]}</div>
    <div class="buttons" id="toc_buttons">
    {If}{[srcicon],_document:viewfullsize_}
    {If}{[screenicon],_document:viewpreview_}
    {If}{[NoText] ne '1',_document:viewtext_}
    </div>
    <div class="toc">[DocTOC]</div>

    {Or}{[parent(Top):ex.Title],[ex.Title]} outputs the newspaper Title metadata. This is only stored at the top level of the document, so if we are at a subsection, we need to get it from the top ([parent(Top):ex.Title]). Note that we can't just use [parent:ex.Title] as this retrieves the Title from the immediate parent node, which may not be the top node of the document.

    _document:viewpreview_, _document:viewfullsize_, _document:viewtext_ are macros defined in extra.dm which output buttons for preview, fullsize and text versions, respectively. We choose which buttons to display based on what metadata and text the document has. (Note: you can view the macros by going to the Collection Specific Macros section of the Format panel.)

    [DocTOC] is the document table of contents or "go to page" navigation element. Since we are using extended options, we need to explicitly specify this for it to appear in the page.

    The different pieces are surrounded by <div> elements, so that the appropriate styling information can be used.

  1. Select the DocumentText format statement and set it to the following text (which can be copied from sample_files → niupepa → formats → adv_doc_text.txt):

    {If}{_cgiargp_ eq 'fullsize',[srcicon],
    {If}{_cgiargp_ eq 'preview',[screenicon],
    {If}{[NoText] ne '1',[Text],[screenicon]}}}

    This format statement changes the display based on the "p" argument (_cgiargp_). This is not used normally for document display, so we can use it here to switch between full size image ([srcicon]), preview size image ([screenicon]) and text ([Text]) versions of each page.

  1. Preview the collection. View some of the documents—once you have reached a newspaper page, you should get fullsize, preview and text options.


Copyright © 2005-2016 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”