Greenstone tutorial exercise

Back to wiki
Back to index
Prerequisite: Scanned image collection
Sample files: niupepa.zip
Devised for Greenstone version: 2.70|3.06
Modified for Greenstone version: 2.86|3.08

Advanced scanned image collection

In this exercise we build upon the collection created in the Scanned image collection exercise. We add a new newspaper by creating an item file for it, add a new newspaper using the extended XML item file format, and modify the formatting.

Adding another newspaper to the collection

Another newspaper has been scanned and OCRed, but has no item file. We will add this newspaper into the collection, and create an item file for it.

  1. In the Librarian Interface, open up the Paged Image collection that was created in exercise Scanned image collection if it is not already open (FileOpen...).

  1. In the Gather panel, add the folder sample_files → niupepa → new_papers → 12 to your collection.

    Inside the 12 folder you can see that there are 4 images and 4 text files.

  1. Create an item file for the collection. Have a look at an existing item file to see the format. Start up a text editor (e.g. WordPad) to open a new document. Add some metadata. The Title for this newspaper is "Te Haeata 1859-1862". The Volume is 3, Number is 6, and the Date is "18610902". (Greenstone's date format is yyyymmdd.) Metadata must be added in the form:

    <Metadata name>Metadata value

    For this document, the metadata looks like:

    <Title>Te Haeata 1859-1862
    <Date>18610902
    <Volume>3
    <Number>6

  1. For each page, add a line in the file in the following format:

    pagenum:imagefile:textfile

    For example, the first page entry would look like

    1:images/12_3_6_1.gif:text/12_3_6_1.txt

    Note that if there is no text file, you can leave that space blank. You need to add a line for each page in the document. Make sure you increment the page number as well as the image number for each line. (The full text for this file can be copied from sample_files → niupepa → formats → 12_3_6.item.)

  1. Save the file using Filename 12_3_6.item, and save as a plain text document. (If you are using Windows, make sure the file doesn't accidentally end up getting saved as 12_3_6.item.txt.) Back in the Gather panel of the Librarian Interface, locate the new file in the Workspace tree, and drag it into the collection, adding it to the 12 folder.

  1. Build the collection and preview. Check that your new document has been added.

XML based item file

There are two styles of item files. The first, which was used in the previous section, uses a simple text based format, and consists of a list of metadata for the document, and a list of pages. This format allows specification of document level metadata, and a single list of pages.

The second style is an extended format, and uses XML. It allows a hierarchy of pages, and metadata specification at the page level as well as at the document level. In this section, we add in two newspapers which use XML-based item files.

  1. In the Gather panel, add the folder sample_files → niupepa → new_papers → xml (you need to add the xml folder, not the 23 folder) to your collection.

  1. Open up the file xml → 23 → 23__2.item and have a look at the XML. This is Number 2 of the newspaper titled Matariki 1881. The contents of this document have been grouped into two sections: Supplementary Material, which contains an Abstract, and Newspaper Pages, which contains the page images (and OCR text).

  1. Build and preview the collection. The xml style items have been included.


Copyright © 2005-2016 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”