Greenstone tutorial exercise

Back to wiki
Back to index
Prerequisite: Looking at a multimedia collection
Sample files: beatles.zip
Devised for Greenstone version: 2.60|3.06
Modified for Greenstone version: 2.86|3.08

Building a multimedia collection

We will proceed to reconstruct from scratch the Beatles collection that you have just looked at. We develop the collection using a small subset of the material, purely to speed up the repeated rebuilding that is involved.

  1. Start a new collection (FileNew...) called small beatles, basing it on the default -- New Collection --. (Basing it on the existing Advanced Beatles collection would make your life far easier, but we want you to learn how to build it from scratch!)

  1. Copy the files and folders provided in

    sample_files → beatles → advbeat_small

    into your new collection. Do this by opening up advbeat_small, selecting the eight items within it (from discography to beatles_midi.zip), and dragging them across. Because some of these files are in MP3 and MARC formats you will be asked whether to include MP3Plugin and MARCPlugin in your collection. Click <Add Plugin>.

  1. Change to the Enrich panel and browse around the files. There is no metadata—yet. Recall that you can double-click files to view them.

    (There are no MIDI files in the collection: these require more advanced customisation because there is no MIDI plugin. We will deal with them later.)

  1. Change to the Create panel and build the collection.

  1. Preview the result.

Manually correcting metadata

  1. You might want to correct some of the metadata—for example, the atrocious misspelling in the titles "MAGICAL MISTERY TOUR." These documents are in the discography section, with filenames that contain the same misspelling. Locate one of them in the Enrich panel. Notice that the extracted metadata element ex.Title is now filled in, and misspelt. You cannot correct this element, for it is extracted from the file and will be re-extracted every time the collection is re-built.

  1. Instead, add dc.Title metadata for these two files: "Magical Mystery Tour." In the Enrich panel, open the discography folder and drill down to the individual files. Set the dc.Title value for the two offending items.

  1. Build the collection again, and preview it.

    Extracted metadata is unreliable. But it is very cheap! On the other hand, manually assigned metadata is reliable, but expensive. The previous section of this exercise has shown how to aim for the best of both worlds by using extracted metadata but correcting it when it is wrong.

Browsing by media type

  1. First let's remove the List classifier for filenames, which isn't very useful, and replace it with a browsing structure that groups documents by category (discography, lyrics, audio etc.). Categories are defined by manually assigned metadata.

    Build the collection again and preview it.

Note how we assigned dc.Format metadata to all documents in the collection with a minimum of labour. We did this by capitalizing on the folder structure of the original information. Even though we complained earlier about how messy this folder structure is, you can still take advantage of it when assigning metadata.

Using switch statements

  1. Alongside the Audio files there is an MP3 icon, which plays the audio when you click it. There is also a document icon, which doesn't make much sense with either audio or image files. We can modify the format statement to display different icons depending on the value of the dc.Format metadata field.

    To make this easier for you we have prepared a plain text file that contains the new text. In WordPad open the following file. (Do not use Notepad, because Notepad does not display the line breaks correctly.)

    sample_files → beatles → format_tweaks → audio_tweak_3.txt

    The gsf:switch statement allows you to display different things depending on the value (or existence) of a metadata field. This switch statement is based on the value of dc.Format. If dc.Format equals Audio, then the source icon will be displayed, linking to the source document. (For the MP3 files, the MP3 icon will display, and clicking the icon will play the MP3.) If dc.Format equals Images, the image thumbnail will appear, linking to the full-size image. If dc.Format equals Supplementary, both the source and document icons will appear, linking to the source document and the document display page, respectively. Finally, the gsf:otherwise statement says what to do in all other cases.

    Preview the result. You may need to click the browser's <Reload> button to force it to re-load the page.

  1. While we're at it, let's remove the source filename from where it appears after each document.

    Preview the result (you don't need to rebuild the collection.)

Using AZCompactList rather than List

  1. There are sometimes several documents with the same title. For example, All My Loving appears both as lyrics and tablature (under ALL MY LOVING). The titles browser might be improved by grouping these together under a bookshelf icon. This is a job for an AZCompactList. In a previous tutorial we showed how to use the bookshelf_type option in List classifier to group documents with the same metadata value (dc.Format in that case) in one bookshelf. Here we use AZCompactList instead.

    Build the collection again and preview it. Both items for All My Loving now appear under the same bookshelf. However, many entries haven't been amalgamated because of non-uniform titles: for example A Hard Day's Night appears as several different variants. We will learn below how to amalgamate these.

Making bookshelves show how many items they contain

  1. Make the bookshelves show how many documents they contain by modifying the VList classifierNode template of the browse format feature in the Format Features section of the Format panel. Insert the highlighted statements:

    <gsf:template match="classifierNode[@classifierStyle = 'VList']">
    ...
    <gsf:metadata name="Title"/>
    </td>
    <td valign="top">
    (<gsf:metadata name="numleafdocs"/>)
    </td>

    <gsf:template

    The complete format statement for the VList classifierNode template of the browse format feature can be copied from sample_files → beatles → format_tweaks → show_num_docs_3.txt.

    Preview the result (you don't need to build the collection.) Bookshelves in the titles and browse classifiers should show how many documents they contain.

Adding a Phind phrase browser

  1. In the Browsing Classifiers section on the Design panel, add a Phind classifier. Leave the settings at their defaults: this generates a phrase browsing classifier that sources its phrases from Title and text.

    Build the collection again and preview it. Select the new Phrase browse option from the navigation bar. Enter a single word in the text box, such as band. The phrase browser will present you with phrases found in the collection containing the search term. This can provide a useful way of browsing a very large collection. Note that even though it is called a phrase browser, only single terms can be used as the starting point for browsing.

Branding the collection with an image

  1. To complete the collection, lets give it a new image for the link from the main page. Go to the General section of the Format panel. Use the browse button of URL to 'about page' image: to select the following image:

    sample_files → beatles → advbeat_large → images → beatlesmm.png

    Preview the collection, and make sure the new image appears.

Using UnknownPlugin

In this section we incorporate the MIDI files. Greenstone has no MIDI plugin (yet). But that doesn't mean you can't use MIDI files!

  1. UnknownPlugin is a useful generic plugin. It knows nothing about any given format but can be tailored to process particular document types—like MIDI—based on their filename extension, and set basic metadata.

    In the Document Plugins section of the Design panel:

    In this collection, all MIDI files are contained in the file beatles_midi.zip. ZIPPlugin (already in the list of default plugins) is used to unpack the files and pass them down the list of plugins until they reach UnknownPlugin.

  1. Build the collection and preview it. Unfortunately, the MIDI files don't appear as Audio under the browse button. That's because they haven't been assigned dc.Format metadata.

Cleaning up a title browser using regular expressions

We now clean up the titles browser.

  1. We are going to use the removesuffix classifier option. The aim is to amalgamate variants of titles by stripping away extraneous text. For example, we would like to treat "ANTHOLOGY 1", "ANTHOLOGY 2" and "ANTHOLOGY 3" the same for grouping purposes. To achieve this:

    Build the collection and preview the result. Observe how many more times similar titles have been amalgamated under the same bookshelf. Test your understanding of regular expressions by trying to rationalize the amalgamations. (Note: [[:punct:]] stands for any punctuation character.)

One powerful use of regular expressions in the exercise was to clean up the titles browser. Perhaps the best way of doing this would be to have proper title metadata. The metadata extracted from HTML files is messy and inconsistent, and this was reflected in the original titles browser. Defining proper title metadata would be simple but rather laborious. Instead, we have opted to use regular expressions in the AZCompactList classifier to clean up the title metadata. This is difficult to understand, and a bit fiddly to do, but if you can cope with its idiosyncrasies it provides a quick way to clean up the extracted metadata and avoid having to enter a large amount of metadata.

Using different icons for different media types

To put finishing touches to our collection, we add some decorative features

  1. Close the collection in the Librarian Interface (FileClose).

  1. Using your file browser outside Greenstone, locate the folder

    sample_files → beatles → advbeat_large

  1. Open up another file browser, and locate the small beatles collection in your Greenstone installation:

    Greenstone3 → web → sites → localsite → collect → smallbea

    smallbea is the folder name generated by Greenstone for this collection. You can determine what the folder name is for a collection by looking at the title bar of the Librarian Interface: the folder name is displayed in brackets after the collection name.

  1. Using the file browser, copy the images folder from the advbeat_large folder into the smallbea folder. (It's OK to overwrite the existing images folder: the image in it is included in the folder being copied.) The images folder includes some useful icons.

  1. Open the collection in GLI again and update the previously edited portion of the documentNode format statement of the browse format feature (in Format Features on the Format panel) to be the following. You can copy this text from the file sample_files → beatles → format_tweaks → multi_icons_3.txt.Change:

    <td valign="top">
    <gsf:switch>
    <gsf:metadata name="dc.Format"/>
    <gsf:when test='equals' test-value='Audio'>
    <gsf:link type="source"><gsf:metadata name="srcicon"/></gsf:link>
    </gsf:when>
    <gsf:when test='equals' test-value='Images'>
    <gsf:link type="source"><gsf:metadata name="thumbicon"/></gsf:link>
    </gsf:when>
    <gsf:when test='equals' test-value='Supplementary'>
    <gsf:link type="source"><gsf:metadata name="srcicon"/></gsf:link> <gsf:link type="document"><gsf:icon type="document"/></gsf:link>
    </gsf:when>
    <gsf:otherwise>
    <gsf:link type="document"><gsf:icon type="document"/></gsf:link>
    </gsf:otherwise>
    </gsf:switch>

    </td>

    to this:

    <td valign="top">
    <gsf:switch>
    <gsf:metadata name="dc.Format"/>
    <gsf:when test="equals" test-value="Lyrics">
    <gsf:link type="document">
    <gsf:icon file="lyrics.gif" select="collection" />
    </gsf:link>
    </gsf:when>
    <gsf:when test="equals" test-value="Discography">
    <gsf:link type="document">
    <gsf:icon file="disc.gif" select="collection" />
    </gsf:link>
    </gsf:when>
    <gsf:when test="equals" test-value="Tablature">
    <gsf:link type="document">
    <gsf:icon file="tab.gif" select="collection" />
    </gsf:link>
    </gsf:when>
    <gsf:when test="equals" test-value="MARC">
    <gsf:link type="document">
    <gsf:icon file="marc.gif" select="collection" />
    </gsf:link>
    </gsf:when>
    <gsf:when test="equals" test-value="Images">
    <gsf:link type="source">
    <gsf:metadata name="thumbicon"/>
    </gsf:link>
    </gsf:when>
    <gsf:when test="equals" test-value="Supplementary">
    <gsf:link type="source">
    <gsf:metadata name="srcicon"/>
    </gsf:link>
    </gsf:when>
    <gsf:when test="equals" test-value="Audio">
    <gsf:link type="source">
    <gsf:switch>
    <gsf:metadata name="FileFormat"/>
    <gsf:when test="equals" test-value="MIDI">
    <gsf:icon file="midi.gif" select="collection" />
    </gsf:when>
    <gsf:otherwise>
    <gsf:metadata name="srcicon"/>
    </gsf:otherwise>
    </gsf:switch>
    </gsf:link>
    </gsf:when>
    </gsf:switch>
    </td>

  1. Preview your collection as before. Now different icons are used for discography, lyrics, tablature, and MARC metadata. Even MP3 and MIDI audio file types are distinguished.

Building a full-size version of the collection

  1. To finish, let's now build a larger version of the collection. To do this:


Copyright © 2005-2016 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”