Greenstone tutorial exercise

Back to wiki
Back to index
Sample files: tudor.zip
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.74

A large collection of HTML files—Tudor

You will need the files in the sample_files → tudor folder.

  1. Invoke the Greenstone Librarian Interface (from the Windows Start menu) and start a new collection called tudor (use the File menu), based on the default -- New Collection --.

  1. In the Gather panel, open the tudor folder in sample_files.

  1. Drag englishhistory.net from the left-hand side to the right to include it in your tudor collection. (This material is from Marilee Hanson's Tudor England Collection at http://englishhistory.net/tudor.html, distributed with her permission.)

  1. Switch to the Create panel and click <Build Collection>.

  1. When building has finished, preview the collection.

Extracting more metadata from the HTML

  1. The browsing facilities in this collection (Titles and Filenames) are based entirely on extracted metadata. Return to the Enrich panel in the Librarian Interface and examine the metadata that has been extracted for some of the files.

  1. Many HTML documents contain metadata in <meta> tags in the <head> of the page. Open up the englishhistory.net → tudor → monarchs → boleyn.html file by navigating to it in the tree on the left hand side, and double clicking it. This will open it in a web browser. View the HTML source of the page (View → Source in Internet Explorer, View → Page Source in Mozilla). You will notice that this page has page_topic, content and author metadata.

  1. By default, HTMLPlugin only looks for Title metadata. Configure the plugin so that it looks for the other metadata too. Switch to the Design panel and select the Document Plugins section. Select the plugin HTMLPlugin line and click <Configure Plugin...>. A popup window appears. Switch on the metadata_fields option, and set the value to

    Title,Author,Page_topic,Content

    Click <OK>.

  1. Switch to the Create panel and rebuild the collection. Go back to the Enrich panel and look at the extracted metadata for some of the HTML files in englishhistory.net → tudor → monarchs. The new metadata should now be visible.

Blocking the stray images

You've probably noticed that the collection contains a few stray image files, as well as the HTML documents. This is a mistake. The issue is that many of the HTML documents include images, and although Greenstone attempts to determine which images belong to HTML pages and only considers other images for inclusion in the collection, in this case it hasn't been completely successful. (This is because the web site from which these files were downloaded occasionally departs from the usual convention of hierarchical structuring.)

  1. Switch back to the Document Plugins section of the Design panel. Beside plugin HTMLPlugin you will see -smart_block. This is the option that attempts to identify images in the HTML pages and block them from inclusion—in this case, it's not smart enough! Configure plugin HTMLPlugin again, scroll down the page to locate the smart_block option, and switch it off.

  1. Rebuild and preview the collection. The collection is exactly as before except that these stray images are suppressed. What is happening is that plug-ins operate as a pipeline: files are passed to each one in turn until one is found that can process it. By default (i.e. without smart_block) the HTML plug-in blocks all images, which is appropriate for this collection.

Looking at different views of the files in the Gather and Enrich panels

  1. Switch to the Gather panel and in the right-hand side open englishhistory.net → tudor.

  1. Change the Show Files menu for the right-hand side from All Files to HTM & HTML. Notice the files displayed above are filtered accordingly, to show only files of this type.

  1. Change the Show Files menu to Images. Again, the files shown above alter.

  1. Now return the Show Files setting back to All Files, otherwise you may get confused later. Remember, if the Gather or Enrich panels do not seem to be showing all your files, this could be the problem.


Copyright © 2005 2006 2007 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”