Greenstone tutorial exercises (November 2008)

Suitable for Greenstone version 2.74 and above

If you are working from a Greenstone CD-ROM, sample files for these exercises are in the folder sample_files; otherwise they can be downloaded from sourceforge.

The text sometimes uses Windows terminology, but the exercises work equally well on other systems if you make appropriate changes to the pathnames.

Working with a pre-packaged collection (UNAIDS)
Installing a pre-packaged Greenstone collection
Browsing around a Greenstone collection
Searching within a Greenstone collection
Leaving the Greenstone digital library
Exercise: Use the UNAIDS collection to answer these questions
Working with a pre-packaged collection (Digital Libraries in Education)
Installing a pre-packaged collection
Browsing around a Greenstone collection
Exercise: Read the Help page; then answer these questions
Exercise: Use the How to build a digital library collection to answer these questions.
Installing Greenstone
Installing Greenstone on a Windows system
Installing ImageMagick on a Windows system
Installing Ghostscript on a Windows system
Updating a Greenstone installation
Removing Greenstone from a Windows system
Reinstalling Greenstone on a Windows system
Amalgamating different Greenstone collections
Installing the Greenstone language pack (2.62 and earlier)
Enabling other languages (2.63 and later)
Installing the Classic Interface Pack (2.63 and later)
Building a small collection of HTML files
Running the Greenstone Librarian Interface
Starting a new collection
Adding documents to the collection
Building the collection
Viewing the extracted metadata
Viewing the internal links and external links
Setting up a shortcut in the Librarian interface
A simple image collection
Adding Title and Description metadata
Change Format Features to display new metadata
Changing the size of image thumbnails
Adding a browsing classifier based on Description metadata
Creating a searchable index based on Description metadata
A collection of Word and PDF files
Viewing the extracted metadata
Manually adding metadata to documents in a collection
Document Plugins
Search indexes
Browsing classifiers
Renaming the search indexes
Classifying on multiple metadata
Formatting the Word and PDF collection
Tidying up the default format statement
Linking to Greenstone version or original version of documents
Making bookshelves show how many items they contain
Displaying multi-valued metadata
Advanced multi-valued metadata
Enhanced PDF handling
Modes in the Librarian Interface
Splitting PDFs into sections
Using image format
Using process_exp to control document processing (advanced)
Opening PDF files with query terms highlighted
Enhanced Word document handling
Using Windows native scripting
Modes in the Librarian Interface
Defining styles
Removing pre-defined table of contents
Extracting document properties as metadata
Exporting a collection to CD-ROM/DVD
A large collection of HTML files—Tudor
Extracting more metadata from the HTML
Blocking the stray images
Looking at different views of the files in the Gather and Enrich panels
Enhanced collection of HTML files—Tudor
Adding hierarchically-structured metadata and a Hierarchy classifier
Adding a hierarchical phrase browser (PHIND)
Partitioning the full-text index based on metadata values
Controlling the building process
Formatting the HTML collection—Tudor
Section tagging for HTML documents
Downloading files from the web
Pointing to documents on the web
Bibliographic collection
Using fielded searching
Exploding the database
Reformatting the collection to use the exploded metadata
CDS/ISIS collection
Customization: macro files and stylesheets
Collection specific customisation
Changing the colour of the page title and page text
Make your own Greenstone home page
How to determine which images to replace (advanced)
Looking at a multimedia collection
Building a multimedia collection
Manually correcting metadata
Browsing by media type
Suppressing dummy text
Using AZCompactList rather than AZList
Making bookshelves show how many items they contain
Adding a Phind phrase browser
Branding the collection with an image
Using UnknownPlugin
Cleaning up a title browser using regular expressions
Using non-standard macro files
Using different icons for different media types
Changing the collection's background image
Building a full-size version of the collection
Adding an image collage browser
Scanned image collection
Grouping documents by series title and displaying dates within each group
Displaying scanned images and suppressing dummy text
Searching at page level
Advanced scanned image collection
Adding another newspaper to the collection
XML based item file
Using process_exp to control document processing
Switching between images and text
Open Archives Initiative (OAI) collection
Tweaking the presentation with format statements
Downloading over OAI
Downloading using the Librarian Interface
Downloading using the command line
Use METS as Greenstone's Internal Representation
Moving a collection from DSpace to Greenstone
Adding indexing and browsing capabilities to match DSpace's
Moving a collection from Greenstone to DSpace
Using Greenstone from the command line
Editing metadata sets
Running GEMS
Creating a new metadata set
Adding a new element to a metadata set

Working with a pre-packaged collection (UNAIDS)

Devised for Greenstone version: UNAIDS 2.0 CD-ROM

You will need the Greenstone UNAIDS CD-ROM

Installing a pre-packaged Greenstone collection

  1. On inserting the UNAIDS CD-ROM, for many computers installation will begin automatically. If not, "auto-run"—a configurable setting under Windows—is disabled on your computer and you need to double-click Setup.exe on the CD-ROM.

    My Computer → UNAIDS20 → Setup.exe

  1. The InstallShield Wizard begins to install the UNAIDS pre-packaged collection. Select the English language and click <OK>.

  1. On the welcome screen, click the <Next> button.

  1. Choose Run from CD-ROM (standard) as the setup type. This is the default and is already selected. Then click <Next>.

  1. Click <Next> again to install the UNAIDS collection in the default folder, which is C:\Program Files\UNAIDS Library 2.0 [CD-ROM].

    Installation Wizard copies the required files from CD-ROM to disk

  1. Click <OK> to confirm completion of UNAIDS collection (twice).

    InstallShield quits—the UNAIDS Library is installed.

CD-ROMs like this one that contain pre-packaged Greenstone collections do not include the full Greenstone software. Instead they embody a mini version of Greenstone that allows you to view the collection but not to build new ones.

Browsing around a Greenstone collection

  1. Launch the prebuilt library by clicking:

    Start → All Programs → UNAIDS Library 2.0 [CD-ROM] → UNAIDS Library 2.0 (Standard Version).

    To access Greenstone through the Local Library Server, it is sometimes necessary to turn off the proxy settings of the browser. Greenstone normally detects this and pops up a window alerting you to the problem.

  1. Click <Enter Library> in the dialog box and your browser (typically Internet Explorer by default) will display the Greenstone home page.

  1. Within the web browser, click titles a-z (in the centre of the navigation bar near the top of the page).

  1. Access the first book in the list of titles by clicking the book icon next to the title:

    About UNAIDS.

  1. Use the scroll bar to view the full length of the page.

  1. In the table of contents near the top, click the page icon next to the heading Guiding principles of UNAIDS to view this section.

  1. Click the page icon next to the heading Global and local impact to view the next section.

This style of interaction can be continued to further expand and contract folders and switch to a different section.

  1. To fully expand the contents of this introduction chapter, click Expand Document or Chapter in the upper left portion of the page, under the picture of the document's front cover.

  1. You can return to the currently selected page of document titles by clicking the book icon next to the title of the book at the top of the table of contents (this signifies closing the book). You also get to the document titles using titles a-z in the navigation bar, in this case to the titles beginning with A-D.

    If the table of contents is open at the top level—showing all the chapters—then clicking Expand Document or Chapter expands the full document. For long documents, which take some time to load in, Greenstone seeks confirmation for this action: clicking 'continue' loads the full document.

  1. Browse around and peruse some other documents in the collection.

Searching within a Greenstone collection

  1. Access the search page by clicking search in the navigation bar.

  1. In the query box under Search for chapters in any language which contain some of the words, enter the term gender then click <Begin Search>.

    After a short pause, the web browser loads a fresh page showing the results of the search.

  1. Click the page icon for the first matching document in the result set (Five Year Implementation Review of the Vienna Declaration and Programme of Action) to view the document. Because the search was at the chapter level, you are taken directly to the matching chapter within the document.

  1. Experiment further with searching, and with the interface in general. For example, there is a detailed Help page. It contains a Preferences section through which you can control some search settings.

    The Preferences options in the UNAIDS collection are intentionally minimalist. Most collections have a separate Preferences button that offers more features.

    The home page of the UNAIDS library collection cycles through a sequence of front cover images, updated every 5 seconds or so. Clicking a particular image takes you directly to that document.

Leaving the Greenstone digital library

  1. There are two ways of leaving Greenstone:

    1. Exit from the Greenstone Software server. Click on the Greenstone Software in the task bar, then choose Exit from the Browser Selection and Settings menu (or click on the exit hotspot, the red cross at the top right). The Greenstone Software exits, but your web browser continues to run.

    1. Exit from your web browser. Leave your web browser in the usual way. The Greenstone server detects when you exit from the browser and generates a popup window that asks whether to close down the server as well. (The reason is that other people may be using Greenstone over the network, and should not be rudely terminated.)

Exercise: Use the UNAIDS collection to answer these questions



Working with a pre-packaged collection (Digital Libraries in Education)

Devised for Greenstone version: IITE Digital Libraries in Education CD-ROM

You will need the Greenstone Digital Libraries in Education CD-ROM

Installing a pre-packaged collection

  1. Insert your CD-ROM for the course Digital libraries in education into a Windows computer. If the installation process does not start up straightaway (because the AutoPlay feature is disabled on your computer), navigate to your CD-ROM/DVD drive (normally D:), open the folder prebuilt, and double click on Setup.exe.

  1. During installation you are offered a choice of folder to install in: we recommend the default, which is C:\GSDL.

  1. You are also presented with the option to run Greenstone from the CD-ROM or to copy the entire CD-ROM. We recommend the latter: please check the box that says Install all collection files. It will take at least a couple of minutes to copy the files across.

  1. Finally, the installer offers to install the Netscape browser for you. Do not request this except in the unlikely event that you do not already have a web browser on your computer.

CD-ROMs like this one that contain pre-packaged Greenstone collections do not include the full Greenstone software. Instead they embody a mini version of Greenstone that allows you to view the collection but not to build new ones.

Browsing around a Greenstone collection

  1. To run Greenstone, open the Windows Start menu, Programs, and select Greenstone, then the sub-menu item Digital Libraries in Education: then <Enter Library>.

  1. Click the Digital libraries in Education collection's icon. This takes you to the collection's home page, often called the "about" page.

    The home page contains an access bar with buttons called search, contents, authors a-z, modules, and acronyms. This access bar is the key to finding information in any Greenstone collection.

  1. Click <authors a-z>. A list of bookshelf icons appears. Click the one called Marchionini, G. to see the two course readings by Gary Marchionini.

  1. One of these items is a PDF file and the other is an HTML file. Click them both in turn to open up the documents.

  1. Click the <contents> button in the access bar. This shows two bookshelves, one for this Study Guide and the other for the Course Readings. Choose one and look at what it contains.

  1. Clicking a bookshelf that is open closes it. Close the bookshelf you have just opened and then choose the other one and examine its contents.

  1. Click <acronyms> in the access bar and find the meaning of the acronym "LOM".

  1. Click <search> and search for the word "LOM". Check out the difference between searching text and searching titles (use the pull-down box on the search page).

  1. Click the collection icon Digital Libraries in Education at the top left. This takes you back to the collection's about page.

    Beneath the access bar on the collection's about page is a search box (just the same as the one that appears on the search page), a description of the collection under the heading About this collection, and instructions on how to find information in this collection.

    Above the access bar is the collection's icon, saying Digital Libraries in Education. On the right is an icon saying about, above which are three buttons, home, help, and preferences.

  1. Click <home>. This returns you to the Greenstone home page.

  1. Return to the collection (by clicking its icon), and click <help>. This gives more information about how to access the collection.

  1. Click <preferences>. This takes you to a page where you can change some of the settings.

  1. Now explore the collection by navigating freely around it. Click liberally: all images that appear on the screen are clickable. If you hold the mouse stationary over an image, most browsers will soon pop up a brief "mouse-over" message that tells you what will happen if you click. Experiment! Choose common words like "the" or "and" to search for—that should evoke some response, and nothing will break. (Note: unlike many search systems, Greenstone indexes all words, including these ones.)

Exercise: Read the Help page; then answer these questions

Exercise: Use the How to build a digital library collection to answer these questions.

Most of these questions would be rather difficult to answer from the printed book.



Installing Greenstone

Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.74

Installing Greenstone on a Windows system

There are various ways of getting Greenstone:

  1. From a UNESCO CD-ROM (version 2.70) (or FAO IMARK CD-ROM, but this is an earlier version 2.51)

    These CD-ROMs contain the Greenstone software, plus documented example collections, four language interfaces (English French Spanish Russian), the Export to CD-ROM package, the ImageMagick graphics package, the Java runtime environment, and an installer that installs all of these.

  1. From the IITE Digital Libraries in Education CD-ROM, or a Greenstone workshop CD-ROM

    In addition to all the above software, these CD-ROMs contain the tutorial exercises and a set of sample files to be used for these exercises. CD-ROMs with Greenstone version 2.62 or earlier also include the Greenstone Language Pack, which gives reader's interfaces in many languages (currently about 40). This has its own installer which you have to invoke separately, after you have installed Greenstone. CD-ROMs with version 2.70 or later now come with reader's interfaces in all available languages. Textual images have been removed from the interface; they are now done using CSS (Cascading Style Sheets). The Greenstone Language Pack is no longer needed. Instead, these CD-ROMs come with the Classic Interface Pack, which contains the old text images for use with a backwards compatibility macro file.

    All these CD-ROMs contain the full Greenstone software, which allows you to view collections and build new ones. They are not the same as CD-ROMs that contain a pre-packaged Greenstone collection, which only allow you to view that collection.

  1. From http://www.greenstone.org

    Most people download the Windows distribution from http://www.greenstone.org, which contains the latest version of Greenstone. There are several optional modules that must be downloaded separately (to avoid a single massive download): documented example collections, the Export to CD-ROM package (Greenstone 2.70 and earlier), the Language Pack (Greenstone 2.62 and earlier) and Classic Interface Pack (Greenstone 2.63 and later). There is also the set of sample files used in these exercises. (To reduce the download size the documented example collections are distributed in unbuilt form and need to be built.)

    You need Java to run Greenstone. You might already have it; otherwise download it from http://java.sun.com. To work with image collections, you need ImageMagick (from http://www.imagemagick.org).

Most Greenstone CD-ROMs start the installation process as soon as they are inserted into the drive, assuming that the AutoPlay feature is enabled on your computer. If installation does not begin by itself, locate the file setup.exe and double click it to start the installation process. (On the IMARK CD-ROM this file resides in the folder software_tools → Greenstone). If you download Greenstone over the web, what you get is the installer—just double-click it.

If Greenstone has been installed on your computer before, you should completely remove the old version before installing a new one. (However, you need not remove any pre-packaged collections that you may have installed.) To do this, see Updating a Greenstone installation.

Here is what you need to do to install Greenstone. Older versions of the installer follow much the same sequence but use slightly different wording.

To invoke the Greenstone Reader's interface, go to the Greenstone Digital Library Software item under Programs on the Windows Start menu and select Greenstone Digital Library. To invoke the Librarian interface, go to the same item and select Greenstone Librarian Interface.

Installing ImageMagick on a Windows system

Once Greenstone has been installed, you should ensure that ImageMagick is installed on your computer if you wish to build any image collections. If you are installing from a Greenstone CD-ROM, you will be asked whether you want to install ImageMagick: say Yes. If you are not, you will need to download ImageMagick (from http://www.imagemagick.org). To install this program you must have Windows "Administrator" privileges. (If you do not have Windows Administrator privileges, the ImageMagick installer will give a cryptic error complaining that it failed to set a particular Windows registry value. If this happens you can continue your work with Greenstone, but you will not be able to build collections of images.)

The remaining steps are straightforward, and, as before, we recommend the default settings. Here is what you need to do.

Installing Ghostscript on a Windows system

If you wish to do advanced conversion of PDF and Postscript documents (as described in exercise Enhanced PDF handling), you will need to install Ghostscript. If you are installing from a Greenstone CD-ROM you will automatically be prompted for this; the procedure is analogous to that described above for ImageMagick. If not, you will need to download Ghostscript from http://www.cs.wisc.edu/~ghost/ (follow the link to the current stable release).

If you are not sure whether you will need Ghostscript or not, you might as well install it anyway—it will do no harm.



Updating a Greenstone installation

Prerequisite: Installing Greenstone
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.74

These tutorial exercises assume that you are using Greenstone 2.60 or above.

Before updating to a new version of Greenstone, ensure that the computer is not running the Greenstone Librarian Interface or the Greenstone local library server. Normally, quitting your web browser, or quitting the Librarian Interface, also quits the server.

Removing Greenstone from a Windows system

Completely remove the existing version before you install a new version of Greenstone.

  1. Ensure that you are not running Greenstone.

  1. Remove the old version by going to the Windows Control Panel (from the Settings item on the Start menu). Click Add or Remove Programs, select Greenstone Digital Library Software, and Remove it. (To do this you may need Windows "Administrator" privileges.)

  1. At the end of this procedure you will be asked whether you would like all your Greenstone collections to be removed: you should probably say No if you wish to preserve your work.

Occasionally, problems are encountered if older Greenstone installations are not fully removed. To clean up your system, move your Greenstone collect folder, which contains all your collections, to the desktop. Then check for the folder C:\Program Files\gsdl or C:\Program Files\Greenstone, which is where Greenstone is usually installed, and remove it completely if it exists.

Reinstalling Greenstone on a Windows system

  1. The reinstallation procedure is exactly the same as the original installation procedure, described in Installing Greenstone. If you already have ImageMagick, you do not need to install it again.

There have been some superficial changes to the installation procedure in moving to Greenstone Version 2.60, because it uses a different installer program.

There is another important difference that you should be aware of: Versions 2.60 and above are installed in the folder Program Files\Greenstone, whereas prior versions were placed in the folder Program Files\gsdl (these are both default locations that you could have changed during installation.) When upgrading to Version 2.60, if you want to save existing collections you must explicitly move the contents of your collect folder from the old place to the new one. Future Greenstone versions will be installed in the new place, Program Files\Greenstone, so this problem will not happen again.

Amalgamating different Greenstone collections

  1. If you have previously installed the Greenstone Digital Library software in a non-standard place, you should amalgamate your collections by moving them from the collect folder in the old place into the folder Program Files\Greenstone\collect.

  1. If you have installed collections from pre-packaged Greenstone CD-ROMs, they reside in a different place: C:\GSDL\collect. To amalgamate these with your main Greenstone installation, move them into the folder Program Files\Greenstone\collect. The mini version of Greenstone that is associated with the pre-packaged collections is no longer necessary. To uninstall it, select Uninstall on the Greenstone menu of the Windows Start menu.

Installing the Greenstone language pack (2.62 and earlier)

If you go to the Preferences page of any Greenstone collection, and look at the Interface language menu, you will probably find that only English, Spanish, French and Russian interfaces are installed.

  1. Locate the Greenstone Language Pack (glp-x.xx.exe/glp-x.xx-linux.bin/gli-x.xx-macOSx.command). This may be on the CD-ROM from which you installed Greenstone, or you may have to download it from http://www.greenstone.org.

  1. Run the executable file (double click it on Windows); this will start the installer. Accept all the defaults

  1. Restart the Greenstone Digital Library and look at the interface language menu again. Now you should see about 40 different languages.

Enabling other languages (2.63 and later)

If you have downloaded Greenstone from the web, then all the languages will be enabled by default. However, if you have installed Greenstone from a UNESCO CD-ROM, then only English, French, Spanish and Russian will be enabled.

  1. To enable a new language, edit the file greenstone →etc →main.cfg. Look for the appropriate "Language" line, and uncomment it (i.e. remove the # from the start). Check that the required encoding is also enabled.

    For example, suppose that we want to enable Turkish. The "Language" line for Turkish looks like:

    #Language shortname=tr longname=Turkish default_encoding=windows-1254

    To enable it, we remove the #, i.e. make it look like:

    Language shortname=tr longname=Turkish default_encoding=windows-1254

    The default encoding for Turkish is windows-1254. So we look for the windows-1254 Encoding line:

    Encoding shortname=windows-1254 "longname=Turkish (Windows-1254)" map=win1254.ump

    This is already enabled (no # at the start) so we don't need to do anything else.

Installing the Classic Interface Pack (2.63 and later)

Greenstone now comes with all languages enabled. The generated HTML uses text + CSS rather than images for navigation bar, home, help, preferences buttons etc. The classic interface pack is not needed if you want to use Greenstone in another language. It is only needed if you want to revert back to the old style HTML with text images. This may be useful if you have customized your Greenstone, or if you require compatibility with Netscape 4.

  1. Locate the Classic Interface Pack (gcip-x.xx.zip). This may be on the CD-ROM from which you installed Greenstone, or you may have to download it from http://www.greenstone.org.

  1. The classic interface pack is a zip file containing the old text images, such as classifier buttons. Unzip the zip file into the images directory of your Greenstone installation.

  1. Enable the use of the old-style macros by editing greenstone → etc → main.cfg: replace "nav_css.dm" with "nav_ns4.dm" in the "macrofiles" list.

  1. Restart the Greenstone Digital Library. It should now be using the old text images.



Building a small collection of HTML files

Sample files: simple_html.zip
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.74

You will need some HTML files, such as those in the simple_html folder in sample_files.

Running the Greenstone Librarian Interface

  1. Start the Greenstone Librarian Interface:

    Start → All Programs → Greenstone Digital Library Software v2.74 → Greenstone Librarian Interface

    After a short pause a startup screen appears, and then after a slightly longer pause the main Greenstone Librarian Interface appears. (A command prompt is also opened in the background.)

Starting a new collection

  1. Start a new collection within the Librarian Interface:

    FileNew...

  1. You will create a collection based on a few HTML web pages from the Tudor collection.

    A window pops up. Fill it out with appropriate values—for example,

    Collection title: Small HTML Collection
    Description of content: A small collection of HTML pages.

    Leave the setting for Base this collection on: at its default: -- New Collection --, and click <OK>.

  1. Next you must gather together the files that will constitute the collection. A suitable set has been prepared ahead of time in sample_files → simple_html. Using the left-hand side of the Librarian Interface's Gather panel, interactively navigate to the sample_files folder.

Adding documents to the collection

  1. Now drag the simple_html folder from the left-hand side and drop it on the right. The progress bar at the bottom shows some activity. Gradually, duplicates of all the files will appear in the collection panel.

    You can inspect the files that have been copied by double-clicking on the folder in the right-hand side.

  1. Since this is our first collection, we won't complicate matters by manually assigning metadata or altering the collection's design. Instead we rely on default behaviour. So pass directly to the Create panel by clicking its tab.

Building the collection

  1. To start building the collection, click the <Build Collection> button.

  1. Once the collection has built successfully, a window pops up to confirm this. Click <OK>.

  1. Click the <Preview Collection> button to look at the end result. This loads the relevant page into your web browser (starting it up if necessary).

Viewing the extracted metadata

  1. Back in the Librarian Interface, click the Enrich tab to view the metadata associated with the documents in the collection.

  1. Presently there is no manually assigned metadata, but the act of building the collection has extracted metadata from the documents. Double click the simple_html folder to expand its content. Then single-click aragon.html to display all its metadata in the right-hand side of the panel. The initial fields, starting "dc.", are empty. These are Dublin Core metadata fields for manually entered data.

  1. Use the scroll bar on the extreme right to view the bottom part of the list. There you will see fields starting "ex." that express the extracted metadata: for example ex.Title, based on the text within the HTML Title tags, and ex.Language, the document's language (represented using the ISO standard 2-letter mnemonic) which Greenstone determines by analyzing the document's text.

  1. Close the collection by clicking FileClose. This automatically saves the collection to disk.

Viewing the internal links and external links

  1. Hyperlinks in a Greenstone collection work like this. If the link is to a document that is also in the collection, clicking it takes you to that document in the collection. If the link is to a document that is not in the collection, clicking it takes you to that document on the web.

    Open boleyn.html and look for the link to Katharine of Aragon (in the 5th paragraph of the Biography section). This links to a document inside the collection--aragon.html. View this document by clicking the link. For an external link, click letters written by Katharine (in the Primary Sources section). This takes you out on to the web. (A warning message is displayed first. You can also get rid of the warning by add cgiarg shortname=el argdefault=direct in the Greenstone → etc → main.cfg file)

Setting up a shortcut in the Librarian interface

  1. To set up a shortcut to the source files, in the Gather panel navigate to the folder in your local file space that contains the files you want to use—in our case, the sample_files folder. Select this folder and then right-click it, and choose Create Shortcut from the menu. In the Name field, enter the name you want the shortcut to have, or accept the default sample_files. Click <OK>. Close all the folders in the file tree in the left-hand pane, and you will see the shortcut to your source files.



A simple image collection

Sample files: images.zip
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.74
  1. In the Librarian Interface, start a new collection (FileNew...) called backdrop. Fill out the fields with appropriate information. For Base this collection on:, select the item Simple image collection (image-e) from the pull-down menu.

    When you base a collection on an existing one, it inherits all the settings of the old one, including which metadata sets (if any) the collection uses.

  1. Copy the images provided in sample_files → images into your newly-formed collection.

  1. Change to the Create panel and build the collection.

  1. Preview the result.

  1. Click on Browse in the navigation bar to view a list of the photos ordered by filename and presented as a thumbnail accompanied by some basic data about the image. The structure of this collection is the same as Simple image collection (image-e), but the content is different.

  1. Back in the Librarian Interface, change to the Enrich panel and view the extracted metadata for Bear.jpg.

Adding Title and Description metadata

  1. We work with just the first three files (Bear.jpg, Cat.jpg and Cheetah.jpg) to get a flavour of what is possible. First, set each file's dc.Title field to be the same as its filename but without the filename extension:

    Click on Bear.jpg so its metadata fields are available, then click on its dc.Title field on the right-hand side. Type in Bear.

    Repeat the process for Cat.jpg and Cheetah.jpg.

  1. Add a description for each image as dc.Description metadata.

    What description should you enter? To remind yourself of a file's content, the Librarian Interface lets you open files by double-clicking them. It launches the appropriate application based on the filename extension, Word for .doc files, Acrobat for .pdf files and so on.

    Double-click Bear.jpg: on Windows, the image will normally be displayed by Microsoft's Photo Editor (although this depends on how your computer has been set up).

    Back in the Enrich pane, make sure that Bear.jpg is selected in the collection tree on the left hand side. Enter the text Bear in the Rocky Mountains as the value for the dc.Description field.

    Repeat this process for Cat.jpg and Cheetah.jpg, adding a suitable description for each.

  1. Go to the Create panel and click <Build Collection>. Once it has finished building, preview the collection. You will not notice anything new. That's because we haven't changed the design of the collection to take advantage of the new metadata.

Change Format Features to display new metadata

  1. Now we customize the collection's appearance. Go to the Format panel and select Format Features from the left-hand list. Leave the feature selection controls at their default values, so that All Features is selected for Choose Feature, and VList is selected as the Affected Component. In the HTML Format String, edit the text as follows:

    • Change _ImageName_: to Title:
    • Change [Image] to [dc.Title]
    • After [dc.Title]<br> add Description: [dc.Description]<br>

    Metadata names are case-sensitive in Greenstone: it is important that you capitalize "Title" and "Description" (and don't capitalize "dc").

  1. The new format statement is displayed in the list of assigned format statements. The first substitution alters the fragment of text that appears to the right of the thumbnail image, the second alters the item of metadata that follows it. The addition displays the description after the Title.

  1. Preview the collection by clicking the <Preview Collection> button. When you click on Browse in the navigation bar the presentation has changed to "Title: Bear" and so on. Each image's description should appear beside the thumbnail, following the title.

After the first three items, the Title and Description become blank because we have only assigned Dublin Core metadata to these first three. To get a full listing, enter all the metadata.

Changes in the Format panel take place immediately and you can see the result straightaway by clicking the Preview Collection. If you modify anything in the Gather, Enrich or Design panels, you will need to rebuild the collection.

Changing the size of image thumbnails

  1. Lets change the size of the thumbnail image and make it smaller. Thumbnail images are created by the ImagePlugin plug-in, so we need to access its configuration settings. To do this, switch to the Design panel and select Document Plugins from the list on the left. Double-click ImagePlugin to pop up a window that shows its settings. (Alternatively, select ImagePlugin with a single click and then click <Configure Plugin...> further down the screen). Currently all options are off, so standard defaults are used. Select thumbnailsize, set it to 50, and click <OK>.

  1. Build and preview the collection.

  1. Once you have seen the result of the change, return to the Design panel, select the configuration options for ImagePlugin, and switch the thumbnailsize option off so that the thumbnail reverts to its normal size when the collection is re-built.

Adding a browsing classifier based on Description metadata

  1. Now we'll add a new browsing option based on the descriptions. In the Design panel, select Browsing Classifiers from the left-hand list. Set the menu item for Select classifier to add to AZList; then click <Add Classifier...>.

  1. A window pops up to control the classifier's options. Set the metadata option to dc.Description and click <OK>.

  1. Build the collection, and preview it. Choose the new Descriptions link that appears in the navigation bar.

Only three items are shown, because only items with the relevant metadata (dc.Description in this case) appear in the list. The original browse list includes all photos in the collection because it is based on ex.Image, extracted metadata that reflects an image's filename, which is set for all images in the collection.

Creating a searchable index based on Description metadata

  1. Now we'll add an index so that the collection can be searched by descriptions. Switch to the Design panel and select Search Indexes from the left-hand list. Click the <New Index> button. Select dc.Description from the list of metadata to include in the index, leave Indexing level: at its default, "document", and click <Add Index>.

  1. Switch to the Create panel, build the collection, then preview it. There is now a Search button in the navigation bar. As an example, search for the term "bear" in the document:dc.Description index (which is the only index at this point).

  1. To change the text that is displayed for the index (document:dc.Description), go to the Format panel back in the Librarian Interface. Select Search from the left-hand list. This panel allows you to change the text that is displayed on the search form. Change the Display text for the document:dc.Description index to "descriptions" (or other suitable text). Go back to the browser and reload the search page. Your new text will appear in the search form.



A collection of Word and PDF files

Sample files: Word_and_PDF.zip
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.74

You will need some source files like those in the sample_files → Word_and_PDF folder.

  1. Start a new collection called reports (FileNew...) and base it on -- New Collection --.

  1. Copy all the files from sample_files → Word_and_PDF → Documents into the collection. You can select multiple files by clicking on the first one and shift-clicking on the last one, and drag them all across together. (This is the normal technique of multiple selection.)

  1. Switch to the Create panel, and build and preview the collection.

Viewing the extracted metadata

  1. Again, this collection contains no manually assigned metadata. All the information that appears—title and filename—is extracted automatically from the documents themselves. Because of this the quality of some of the title metadata is suspect.

  1. Back in the Librarian Interface, click the Enrich tab to view the automatically extracted metadata. You will need to scroll down to see the extracted metadata, which begins with "ex.".

  1. Check whether the ex.Title metadata is correct for some of the documents by opening them. You can open a document from the Librarian Interface by double clicking on it.

  1. The extracted Title metadata for some documents is incorrect. For example, the Titles for pdf01.pdf and word03.doc (the same document in different formats) have missed out the second line. The Title for pdf03.pdf has the wrong text altogether.

Manually adding metadata to documents in a collection

  1. In the Enrich panel, manually add Dublin Core dc.Title metadata to those documents which have incorrect ex.Title metadata. Select word03.doc and double-click to open it. Copy the title of this document ("Greenstone: A comprehensive open-source digital library software system") and return to the Librarian Interface. Scroll up or down in the metadata table until you can see dc.Title. Click in the value box and paste in the metadata.

  1. Now add dc.Creator information for the same document. You can add more than one value for the same field: when you press Enter in a metadata value field, a new empty field of the same type will be generated. Add each author separately as dc.Creator metadata.

  1. Close the document (in Microsoft Word) when you have finished copying metadata from it. External programs opened when viewing documents must be closed before building the collection, otherwise errors can occur.

  1. Next add dc.Title and dc.Creator metadata for a few of the other documents.

  1. You will notice as you add more values, they appear in the Existing values for ... box below the metadata table. If you are adding the same metadata value to more than one document, you can select it from this list. For example, pdf01.pdf and word03.doc share the same Title; and many documents have common authors.

If you build and preview your collection at this point, you will see that the Titles list now shows your new Titles. However, the dc.Creator metadata is not displayed. You need to alter the collection design to use this metadata.

Document Plugins

  1. In the Librarian Interface, look at the Document Plugins section of the Design panel, by clicking on this in the list to the left. Here you can add, configure or remove plugins to be used in the collection. There is no need to remove any plugins, but it will speed up processing a little. In this case we have only Word, PDF, RTF, and PostScript documents, and can remove the ZIPPlugin, TEXTPlugin, HTMLPlugin, EmailPlugin, ImagePlugin, ISISPlugin and NulPlugin plugins. To delete a plugin, select it and click <Remove Plugin>. GreenstoneXMLPlugin and MetadataXMLPlugin are required for any type of source collection and should not be removed.

Search indexes

  1. The next step in the Design panel is Search Indexes. These specify what parts of the collection are searchable (e.g. searching by title and author). Delete the ex.Source index, which is not particularly useful, by selecting it and clicking <Remove Index>.

  1. Modify the ex.Title index to include dc.Title by selecting the index in the Assigned Indexes box and clicking <Edit Index>. Select dc.Title from the list of metadata, and click <Replace Index>. Searching this index will search both dc.Title and ex.Title metadata. If you want to restrict searching to just the manually added dc.Title metadata, edit the index again and deselect ex.Title from the list of metadata.

  1. You can add indexes based on any metadata. Add a new index based on dc.Creator by clicking <New Index>. Select dc.Creator in the list of metadata, and click <Add Index>.

Browsing classifiers

  1. The Browsing Classifiers section adds "classifiers," which provide the collection with browsing functions. Go to this section and observe that Greenstone has provided two classifiers, AZLists based on ex.Title and ex.Source metadata. These correspond to the Titles and Filenames buttons on the collection's access bar.

    Remove the ex.Source classifier by selecting it and clicking <Remove Classifier>.

  1. Modify the ex.Title classifier to use dc.Title instead. Select the classifier and click <Configure Classifier...>. In the metadata box, select dc.Title instead of ex.Title. Click <OK>.

  1. Now add an AZCompactList classifier for dc.Creator. Select AZCompactList from the Select classifier to add drop-down list and click <Add Classifier...>. A popup window Configuring Arguments appears. Select dc.Creator from the metadata drop-down list and click <OK>.

    AZCompactList is like AZList, except that values that appear multiple times in the hierarchy are automatically grouped together and a new node, shown as a bookshelf icon, is formed.

  1. Switch to the Create panel, and build and preview the collection.

  1. Check that all the facilities work properly. There should be three full-text indexes, called text, dc.Title, and dc.Creator. The Titles list should display all the documents to which you have assigned dc.Title metadata (and only those documents). The Creators list should show one bookshelf for each author you have assigned as dc.Creator, and clicking on that bookshelf should take you to all the documents they authored.

Renaming the search indexes

  1. The default display text for the indexes in the drop-down list on the search page contains the content of the index. Now we will change this display text to make it nicer. Go to the Format panel by clicking its tab. This panel is split into several sections, each controlling some aspect of collection presentation.

  1. Select Search in the left hand list. This section allows you to modify what text is displayed for the drop-down lists in the search form (indexes, subcollections, levels etc). Set the Display text for the dc.Title,Title index to be "titles", and that for the dc.Creator index to be "creators". Preview the collection by clicking the Preview Collection. The search form should display the new text.

Classifying on multiple metadata

  1. The new Titles list shows only those documents which have been assigned dc.Title metadata. For many documents, extracted Titles may be fine, and it is impractical to add the same metadata again as dc.Title. Fortunately there is a way we can use both metadata types in one classifier: specify a list of metadata names in the classifier.

  1. In the Browsing Classifiers section of the Design panel, select the AZList for dc.Title in the Assigned Classifiers box and click <Configure Classifier...>. Note you can achieve the same result by double clicking on the classifier.

  1. In the metadata field, type ",ex.Title" after the "dc.Title"—i.e. make it read

    dc.Title,ex.Title

  1. If you have already done the Enhanced Word document handling exercise, some of the documents will have extracted ex.Creator metadata, and some will have dc.Creator. To use both of these in the Creators classifier, make a similar change to the AZCompactList: make the metadata field read dc.Creator,ex.Creator.

    Build the collection again and preview it. Now all of the documents should appear in the Titles list (and extracted Creators should appear in the Creators list).

We will play around with the format statements and customize the outlook of this collection in the Formatting the Word and PDF collection exercise.



Formatting the Word and PDF collection

Prerequisite: A collection of Word and PDF files
Devised for Greenstone version: 2.70w
Modified for Greenstone version: 2.74

In this exercise, we play around with the format statements in the Word and PDF collection.

  1. Open the reports collection in the Librarian Interface and go to the Format Features section of the Format panel.

Tidying up the default format statement

  1. In this part of the exercise, we make the format statement simpler without changing the resulting display.

    Greenstone's default format statement is complex because it is designed to produce something reasonable under almost any conditions, and also because for practical reasons it needs to be backwards compatible with legacy collections. For this collection, we don't need all of the complexity.

    Make sure that the VList format statement is selected in the list of formats.

    The default VList format statement looks like the following:

    <td valign="top">[link][icon][/link]</td>
    <td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
    <td valign="top">[highlight]
    {Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
    [/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td>

    This format statement is the default used for any vertical list, such as search results, classifiers, and document table of contents.

    {Or}{[ex.thumbicon],[ex.srcicon]} chooses ex.thumbicon metadata if its there, otherwise chooses ex.srcicon metadata. If neither are present, nothing is displayed. For this collection there is no ex.thumbicon metadata so the choice is not needed.

    Replace {Or}{[ex.thumbicon],[ex.srcicon]} (highlighted above) with [ex.srcicon].

    There is no exp.Title metadata, so remove that element from {Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}.

    The resulting format statement looks like the following:

    <td valign=top>[link][icon][/link]</td>
    <td valign=top>[ex.srclink][ex.srcicon][ex./srclink]</td>
    <td valign=top>[highlight]
    {Or}{[dc.Title],[ex.Title],Untitled} [/highlight] {If}{[ex.Source],<br><i>([ex.Source])</i>}</td>

    Preview the collection to make sure the display hasn't changed. You shouldn't notice any difference when looking at search results, classifiers etc.

Linking to Greenstone version or original version of documents

  1. For collections with documents that undergo a conversion process during importing (e.g. Word, PDF, PowerPoint documents, but not text, HTML documents), the original file is stored in the collection along with the converted version. The default VList format statement links to both versions:

    [ex.link][icon][/ex.link] links to the Greenstone HTML version, while [ex.srclink][ex.srcicon][/ex.srclink] links to the original.

    Choose SearchVList in Format Features by selecting Search from the Choose Feature drop down list, and VList from the Affected Component list. Click <Add Format> to add the SearchVList format statement into the list of assigned formats. Experiment with removing either of the two links from the format statement.

    To see the results of your changes, preview the collection and do a search. You are making changes to SearchVList, which means the changes will only apply to search results.

    Storing and displaying the original allows users to see the correct format, but requires the user to have the relevant program installed. It also increases the size of the collection. The Greenstone version can be viewed in a browser, but may not look as nice.

Making bookshelves show how many items they contain

  1. Next, we'll customize the format for the Creators list. Classifier bookshelves have only a few pieces of metadata to display: [ex.Title] and [numleafdocs]. Whatever metadata the classifier has been built on, the bookshelf label is always stored as [ex.Title]. This is why a Creator is printed out for each bookshelf even though [dc.Creator] is not specified in the format statement. [numleafdocs] is only defined for bookshelves, so this metadata can be used in an {If} statement to make bookshelves and documents display differently in the list.

    Make each bookshelf in the Creator classifier show how many entries it contains. In the Format Features section of the Format panel, select the CL2 AZCompactList classifier which is based on dc.Creator metadata from the Choose Feature drop down list, and VList from the Affected Component list. Click the <Add Format> button to add this format into the list of assigned formats. Note that it gets added as CL2VList in this list: it is the VList format for the second (CL2) classifier.

    Append the following text to the bottom of the format statement:

    {If}{[numleafdocs],<td><i>([numleafdocs])</i></td>}

    Preview the collection. Click on the Creators list and notice that the bookshelves now display how many documents they contain.

    This revised format statement has the effect of specifying in brackets how many items are contained within a bookshelf. Since only bookshelves define [numleafdocs], only they will display this. By modifying CL2VList instead of VList, the change will only apply to the second classifier (Creators).

Displaying multi-valued metadata

  1. Next we modify the document entries in the Creator classifier to display all authors. Back in Format Features, select the CL2VList format in the list of assigned formats. After {If}{[ex.Source],<br> in the format statement, add [sibling:dc.Creator].

    [ex.Source] is not defined for bookshelves, so can also be used to differentiate bookshelves and documents.

    The resulting format statement looks like:

    <td valign=top>[link][icon][/link]</td>
    <td valign=top>[ex.srclink][ex.srcicon][ex./srclink]</td>
    <td valign=top>[highlight]
    {Or}{[dc.Title],[ex.Title],Untitled}[/highlight]
    {If}{[ex.Source],<br>[sibling:dc.Creator]
    <i>([ex.Source])</i>}</td>
    {If}{[numleafdocs],<td><i>([numleafdocs])</i></td>}

    This will display the Greenstone link, the link to the original, then the Title. For bookshelves, it will also display how many documents the bookshelf contains. For documents, it will display all the Authors (Creators), and the source document. [sibling:dc.Creator] displays all the Creator metadata for the document, separated by a space (" "), while [dc.Creator] displays only the first author. Preview the Creators list and make sure that all authors are displayed for documents.

  1. You can change the separator between the authors. Modify the format statement, and replace [sibling:dc.Creator] with [sibling(All'<br/>'):dc.Creator]. This will add a new line after each author (<br/> specifies a line break in HTML). Preview the Creators list.

    If you have done exercise Enhanced Word document handling, the collection will have both dc.Creator and ex.Creator metadata. To display both, you can use

    [sibling:dc.Creator] [sibling:ex.Creator]

    To display dc.Creator if it is present, otherwise display ex.Creator, use

    {Or}{[sibling:dc.Creator],[sibling:ex.Creator]}

Advanced multi-valued metadata

  1. You may notice that AZCompactList has two options after the metadata option: firstvalueonly and allvalues. Manually added metadata can be used to replace or enhance automatically extracted metadata, and these options control exactly which pieces of metadata a document is classified by.

    For example, say we have two documents. Document 1 has four Creators specified (dc.Creator = dcA, dc.Creator = dcB, ex.Creator = exA, ex.Creator = exB), while document 2 has three (ex.Creator = exA, ex.Creator = exB, ex.Creator = exC). The following table shows which metadata values each document is classified by, for the different classifier options:

    AZCompactList options Document 1 Document 2
    -metadata dc.Creator,ex.CreatordcA, dcBexA, exB, exC
    -metadata dc.Creator,ex.Creator -firstvalueonlydcAexA
    -metadata dc.Creator,ex.Creator -allvaluesdcA, dcB, exA, exBexA, exB, exC
  1. Now we set the firstvalueonly option for the Creators classifier. Switch to the Browsing Classifiers section of the Design panel, select the AZCompactList for dc.Creator metadata in the Assigned Classifiers box and click <Configure Classifier...>. Select the firstvalueonly option.

    Rebuild and preview the collection. Now the Creators list classifies documents based on the first author appearing in the dc.Creator metadata.

    If you set the metadata field of AZCompactList to dc.Creator,ex.Creator in the A collection of Word and PDF files exercise, now the Creators list will classify based on the first author appearing in either the dc.Creator metadata or the ex.Creator metadata.



Enhanced PDF handling

Sample files: Word_and_PDF.zip
Devised for Greenstone version: 2.70
Modified for Greenstone version: 2.74
Greenstone converts PDF files to HTML using third-party software: pdftohtml.pl. This lets users view these documents even if they don't have the PDF software installed. Unfortunately, sometimes the formatting of the resulting HTML files is not so good.

This exercise explores some extra options to the PDF plugin which may produce a nicer version for display. Some of these options use the standard pdftohtml program, others use ImageMagick and Ghostscript to convert the file to a series of images. Ghostscript is a program that can convert Postscript and PDF files to other formats. You can download it from http://www.cs.wisc.edu/~ghost/ (follow the link to the current stable release).

  1. In the Librarian Interface, start a new collection called "PDF collection" and base it on -- New Collection --.

    In the Gather panel, drag just the PDF documents from sample_files → Word_and_PDF → Documents into the new collection. Also drag in the PDF documents from sample_files → Word_and_PDF → difficult_pdf.

    Go to the Create panel and build the collection. Examine the output from the build process. You will notice that one of the documents could not be processed. The following messages are shown: "The file pdf05-notext.pdf was recognised but could not be processed by any plugin.", and "3 were processed and included in the collection. 1 was rejected".

  1. Preview the collection and view the documents. pdf05-notext.pdf does not appear as it could not be processed. pdf06-weirdchars.pdf was processed but looks very strange. The other PDF documents appear as one long document, with no sections.

Modes in the Librarian Interface

The Librarian Interface can operate in different modes. The default mode is Librarian mode. We can use Expert mode to work out why the pdf file could not be processed.

  1. Use the Preferences... item on the File menu to switch to Expert mode and then build the collection again. The Create panel looks different in Expert mode because it gives more options: locate the <Build Collection> button, near the bottom of the window, and click it. Now a message appears saying that the file could not be processed, and why. Amongst all the output, we get the following message: "Error: PDF contains no extractable text. Could not convert pdf05-notext.pdf to HTML format". pdftohtml.pl cannot convert a PDF file to HTML if the PDF file has no extractable text.

  1. We recommend that you switch back to Librarian mode for subsequent exercises, to avoid confusion.

Splitting PDFs into sections

  1. In the Document Plugins section of the Design panel, configure PDFPlugin. Switch on the use_sections option.

    In the Search Indexes section, check the section checkbox to build the indexes on section level as well as document level.

    Build and preview the collection. View the text versions of some of the PDF documents. Note that these are now split into a series of pages, and a "go to page" box is provided. The format is still a bit ugly though, and pdf05-notext.pdf is still not processed.

Using image format

  1. If conversion to HTML doesn't produce the result you like, PDF documents can be converted to a series of images, one per page. This requires ImageMagick and Ghostscript to be installed.

  1. In the Document Plugins section, configure PDFPlugin. Set the convert_to option to one of the image types, e.g. pagedimg_jpg. Switch off the use_sections option, as it is not used with image conversion.

  1. Build the collection and preview. All PDF documents (including pdf05-notext.pdf) have been processed and divided into sections, but each section displays "This document has no text.". For the conversion to images for PDF documents, no text is extracted.

  1. In order to view the documents properly, you will need to modify the format statement. In the Format Features section on the Format panel, select the DocumentText format statement. Replace

    [Text]

    with

    [srcicon]

  1. Preview the collection. Images from the document are now displayed instead of the extracted text. Both pdf05-notext.pdf and pdf06-weirdchars.pdf display nicely now.

    In this collection, we only have PDF documents and they have all been converted to images. If we had other document types in the collection, we should use a different format statement, such as:

    {If}{[parent:FileFormat] eq PDF,[srcicon],[Text]}

    FileFormat is an extracted metadata item which shows the format of the source document. We can use this to test whether the documents are PDF or not: for PDF documents, display [srcicon], for other documents, display [Text].

Using process_exp to control document processing (advanced)

  1. Processing all of the PDF documents using an image type may not give the best result for your collection. The images will look nice, but as no text is extracted, searching the full text will not be available for these documents. The best solution would be to process most of the PDF files as HTML, and only use the image format where HTML doesn't work.

  1. We achieve this by putting the problem files into a separate folder, and adding another PDFPlugin plugin with different options.

  1. Go to the Gather panel. Make a new folder called "notext": right click in the collection panel and select New folder from the menu. Change the Folder Name to "notext", and click <OK>.

    Move the two pdf files that have problems with html (pdf05-notext.pdf and pdf06-weirdchars.pdf) into this folder by drag and drop. We will set up the plugins so that PDF files in this notext folder are processed differently to the other PDF files.

  1. Change to Library Systems Specialist mode so that you can add two of the same plugin, and use regular expressions in the plugin options (FilePreferences...Mode).

    For version 2.71, you'll need to close GLI now then restart it to get the list of plugins to update properly.

  1. Switch to the Document Plugins section of the Design panel. Add a second PDF plugin by selecting PDFPlugin from the Select plugin to add: drop-down list, and clicking <Add Plugin...>. This plugin will come after the first PDF plugin, so we configure it to process PDF documents as HTML. Set the convert_to option to html, and switch on the use_sections option. Click <OK>.

  1. Configure the first PDF plugin, and set the process_exp option to 'notext.*\.pdf'.

  1. The two PDF plugins should have options like the following:

    plugin PDFPlugin -convert_to pagedimg_jpg -process_exp "notext.*\.pdf"
    plugin PDFPlugin -convert_to html -use_sections

    The paged_img version must come earlier in the list than the html version. The process_exp for the first PDFPlugin will process any PDF files in the notext directory. The second PDFPlugin will process any PDF files that are not processed by the first one.

    Note that all plugins have the process_exp option, and this can be used to customize which documents are processed by which plugin. This option is only visible in Library Systems Specialist and Expert modes.

    Change back to Librarian mode.

  1. Edit the DocumentText format statement. PDF files processed as HTML will not have images to display, so we need to make sure they get text displayed instead. Change [srcicon] to {If}{[NoText] eq "1",[srcicon],[Text]}.

  1. Build and preview the collection. All PDF documents should look relatively nice. Try searching this collection. You will be able to search for the PDFs that were converted to HTML (try e.g. "bibliography"), but not the ones that were converted to images (try searching for "FAO" or "METS").

Opening PDF files with query terms highlighted

  1. Next we'll customize the SearchVList format statement to highlight the query terms in a PDF file when it is opened from the search result list. This requires Acrobat Reader 7.0 version or higher, and currently only works on a Microsoft Windows platform.

  1. The search terms are kept in the macro variable _cgiargq_, and we append #search="_cgiargq_" to the end of a PDF file link to pass the query terms to the PDF file.

    PDFPlugin renames each PDF file as doc.pdf and saves it in a unique directory for that document, so we use

    _httpcollection_/index/assoc/[archivedir]/doc.pdf

    to refer to the PDF source file. (However, if you used the -keep_original_filename option to PDFPlugin when building the collection, the original name of the PDF file is kept, and we use

    _httpcollection_/index/assoc/[archivedir]/[Source]

    instead to locate the PDF source file.)

  1. Select SearchVList from the list of assigned formats. We need to test whether the file is a PDF file before linking to doc.pdf, using {If}{[ex.FileFormat] eq 'PDF',,}. For PDF files, we use the above format instead of the [ex.srclink] and [ex./srclink] variables to link to the file.

    The resulting format statement is:

    <td valign="top">[link][icon][/link]</td>
    <td valign="top">{If}{[ex.FileFormat] eq 'PDF', <a href=\"_httpcollection_/index/assoc/[archivedir]/doc.pdf#search=&quot;_cgiargq_&quot;\">[ex.srcicon]</a>,
    [ex.srclink][ex.srcicon][ex./srclink]}
    </td>
    <td valign="top">[highlight]
    {Or}{[dc.Title],[ex.Title],Untitled}
    [/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td>

    When the PDF icons are clicked in the search results, Acrobat will open the file with the search window open, and the query terms highlighted.



Enhanced Word document handling

Prerequisite: A collection of Word and PDF files
Devised for Greenstone version: 2.70w
Modified for Greenstone version: 2.74
The standard way Greenstone processes Word documents is to convert them to HTML format using a third-party program, wvWare. This sometimes doesn't do a very good job of conversion. If you are using Windows, and have Microsoft Word installed, you can take advantage of Windows native scripting to do a better job of conversion. If the original document was hierarchically structured using Word styles, these can be used to structure the resulting HTML. Word document properties can also be extracted as metadata.

  1. In your digital library, preview the reports collection. Look at the HTML versions of the Word documents and notice how they have no structure-they have been converted to flat documents.

Using Windows native scripting

  1. In the Librarian Interface, open up the reports collection. Switch to the Design panel and select the Document Plugins section on the left-hand side. Double click the WordPlugin plugin and switch on the windows_scripting option.

    In the Search Indexes section, check the section checkbox to build the indexes on section level as well as document level.

  1. Build the collection. You will notice that the Microsoft Word program is started up for each Word document—the document is saved as HTML from Word itself, to get a better conversion. Preview the collection. In the Titles list, notice that word03.doc and word06.doc now have a book icon, rather than a page icon. These now appear with hierarchical structure.

    The default behaviour for WordPlugin with windows_scripting is to section the document based on "Heading 1", "Heading 2", "Heading 3" styles. If you open up the word03.doc or word06.doc documents in Word, you will see that the sections use these Heading styles.

    Note, to view style information in Word, you can select Format → Styles and Formatting from the menu, and a side bar will appear on the right hand side. Click on a section heading and the formatting information will be displayed in this side bar.

  1. Some of the documents do not use styles (e.g. word01.doc) and no structure can be extracted from them. Some documents use user-defined styles. WordPlugin can be configured to use these styles instead of Heading 1, Heading 2 etc. Next we will configure WordPlugin to use the styles found in word05.doc.

Modes in the Librarian Interface

  1. The Librarian Interface can operate in four modes. Go to FilePreferences...Mode and see the four modes and what functionality they provide access to. Librarian is the default mode.

  1. Change the mode to Library Systems Specialist because you will need to use regular expressions to set up the style options in the next part of the exercise.

Defining styles

  1. Open up word05.doc in Word (by double-clicking on it in the Gather pane), and examine the title and section heading styles. You will see that various user-defined header styles are set such as:

    • ManualTitle: Title of the manual
    • ChapterTitle: Level 1 section heading
    • SectionHeading: Level 2 section heading
    • SubsectionHeading: Level 3 section heading
    • AppendixTitle: Appendix section title

  1. In the Document Plugins section of the Design panel, select WordPlugin and click <Configure Plugin...>. Four types of header can be set which are:

    • level1_header (level1Header1|level1Header2|...)
    • level2_header (level2Header1|level2Header2|...)
    • level3_header (level3Header1|level3Header2|...)
    • title_header (titleHeader1|titleHeader2|...)

    These header options define which styles should be considered as title, level 1, level 2 and level 3 styles.

    Ensure that the windows_scripting option is checked, and set the options as follows (spaces in the Word styles are removed when converting to HTML styles, and these options must match the HTML styles):

    level1_header:(ChapterTitle|AppendixTitle)
    level2_header: SectionHeading
    level3_header: SubsectionHeading
    title_header: ManualTitle

    If you can't see these options in the WordPlugin configuration pane, check that you are in Library Systems Specialist mode as described above.

    Once these are set, click <OK>.

  1. Close any documents that are still open in Word, as this can prevent the build process from completing correctly.

  1. Build the collection and preview it. Look in particular at word05.doc. You will see that this document is now also hierarchically structured.

    If you have documents with different formatting styles, you can use (...|...) to specify all of the different styles.

Removing pre-defined table of contents

  1. If you look at the HTML versions of word05.doc and word06.doc, you will see that it now has two tables of contents. One is generated by Greenstone based on the document's styles, the other was already defined in the Word document. WordPlugin can be configured to remove predefined tables of contents and tables of figures. The tables must be defined with Word styles in order for this to work.

  1. To remove the tables of contents and figures from word06.doc and the table of contents from word05.doc, switch on the delete_toc option in WordPlugin. Set the toc_header option to (MsoToc1|MsoToc2|MsoToc3|MsoTof|TOA). In this document, the table of contents and list of figures use these four style names. Click <OK>.

  1. Build and preview the collection. Both word05.doc and word06.doc should now have only one table of contents.

  1. Switch the Librarian Interface back to Librarian mode (FilePreferences...Mode).

Extracting document properties as metadata

  1. When the windows_scripting option is set, word document properties can be extracted as metadata. By default, only the Title will be extracted. Other properties can be extracted using the metadata_fields option.

  1. In the Enrich panel, look at the metadata that has been extracted for word05.doc and word06.doc. Now open the documents in Word and look at what properties have been set (File → Properties). They have Title, Author, Subject, and Keywords properties. WordPlugin can be configured to look for these properties and extract them.

  1. In the Design panel, under Document Plugins, configure WordPlugin once again. Switch on the configuration option metadata_fields. Set the value to

    Title,Author<Creator>,Subject,Keywords<Subject>

    This will make WordPlugin try to extract Title, Author, Subject and Keywords metadata. Title and Subject will be saved with the same name, while Author will be saved as Creator metadata, and Keywords as Subject metadata.

  1. Make sure you have closed all the documents that were opened, then rebuild the collection.

  1. Look at the metadata for the two documents again in the Enrich panel. You should now see ex.Creator and ex.Subject metadata items. This metadata can now be used in display or browsing classifiers etc.



Exporting a collection to CD-ROM/DVD

Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.74

To publish a collection on CD-ROM or DVD, Greenstone's Export to CD-ROM export module must be installed. This is included with CD-ROM distributions, and all distributions 2.70w and later. It must be installed separately for non-CD-ROM versions of Greenstone, version 2.70 and earlier (see Installing Greenstone).

  1. Launch the Greenstone Librarian Interface if it is not already running.

  1. Choose FileWrite CD/DVD image.... In the resulting popup window, select the collection or collections that you wish to export by ticking their check boxes. You can optionally enter a name for the CD-ROM: this is the name that will appear in the menu when the CD-ROM is run. If a name is not entered, the default Greenstone Collections will be used. You can also specify whether the resulting CD-ROM will install files onto the host machine when used or not. Click <Write CD/DVD image> to start the export process.

    The necessary files for export are written to:

    Greenstone → tmp → exported_xxx

    where xxx will be similar to the name you have entered. If you didn't specify a name for the CD-ROM, then the folder name will be exported_collections.

    You need to use your own computer's software to write these on to CD-ROM. On Windows XP this ability is built into the operating system: assuming you have a CD-ROM or DVD writer insert a blank disk into the drive and drag the contents of exported_xxx into the folder that represents the disk.

    The result will be a self-installing Windows Greenstone CD-ROM or DVD, which starts the installation process as soon as it is placed in the drive.



A large collection of HTML files—Tudor

Sample files: tudor.zip
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.74

You will need the files in the sample_files → tudor folder.

  1. Invoke the Greenstone Librarian Interface (from the Windows Start menu) and start a new collection called tudor (use the File menu), based on the default -- New Collection --.

  1. In the Gather panel, open the tudor folder in sample_files.

  1. Drag englishhistory.net from the left-hand side to the right to include it in your tudor collection. (This material is from Marilee Hanson's Tudor England Collection at http://englishhistory.net/tudor.html, distributed with her permission.)

  1. Switch to the Create panel and click <Build Collection>.

  1. When building has finished, preview the collection.

Extracting more metadata from the HTML

  1. The browsing facilities in this collection (Titles and Filenames) are based entirely on extracted metadata. Return to the Enrich panel in the Librarian Interface and examine the metadata that has been extracted for some of the files.

  1. Many HTML documents contain metadata in <meta> tags in the <head> of the page. Open up the englishhistory.net → tudor → monarchs → boleyn.html file by navigating to it in the tree on the left hand side, and double clicking it. This will open it in a web browser. View the HTML source of the page (View → Source in Internet Explorer, View → Page Source in Mozilla). You will notice that this page has page_topic, content and author metadata.

  1. By default, HTMLPlugin only looks for Title metadata. Configure the plugin so that it looks for the other metadata too. Switch to the Design panel and select the Document Plugins section. Select the plugin HTMLPlugin line and click <Configure Plugin...>. A popup window appears. Switch on the metadata_fields option, and set the value to

    Title,Author,Page_topic,Content

    Click <OK>.

  1. Switch to the Create panel and rebuild the collection. Go back to the Enrich panel and look at the extracted metadata for some of the HTML files in englishhistory.net → tudor → monarchs. The new metadata should now be visible.

Blocking the stray images

You've probably noticed that the collection contains a few stray image files, as well as the HTML documents. This is a mistake. The issue is that many of the HTML documents include images, and although Greenstone attempts to determine which images belong to HTML pages and only considers other images for inclusion in the collection, in this case it hasn't been completely successful. (This is because the web site from which these files were downloaded occasionally departs from the usual convention of hierarchical structuring.)

  1. Switch back to the Document Plugins section of the Design panel. Beside plugin HTMLPlugin you will see -smart_block. This is the option that attempts to identify images in the HTML pages and block them from inclusion—in this case, it's not smart enough! Configure plugin HTMLPlugin again, scroll down the page to locate the smart_block option, and switch it off.

  1. Rebuild and preview the collection. The collection is exactly as before except that these stray images are suppressed. What is happening is that plug-ins operate as a pipeline: files are passed to each one in turn until one is found that can process it. By default (i.e. without smart_block) the HTML plug-in blocks all images, which is appropriate for this collection.

Looking at different views of the files in the Gather and Enrich panels

  1. Switch to the Gather panel and in the right-hand side open englishhistory.net → tudor.

  1. Change the Show Files menu for the right-hand side from All Files to HTM & HTML. Notice the files displayed above are filtered accordingly, to show only files of this type.

  1. Change the Show Files menu to Images. Again, the files shown above alter.

  1. Now return the Show Files setting back to All Files, otherwise you may get confused later. Remember, if the Gather or Enrich panels do not seem to be showing all your files, this could be the problem.



Enhanced collection of HTML files—Tudor

Prerequisite: A large collection of HTML files—Tudor
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.74

We return to the Tudor collection and add metadata that expresses a subject hierarchy. Then we build a classifier that exploits it by allowing readers to browse the documents about Monarchs, Relatives, Citizens, and Others separately.

Adding hierarchically-structured metadata and a Hierarchy classifier

  1. Open up your tudor collection (the original version, not the webtudor version), switch to the Enrich panel and select the citizens folder (a subfolder of englishhistory.net → tudor). Set its dc.Subject and Keywords metadata to Tudor period|Citizens. The vertical bar ("|") is a hierarchy marker. Selecting a folder and adding metadata has the effect of setting this metadata value for all files contained in this folder, its subfolders, and so on. A popup alerts you to this fact. Click <OK> to close the popup.

  1. Repeat for the monarchs and relative folders, setting their dc.Subject and Keywords metadata to Tudor period|Monarchs and Tudor period|Relatives respectively. Note that the hierarchy appears in the Existing values for dc.Subject and Keywords area.

    If you don't want to see the popup each time you add folder level metadata, tick the Do not show this warning again checkbox; it won't be displayed again.

  1. Finally, select all remaining files—the ones that are not in the citizens, monarchs, or relative folders—by selecting the first and shift-clicking the last. Set their dc.Subject and Keywords metadata to Tudor period|Others: this is done in a single operation (there is a short delay before it completes).

    When multiple files are selected in the left hand collection tree, all metadata values for all files are shown on the right hand side. Items that are common to all files are displayed in black—e.g. dc.Subject and Keywords—while others that pertain to only one or some of the files are displayed in grey—e.g. any extracted metadata.

    Metadata inherited from a parent folder is indicated by a folder icon to the left of the metadata name. Select one of the files in the relative folder to see this.

  1. Switch to the Design panel and select Browsing Classifiers from the left-hand list. Set the menu item for Select classifier to add to Hierarchy; then click <Add Classifier...>.

  1. A window pops up to control the classifier's options. Change the metadata to dc.Subject and Keywords and then click <OK>.

  1. For tidiness' sake, remove the classifier for Source metadata (included by default) from the list of currently assigned classifiers, because this adds little to the collection.

  1. Now switch to the Create panel, build the collection, and preview it. Choose the new Subjects link that appears in the navigation bar, and click the bookshelves to navigate around the four-entry hierarchy that you have created.

Adding a hierarchical phrase browser (PHIND)

Next we'll add an interactive hierarchical phrase browsing classifier to this collection.

  1. Switch to the Design panel and choose the Browsing Classifiers item from the left-hand list.

  1. Choose Phind from the Select classifier to add menu. Click <Add Classifier...>. A window pops asking for configuration options: leave the values at their preset defaults (this will base the phrase index on the full text) and click <OK>.

  1. Build the collection again, preview it, and try out the new Phrases option in the navigation bar. An interesting PHIND search term for this collection is "king". Note that even though it is called a phrase browser, only single terms can be used as the starting point for browsing.

Partitioning the full-text index based on metadata values

Next we partition the full-text index into four separate pieces. To do this we first define four subcollections obtained by "filtering" the documents according to a criterion based on their dc.Subject and Keywords metadata. Then an index is assigned to each subcollection. This will enable users to restrict a search to a subset of the documents.

  1. Switch to the Design panel, and click Partition Indexes. This feature is disabled because you are operating in Librarian mode (this is indicated in the title bar at the top of the window).

  1. Switch to Library Systems Specialist mode by going to Preferences... (on the File menu) and clicking <Mode>. Read about the other modes too.

  1. Return to the Partition Indexes section of the Design panel. Ensure that the Define Filters tab is selected (the default). Define a subcollection filter with name monarchs that matches against dc.Subject and Keywords, and type Monarchs as the regular expression to match with. Click <Add Filter>. This filter includes any file whose dc.Subject and Keywords metadata contains the word Monarchs.

  1. Define another filter, relatives, which matches dc.Subject and Keywords against the word Relatives. Define a third and fourth, citizens and others, which matches it against the words Citizens and Others respectively.

  1. Having defined the subcollection filters, we partition the index into corresponding parts. Click the Assign Partitions tab. Select the citizens subcollection and click <Add Partition>. Next select monarchs, and click <Add Partition>. Repeat for the other two subcollections, so that you end up with four partitions, one based on each subcollection filter.

    The order they appear in the Assigned Subcollection Partitions list is the order they will appear in the drop down menu on the search page. You can change the order by using the <Move Up> and <Move Down> buttons.

  1. Build and preview the collection.

  1. The search page includes a pulldown menu that allows you to select one of these partitions for searching. For example, try searching the relatives partition for mary and then search the monarchs partition for the same thing.

  1. To allow users to search the collection as a whole as well as each subcollection individually, return to the Partition Indexes section of the Design panel and select the Assign Partitions tab. Select all four subcollections by checking their boxes and click <Add Partition>.

  1. To ensure that the combined index appears first in the list on the reader's web page, use the <Move Up> button to get it to the top of the list here in the Design panel. Then build and preview the collection.

  1. Search for a common term (like the) in all five index partitions, and check that the numbers of words (not documents) add up.

  1. The text in the drop down box on the search page is based on the filters each partition was built on. To change the text that is displayed, go to the Search section of the Format panel. The single filter partitions have sensible default text, but the combined partition does not. Set the Display text for the combined partition to "all". Preview the collection.

  1. In the Librarian Interface, return to Librarian mode, using Preferences... (on the File menu).

Controlling the building process

Finally we look at how the building process can be controlled. Developing a new collection usually involves numerous cycles of building, previewing, adjusting some enrich and design features, and so on. While prototyping, it is best to temporarily reduce the number of documents in the collection. This can be accomplished through the maxdocs parameter to the building process.

  1. Switch to the Create panel and view the options that are displayed in the top portion of the screen. Select maxdocs and set its numeric counter to 3. Now build.

  1. Preview the newly rebuilt collection's Titles page. Previously this listed more than a dozen pages per letter of the alphabet, but now there are just three—the first three files encountered by the building process.

  1. Go back to the Create panel and turn off the maxdocs option. Rebuild the collection so that all the documents are included.



Formatting the HTML collection—Tudor

Prerequisite: A large collection of HTML files—Tudor
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.74
  1. Open up your tudor collection, go to the Format panel (by clicking on its tab) and select Format Features from the left-hand list. Leave the editing controls at their default value, so that Choose Feature displays All Features and VList is selected as the Affected Component. The text in the HTML Format String box reads as follows:

    <td valign=top>[link][icon][/link]</td>
    <td valign=top>[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]} [ex./srclink]</td>
    <td valign=top>[highlight]
    {Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
    [/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td>

    This displays something that looks like this:

    A discussion of question five from Tudor Quiz: Henry VIII
    (quizstuff.html)

    for a particular document whose Title metadata is A discussion of question five from Tudor Quiz: Henry VIII and whose Source metadata is quizstuff.html.

    This format appears in the search results list, in the Titles list, and also when you get down to individual documents in the Subjects hierarchy. This is Greenstone's default format statement.

Greenstone's default format statement is complex because it is designed to produce something reasonable under almost any conditions, and also because for practical reasons it needs to be backwards compatible with legacy collections.

  1. Delete the contents of the HTML Format String box and replace it with this simpler version:

    <td>[link][icon][/link]</td>
    <td>[ex.Title]<br>
        <i>([ex.Source])</i>
    </td>

    Preview the result (you don't need to build the collection, because changes to format statements take effect immediately). Look at some search results and at the Titles list. They are just the same as before! Under most circumstances this far simpler format statement is entirely equivalent to Greenstone's more complex default.

    But there's a problem. Beside the bookshelves in the Subjects browser, beneath the subject appears a mysterious "()". What is printed for these bookshelves is governed by the same format statement, and though bookshelf nodes of the hierarchy have associated Title metadata—their title is the name of the metadata value associated with that bookshelf—they do not have ex.Source metadata, so it comes out blank.

  1. In the Format Features section of the Format panel, the Choose Feature menu (just above Affected Component menu) displays All Features. That implies that the same format is used for the search results, titles, and all nodes in the subject hierarchy—including internal nodes (that is, bookshelves). The Choose Feature menu can be used to restrict a format statement to a specific one of these lists. We will override this format statement for the hierarchical subject classifier. In the Choose Feature menu, scroll down to the item that says

    CL2: Hierarchy -metadata dc.Subject and Keywords

    and select it. This is the format statement that affects the second classifier (i.e., "CL2"), which is a Hierarchy classifier based on dc.Subject and Keywords metadata.

    Click <Add Format> to add this format statement to the collection.

    Edit the HTML Format String box below to read

    <td>[link][icon][/link]</td>
    <td>[ex.Title]</td>

  1. Preview the Subjects list in the collection. First, the offending "()" has disappeared from the bookshelves. Second, when you get down to a list of documents in the subject hierarchy, the filename does not appear beside the title, because ex.Source is not specified in the format statement and this format statement applies to all nodes in the subject classifier. Note that the search results and titles lists have not changed: they still display the filename underneath the title.

  1. Let's change the search results format so that dc.Subject and Keywords metadata is displayed here instead of the filename. In the Choose Feature menu (under Format Features on the Format panel), scroll down to the item Search and select it. Click <Add Format> to add this format statement to the collection. Change the HTML Format String box below to read

    <td>[link][icon][/link]</td>
    <td>[ex.Title]<br>
        [dc.Subject]
    </td>

  1. To insert the [dc.Subject], position the cursor at the appropriate point and either type it in, or select it from the Insert Variable... drop down menu. This menu shows many of the things that you can put in square brackets in the format statement.

  1. Preview the collection. Documents in the search results list will be displayed like this:

    A discussion of question five from Tudor Quiz: Henry VIII
    Tudor period|Others
    (The vertical bar appears because this dc.Subject and Keywords metadata is hierarchical metadata. Unfortunately there is no way to get at individual components of the hierarchy. For most metadata, such as title and author, this isn't a problem.)

  1. Finally, let's return to the Subjects hierarchy and learn how to do different things to the bookshelves and to the documents themselves. In the Choose Feature menu, re-select the item

    CL2: Hierarchy -metadata dc.Subject and Keywords

    Edit the HTML Format String box below to read

    <td>[link][icon][/link]</td>
    <td>{If}{[numleafdocs],<b>Bookshelf title:</b> [ex.Title],
             <b>Title:</b> [ex.Title]}
    </td>

    Again, you can insert the items in square brackets by selecting them from the Insert Variable... drop down box.

    The If statement tests the value of the variable numleafdocs. This variable is only set for internal nodes of the hierarchy, i.e. bookshelves, and gives the number of documents below that node. If it is set we take the first branch, otherwise we take the second. Commas are used to separate the branches. The curly brackets serve to indicate that the If is special—otherwise the word "If" itself would be output.

  1. Preview the collection and examine the subject hierarchy again to see the effect of your changes. Bookshelves should say Bookshelf title: and then the title, while documents will display Title: and the title. Note that the number of documents in the bookshelf is not displayed: we are using [numleafdocs] to test what kind of item in the list we are at, but we are not displaying it.



Section tagging for HTML documents

Devised for Greenstone version: 2.70w
Modified for Greenstone version: 2.74
  1. In a browser, take a look at the Greenstone demo collection. Browse to one of the documents. This collection is based on HTML files, but they appear structured in the collection. This is because these HTML files were tagged by hand into sections.

  1. Using a text editor (e.g. WordPad) open up one of the HTML files from the demo collection: Greenstone → collect → demo → import → fb33fe →fb33fe.htm. You will see some HTML comments which contain section information for Greenstone. They look like:

    <!--
    <Section>
      <Description>
        <Metadata name="Title">Farming snails 1: Learning about snails;
        Building a pen; Food and shelter plants</Metadata>
      </Description>
    -->

    <!--
    </Section>
    <Section>
      <Description>
        <Metadata name="Title">Dew and rain</Metadata>
      </Description>
    -->

    When Greenstone encounters a <Section> tag in one of these comments, it will start a new subsection of the document. This will be closed when a </Section> tag is encountered. Metadata can also be added for each section—in this case, Title metadata has been added for each section. In the browser, find the Farming snails 1 document in the demo collection (through the Titles browser). Look at its table of contents and compare it to the <Section> tags in the HTML document.

  1. Add a new Section into this document. For example, lets add a new subsection into the Introduction chapter. In the text editor, add the following just after the Section tag for the Introduction section:

    <!--
    <Section>
      <Description>
        <Metadata name="Title">Snails are good to eat.</Metadata>
      </Description>
    -->

    Then just before the next section tag (What do you need to start?), add the following:

    <!--
    </Section>
    -->

    The effect of these changes is to make a new subsection inside the Introduction chapter.

  1. Open the Greenstone demo collection in the Librarian Interface. In the Document Plugins section of the Design panel, note that HTMLPlugin has the description_tags option set. This option is needed when <Section> tags are used in the source documents.

  1. Build and preview the collection. Look at the Farming snails 1 document again and check that your new section has been added.



Downloading files from the web

Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.74

The Greenstone Librarian Interface's Download panel allows you to download individual files, parts of websites, and indeed whole websites, from the web.

  1. Start a new collection called webtudor, and base it on -- New Collection --.

  1. In a web browser, visit http://englishhistory.net, follow the link to Tudor England, and click <Enter>. You should be at the URL

    http://englishhistory.net/tudor.html
    </