Greenstone tutorial exercise
A collection of Word and PDF files
You will need some source files like those in the sample_files → Word_and_PDF folder.
-
Start a new collection called reports (File → New...) and base it on -- New Collection --.
-
Copy all the files from sample_files → Word_and_PDF → Documents into the collection. You can select multiple files by clicking on the first one and shift-clicking on the last one, and drag them all across together. (This is the normal technique of multiple selection.)
-
Switch to the Create panel, and build and preview the collection.
Viewing the extracted metadata
-
Again, this collection contains no manually assigned metadata. All the information that appears—title and filename—is extracted automatically from the documents themselves. Because of this the quality of some of the title metadata is suspect.
-
Back in the Librarian Interface, click the Enrich tab to view the automatically extracted metadata. You will need to scroll down to see the extracted metadata, which begins with "ex.".
-
Check whether the ex.Title metadata is correct for some of the documents by opening them. You can open a document from the Librarian Interface by double clicking on it.
-
The extracted Title metadata for some documents is incorrect. For example, the Titles for pdf01.pdf and word03.doc (the same document in different formats) have missed out the second line. The Title for pdf03.pdf has the wrong text altogether.
Manually adding metadata to documents in a collection
-
In the Enrich panel, manually add Dublin Core dc.Title metadata to those documents which have incorrect ex.Title metadata. Select word03.doc and double-click to open it. Copy the title of this document ("Greenstone: A comprehensive open-source digital library software system") and return to the Librarian Interface. Scroll up or down in the metadata table until you can see dc.Title. Click in the value box and paste in the metadata.
-
Now add dc.Creator information for the same document. You can add more than one value for the same field: when you press Enter in a metadata value field, a new empty field of the same type will be generated. Add each author separately as dc.Creator metadata.
-
Close the document (in Microsoft Word) when you have finished copying metadata from it. External programs opened when viewing documents must be closed before building the collection, otherwise errors can occur.
-
Next add dc.Title and dc.Creator metadata for a few of the other documents.
-
You will notice as you add more values, they appear in the Existing values for ... box below the metadata table. If you are adding the same metadata value to more than one document, you can select it from this list. For example, pdf01.pdf and word03.doc share the same Title; and many documents have common authors.
If you build and preview your collection at this point, you will see that the Titles list now shows your new Titles. However, the dc.Creator metadata is not displayed. You need to alter the collection design to use this metadata.
Document Plugins
-
In the Librarian Interface, look at the Document Plugins section of the Design panel, by clicking on this in the list to the left. Here you can add, configure or remove plugins to be used in the collection. There is no need to remove any plugins, but it will speed up processing a little. In this case we have only Word, PDF, RTF, and PostScript documents, and can remove the ZIPPlug, TEXTPlug, HTMLPlug, EMAILPlug, ImagePlug, ISISPlug and NULPlug plugins. To delete a plugin, select it and click <Remove Plugin>. GAPlug and MetadataXMLPlug are required for any type of source collection and should not be removed.
Search indexes
-
The next step in the Design panel is Search Indexes. These specify what parts of the collection are searchable (e.g. searching by title and author). Delete the ex.Source index, which is not particularly useful, by selecting it and clicking <Remove Index>.
-
Modify the ex.Title index to include dc.Title by selecting the index in the Assigned Indexes box and clicking <Edit Index>. Select dc.Title from the list of metadata, and click <Replace Index>. Searching this index will search both dc.Title and ex.Title metadata. If you want to restrict searching to just the manually added dc.Title metadata, edit the index again and deselect ex.Title from the list of metadata.
-
You can add indexes based on any metadata. Add a new index based on dc.Creator by clicking <New Index>. Select dc.Creator in the list of metadata, and click <Add Index>.
Browsing classifiers
-
The Browsing Classifiers section adds "classifiers," which provide the collection with browsing functions. Go to this section and observe that Greenstone has provided two classifiers, AZLists based on ex.Title and ex.Source metadata. These correspond to the Titles and Filenames buttons on the collection's access bar.
Remove the ex.Source classifier by selecting it and clicking <Remove Classifier>.
-
Modify the ex.Title classifier to use dc.Title instead. Select the classifier and click <Configure Classifier...>. In the metadata box, select dc.Title instead of ex.Title. Click <OK>.
-
Now add an AZCompactList classifier for dc.Creator. Select AZCompactList from the Select classifier to add drop-down list and click <Add Classifier...>. A popup window Configuring Arguments appears. Select dc.Creator from the metadata drop-down list and click <OK>.
AZCompactList is like AZList, except that values that appear multiple times in the hierarchy are automatically grouped together and a new node, shown as a bookshelf icon, is formed.
-
Switch to the Create panel, and build and preview the collection.
-
Check that all the facilities work properly. There should be three full-text indexes, called text, dc.Title, and dc.Creator. The Titles list should display all the documents to which you have assigned dc.Title metadata (and only those documents). The Creators list should show one bookshelf for each author you have assigned as dc.Creator, and clicking on that bookshelf should take you to all the documents they authored.
Renaming the search indexes
-
The default display text for the indexes in the drop-down list on the search page contains the content of the index. Now we will change this display text to make it nicer. Go to the Format panel by clicking its tab. This panel is split into several sections, each controlling some aspect of collection presentation.
-
Select Search in the left hand list. This section allows you to modify what text is displayed for the drop-down lists in the search form (indexes, subcollections, levels etc). Set the Display text for the dc.Title,Title index to be "titles", and that for the dc.Creator index to be "creators". Preview the collection by clicking the Preview Collection. The search form should display the new text.
Classifying on multiple metadata
-
The new Titles list shows only those documents which have been assigned dc.Title metadata. For many documents, extracted Titles may be fine, and it is impractical to add the same metadata again as dc.Title. Fortunately there is a way we can use both metadata types in one classifier: specify a list of metadata names in the classifier.
-
In the Browsing Classifiers section of the Design panel, select the AZList for dc.Title in the Assigned Classifiers box and click <Configure Classifier...>. Note you can achieve the same result by double clicking on the classifier.
-
In the metadata field, type ",ex.Title" after the "dc.Title"—i.e. make it read
dc.Title,ex.Title
-
If you have already done the Enhanced Word document handling exercise, some of the documents will have extracted ex.Creator metadata, and some will have dc.Creator. To use both of these in the Creators classifier, make a similar change to the AZCompactList: make the metadata field read dc.Creator,ex.Creator.
Build the collection again and preview it. Now all of the documents should appear in the Titles list (and extracted Creators should appear in the Creators list).
We will play around with the format statements and customize the outlook of this collection in the Formatting the Word and PDF collection exercise.