Greenstone tutorial exercise

Back to wiki
Back to index
Prerequisite: A collection of Word and PDF files
Devised for Greenstone version: 2.70w|3.06
Modified for Greenstone version: 2.87|3.11

Enhanced Word document handling

The standard way Greenstone processes Word documents is to convert them to HTML format using a third-party program, wvWare. This sometimes doesn't do a very good job of conversion. If you are using Windows, and have Microsoft Word installed, you can take advantage of Windows native scripting to do a better job of conversion. If the original document was hierarchically structured using Word styles, these can be used to structure the resulting HTML. Word document properties can also be extracted as metadata.

  1. In your digital library, preview the reports collection. Look at the HTML versions of the Word documents and notice how they have no structure-they have been converted to flat documents.

Using Windows native scripting

  1. In the Librarian Interface, open up the reports collection. Switch to the Design panel and select the Document Plugins section on the left-hand side. Double click the WordPlugin plugin and switch on the windows_scripting option.

    In the Search Indexes section, check the section checkbox, if not already the case, to build the indexes on section level as well as document level.

  1. Build the collection. You will notice that the Microsoft Word program is started up for each Word document—the document is saved as HTML from Word itself, to get a better conversion. Preview the collection. In the titles list, notice that word03.doc and word06.doc now have a book icon, rather than a page icon. These now appear with hierarchical structure.

    The default behaviour for WordPlugin with windows_scripting is to section the document based on "Heading 1", "Heading 2", "Heading 3" styles. If you open up the word03.doc or word06.doc documents in Word, you will see that the sections use these Heading styles.

    Note, to view style information in Word 2003, you can select Format → Styles and Formatting from the menu, and a side bar will appear on the right hand side. (In Word 2007 and later, find the Change Styles button on the far right of the menu ribbon. Click on the tiny Expand icon to its bottom right to display the styles side bar.) Click on a section heading and the formatting information will be displayed in this side bar.

  1. Some of the documents do not use styles (e.g. word01.doc) and no structure can be extracted from them. Some documents use user-defined styles. WordPlugin can be configured to use these styles instead of Heading 1, Heading 2 etc. Next we will configure WordPlugin to use the styles found in word05.doc.

Modes in the Librarian Interface

  1. The Librarian Interface operates in three modes. Go to FilePreferences...Mode and see the modes and what functionality they provide access to. Librarian is the default mode. Check that this is indeed the currently active mode.

Defining styles

  1. Open up word05.doc in Word (by double-clicking on it in the Gather pane), and examine the title and section heading styles. You will see that various user-defined header styles are set such as:

  1. In the Document Plugins section of the Design panel, select WordPlugin and click <Configure Plugin...>. Four types of header can be set which are:

    • level1_header (level1Header1|level1Header2|...)
    • level2_header (level2Header1|level2Header2|...)
    • level3_header (level3Header1|level3Header2|...)
    • title_header (titleHeader1|titleHeader2|...)

    These header options define which styles should be considered as title, level 1, level 2 and level 3 styles.

    Ensure that the windows_scripting option is checked, and set the 4 header options to the values highlighted in the following (spaces in the Word styles are removed when converting to HTML styles, and these options must match the HTML styles):

    level1_header: (ChapterTitle|AppendixTitle)
    level2_header: SectionHeading
    level3_header: SubsectionHeading
    title_header : ManualTitle

    Once these are set, click <OK>.

  1. Close any documents that are still open in Word, as this can prevent the build process from completing correctly.

  1. Build the collection and preview it. Look in particular at word05.doc. You will see that this document is now also hierarchically structured.

    If you have documents with different formatting styles, you can use (...|...) to specify all of the different styles.

Removing pre-defined table of contents

  1. If you look at the HTML version word06.doc, you will see that it now has two tables of contents. One is generated by Greenstone based on the document's styles, the other was already defined in the Word document. WordPlugin can be configured to remove predefined tables of contents and tables of figures. The tables must be defined with Word styles in order for this to work.

  1. To remove the tables of contents and figures from word06.doc, switch on the delete_toc option in WordPlugin. Set the toc_header option to (MsoToc1|MsoToc2|MsoToc3|MsoTof|TOA). In this document, the table of contents and list of figures use these four style names. Click <OK>.

  1. Build and preview the collection. word06.doc should now have only one table of contents.

Extracting document properties as metadata

  1. When the windows_scripting option is set, word document properties can be extracted as metadata. By default, only the Title will be extracted. Other properties can be extracted using the metadata_fields option.

  1. In the Enrich panel, look at the metadata that has been extracted for word05.doc and word06.doc. Now open the documents in Word and look at what properties have been set (File → Properties for Word 2003. In Word 2007/2010, click the Word Icon on the top left, then choose Prepare → Properties. In Word 2013, File → Info; the Properties section is on the right.) They have Title, Author, Subject, and Keywords properties. WordPlugin can be configured to look for these properties and extract them.

  1. In the Design panel, under Document Plugins, configure WordPlugin once again. Switch on the configuration option metadata_fields. Set the value to the following (but make sure not to enter any trailing spaces)

    Title,Author<Creator>,Subject,Keywords<Subject>

    This will make WordPlugin try to extract Title, Author, Subject and Keywords metadata. Title and Subject will be saved with the same name, while Author will be saved as Creator metadata, and Keywords as Subject metadata.

  1. Make sure you have closed all the documents that were opened, then rebuild the collection.

  1. Look at the metadata for the two documents again in the Enrich panel. You should now see ex.Creator and ex.Subject metadata items. This metadata can now be used in display or browsing classifiers etc.

Processing docx files

  1. Drag and drop the sample_files → Word_and_PDF → extra_docx → testword.docx file, or any Word doc you have with docx file extension, into the collection. In the Document Plugins section of the Design panel, use the <Move Down> to move the UnknownConverterPlugin in the plugins list to below the WordPlugin in the document plugin pipeline. Build the collection. With windows_scripting turned on, docx files, which are the newer version of word documents, will now also be processed during build. Preview the collection and have a look at the document view of the newly added word document in the collection, to see what the generated html version of the file looks like. (testword.docx is a very basic docx file, containing a few sentences and an image.)

  1. Now turn off windows_scripting in the Design panel, and rebuild the collection again. All the documents should still be processed, because Greenstone's document plugin pipeline is now set up with an UnknownConverterPlugin configured to use Apache Tika to extract text from Word documents by default (including docx files). Preview the collection and revisit the document view of the docx file. This time, the html produced should look very different: much more basic. This is because Tika supports extracting text from different document formats, including word documents, but is not optimised for html presentation. However, this does mean full text searching will be available for docx files too when Greenstone is installed out-of-the-box.

    So at a pinch, you can always use Greenstone's now default document plugins setup, to process a collection that includes docx files, to at least support full text searching of the contents of docx files, even if the document view (the HTML view) of docx files processed with Tika may not look as formatted as the original source document. Presentation may be of secondary importance, since by default Greenstone will anyway provide a link to the original source document in its original format (in this case, a link to the docx file).

    Above, we shifted the UnknownConverterPlugin that uses Apache Tika to below the WordPlugin in the document plugin pipeline, because we want to force WordPlugin to attempt to process all word documents first, when it recognises them. Apache Tika can always process Word documents, but we favour WordPlugin to try processing them first, including the newer docx files, which it can do when on Windows machines with Word installed and windows_scripting turned on. Turning off windows_scripting instructs the WordPlugin not to make use of Word to convert doc(x) files to html, and so WordPlugin is not able to process docx files. As a result, the document plugins in the pipeline pass the unprocessed docx file further down the pipeline to the UnknownConverterPlugin that is able to process the docx file as it's pre-configured to make use of Apache Tika to extract text from Word documents.


Copyright © 2005-2019 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”