This version (2014/04/14 11:59) is a draft.
Approvals: 0/1


A plugin for importing Microsoft Word documents.

  • Processes files with extensions: .doc, .docx, .dot
    Perl regular expressions: (?i)\.(docx?|dot)$

The following table lists all of the configuration options available for WordPlugin.

WordPlugin Options
process_exp A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive). Default: (?i)\.(docx?|dot)$
description_tags Split document into sub-sections where <Section> tags occur. '-keep_head' will have no effect when this option is set.
windows_scripting Use MicroSoft Windows scripting technology (Visual Basic for Applications) to get Word to convert document to HTML rather than rely on the open source package WvWare. Causes Word application to open on screen if not already running.
metadata_fields This is to retrieve metadata from the HTML document converted by VB scripting. It allows users to define comma separated list of metadata fields to attempt to extract. Use 'tag<tagname>' to have the contents of the first <tagname> pair put in a metadata element called 'tagname'. Capitalise this as you want the metadata capitalised in Greenstone, since the tag extraction is case insensitive Default: Title
level1_header possible user-defined styles for the level1 header in the HTML document (equivalent to <h1>).
level2_header possible user-defined styles for the level2 header in the HTML document (equivalent to <h2>).
level3_header possible user-defined styles for the level3 header in the HTML document (equivalent to <h3>).
title_header possible user-defined styles for the title header.
delete_toc Remove any table of contents, list of figures etc from the converted HTML file. Styles for these are specified by the toc_header option.
toc_header possible user-defined header styles for the table of contents, table of figures etc, to be removed if delete_toc is set.
Options Inherited from AutoLoadConverters
openoffice_conversion Use Open Office to convert Microsoft Office source documents to HTML. This option is only available if you have Open Office installed, and have downloaded the Open Office extension.
Options Inherited from ConvertBinaryFile
convert_to (REQUIRED) Plugin converts to TEXT or HTML or various types of Image (e.g. JPEG, GIF, PNG). Default: auto
keep_original_filename Keep the original filename for the associated file, rather than converting to doc.pdf, doc.doc etc.
title_sub Substitution expression to modify string stored as Title. Used by, for example, PDFPlugin to remove "Page 1", etc from text used as the title.
apply_fribidi Run the "fribidi" Unicode Bidirectional Algorithm program over the converted file (for right-to-left text).
use_strings If set, a simple strings function will be called to extract text if the conversion utility fails.
Options Inherited from AutoExtractMetadata
first Comma separated list of numbers of characters to extract from the start of the text into a set of metadata fields called 'FirstN', where N is the size. For example, the values "3,5,7" will extract the first 3, 5 and 7 characters into metadata fields called "First3", "First5" and "First7".
Options Inherited from AcronymExtractor
extract_acronyms Extract acronyms from within text and set as metadata.
markup_acronyms Add acronym metadata into document text.
Options Inherited from KeyphraseExtractor
extract_keyphrases Extract keyphrases automatically with Kea (default settings).
extract_keyphrases_kea4 Extract keyphrases automatically with Kea 4.0 (default settings). Kea 4.0 is a new version of Kea that has been developed for controlled indexing of documents in the domain of agriculture.
extract_keyphrase_options Options for keyphrase extraction with Kea. For example: mALIWEB - use ALIWEB extraction model; n5 - extract 5 keyphrase;, eGBK - use GBK encoding.
Options Inherited from EmailAddressExtractor
extract_email Extract email addresses as metadata.
Options Inherited from DateExtractor
extract_historical_years Extract time-period information from historical documents. This is stored as metadata with the document. There is a search interface for this metadata, which you can include in your collection by adding the statement, "format QueryInterface DateSearch" to your collection configuration file.
maximum_year The maximum historical date to be used as metadata (in a Common Era date, such as 1950). Default: 2013
maximum_century The maximum named century to be extracted as historical metadata (e.g. 14 will extract all references up to the 14th century). Default: -1
no_bibliography Do not try to block bibliographic dates when extracting historical dates.
Options Inherited from GISExtractor
extract_placenames Extract placenames from within text and set as metadata. Requires GIS extension to Greenstone.
gazetteer Gazetteer to use to extract placenames from within text and set as metadata. Requires GIS extension to Greenstone.
place_list When extracting placements, include list of placenames at start of the document. Requires GIS extension to Greenstone.
Options Inherited from BasePlugin
process_exp A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive).
no_blocking Don't do any file blocking. Any associated files (e.g. images in a web page) will be added to the collection as documents in their own right.
block_exp Files matching this regular expression will be blocked from being passed to any later plugins in the list.
store_original_file Save the original source document as an associated file. Note this is already done for files like PDF, Word etc. This option is only useful for plugins that don't already store a copy of the original file.
associate_ext Causes files with the same root filename as the document being processed by the plugin AND a filename extension from the comma separated list provided by this argument to be associated with the document being processed rather than handled as a separate list.
associate_tail_re A regular expression to match filenames against to find associated files. Used as a more powerful alternative to associate_ext.
OIDtype The method to use when generating unique identifiers for each document. Default: auto
OIDmetadata Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned. Default: dc.Identifier
no_cover_image Do not look for a prefix.jpg file (where prefix is the same prefix as the file being processed) to associate as a cover image.
filename_encoding The encoding of the source file filenames. Default: auto
file_rename_method The method to be used in renaming the copy of the imported file and associated files. Default: url