en:plugin:openofficeplugin

OpenOfficePlugin

OpenOfficePlugin

Top level plugin that uses Open Office to convert various types of documents.

Processes files with extensions: .doc, .dot, .docx, .odt, .wpd, .ppt, .pptx, .odp, .rtf, .xls, .xlsx, .ods
Perl regular expression: (?i).(doc|dot|docx|odt|wpd|ppt|pptx|odp|rtf|xls|xlsx|ods)$

The OpenOfficePlugin is an extension that is available for Greenstone. It is useful for processing both Open Office and Microsoft Office documents, as well as other text formats.

With OpenOffice and the extension installed and the Greenstone environment set up for this, the OpenOfficePlugin will be available.

Furthermore, an OpenOfficeConverter helper plugin provides a new option for Greenstone's Word, PowerPoint and Excel Plugins, -openoffice_conversion, allowing conversion with Open Office instead of the existing converter. Switching on this new option means that more recent Office formats like docx can be included in Greenstone collections and processed by Greenstone.

Installing OpenOffice

To use OpenOfficePlugin or OpenOfficeConverter, you must first install Open Office. Once Open Office is installed, you may have to set the environment variables so Greenstone can find OpenOffice:

Linux

If SOFFICE_HOME is not set, set it to the full path to your OpenOffice folder.

Windows

In the Start Menu, type "environment" in the search box and select Edit environment variables for your account
In the Environment Variables window, press New… under the User variables section. For Variable name, enter SOFFICE_HOME. For Variable value, enter the full path to your OpenOffice folder (for example C:\Program Files (x86)\OpenOffice 4) and click OK.
If there is a Path variable in the User variables section, select this and press Edit…. Add ;%SOFFICE_HOME%\program to the end of the Variable value and click OK.
If there is NOT a Path variable in the User variables section, press New…. For Variable name, enter Path. For Variable value, enter %SOFFICE_HOME%\program;%Path% and click OK.
Click OK in the Environment Variables window to save your changes.

Installing OpenOffice Extension

Greenstone3

Once you have Open Office set up, download the Greenstone extension for it from here, which is available in tar.gz and zip formats, and unzip into Greenstone's gs2build/ext folder.

Greenstone2

Once you have Open Office set up, download the Greenstone extension for it from here, which is available in tar.gz and zip formats, and unzip into Greenstone's ext folder.

If you have Greenstone open during this, be sure to completely exit both the GLI and the server and restart Greenstone in order for the extension to become available in the Greenstone environment.

If you are running Greenstone or GLI from a terminal, you need to start a fresh terminal, and run 'source gs3-setup.sh' (linux) or 'gs3-setup' (windows) to include the extension setup in the environment.

Note that you cannot already have an instance of OpenOffice running when using GLI: you will need to terminate any previously running instance. It is also unlikely that you can get a separate instance of OpenOffice running after quitting GLI. If you wish to do so, you will need to use Task Manager to terminate the open office process launched by the extension upon running GLI.

Plugin options

The following table lists all of the configuration options available for OpenOfficePlugin.

Option	Description	Value
OpenOfficePlugin Options
process_exp	A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive).	Default: (?i).(doc\|dot\|docx\|odt\|wpd\| ppt\|pptx\|odp\|rtf\| xls\|xlsx\|ods)$
Options Inherited from OpenOfficeConverter
openoffice_port	Port number for…	Default: 8100 Range: 81,
Options Inherited from BaseMediaConverter
enable_cache	Cache automatically generated files (such as thumbnails and screen-size images) so they don't need to be repeatedly generated.
Options Inherited from ConvertBinaryFile
convert_to	(REQUIRED) Plugin converts to TEXT or HTML or various types of Image (e.g. JPEG, GIF, PNG).	Default: auto List
keep_original_filename	Keep the original filename for the associated file, rather than converting to doc.pdf, doc.doc etc.
title_sub	Substitution expression to modify string stored as Title. Used by, for example, PDFPlugin to remove "Page 1", etc from text used as the title.
apply_fribidi	Run the "fribidi" Unicode Bidirectional Algorithm program over the converted file (for right-to-left text).
use_strings	If set, a simple strings function will be called to extract text if the conversion utility fails.
Options Inherited from AutoExtractMetadata
first	Comma separated list of numbers of characters to extract from the start of the text into a set of metadata fields called 'FirstN', where N is the size. For example, the values "3,5,7" will extract the first 3, 5 and 7 characters into metadata fields called "First3", "First5" and "First7".
Options Inherited from AcronymExtractor
extract_acronyms	Extract acronyms from within text and set as metadata.
markup_acronyms	Add acronym metadata into document text.
Options Inherited from KeyphraseExtractor
extract_keyphrases	Extract keyphrases automatically with Kea (default settings).
extract_keyphrases_kea4	Extract keyphrases automatically with Kea 4.0 (default settings). Kea 4.0 is a new version of Kea that has been developed for controlled indexing of documents in the domain of agriculture.
extract_keyphrase_options	Options for keyphrase extraction with Kea. For example: mALIWEB - use ALIWEB extraction model; n5 - extract 5 keyphrase;, eGBK - use GBK encoding.
Options Inherited from EmailAddressExtractor
extract_email	Extract email addresses as metadata.
Options Inherited from DateExtractor
extract_historical_years	Extract time-period information from historical documents. This is stored as metadata with the document. There is a search interface for this metadata, which you can include in your collection by adding the statement, "format QueryInterface DateSearch" to your collection configuration file.
maximum_year	The maximum historical date to be used as metadata (in a Common Era date, such as 1950).	Default: 2013
maximum_century	The maximum named century to be extracted as historical metadata (e.g. 14 will extract all references up to the 14th century).	Default: -1
no_bibliography	Do not try to block bibliographic dates when extracting historical dates.
Options Inherited from GISExtractor
extract_placenames	Extract placenames from within text and set as metadata. Requires GIS extension to Greenstone.
gazetteer	Gazetteer to use to extract placenames from within text and set as metadata. Requires GIS extension to Greenstone.
place_list	When extracting placements, include list of placenames at start of the document. Requires GIS extension to Greenstone.
Options Inherited from BasePlugin
process_exp	A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive).
no_blocking	Don't do any file blocking. Any associated files (e.g. images in a web page) will be added to the collection as documents in their own right.
block_exp	Files matching this regular expression will be blocked from being passed to any later plugins in the list.
store_original_file	Save the original source document as an associated file. Note this is already done for files like PDF, Word etc. This option is only useful for plugins that don't already store a copy of the original file.
associate_ext	Causes files with the same root filename as the document being processed by the plugin AND a filename extension from the comma separated list provided by this argument to be associated with the document being processed rather than handled as a separate list.
associate_tail_re	A regular expression to match filenames against to find associated files. Used as a more powerful alternative to associate_ext.
OIDtype	The method to use when generating unique identifiers for each document.	Default: auto List
OIDmetadata	Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned.	Default: dc.Identifier
no_cover_image	Do not look for a prefix.jpg file (where prefix is the same prefix as the file being processed) to associate as a cover image.
filename_encoding	The encoding of the source file filenames.	Default: auto List
file_rename_method	The method to be used in renaming the copy of the imported file and associated files.	Default: url List

Table of Contents