Greenstone Wiki

PagedImagePlugin

Plugin for documents made up of a sequence of images, with optional OCR text for each image. This plugin processes .item files which list the sequence of image and text files, and provide metadata.

Processes files with extensions: .item
Perl regular expression: \.item$

The following table lists all of the configuration options available for PagedImagePlugin.

Option	Description	Value
PagedImagePlugin Options
process_exp	A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive).	Default: \.item$
title_sub	Substitution expression to modify string stored as Title. Used by, for example, PDFPlugin to remove "Page 1", etc from text used as the title.
headerpage	Add a top level header page (that contains no image) to each document.
documenttype	Set the document type (used for display)	Default: auto List
Options Inherited from ImageConverter
create_thumbnail	If set to true, create a thumbnail version of each image, and add Thumb, ThumbType, thumbicon, ThumbWidth, ThumbHeight metadata.	Default: true List
thumbnailsize	Make thumbnails of size nxn.	Default: 100 Range: 1,
thumbnailtype	Make thumbnails in format 's'.	Default: gif
noscaleup	Don't scale up small images when making thumbnails.
create_screenview	If set to true, create a screen sized image, and set Screen, ScreenType, screenicon, ScreenWidth, ScreenHeight metadata.	Default: true List
screenviewsize	Make screenview images of size nxn.	Default: 500 Range: 1,
screenviewtype	Make screenview images in format 's'.	Default: jpg
converttotype	Convert main image to format 's'.
minimumsize	Ignore images smaller than n bytes.	Default: 100 Range: 1,
apply_aspectpad	{ImageConverter.apply_aspectpad}	Default: false List
aspectpad_ratio	{ImageConverter.aspectpad_ratio}	Default: 2 Range: 1,
aspectpad_mode	{ImageConverter.aspectpad_mode}	Default: al List
aspectpad_colour	{ImageConverter.aspectpad_colour}	Default: transparent
aspectpad_tolerance	{ImageConverter.aspectpad_tolerance}	Default: 0.0 Range: 0,
Options Inherited from BaseMediaConverter
enable_cache	Cache automatically generated files (such as thumbnails and screen-size images) so they don't need to be repeatedly generated.
Options Inherited from ReadTextFile
input_encoding	The encoding of the source documents. Documents will be converted from these encodings and stored internally as utf8.	Default: auto List
default_encoding	Use this encoding if -input_encoding is set to 'auto' and the text categorization algorithm fails to extract the encoding or extracts an encoding unsupported by Greenstone. This option can take the same values as -input_encoding.	Default: utf8 List
extract_language	Identify the language of each document and set 'Language' metadata. Note that this will be done automatically if -input_encoding is 'auto'.
default_language	If Greenstone fails to work out what language a document is the 'Language' metadata element will be set to this value. The default is 'en' (ISO 639 language symbols are used: en = English). Note that if -input_encoding is not set to 'auto' and -extract_language is not set, all documents will have their 'Language' metadata set to this value.	Default: en
Options Inherited from AutoExtractMetadata
first	Comma separated list of numbers of characters to extract from the start of the text into a set of metadata fields called 'FirstN', where N is the size. For example, the values "3,5,7" will extract the first 3, 5 and 7 characters into metadata fields called "First3", "First5" and "First7".
Options Inherited from AcronymExtractor
extract_acronyms	Extract acronyms from within text and set as metadata.
markup_acronyms	Add acronym metadata into document text.
Options Inherited from KeyphraseExtractor
extract_keyphrases	Extract keyphrases automatically with Kea (default settings).
extract_keyphrases_kea4	Extract keyphrases automatically with Kea 4.0 (default settings). Kea 4.0 is a new version of Kea that has been developed for controlled indexing of documents in the domain of agriculture.
extract_keyphrase_options	Options for keyphrase extraction with Kea. For example: mALIWEB - use ALIWEB extraction model; n5 - extract 5 keyphrase;, eGBK - use GBK encoding.
Options Inherited from EmailAddressExtractor
extract_email	Extract email addresses as metadata.
Options Inherited from DateExtractor
extract_historical_years	Extract time-period information from historical documents. This is stored as metadata with the document. There is a search interface for this metadata, which you can include in your collection by adding the statement, "format QueryInterface DateSearch" to your collection configuration file.
maximum_year	The maximum historical date to be used as metadata (in a Common Era date, such as 1950).	Default: 2013
maximum_century	The maximum named century to be extracted as historical metadata (e.g. 14 will extract all references up to the 14th century).	Default: -1
no_bibliography	Do not try to block bibliographic dates when extracting historical dates.
Options Inherited from GISExtractor
extract_placenames	Extract placenames from within text and set as metadata. Requires GIS extension to Greenstone.
gazetteer	Gazetteer to use to extract placenames from within text and set as metadata. Requires GIS extension to Greenstone.
place_list	When extracting placements, include list of placenames at start of the document. Requires GIS extension to Greenstone.
Options Inherited from BasePlugin
process_exp	A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive).
no_blocking	Don't do any file blocking. Any associated files (e.g. images in a web page) will be added to the collection as documents in their own right.
block_exp	Files matching this regular expression will be blocked from being passed to any later plugins in the list.
store_original_file	Save the original source document as an associated file. Note this is already done for files like PDF, Word etc. This option is only useful for plugins that don't already store a copy of the original file.
associate_ext	Causes files with the same root filename as the document being processed by the plugin AND a filename extension from the comma separated list provided by this argument to be associated with the document being processed rather than handled as a separate list.
associate_tail_re	A regular expression to match filenames against to find associated files. Used as a more powerful alternative to associate_ext.
OIDtype	The method to use when generating unique identifiers for each document.	Default: auto List
OIDmetadata	Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned.	Default: dc.Identifier
no_cover_image	Do not look for a prefix.jpg file (where prefix is the same prefix as the file being processed) to associate as a cover image.
filename_encoding	The encoding of the source file filenames.	Default: auto List
file_rename_method	The method to be used in renaming the copy of the imported file and associated files.	Default: url List
Options Inherited from ReadXMLFile
process_exp	A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive).	Default: (?i)\.xml$
xslt	Transform a matching input document with the XSLT in the named file. A relative filename is assumed to be in the collection's file area, for instance etc/mods2dc.xsl.
Options Inherited from BasePlugin
process_exp	A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive).
no_blocking	Don't do any file blocking. Any associated files (e.g. images in a web page) will be added to the collection as documents in their own right.
block_exp	Files matching this regular expression will be blocked from being passed to any later plugins in the list.
store_original_file	Save the original source document as an associated file. Note this is already done for files like PDF, Word etc. This option is only useful for plugins that don't already store a copy of the original file.
associate_ext	Causes files with the same root filename as the document being processed by the plugin AND a filename extension from the comma separated list provided by this argument to be associated with the document being processed rather than handled as a separate list.
associate_tail_re	A regular expression to match filenames against to find associated files. Used as a more powerful alternative to associate_ext.
OIDtype	The method to use when generating unique identifiers for each document.	Default: auto List
OIDmetadata	Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned.	Default: dc.Identifier
no_cover_image	Do not look for a prefix.jpg file (where prefix is the same prefix as the file being processed) to associate as a cover image.
filename_encoding	The encoding of the source file filenames.	Default: auto List
file_rename_method	The method to be used in renaming the copy of the imported file and associated files.	Default: url List

documenttype option values

Value	Description
auto	Automatically set document type based on item file format. Uses 'paged' for documents with a single sequence of pages, and 'hierarchy' for documents with internal structure (i.e. from XML item files containing PageGroup elements).
paged	Paged documents have a linear sequence of pages and no internal structure. They will be displayed with next and previous arrows and a 'go to page X' box.
hierarchy	Hierarchical documents have internal structure and will be displayed with a table of contents

Table of Contents

PagedImagePlugin

documenttype option values