This version (2014/04/14 11:52) is a draft.
Approvals: 0/1


A plugin for extracting images and associated text from webpages.

  • Processes files with extensions: .html, .htm, .shtml, .xhm, .asp, .php, .cgi
    (Perl regular expression: (?i)(\.html?|\.shtml|\.shm|\.asp|\.php\d?|\.cgi|.+\?.+=.*)$)

The following table lists all of the configuration options available for HTMLImagePlugin.

HTMLImagePlugin Options
aggressiveness Range of related text extraction techniques to use. Default: 3
index_pages Index the pages along with the images. Otherwise reference the pages at the source URL.
no_cache_images Don't cache images (point to URL of original).
min_size Bytes. Skip images smaller than this. Default: 2000
min_width Pixels. Skip images narrower than this. Default: 50
min_height Pixels. Skip images shorter than this. Default: 50
thumb_size Max thumbnail size. Both width and height. Default: 100
convert_params Additional parameters for ImageMagicK convert on thumbnail creation. For example, '-raise' will give a three dimensional effect to thumbnail images.
min_near_text Minimum characters of near text or caption to extract. Default: 10
max_near_text Maximum characters near images to extract. Default: 400
smallpage_threshold Images on pages smaller than this (bytes) will have the page (title, keywords, etc) meta-data added. Default: 2048
textrefs_threshold Threshold for textual references. Lower values mean the algorithm is less strict. Default: 2
caption_length Maximum length of captions (in characters). Default: 80
neartext_length Target length of near text (in characters). Default: 300
document_text Add image text as document:text (otherwise IndexedText metadata field).
Options Inherited from HTMLPlugin
process_exp A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive). Default: (?i)(\.html?|\.shtml|\.shm|\.asp|\.php\d?|\.cgi|.+\?.+=.*)$
block_exp Files matching this regular expression will be blocked from being passed to any later plugins in the list.
nolinks Don't make any attempt to trap links (setting this flag may improve speed of building/importing but any relative links within documents will be broken).
keep_head Don't remove headers from html files.
no_metadata Don't attempt to extract any metadata from files.
metadata_fields Comma separated list of metadata fields to attempt to extract. Capitalise this as you want the metadata capitalised in Greenstone, since the tag extraction is case insensitive. e.g. Title,Date. Use 'tag<tagname>' to have the contents of the first <tag> pair put in a metadata element called 'tagname'. e.g. Title,Date,Author<Creator> Default: Title
metadata_field_separator Separator character used in multi-valued metadata. Will split a metadata field value on this character, and add each item as individual metadata.
hunt_creator_metadata Find as much metadata as possible on authorship and place it in the 'Creator' field.
file_is_url Set if input filenames make up url of original source documents e.g. if a web mirroring tool was used to create the import directory structure.
assoc_files Perl regular expression of file extensions to associate with html documents.
rename_assoc_files Renames files associated with documents (e.g. images). Also creates much shallower directory structure (useful when creating collections to go on cd-rom).
title_sub Substitution expression to modify string stored as Title. Used by, for example, PDFPlugin to remove "Page 1", etc from text used as the title.
description_tags Split document into sub-sections where <Section> tags occur. '-keep_head' will have no effect when this option is set.
no_strip_metadata_html Comma separated list of metadata names, or 'all'. Used with -description_tags, it prevents stripping of HTML tags from the values for the specified metadata.
sectionalise_using_h_tags Automatically create a sectioned document using h1, h2, … hX tags.
use_realistic_book If set, converts an HTML document into a well-formed XHTML to enable users view the document in the book format.
old_style_HDL To mark whether the file in this collection used the old HDL document's tags style.
Options Inherited from ReadTextFile
input_encoding The encoding of the source documents. Documents will be converted from these encodings and stored internally as utf8. Default: auto
default_encoding Use this encoding if -input_encoding is set to 'auto' and the text categorization algorithm fails to extract the encoding or extracts an encoding unsupported by Greenstone. This option can take the same values as -input_encoding. Default: utf8
extract_language Identify the language of each document and set 'Language' metadata. Note that this will be done automatically if -input_encoding is 'auto'.
default_language If Greenstone fails to work out what language a document is the 'Language' metadata element will be set to this value. The default is 'en' (ISO 639 language symbols are used: en = English). Note that if -input_encoding is not set to 'auto' and -extract_language is not set, all documents will have their 'Language' metadata set to this value. Default: en
Options Inherited from AutoExtractMetadata
first Comma separated list of numbers of characters to extract from the start of the text into a set of metadata fields called 'FirstN', where N is the size. For example, the values "3,5,7" will extract the first 3, 5 and 7 characters into metadata fields called "First3", "First5" and "First7".
Options Inherited from AcronymExtractor
extract_acronyms Extract acronyms from within text and set as metadata.
markup_acronyms Add acronym metadata into document text.
Options Inherited from KeyphraseExtractor
extract_keyphrases Extract keyphrases automatically with Kea (default settings).
extract_keyphrases_kea4 Extract keyphrases automatically with Kea 4.0 (default settings). Kea 4.0 is a new version of Kea that has been developed for controlled indexing of documents in the domain of agriculture.
extract_keyphrase_options Options for keyphrase extraction with Kea. For example: mALIWEB - use ALIWEB extraction model; n5 - extract 5 keyphrase;, eGBK - use GBK encoding.
Options Inherited from EmailAddressExtractor
extract_email Extract email addresses as metadata.
Options Inherited from DateExtractor
extract_historical_years Extract time-period information from historical documents. This is stored as metadata with the document. There is a search interface for this metadata, which you can include in your collection by adding the statement, "format QueryInterface DateSearch" to your collection configuration file.
maximum_year The maximum historical date to be used as metadata (in a Common Era date, such as 1950). Default: 2013
maximum_century The maximum named century to be extracted as historical metadata (e.g. 14 will extract all references up to the 14th century). Default: -1
no_bibliography Do not try to block bibliographic dates when extracting historical dates.
Options Inherited from GISExtractor
extract_placenames Extract placenames from within text and set as metadata. Requires GIS extension to Greenstone.
gazetteer Gazetteer to use to extract placenames from within text and set as metadata. Requires GIS extension to Greenstone.
place_list When extracting placements, include list of placenames at start of the document. Requires GIS extension to Greenstone.
Options Inherited from BasePlugin
process_exp A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive).
no_blocking Don't do any file blocking. Any associated files (e.g. images in a web page) will be added to the collection as documents in their own right.
block_exp Files matching this regular expression will be blocked from being passed to any later plugins in the list.
store_original_file Save the original source document as an associated file. Note this is already done for files like PDF, Word etc. This option is only useful for plugins that don't already store a copy of the original file.
associate_ext Causes files with the same root filename as the document being processed by the plugin AND a filename extension from the comma separated list provided by this argument to be associated with the document being processed rather than handled as a separate list.
associate_tail_re A regular expression to match filenames against to find associated files. Used as a more powerful alternative to associate_ext.
OIDtype The method to use when generating unique identifiers for each document. Default: auto
OIDmetadata Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned. Default: dc.Identifier
no_cover_image Do not look for a prefix.jpg file (where prefix is the same prefix as the file being processed) to associate as a cover image.
filename_encoding The encoding of the source file filenames. Default: auto
file_rename_method The method to be used in renaming the copy of the imported file and associated files. Default: url

aggressiveness option values

1Filename, path, alternative text (ALT attributes in img HTML tags) only.
2All of 1, plus caption where available.
3All of 2, plus near paragraphs where available.
4All of 3, plus previous headers (<h1>, <h2>…) where available.
5All of 4, plus textual references where available.
6All of 4, plus metadata tags in HTML pages (title, keywords, etc).
7All of 6, 5 and 4 combined.
8All of 7, plus duplicating filename, path, alternative text, and caption (raise ranking of more relevant results).
9All of 1, plus full text of source page.