This version (2014/04/14 11:52) is a draft.
Approvals: 0/1

EmailPlugin

A plugin that reads email files. These are named with a simple number (i.e. as they appear in maildir folders) or with the extension .mbx (for mbox mail file format).

Document text: The document text consists of all the text after the first blank line in the document.

Metadata (not Dublin Core!):

  • $Headers All the header content (optional, not stored by default)
  • $Subject Subject: header
  • $To To: header
  • $From From: header
  • $FromName Name of sender (where available)
  • $FromAddr E-mail address of sender
  • $DateText Date: header
  • $Date Date: header in GSDL format (eg: 19990924)
  • Processes files with extensions: .mbx, .mbox, .email, .eml or that begin with 1+ numbers
    Perl regular expression: ([\/]\d+|\.(mbo?x|email|eml))$

The following table lists all of the configuration options available for EmailPlugin.

OptionDescriptionValue
EmailPlugin Options
process_exp A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive). Default: ([\/]\d+|\.(mbo?x|email|eml))$
no_attachments Do not save message attachments.
headers Store email headers as "Headers" metadata.
OIDtype The method to use when generating unique identifiers for each document. Default: message_id
List
OIDmetadata Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned. Default: dc.Identifier
split_exp A perl regular expression used to split files containing many messages into individual documents. Default: \nFrom .*\d{4}\n
Options Inherited from SplitTextFile
split_exp A perl regular expression to split input files into segments.
Options Inherited from ReadTextFile
input_encoding The encoding of the source documents. Documents will be converted from these encodings and stored internally as utf8. Default: auto
List
default_encoding Use this encoding if -input_encoding is set to 'auto' and the text categorization algorithm fails to extract the encoding or extracts an encoding unsupported by Greenstone. This option can take the same values as -input_encoding. Default: utf8
List
extract_language Identify the language of each document and set 'Language' metadata. Note that this will be done automatically if -input_encoding is 'auto'.
default_language If Greenstone fails to work out what language a document is the 'Language' metadata element will be set to this value. The default is 'en' (ISO 639 language symbols are used: en = English). Note that if -input_encoding is not set to 'auto' and -extract_language is not set, all documents will have their 'Language' metadata set to this value. Default: en
Options Inherited from AutoExtractMetadata
first Comma separated list of numbers of characters to extract from the start of the text into a set of metadata fields called 'FirstN', where N is the size. For example, the values "3,5,7" will extract the first 3, 5 and 7 characters into metadata fields called "First3", "First5" and "First7".
Options Inherited from AcronymExtractor
extract_acronyms Extract acronyms from within text and set as metadata.
markup_acronyms Add acronym metadata into document text.
Options Inherited from KeyphraseExtractor
extract_keyphrases Extract keyphrases automatically with Kea (default settings).
extract_keyphrases_kea4 Extract keyphrases automatically with Kea 4.0 (default settings). Kea 4.0 is a new version of Kea that has been developed for controlled indexing of documents in the domain of agriculture.
extract_keyphrase_options Options for keyphrase extraction with Kea. For example: mALIWEB - use ALIWEB extraction model; n5 - extract 5 keyphrase;, eGBK - use GBK encoding.
Options Inherited from EmailAddressExtractor
extract_email Extract email addresses as metadata.
Options Inherited from DateExtractor
extract_historical_years Extract time-period information from historical documents. This is stored as metadata with the document. There is a search interface for this metadata, which you can include in your collection by adding the statement, "format QueryInterface DateSearch" to your collection configuration file.
maximum_year The maximum historical date to be used as metadata (in a Common Era date, such as 1950). Default: 2013
maximum_century The maximum named century to be extracted as historical metadata (e.g. 14 will extract all references up to the 14th century). Default: -1
no_bibliography Do not try to block bibliographic dates when extracting historical dates.
Options Inherited from GISExtractor
extract_placenames Extract placenames from within text and set as metadata. Requires GIS extension to Greenstone.
gazetteer Gazetteer to use to extract placenames from within text and set as metadata. Requires GIS extension to Greenstone.
place_list When extracting placements, include list of placenames at start of the document. Requires GIS extension to Greenstone.
Options Inherited from BasePlugin
process_exp A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive).
no_blocking Don't do any file blocking. Any associated files (e.g. images in a web page) will be added to the collection as documents in their own right.
block_exp Files matching this regular expression will be blocked from being passed to any later plugins in the list.
store_original_file Save the original source document as an associated file. Note this is already done for files like PDF, Word etc. This option is only useful for plugins that don't already store a copy of the original file.
associate_ext Causes files with the same root filename as the document being processed by the plugin AND a filename extension from the comma separated list provided by this argument to be associated with the document being processed rather than handled as a separate list.
associate_tail_re A regular expression to match filenames against to find associated files. Used as a more powerful alternative to associate_ext.
OIDtype The method to use when generating unique identifiers for each document. Default: auto
List
OIDmetadata Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned. Default: dc.Identifier
no_cover_image Do not look for a prefix.jpg file (where prefix is the same prefix as the file being processed) to associate as a cover image.
filename_encoding The encoding of the source file filenames. Default: auto
List
file_rename_method The method to be used in renaming the copy of the imported file and associated files. Default: url
List

OIDtype option values

ValueDescription
autoUse OIDtype set in import.pl
hashHash the contents of the file. Document identifiers will be the same every time the collection is imported.
hash_on_ga_xmlHash the contents of the Greenstone Archive XML file. Document identifiers will be the same every time the collection is imported as long as the metadata does not change.
hash_on_full_filenameHash on the full filename to the document within the 'import' folder (and not its contents). Helps make document identifiers more stable across upgrades of the software, although it means that duplicate documents contained in the collection are no longer detected automatically.
assignedUse the metadata value given by the OIDmetadata option; if unspecified, for a particular document a hash is used instead. These identifiers should be unique. Numeric identifiers will be preceded by 'D'.
incrementalUse a simple document count. Significantly faster than "hash", but does not necessarily assign the same identifier to the same document content if the collection is reimported.
filenameUse the tail file name (without the file extension). Requires every filename across all the folders within 'import' to be unique. Numeric identifiers will be preceded by 'D'.
dirnameUse the immediate parent directory name. There should only be one document per directory, and directory names should be unique. E.g. import/b13as/h15ef/page.html will get an identifier of h15ef. Numeric identifiers will be preceded by 'D'.
full_filenameUse the full file name within the 'import' folder as the identifier for the document (with _ and - substitutions made for symbols such as directory separators and the fullstop in a filename extension)
message_idUse the message identifier as the document OID. If no message identifier is found, then will use a hash OID.