This version (2014/04/14 11:52) is a draft.
Approvals: 0/1

EmbeddedMetadataPlugin

Plugin that extracts embedded metadata from a variety of file types. It is based on the CPAN module 'ExifTool which includes support for over 70 file formats and 20 metadata formats. Highlights include: video formats such as AVI, ASF, FLV, MPEG, OGG Vorbis, and WMV; image formats such as BMP, GIF, JPEG, JPEG 2000 and PNG; audio formats such as AIFF, RealAudio, FLAC, MP3, and WAV; Office document formats such as Encapsulated PostScript, HTML, PDF, and Word. More details are available at the ExifTool home page

The following table lists all of the configuration options available for EmbeddedMetadataPlugin.

OptionDescriptionValue
EmbeddedMetadataPlugin Options
metadata_field_separator Separator character used in multi-valued metadata. Will split a metadata field value on this character, and add each item as individual metadata.
input_encoding The encoding of the source documents. Documents will be converted from these encodings and stored internally as utf8. Default: auto
List
join_before_split Join fields with multiple entries (e.g. Authors or Keywords) before they are (optionally) split using the specified separator.
join_character The character to use with join_before_split (default is a single space).
trim_whitespace Trim whitespace from start and end of any extracted metadata values (Note: this also applies to any values generated through joining with join_before_split or splitting through metadata_field_separator). Default: true
List
set_filter_list A comma-separated list of the metadata sets we would like to retrieve.
set_filter_regexp A regular expression that selects the metadata we would like to retrieve. Default: .*
Options Inherited from BasePlugin
process_exp A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive).
no_blocking Don't do any file blocking. Any associated files (e.g. images in a web page) will be added to the collection as documents in their own right.
block_exp Files matching this regular expression will be blocked from being passed to any later plugins in the list.
store_original_file Save the original source document as an associated file. Note this is already done for files like PDF, Word etc. This option is only useful for plugins that don't already store a copy of the original file.
associate_ext Causes files with the same root filename as the document being processed by the plugin AND a filename extension from the comma separated list provided by this argument to be associated with the document being processed rather than handled as a separate list.
associate_tail_re A regular expression to match filenames against to find associated files. Used as a more powerful alternative to associate_ext.
OIDtype The method to use when generating unique identifiers for each document. Default: auto
List
OIDmetadata Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned. Default: dc.Identifier
no_cover_image Do not look for a prefix.jpg file (where prefix is the same prefix as the file being processed) to associate as a cover image.
filename_encoding The encoding of the source file filenames. Default: auto
List
file_rename_method The method to be used in renaming the copy of the imported file and associated files. Default: url
List

input_encoding option values

ValueDescription
autoUse text categorization algorithm to automatically identify the encoding of each source document. This will be slower than explicitly setting the encoding but will work where more than one encoding is used within the same collection.
asciiPlain 7 bit ASCII. This may be a bit faster than using iso_8859_1. Beware of using this when the text may contain characters outside the plain 7 bit ASCII set though (e.g. German or French text containing accents), use iso_8859_1 instead.
utf8Either utf8 or unicode – automatically detected.
unicodeJust unicode.
iso_8859_6Arabic
gbChinese Simplified (GB)
big5Chinese Traditional (Big5)
koi8_rCyrillic
iso_8859_5Cyrillic
koi8_uCyrillic (Ukrainian)
dos_437DOS codepage 437 (US English)
dos_850DOS codepage 850 (Latin 1)
dos_852DOS codepage 852 (Central European)
dos_866DOS codepage 866 (Cyrillic)
iso_8859_7Greek
iso_8859_8Hebrew
iscii_deISCII Devanagari
euc_jpJapanese (EUC)
shift_jisJapanese (Shift-JIS)
koreanKorean (Unified Hangul Code - i.e. a superset of EUC-KR)
iso_8859_1Latin1 (western languages)
iso_8859_15Latin15 (revised western)
iso_8859_2Latin2 (central and eastern european languages)
iso_8859_3Latin3
iso_8859_4Latin4
iso_8859_9Turkish
windows_1250Windows codepage 1250 (WinLatin2)
windows_1251Windows codepage 1251 (WinCyrillic)
windows_1252Windows codepage 1252 (WinLatin1)
windows_1253Windows codepage 1253 (WinGreek)
windows_1254Windows codepage 1254 (WinTurkish)
windows_1255Windows codepage 1255 (WinHebrew)
windows_1256Windows codepage 1256 (WinArabic)
windows_1257Windows codepage 1257 (WinBaltic)
windows_1258Windows codepage 1258 (Vietnamese)
windows_874Windows codepage 874 (Thai)

trim_whitespace option values

ValueDescription
truetrue
falsefalse