en:plugin:embeddedmetadataplugin
Table of Contents
EmbeddedMetadataPlugin
Plugin that extracts embedded metadata from a variety of file types. It is based on the CPAN module 'ExifTool which includes support for over 70 file formats and 20 metadata formats. Highlights include: video formats such as AVI, ASF, FLV, MPEG, OGG Vorbis, and WMV; image formats such as BMP, GIF, JPEG, JPEG 2000 and PNG; audio formats such as AIFF, RealAudio, FLAC, MP3, and WAV; Office document formats such as Encapsulated PostScript, HTML, PDF, and Word. More details are available at the ExifTool home page
The following table lists all of the configuration options available for EmbeddedMetadataPlugin.
Option | Description | Value |
---|---|---|
EmbeddedMetadataPlugin Options | ||
metadata_field_separator | Separator character used in multi-valued metadata. Will split a metadata field value on this character, and add each item as individual metadata. | |
input_encoding | The encoding of the source documents. Documents will be converted from these encodings and stored internally as utf8. | Default: auto List |
join_before_split | Join fields with multiple entries (e.g. Authors or Keywords) before they are (optionally) split using the specified separator. | |
join_character | The character to use with join_before_split (default is a single space). | |
trim_whitespace | Trim whitespace from start and end of any extracted metadata values (Note: this also applies to any values generated through joining with join_before_split or splitting through metadata_field_separator). | Default: true List |
set_filter_list | A comma-separated list of the metadata sets we would like to retrieve. | |
set_filter_regexp | A regular expression that selects the metadata we would like to retrieve. | Default: .* |
Options Inherited from BasePlugin | ||
process_exp | A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive). | |
no_blocking | Don't do any file blocking. Any associated files (e.g. images in a web page) will be added to the collection as documents in their own right. | |
block_exp | Files matching this regular expression will be blocked from being passed to any later plugins in the list. | |
store_original_file | Save the original source document as an associated file. Note this is already done for files like PDF, Word etc. This option is only useful for plugins that don't already store a copy of the original file. | |
associate_ext | Causes files with the same root filename as the document being processed by the plugin AND a filename extension from the comma separated list provided by this argument to be associated with the document being processed rather than handled as a separate list. | |
associate_tail_re | A regular expression to match filenames against to find associated files. Used as a more powerful alternative to associate_ext. | |
OIDtype | The method to use when generating unique identifiers for each document. | Default: auto List |
OIDmetadata | Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned. | Default: dc.Identifier |
no_cover_image | Do not look for a prefix.jpg file (where prefix is the same prefix as the file being processed) to associate as a cover image. | |
filename_encoding | The encoding of the source file filenames. | Default: auto List |
file_rename_method | The method to be used in renaming the copy of the imported file and associated files. | Default: url List |
input_encoding option values
Value | Description |
---|---|
auto | Use text categorization algorithm to automatically identify the encoding of each source document. This will be slower than explicitly setting the encoding but will work where more than one encoding is used within the same collection. |
ascii | Plain 7 bit ASCII. This may be a bit faster than using iso_8859_1. Beware of using this when the text may contain characters outside the plain 7 bit ASCII set though (e.g. German or French text containing accents), use iso_8859_1 instead. |
utf8 | Either utf8 or unicode – automatically detected. |
unicode | Just unicode. |
iso_8859_6 | Arabic |
gb | Chinese Simplified (GB) |
big5 | Chinese Traditional (Big5) |
koi8_r | Cyrillic |
iso_8859_5 | Cyrillic |
koi8_u | Cyrillic (Ukrainian) |
dos_437 | DOS codepage 437 (US English) |
dos_850 | DOS codepage 850 (Latin 1) |
dos_852 | DOS codepage 852 (Central European) |
dos_866 | DOS codepage 866 (Cyrillic) |
iso_8859_7 | Greek |
iso_8859_8 | Hebrew |
iscii_de | ISCII Devanagari |
euc_jp | Japanese (EUC) |
shift_jis | Japanese (Shift-JIS) |
korean | Korean (Unified Hangul Code - i.e. a superset of EUC-KR) |
iso_8859_1 | Latin1 (western languages) |
iso_8859_15 | Latin15 (revised western) |
iso_8859_2 | Latin2 (central and eastern european languages) |
iso_8859_3 | Latin3 |
iso_8859_4 | Latin4 |
iso_8859_9 | Turkish |
windows_1250 | Windows codepage 1250 (WinLatin2) |
windows_1251 | Windows codepage 1251 (WinCyrillic) |
windows_1252 | Windows codepage 1252 (WinLatin1) |
windows_1253 | Windows codepage 1253 (WinGreek) |
windows_1254 | Windows codepage 1254 (WinTurkish) |
windows_1255 | Windows codepage 1255 (WinHebrew) |
windows_1256 | Windows codepage 1256 (WinArabic) |
windows_1257 | Windows codepage 1257 (WinBaltic) |
windows_1258 | Windows codepage 1258 (Vietnamese) |
windows_874 | Windows codepage 874 (Thai) |
trim_whitespace option values
Value | Description |
---|---|
true | true |
false | false |
en/plugin/embeddedmetadataplugin.txt · Last modified: 2023/03/13 01:46 by 127.0.0.1