Base class for all the import plugins.
The following table lists all of the configuration options available for BasePlugin, while the remaining tables provide information on values for options with a list of possible values.
Option | Description | Value |
---|---|---|
BasePlugin Options | ||
process_exp | A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive). | |
no_blocking | Don't do any file blocking. Any associated files (e.g. images in a web page) will be added to the collection as documents in their own right. | |
block_exp | Files matching this regular expression will be blocked from being passed to any later plugins in the list. | |
store_original_file | Save the original source document as an associated file. Note this is already done for files like PDF, Word etc. This option is only useful for plugins that don't already store a copy of the original file. | |
associate_ext | Causes files with the same root filename as the document being processed by the plugin AND a filename extension from the comma separated list provided by this argument to be associated with the document being processed rather than handled as a separate list. | |
associate_tail_re | A regular expression to match filenames against to find associated files. Used as a more powerful alternative to associate_ext. | |
OIDtype | The method to use when generating unique identifiers for each document. | Default: auto List |
OIDmetadata | Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned. | Default: dc.Identifier |
no_cover_image | Do not look for a prefix.jpg file (where prefix is the same prefix as the file being processed) to associate as a cover image. | |
filename_encoding | The encoding of the source file filenames. | Default: auto List |
file_rename_method | The method to be used in renaming the copy of the imported file and associated files. | Default: url List |
Value | Description |
---|---|
auto | Use OIDtype set in import.pl |
hash | Hash the contents of the file. Document identifiers will be the same every time the collection is imported. |
hash_on_ga_xml | Hash the contents of the Greenstone Archive XML file. Document identifiers will be the same every time the collection is imported as long as the metadata does not change. |
hash_on_full_filename | Hash on the full filename to the document within the 'import' folder (and not its contents). Helps make document identifiers more stable across upgrades of the software, although it means that duplicate documents contained in the collection are no longer detected automatically. |
assigned | Use the metadata value given by the OIDmetadata option; if unspecified, for a particular document a hash is used instead. These identifiers should be unique. Numeric identifiers will be preceded by 'D'. |
incremental | Use a simple document count. Significantly faster than "hash", but does not necessarily assign the same identifier to the same document content if the collection is reimported. |
filename | Use the tail file name (without the file extension). Requires every filename across all the folders within 'import' to be unique. Numeric identifiers will be preceded by 'D'. |
dirname | Use the immediate parent directory name. There should only be one document per directory, and directory names should be unique. E.g. import/b13as/h15ef/page.html will get an identifier of h15ef. Numeric identifiers will be preceded by 'D'. |
full_filename | Use the full file name within the 'import' folder as the identifier for the document (with _ and - substitutions made for symbols such as directory separators and the fullstop in a filename extension) |
Value | Description |
---|---|
auto | Automatically detect the encoding of the filename. |
auto-language-analysis | Auto-detect the encoding of the filename by analysing it. |
auto-filesystem-encoding | Auto-detect the encoding of the filename using filesystem encoding. |
auto-fl | Uses filesystem encoding then language analysis to detect the filename encoding. |
auto-lf | Uses language analysis then filesystem encoding to detect the filename encoding. |
ascii | Plain 7 bit ASCII. This may be a bit faster than using iso_8859_1. Beware of using this when the text may contain characters outside the plain 7 bit ASCII set though (e.g. German or French text containing accents), use iso_8859_1 instead. |
utf8 | Either utf8 or unicode – automatically detected. |
unicode | Just unicode. |
iso_8859_6 | Arabic |
gb | Chinese Simplified (GB) |
big5 | Chinese Traditional (Big5) |
koi8_r | Cyrillic |
iso_8859_5 | Cyrillic |
koi8_u | Cyrillic (Ukrainian) |
dos_437 | DOS codepage 437 (US English) |
dos_850 | DOS codepage 850 (Latin 1) |
dos_852 | DOS codepage 852 (Central European) |
dos_866 | DOS codepage 866 (Cyrillic) |
iso_8859_7 | Greek |
iso_8859_8 | Hebrew |
iscii_de | ISCII Devanagari |
euc_jp | Japanese (EUC) |
shift_jis | Japanese (Shift-JIS) |
korean | Korean (Unified Hangul Code - i.e. a superset of EUC-KR) |
iso_8859_1 | Latin1 (western languages) |
iso_8859_15 | Latin15 (revised western) |
iso_8859_2 | Latin2 (central and eastern european languages) |
iso_8859_3 | Latin3 |
iso_8859_4 | Latin4 |
iso_8859_9 | Turkish |
windows_1250 | Windows codepage 1250 (WinLatin2) |
windows_1251 | Windows codepage 1251 (WinCyrillic) |
windows_1252 | Windows codepage 1252 (WinLatin1) |
windows_1253 | Windows codepage 1253 (WinGreek) |
windows_1254 | Windows codepage 1254 (WinTurkish) |
windows_1255 | Windows codepage 1255 (WinHebrew) |
windows_1256 | Windows codepage 1256 (WinArabic) |
windows_1257 | Windows codepage 1257 (WinBaltic) |
windows_1258 | Windows codepage 1258 (Vietnamese) |
windows_874 | Windows codepage 874 (Thai) |
Value | Description |
---|---|
url | Use url encoding in renaming imported files and associated files. |
base64 | Use base64 encoding in renaming imported files and associated files. |
none | Don't rename imported files and associated files. |