===== DSpacePlugin ===== //A [[en:user:plugins|plugin]] that takes a collection of documents exported from [[en:filetype:dspace|DSpace]] and imports them into Greenstone.// * Processes ''contents'' files \\ Perl regular expression: ''//(?i)(contents)$//'' When using the DSpacePlugin, it should be moved to the top of the Assigned Plugins list in the Document Plugins section of the Design panel (above the GreenstoneXMLPlugin). Also note that the DSpacePlugin will not process the documents themselves; it will pass the documents along to be processed by their respective plugins (which must be in the Assigned Plugins list). If any of the documents in your exported DSpace collection have alternate forms (for instance, the same document as both a ''.doc'' and a ''.pdf'') in their directory folder, by default, the DSpacePlugin will treat these as separate, individual documents. They will appear in all browsing and search lists separately, and will have their own document pages. If you would like to fuse all forms together -- treating all alternative forms as a single Greenstone document -- you must use one of the configuration options to indicate which is the //primary// form of the document (Greenstone will handle all other forms as associated files): * **only_first_doc**: If this option is checked, the first document referenced in the Dublin Core XML file --regardless of file type -- will be treated as the primary document (all others will be associated files). * **first_inorder_ext**: This option allows you to choose the primary document based on file extension. Enter a comma-separated list of extensions (e.g. ''doc,pdf,rtf,txt''). The first one in the list will be treated as the primary document (all others will be associated files). * **first_inorder_mime**: This option allows you to choose the primary document based on MIME type. Enter a comma-separated list of MIME types (e.g. ''video/mpeg,video/quicktime,application/x-shockwave-flash,''). The first one in the list will be treated as the primary document (all others will be associated files). With the latter two options, it is best to be exhaustive, providing an order of precedence for all formats included in your collection. If there is a document directory that has alternative formats and none of the formats in the directory are in the option list, all the alternative forms for that document will be treated as separate Greenstone documents. For example, assume your collection includes documents in ''.pdf'', ''.doc'', ''.rtf'', and ''.txt'' formats. You check the **first_inorder_ext** and type ''pdf,doc''. For any documents with a PDF version, the PDF will be the primary document. For any documents that have a ''.doc'' version and **not** a PDF, the ''.doc'' will be the primary document. However, for any documents that have neither a PDF or Word version, all versions of the document will be treated as separate documents. So, any documents that have both ''.rtf'' and ''.txt'' formats (and no ''.doc'' or ''.pdf''), these alternative formats will each be treated separately. The following table lists all of the configuration options available for DSpacePlugin. ^Option^Description^Value^ ^//DSpacePlugin Options//^^^ | **process_exp** |A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive). |//Default: (?i)(contents)$// | | **only_first_doc** |This is used to identify the primary document file for a DSpace collection document. With this option, the system will treat the first document referenced in the dublin_core metadata file as the primary document file. | | | **first_inorder_ext** | This is used to identify the primary document file for a DSpace collection document. With this option, the system will treat the defined ext types of document in sequence to look for the primary document file. | | | **first_inorder_mime** |This is used to identify the primary document file for a DSpace collection document. With this option, the system will treat the defined mime types of document in sequence to look for the primary document file. | | | **block_exp** |Files matching this regular expression will be blocked from being passed to any later plugins in the list. |//Default: (?i)(handle|\.tx?t)$// | ^//Options Inherited from [[ReadTextFile]]//^^^ | **input_encoding** |The encoding of the source documents. Documents will be converted from these encodings and stored internally as utf8. |//Default: auto//\\ [[ReadTextFile#input_encoding option values|List]] | | **default_encoding** |Use this encoding if -input_encoding is set to 'auto' and the text categorization algorithm fails to extract the encoding or extracts an encoding unsupported by Greenstone. This option can take the same values as -input_encoding. |//Default: utf8//\\ [[ReadTextFile#default_encoding option values|List]] | | **extract_language** |Identify the language of each document and set 'Language' metadata. Note that this will be done automatically if -input_encoding is 'auto'. | | | **default_language** |If Greenstone fails to work out what language a document is the 'Language' metadata element will be set to this value. The default is 'en' (ISO 639 language symbols are used: en = English). Note that if -input_encoding is not set to 'auto' and -extract_language is not set, all documents will have their 'Language' metadata set to this value. |//Default: en// | ^//Options Inherited from [[AutoExtractMetadata]]//^^^ | **first** |Comma separated list of numbers of characters to extract from the start of the text into a set of metadata fields called 'FirstN', where N is the size. For example, the values "3,5,7" will extract the first 3, 5 and 7 characters into metadata fields called "First3", "First5" and "First7". | | ^//Options Inherited from [[AcronymExtractor]]//^^^ | **extract_acronyms** |Extract acronyms from within text and set as metadata. | | | **markup_acronyms** |Add acronym metadata into document text. | | ^//Options Inherited from [[KeyphraseExtractor]]//^^^ | **extract_keyphrases** |Extract keyphrases automatically with Kea (default settings). | | | **extract_keyphrases_kea4** |Extract keyphrases automatically with Kea 4.0 (default settings). Kea 4.0 is a new version of Kea that has been developed for controlled indexing of documents in the domain of agriculture. | | | **extract_keyphrase_options** |Options for keyphrase extraction with Kea. For example: mALIWEB - use ALIWEB extraction model; n5 - extract 5 keyphrase;, eGBK - use GBK encoding. | | ^//Options Inherited from [[EmailAddressExtractor]]//^^^ | **extract_email** |Extract email addresses as metadata. | | ^//Options Inherited from [[DateExtractor]]//^^^ | **extract_historical_years** |Extract time-period information from historical documents. This is stored as metadata with the document. There is a search interface for this metadata, which you can include in your collection by adding the statement, "format QueryInterface DateSearch" to your collection configuration file. | | | **maximum_year** |The maximum historical date to be used as metadata (in a Common Era date, such as 1950). |//Default: 2013// | | **maximum_century** |The maximum named century to be extracted as historical metadata (e.g. 14 will extract all references up to the 14th century). |//Default: -1// | | **no_bibliography** |Do not try to block bibliographic dates when extracting historical dates. | | ^//Options Inherited from [[GISExtractor]]//^^^ | **extract_placenames** |Extract placenames from within text and set as metadata. Requires GIS extension to Greenstone. | | | **gazetteer** |Gazetteer to use to extract placenames from within text and set as metadata. Requires GIS extension to Greenstone. | | | **place_list** |When extracting placements, include list of placenames at start of the document. Requires GIS extension to Greenstone. | | ^//Options Inherited from [[BasePlugin]]//^^^ | **process_exp** |A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive). | | | **no_blocking** |Don't do any file blocking. Any associated files (e.g. images in a web page) will be added to the collection as documents in their own right. | | | **block_exp** |Files matching this regular expression will be blocked from being passed to any later plugins in the list. | | | **store_original_file** |Save the original source document as an associated file. Note this is already done for files like PDF, Word etc. This option is only useful for plugins that don't already store a copy of the original file. | | | **associate_ext** |Causes files with the same root filename as the document being processed by the plugin AND a filename extension from the comma separated list provided by this argument to be associated with the document being processed rather than handled as a separate list. | | | **associate_tail_re** |A regular expression to match filenames against to find associated files. Used as a more powerful alternative to associate_ext. | | | **OIDtype** |The method to use when generating unique identifiers for each document. |//Default: auto//\\ [[BasePlugin#OIDtype option values|List]] | | **OIDmetadata** |Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned. |//Default: dc.Identifier// | | **no_cover_image** |Do not look for a prefix.jpg file (where prefix is the same prefix as the file being processed) to associate as a cover image. | | | **filename_encoding** |The encoding of the source file filenames. |//Default: auto//\\ [[BasePlugin#filename_encoding option values|List]] | | **file_rename_method** |The method to be used in renaming the copy of the imported file and associated files. |//Default: url//\\ [[BasePlugin#file_rename_method option values|List]] |