===== PDFPlugin =====
//Plugin that processes [[en:filetype:pdf|PDF]] documents.//
* Processes files with extensions: ''.pdf'' \\ Perl regular expression: ''//(?i)\.pdf$//''
The following table lists all of the configuration options available for PDFPlugin.
^Option^Description^Value^
^//PDFPlugin Options//^^^
| **convert_to** |(REQUIRED) Plugin converts to TEXT or HTML or various types of Image (e.g. JPEG, GIF, PNG). |//Default: html//\\ [[PDFPlugin#convert_to option values|List]] |
| **process_exp** |A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive). |//Default: (?i)\.pdf$// |
| **block_exp** |Files matching this regular expression will be blocked from being passed to any later plugins in the list. | |
| **metadata_fields** |Comma separated list of metadata fields to attempt to extract. Capitalise this as you want the metadata capitalised in Greenstone, since the tag extraction is case insensitive. e.g. Title,Date. Use 'tag' to have the contents of the first pair put in a metadata element called 'tagname'. e.g. Title,Date,Author; |//Default: Title,Author,Subject,Keywords// |
| **metadata_field_separator** |Separator character used in multi-valued metadata. Will split a metadata field value on this character, and add each item as individual metadata. | |
| **noimages** |Don't attempt to extract images from PDF. | |
| **allowimagesonly** |Allow PDF files with no extractable text. Avoids the need to have -complex set. Only useful with convert_to html. | |
| **complex** |Create more complex output. With this option set the output html will look much more like the original PDF file. For this to function properly you Ghostscript installed (for *nix gs should be on your path while for windows you must have gswin32c.exe on your path). | |
| **nohidden** |Prevent pdftohtml from attempting to extract hidden text. This is only useful if the -complex option is also set. | |
| **zoom** |The factor by which to zoom the PDF for output (this is only useful if -complex is set). |//Default: 2// //Range: 1,3// |
| **use_sections** |Create a separate section for each page of the PDF file. | |
| **description_tags** |Split document into sub-sections where tags occur. '-keep_head' will have no effect when this option is set. | |
^//Options Inherited from [[AutoLoadConverters]]//^^^
|pdfbox_conversion|Use PDFBox to convert the PDF files. **This option is only available if you have downloaded the [[http://trac.greenstone.org/browser/gs2-extensions/pdf-box/trunk/|PDFBox extension.]]**| |
^//Options Inherited from [[ConvertBinaryFile]]//^^^
| **convert_to** |(REQUIRED) Plugin converts to TEXT or HTML or various types of Image (e.g. JPEG, GIF, PNG). |//Default: auto//\\ [[ConvertBinaryFile#convert_to option values|List]] |
| **keep_original_filename** |Keep the original filename for the associated file, rather than converting to doc.pdf, doc.doc etc. | |
| **title_sub** |Substitution expression to modify string stored as Title. Used by, for example, PDFPlugin to remove "Page 1", etc from text used as the title. | |
| **apply_fribidi** |Run the "fribidi" Unicode Bidirectional Algorithm program over the converted file (for right-to-left text). | |
| **use_strings** |If set, a simple strings function will be called to extract text if the conversion utility fails. | |
^//Options Inherited from [[AutoExtractMetadata]]//^^^
| **first** |Comma separated list of numbers of characters to extract from the start of the text into a set of metadata fields called 'FirstN', where N is the size. For example, the values "3,5,7" will extract the first 3, 5 and 7 characters into metadata fields called "First3", "First5" and "First7". | |
^//Options Inherited from [[AcronymExtractor]]//^^^
| **extract_acronyms** |Extract acronyms from within text and set as metadata. | |
| **markup_acronyms** |Add acronym metadata into document text. | |
^//Options Inherited from [[KeyphraseExtractor]]//^^^
| **extract_keyphrases** |Extract keyphrases automatically with Kea (default settings). | |
| **extract_keyphrases_kea4** |Extract keyphrases automatically with Kea 4.0 (default settings). Kea 4.0 is a new version of Kea that has been developed for controlled indexing of documents in the domain of agriculture. | |
| **extract_keyphrase_options** |Options for keyphrase extraction with Kea. For example: mALIWEB - use ALIWEB extraction model; n5 - extract 5 keyphrase;, eGBK - use GBK encoding. | |
^//Options Inherited from [[EmailAddressExtractor]]//^^^
| **extract_email** |Extract email addresses as metadata. | |
^//Options Inherited from [[DateExtractor]]//^^^
| **extract_historical_years** |Extract time-period information from historical documents. This is stored as metadata with the document. There is a search interface for this metadata, which you can include in your collection by adding the statement, "format QueryInterface DateSearch" to your collection configuration file. | |
| **maximum_year** |The maximum historical date to be used as metadata (in a Common Era date, such as 1950). |//Default: 2013// |
| **maximum_century** |The maximum named century to be extracted as historical metadata (e.g. 14 will extract all references up to the 14th century). |//Default: -1// |
| **no_bibliography** |Do not try to block bibliographic dates when extracting historical dates. | |
^//Options Inherited from [[GISExtractor]]//^^^
| **extract_placenames** |Extract placenames from within text and set as metadata. Requires GIS extension to Greenstone. | |
| **gazetteer** |Gazetteer to use to extract placenames from within text and set as metadata. Requires GIS extension to Greenstone. | |
| **place_list** |When extracting placements, include list of placenames at start of the document. Requires GIS extension to Greenstone. | |
^//Options Inherited from [[BasePlugin]]//^^^
| **process_exp** |A perl regular expression to match against filenames. Matching filenames will be processed by this plugin. For example, using '(?i).html?\$' matches all documents ending in .htm or .html (case-insensitive). | |
| **no_blocking** |Don't do any file blocking. Any associated files (e.g. images in a web page) will be added to the collection as documents in their own right. | |
| **block_exp** |Files matching this regular expression will be blocked from being passed to any later plugins in the list. | |
| **store_original_file** |Save the original source document as an associated file. Note this is already done for files like PDF, Word etc. This option is only useful for plugins that don't already store a copy of the original file. | |
| **associate_ext** |Causes files with the same root filename as the document being processed by the plugin AND a filename extension from the comma separated list provided by this argument to be associated with the document being processed rather than handled as a separate list. | |
| **associate_tail_re** |A regular expression to match filenames against to find associated files. Used as a more powerful alternative to associate_ext. | |
| **OIDtype** |The method to use when generating unique identifiers for each document. |//Default: auto//\\ [[BasePlugin#OIDtype option values|List]] |
| **OIDmetadata** |Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned. |//Default: dc.Identifier// |
| **no_cover_image** |Do not look for a prefix.jpg file (where prefix is the same prefix as the file being processed) to associate as a cover image. | |
| **filename_encoding** |The encoding of the source file filenames. |//Default: auto//\\ [[BasePlugin#filename_encoding option values|List]] |
| **file_rename_method** |The method to be used in renaming the copy of the imported file and associated files. |//Default: url//\\ [[BasePlugin#file_rename_method option values|List]] |
==== convert_to option values===
^Value^Description^
|auto|Automatically select the format converted to. Format chosen depends on input document type, for example Word will automatically be converted to HTML, whereas PowerPoint will be converted to Greenstone's PagedImage format.|
|html|HTML format.|
|text|Plain text format.|
|pagedimg_jpg|A series of images in JPEG format.|
|pagedimg_gif|A series of images in GIF format.|
|pagedimg_png|A series of images in PNG format.|