Table of Contents
Plugins are written in the Perl language. They all derive from a basic plugin called BasePlugin, which performs universally-required operations like creating a new Greenstone archive document to work with, assigning an object identifier (OID), and handling the sections in a document. Plugins are kept in the perllib/plugins directory.
An outline of program flow when using
import.pl for developers writing their own plugins:
import.plcalls the methods begin, read then end.
- This starts at the import directory.
- RecPlugin handles directories, and will look through a directory to see what files are there.
metadata_readmethod only gets called from RecPlugin. (and MetadataCSVPlugin)
All plugins inherit from BasPlugin.
- BasPlugin inplements the metadata_read and read methods.
- BasPlugin read calls the process method.
Most plugins call the BasePlugin read method, then do the format specific stuff using their own process method.
- Some plugins override read.
Plugins can implement either read or process (or both).
NOTE on methods- order called
- metadata_read: first to be called - usually by RecPlugin - but also by MetadataCSVPlugin
- in RecPlugin Greenstone
metadata.xmlfiles are read by the
- in MetadataCSVPlugin a
.csvtext file with the first line containing field names is read by metadata_read
- read: called after metadata read
- process: called last?
add_utf8_metadataadds metadata that is already in utf8
add_metadata convertsto utf8 before adding metadata that is not already in utf8
- It's best to put modified plugins into
collect///colname///perllib/plugins, so any other collections can still use the standard ones.
- A collection specific plugin has to have the same name as an existing plugin if you are over-riding the system-wide version of the plugin.
- The collection-specific one is used instead the system-wide one.
- The collection-specific plugin appears in the GLI when you have that collection loaded.
Thanks to Wendy Osborn for most of this text.
If you select a plugin and press Configure Plugin…, you will see the configuration options available for the plugin. You might notice that the options are split into sections. The options at the very top are specific to the plugin; the remaining options are inherited from other plugins.
If you are creating your own plugin, you can choose to have it inherit from other, similar plugins (which, in turn, likely inherit from additional plugins). Top-level plugins (including those that you select to process documents) all inherit from other plugins.
Document processing plugins
Document processing plugins are used by the collection-building software to parse each source document in a way that depends on its format. A collection's configuration file lists all plugins that are used when building it. During the import operation, each file or directory is passed to each plugin in turn until one is found that can process it—thus earlier plugins take priority over later ones. If no plugin can process the file, a warning is printed (to standard error) and processing passes to the next file. (This is where the block_exp option can be useful—to prevent these error messages for files that might be present but don't need processing.) During building, the same procedure is used, but the archives directory is processed instead of the import directory.
The standard Greenstone plugins are listed here. Recursion is necessary to traverse directory hierarchies. Although the import and build programs do not perform explicit recursion, some plugins cause indirect recursion by passing files or directory names into the plugin pipeline. For example, the standard way of recursing through a directory hierarchy is to specify RecPlugin, which does exactly this. If present, it should be the last element in the pipeline.
Some plugins are written for specific collections that have a document format not found elsewhere. These collection-specific plugins are found in the collection's perllib/plugins directory. Collection-specific plugins can be used to override general plugins with the same name.
Some document-processing plugins use external programs that parse specific proprietary formats—for example, Microsoft Word—into either plain text, images, or HTML. A general plugin called ConvertToPlugin invokes the appropriate conversion program and passes the result to either TEXTPlugin or HTMLPlugin. We describe this in more detail shortly.
Some plugins have individual options, which control what they do in finer detail than the general options allow. Select a plugin from the list of plugins to view a complete list of all of its available options.
Plugins to import proprietary formats
Proprietary formats pose difficult problems for any digital library system. Although documentation may be available about how they work, they are subject to change without notice, and it is difficult to keep up with changes. Greenstone has adopted the policy of using GPL (Gnu Public License) conversion utilities written by people dedicated to the task. Utilities to convert Word and PDF formats are included in the packages directory. These all convert documents to either text or HTML. Then HTMLPlugin and TEXTPlugin are used to further convert them to the Greenstone archive format. ConvertToPlugin is used to include the conversion utilities. Like BasePlugin it is never called directly. Rather, plugins written for individual formats are derived from it: ConvertToPlugin uses Perl's dynamic inheritance scheme to inherit from either TEXTPlugin or HTMLPlugin, depending on the format to which a source document has been converted.
When ConvertToPlugin receives a document, it calls gsConvert.pl (found in
Greenstone3/gs2build/bin/scripts) to invoke the appropriate conversion utility. Once the document has been converted, it is returned to ConvertToPlugin, which invokes the text or HTML plugin as appropriate. Any plugin derived from ConvertToPlugin has an option convert_to, whose argument is either text or HTML, to specify which intermediate format is preferred. Text is faster, but HTML generally looks better, and includes pictures.
When ConvertToPlugin receives a document, it calls gsConvert.pl (found in
GSDLHOME/bin/script) to invoke the appropriate conversion utility. Once the document has been converted, it is returned to ConvertToPlugin, which invokes the text or html plugin as appropriate. Any plugin derived from ConvertToPlugin has an option convert_to, whose argument is either text or html, to specify which intermediate format is preferred. Text is faster, but html generally looks better, and includes pictures.
Sometimes there are several conversion utilities for a particular format, and gsConvert may try different ones on a given document. For example, the preferred Word conversion utility wvWare does not cope with anything less than Word 6, and a program called AnyToHTML, which essentially just extracts whatever text strings can be found, is called to convert Word 5 documents.
The steps involved in adding a new external document conversion utility are:
- Install the new conversion utility so that it is accessible by Greenstone (put it in the packages directory).
- Alter gsConvert.pl to use the new conversion utility. This involves adding a new clause to the if statement in the main function, and adding a function that calls the conversion utility.
- Write a top-level plugin that inherits from ConvertToPlugin to catch the format and pass it on.
Greenstone incorporates plugins for many different file formats, listed on the Plugins page. But we are always looking for more! If there is a specific plugin you would like us to write on a contractual basis then contact us. Also, we welcome contributions of code to enable us to extend Greenstone. The following is a list of plugins we would like.
- Gnumeric Spreadsheet
- Kword (all Koffice formats)
- OpenOffice file formats:
- Writer (.sxw)
- Calc (.sxd)
- Impress (.sxi)
- Draw (.sxd)
- StarOffice formats (.sdc, .sdw etc)
- Quicktime (.mov)
- AVI (Audio Video Interleave), Microsoft video
- Windows Media Audio (.wma)
- Windows audio (.wav)
- Sun Audio (.au)
- Audio Interchange File Format (.aiff)
- MIDI (.mid)
- MIDI karoke (.kar)
- CD Audio (.cda)
- Shorten (.shn)
- DjVu (.djvu)
- Photoshop (.psd)
- PaintShopPro (.psp)
- .hqx Mac archive
- Self extracting Archive (.sea)
- Scalable Graphics Format (.svg)
- Synchronized Multimedia Integration Language SMIL (.smil)
- Macromedia Flash (.fla)
- Macromedia shockwave (.swf)
- TrueType fonts (TTF)