Does Greenstone have a plugin for my data format?
What metadata is available for each plugin?
I'm having problems with my PDF files! What's wrong?
How do I use UnknownPlugin to handle my new format?
How do I get my CDS/ISIS database into Greenstone?
How do I get my PMB database into Greenstone?
How do I get my XML files into Greenstone?
- XSL
- New Plugin
How do I use DatabasePlugin?
How do I build collections of documents with images and OCR text (PagedImagePlugin)?
How do I use CONTENTdmPlugin?
How do I build a collection from a MediaWiki website?

This page is in the 'old' namespace, and was imported from our previous wiki. We recommend checking for more up-to-date information using the search box.

More about Plugins

Does Greenstone have a plugin for my data format?

See this page.

What metadata is available for each plugin?

(This applies to Grenstone 2.80 and earlier. Needs updating for 2.81)

"Default" means that the metadata fields will be automatically assigned (or extracted if possible), while the "Available fields" lists other items of metadata that the plugin may be able to assign based on any arguments given to that plugin in the collect.cfg file. All plugins are derived from BasPlug, and have following metadata fields:

Plugin name	Default fields	Available fields
BasPlug	Language, Encoding, Source	FirstNNNN, Keyphrases, Acronym

In addition, many plugins have additional fields available:

Plugin name	Default fields	Available fields
BibTexPlug	Title, Creator, Abstract, Author, Booktitle, Chapter, Copyright, Date, Edition, Editor, EntryType Journal, Keywords, Month, Note, Number, Pages, Publisher, PublisherAddress, Volume, Year
DBPlug		(arbitrary metadata field names based on Database configuration file)
EMAILPlug	Date, DateText, From, FromAddr, FromName, Headers, Subject, Title (based on subject, from, and date), To
ExcelPlug		(all fields as in HTMLPlug)
HTMLPlug	Title, URL	Author, Creator, Email (others as found in the `-metadata_fields` option)
ImagePlug	Image, ImageHeight, ImageSize, ImageType, ImageWidth, ScreenHeight, screenicon, ScreenSize, ScreenType, ScreenWidth, Source, srclink, srcicon, Thumb, ThumbHeight, ThumbType, ThumbWidth
IndexPlug	as in the index.txt file	(use `metadata.xml` files instead of using this plugin)
MARCPlug	Creator, Description, MarcIdentifier, MarcSource, URL, Publisher, Relation, Rights, Subject, Title, Type	(Metadata fields as in the `marctodc.txt` file)
OAIPlug	URL, (all metadata in .oai markup file)
PagedImgPlug	Image, ImageHeight, ImageSize, ImageType, ImageWidth, ScreenHeight, screenicon, ScreenSize, ScreenType, ScreenWidth, Source, srclink, srcicon, Thumb, ThumbHeight, ThumbType, ThumbWidth
PDFPlug		(all fields in HTMLPlug)
PPTPlug		(all fields in HTMLPlug)
PSPlug	Title	Date, Pages, (all fields in TextPlug)
ReferPlug	Abstract, BookConfOnly, Booktitle, Copyright, Creator, Date, Editor, Keywords, Journal, JournalsOnly, Number, Pages, Publisher, Publisheraddr, Report, Title, Volume
RTFPlug		(all fields in HTMLPlug)
SRCPlug	Title, filename, includes, class, classdecl
TEXTPlug	Title
UnknownPlug	(as given in the `-assoc_field` plugin argument)
WordPlug		(all fields in HTMLPlug)

See section two of the Developer's Guide for information about options to plugins, or run the pluginfo.pl command on the plugin name after setting up your environment for Greenstone. (For example, "perl -S pluginfo.pl BasPlug".)

In addition, every document can be manually assigned arbitrary metadata fields and values through use of metadata.xml files, as discussed in the manual.

I'm having problems with my PDF files! What's wrong?

The standard PDF Plugin can process PDF versions up to 1.4. To process later versions, you'll need to download the PDFBox extension. See 2.85 Release Notes.

Security settings can prevent Greenstone from processing the files. Check these in Acrobat Reader. Go to File Menu, Document Properties and once there, go to security tab.

PDF is a "page description language". This means that the document contains objects and commands such as "draw this text here" and "draw this image here".

Greenstone uses an external program called "pdftohtml" to extract text out of PDF files. Sometimes, there is no text that can be extracted. This often depends on how the PDF was created.

Adobe Acrobat Writer can be used to create PDFs from paper documents that are scanned in by a scanner. In this case, the PDF file contains images of text, rather than computer-readable text. Therefore, pdftohtml cannot find any text to extract.
Some programs (such as older versions of GNU ghostscript, which is used by ps2pdf on Unix computers) sometimes create "bitmap fonts", which means that every character in the document is really an image rather than a computer readable letter. The LaTeX type-setting program sometimes does this when the "Computer Modern Roman" font is used.
Certain characters and character combinations may be extracted incorrectly, depending on the program that generated the PDF file. For example, "ligatures" such as "fi", "fl", "ff" and "ffl" are often rendered using a special glyph rather than as individual characters, and this information may be lost in the textual representation. Also, some PDF generating programs may not correctly encode accented characters. For example, to draw a lowercase "u" with an umlaut accent, LaTeX draws a "u" and then draws an umlaut accent over it. This means that pdftohtml will extract two separate characters (¨ and 'u') rather than a single accented character (ü).
PDF contains pieces of text, and coordinates for where that text should be displayed. This means that pdftohtml may incorrectly guess the order that the text fragments are supposed to occur in. For example, for text that is in two or more columns, the text may be extracted as the first sentence of each column, then the second sentence of each column, and so on. In this case, the extracted text is still usable for indexing purposes, but should not be displayed. In this case, a format statement should be added to the collect.cfg file to provide a link to the original PDF file but not to the extracted text, such as: format SearchVList "<td valign=top>[srclink][srcicon][/srclink]</td><td>[srclink][Title][/srclink]</td>"
Because of the way that images are embedded in PDF files, pdftohtml occasionally extracts an image upside-down, or mirrored. This appears to be a bug in the program.

How do I use UnknownPlugin to handle my new format?

UnknownPlugin is a simple plugin for importing files in formats that Greenstone doesn't know anything about. A dummy document will be created for every such file, and the file itself will be passed to Greenstone as the "associated file" of the document.

Here's an example where it is useful: A collection has pictures and includes a couple of quicktime movie files with names like DCP_0163.MOV. Rather than write a new plugin for quicktime movies, add this line to the collection configuration file:

plugin UnknownPlugin -process_extension "MOV" -assoc_field "movie"

A document is created for each movie, with the associated movie file's name in the "movie" metadata field. In the collection's format strings, use the {If} macro to output different text for each type of file, like this:

{If}{[movie],<HTML for displaying movie>} {If}{[Image],<HTML for displaying image>}

You can also add extra metadata, such as the Title, Subject, and Duration, using the Librarian Interface (or with metadata.xml files).

The -process_extension option tells UnknownPlugin which file extension it should look for. Alternatively, you can use the -process_exp option which specifies a regular expression to match against entire filenames. You can have several UnknownPlugins specified for a collection, each processing a different kind of file.

The -assoc_field option is the name of the metadata field that will hold the associated file's name. This can be used to test for these files. You can also specify the mime type of the files to be processed using the -mime_type option. To display the original file, use [srclink][/srclink] metadata.

How do I get my CDS/ISIS database into Greenstone?

Creating digital libraries based on CDS/ISIS databases ( En Español) is a detailed guide for using CDS/ISIS databases in Greenstone.

How do I get my PMB database into Greenstone?

PMB is a open source integrated library management software. It stands for "PhpMyBibli". It supports Unimarc format (not MARC 21). Greenstone doesn't support PMB files. However, you can use WINISIS as a bridge. You export records from PMB and import with WINISIS. Than you can reorganize MARC tags to convert from UNIMARC to MARC21 and integrate records with Greenstone using plugin available for CDS-ISIS (see above).

How do I get my XML files into Greenstone?

There are two main options for getting XML files into Greenstone: using XSL or writing a customised plugin.

XSL

Outside of Greenstone, you can use XSL (or other procedure) to generate either HTML, which can be processed by HTMLPlug, or Greenstone Archive files. If you generate archive files, you will not need to run the import phase of collection building. You will also not be able to build the collection in the Librarian Interface. You can use the Librarian Interface to configure your collection, but you will need to build it on the command line. See here for information about command line building.

New Plugin

The other option is to write a new plugin to process your particular XML format. This plugin will inherit from XMLPlug. You need to implement the new method, as well as the XML parsing call back methods, such as xml_doctype, xml_start_tag, xml_end_tag, xml_text. The plugin will parse the source XML file and build up a doc object in memory, which gets written out as an archive file. greenstone/perllib/plugins/GreenstoneArchivesPlugin.pm is an example of a plugin that inherits from XMLPlug—you can use this as an example.

How do I use DatabasePlugin?

DatabasePlugin uses Perl's DBI module to getting records out of databases, such as mysql, postgresql, comma separated values (CSV), MS Excel, ODBC, sybase etc. You will need to have the DBI module installed, as well as the appropriate back end module(s).

Assuming you have got all the necessary modules installed, then the basic way to use DBPlug is:

Add DatabasePlugin to the list of plugins for your collection.
Copy greenstone/etc/packages/example.dbi into the import directory of your collection.
Modify this file appropriately
You may want to have more than one copy of the file, for different database connections/queries. The name does not matter, but the file extension should be .dbi
Import and build the collection.

Here is what I had to do to process a comma-separated-value file.

Here is what I had to do to use DBPlug to get records out of a Mysql database.

Here is what I had to do to process a excel file.

Here is what I had to do to use DBPlug to get records out of a MS Access database.

How do I build collections of documents with images and OCR text (PagedImagePlugin)?

Please see the following tutorials:

How do I use CONTENTdmPlugin?

The CONTENTdm is a commercial digital library (http://www.dimema.com/) that provides tools for organizing, managing and searching digital collections over the Internet.

Collections in CONTENTdm digital library can be exported in the RDF format. CONTENTdmPlugin is implemented to process the RDF file only. It identifies each <rdf:Description> element in the RDF file as a document and transformes it into the Greenstone archieve file. Meanwhile Metadata are collected. In CONTENTdmPlugin, XML::parser class has been modified, it can process both well-formed and not-well-formed RDF files. A warning message will be output if the RDF file is not well formed. The image files are taken care by the pagedImg plug which is the secondary plugin of CONTENTdmPlugin.

Four parameters are created in the CONTENTdm plugin:

convert_to

(html(default)|text|pagedimg) Compulsory option

xslt apply

xslt file is applied on the RDF file to avoid some content

process_exp

CONTENTdmPlugin only handles .rdf file by default

block_exp

CONTENTdmPlugin blocks (jpg|jpeg|gif) files by default

For example:

 plugin          CONTENTdmPlugin -convert_to html -keep_original_filename

How do I build a collection from a MediaWiki website?

The MediaWikiPlugin processes the HTML pages, supresses unnecessary fragments such as tabs, toolbox, and edit links, and converts files into Greenstone's internal format.

MediaWikiPlug has eight parameters:

show_toc: diplay the table of contents on the website's main page on the collection's home page
delete_toc: supress the table of contents on each page
toc_exp: Perl regular expression for matching the table of contents
delete_nav: supress the navigation box on each page
nav_div_exp: Perl regular expression for matching the navigation box
delete_searchbox: supress the search box on each page
searchbos_div_exp: Perl regular expression for matching the search box
remove_title_suffix: remove the suffix in extracted title

Here is an example that uses all the options:

  plugin          MediaWikiPlug -show_toc -delete_toc -toc_exp <table([^>]*)id=(\"|')toc(\"|')(.|\n)*?</table> 
                   -nav_div_exp <div([^>]*)id=(\"|')p-navigation(\"|')(.|\n)*?</div> -delete_nav
                   -searchbox_div_exp <div([^>]*)id=(\"|')p-search(\"|')(.|\n)*?</div> -delete_searchbox 
                   -remove_title_suffix_exp \s-(.*)$

Table of Contents