This version (2014/05/15 10:39) is a draft.
Approvals: 0/1

Greenstone Archive Formats

During the import portion of the build process, Greenstone stores all metadata–extracted metadata, extracted text, and assigned metadata–in an XML format for use. Then, during the actual collection building, this metadata is processed by a plugin and utilized for creating browsing classifiers, partitions, and indexes.

By default, Greenstone stores metadata in Greenstone Archival Format (also referred to as GreenstoneXML) and is processed by GreenstoneXMLPlugin. However, Greenstone also has an LOC-approved METS profile, which can also be used to store metadata. To use GreenstoneMETS, you must:

Greenstone XML Format

During collection "importing", all source documents are brought into the Greenstone system by converting them to a format known as the Greenstone Archive Format. This is an XML style that marks documents into sections, and can hold metadata at the document or section level. During collection "building" these archive documents are processed, and the content indexed and classified.

In XML, tags are enclosed in angle brackets for markup. The Greenstone archive format encodes documents that are already in HTML, and any embedded <, >, or " characters within the original text are escaped using the standard convention lt;, gt; and quot;.

Here is the XML Document Type Definition (DTD) for the Greenstone archive format. Basically, a document is split up into Sections, which can be nested. Each Section has a Description that comprises zero or more Metadata items, and a Content part (which may be null)-this is where the actual document's contents go. With each Metadata element is associated a name attribute (the name can be anything), and some textual data. In XML, PCDATA stands for "parsed character data": basically text.

Here you can see a simple document in this format, comprising a short book with two associated images. The book has two sections called 'Preface' and 'First and only chapter' respectively, the second of which has two subsections. Note that there is no notion of a "chapter" as such: it is represented simply as a top-level section.

The <Section> tag denotes the start of each document section, and the corresponding </Section> closing tag marks the end of that section. Following each <Section> tag is a <Description> section. Within this come any number of <Metadata> elements. Thus different metadata can be associated with individual sections of a document. Most of these are for particular metadata types such as <Title>.

Some metadata elements are special to Greenstone:

Identifier The greenstone identifier for the document. Must be unique to the collection.
gsdlsourcefilename the original file from which the archive file was generated (path relative to the import directory)
gsdldoctype generally set to indexed_doc
gsdlassocfile One or more associated files that belong to the document. These get copied over during collection building into the index directory. These may include a cover image, images that are linked to by an HTML page, the original file for a Word or PDF document.
assocfilepath The sub directory of index/assoc in which this document's associated files are stored.
hascover set to 1 if the document has a cover image - this must be named cover.jpg.
Source the original file name
srclink A link to the original source file, for file types that have been converted (such as Word, PDF) or binary file types (such as MP3, Images)
srcicon an appropriate icon for the source file type

Greenstone METS format

In Greenstone we use METS in a very specific way - as an alternative archive format to Greenstone Archive format. If the option '-saveas METS' is used with import.pl (and export.pl), then source documents will be converted to the Greenstone METS profile, which uses Dublin_Core as its metadata. This divides documents into sections, stores metadata at the section or document level and uses XML xpointer syntax to locate the content of the source documents stored in a temporary XML file. Then when building (indexing) the collection, the METS plugin is used to read in the METS files. It is therefore only designed to process METS documents that match the Greenstone METS profile.

If you want to see our METS format, you can import (or export) a collection and save as the "METS" format.

To build a collection using the GreenstoneMETS format in the GLI:

  • Switch to Expert mode (File → Preferences → Mode → Expert)
  • In the Create panel under Import Options, check saveas and select GreenstoneMETS from the dropdown menu
  • In the Design panel under Document Plugins, remove GreenstoneXMLPlugin and add GreenstoneMETSPlugin

On the commandline, you can use the following command:

import.pl (export.pl) -saveas GreenstoneMETS collection_name

Then, in the archives (or export) directory in the collection, you will see two files: docmets.xml which stores metadata (at the document or section level) and associated file pointers; doctxt.xml which stores the content of the source document in an XML format.

The Greenstone METS profile has been officially approved by the Library of Congress and you can view the relevant document here.

To add a different kind of METS documents into a collection, you will need to convert them to either our Greenstone Archive format, or our METS Archive format. This can be done using XSLT. You could convert all the original METS documents into Greenstone METS, put them in the archives directory, and generate an archives.inf file, listing document ids and corresponding files. (import a small collection, e.g. demo, into METS format and have a look at the archives.inf file to see what its like). Then build the collection using buildcol.pl.

Alternatively, theoretically, the following should work, but I have not tried it.

Put the original METS documents in the import directory, write an XSLT to convert them to Greenstone METS format. Use METSPlugin in your collection, set the process_exp to match the files you want processed, and set the xslt option to specify the xslt file that you created (relative to greenstone or collection directory). Then import and build as normal.