Greenstone Archive Formats
During the import portion of the build process, Greenstone stores all metadata–extracted metadata, extracted text, and assigned metadata–in an XML format for use. Then, during the actual collection building, this metadata is processed by a plugin and utilized for creating browsing classifiers, partitions, and indexes.
By default, Greenstone stores metadata in Greenstone Archival Format (also referred to as GreenstoneXML) and is processed by GreenstoneXMLPlugin. However, Greenstone also has an LOC-approved METS profile, which can also be used to store metadata. To use GreenstoneMETS, you must:
- select GreenstoneMETS as the storage format
- Include the GreenstoneMETSPlugin
Greenstone XML Format
During collection "importing", all source documents are brought into the Greenstone system by converting them to a format known as the Greenstone Archive Format. This is an XML style that marks documents into sections, and can hold metadata at the document or section level. During collection "building" these archive documents are processed, and the content indexed and classified.
In XML, tags are enclosed in angle brackets for markup. The Greenstone archive format encodes documents that are already in HTML, and any embedded <, >, or " characters within the original text are escaped using the standard convention lt;, gt; and quot;.
Here is the XML Document Type Definition (DTD) for the Greenstone archive format. Basically, a document is split up into Sections, which can be nested. Each Section has a Description that comprises zero or more Metadata items, and a Content part (which may be null)-this is where the actual document's contents go. With each Metadata element is associated a name attribute (the name can be anything), and some textual data. In XML, PCDATA stands for "parsed character data": basically text.
Here you can see a simple document in this format, comprising a short book with two associated images. The book has two sections called 'Preface' and 'First and only chapter' respectively, the second of which has two subsections. Note that there is no notion of a "chapter" as such: it is represented simply as a top-level section.
The <Section> tag denotes the start of each document section, and the corresponding </Section> closing tag marks the end of that section. Following each <Section> tag is a <Description> section. Within this come any number of <Metadata> elements. Thus different metadata can be associated with individual sections of a document. Most of these are for particular metadata types such as <Title>.
Some metadata elements are special to Greenstone:
|Identifier||The greenstone identifier for the document. Must be unique to the collection.|
|gsdlsourcefilename||the original file from which the archive file was generated (path relative to the import directory)|
|gsdldoctype||generally set to indexed_doc|
|gsdlassocfile||One or more associated files that belong to the document. These get copied over during collection building into the index directory. These may include a cover image, images that are linked to by an HTML page, the original file for a Word or PDF document.|
|assocfilepath||The sub directory of index/assoc in which this document's associated files are stored.|
|hascover||set to 1 if the document has a cover image - this must be named cover.jpg.|
|Source||the original file name|
|srclink||A link to the original source file, for file types that have been converted (such as Word, PDF) or binary file types (such as MP3, Images)|
|srcicon||an appropriate icon for the source file type|
Greenstone METS format
In Greenstone we use METS in a very specific way - as an alternative archive format to Greenstone Archive format. If the option '-saveas METS' is used with import.pl (and export.pl), then source documents will be converted to the Greenstone METS profile, which uses Dublin_Core as its metadata. This divides documents into sections, stores metadata at the section or document level and uses XML xpointer syntax to locate the content of the source documents stored in a temporary XML file. Then when building (indexing) the collection, the METS plugin is used to read in the METS files. It is therefore only designed to process METS documents that match the Greenstone METS profile.
If you want to see our METS format, you can import (or export) a collection and save as the "METS" format.
To build a collection using the GreenstoneMETS format in the GLI:
- Switch to Expert mode (
File → Preferences → Mode → Expert)
- In the Create panel under Import Options, check
saveasand select GreenstoneMETS from the dropdown menu
- In the Design panel under Document Plugins, remove GreenstoneXMLPlugin and add GreenstoneMETSPlugin
On the commandline, you can use the following command:
import.pl (export.pl) -saveas GreenstoneMETS collection_name
Then, in the archives (or export) directory in the collection, you will see two files: docmets.xml which stores metadata (at the document or section level) and associated file pointers; doctxt.xml which stores the content of the source document in an XML format.
The Greenstone METS profile has been officially approved by the Library of Congress and you can view the relevant document here.
To add a different kind of METS documents into a collection, you will need to convert them to either our Greenstone Archive format, or our METS Archive format. This can be done using XSLT. You could convert all the original METS documents into Greenstone METS, put them in the archives directory, and generate an archives.inf file, listing document ids and corresponding files. (import a small collection, e.g. demo, into METS format and have a look at the archives.inf file to see what its like). Then build the collection using buildcol.pl.
Alternatively, theoretically, the following should work, but I have not tried it.
Put the original METS documents in the import directory, write an XSLT to convert them to Greenstone METS format. Use METSPlugin in your collection, set the process_exp to match the files you want processed, and set the xslt option to specify the xslt file that you created (relative to greenstone or collection directory). Then import and build as normal.