Greenstone Archive Format

During collection "importing", all source documents are brought into the Greenstone system by converting them to a format known as the Greenstone Archive Format. This is an XML style that marks documents into sections, and can hold metadata at the document or section level. During collection "building" these archive documents are processed, and the content indexed and classified.

In XML, tags are enclosed in angle brackets for markup. The Greenstone archive format encodes documents that are already in HTML, and any embedded <, >, or " characters within the original text are escaped using the standard convention &lt;, &gt; and &quot;.

Here is the XML Document Type Definition (DTD) for the Greenstone archive format. Basically, a document is split up into Sections, which can be nested. Each Section has a Description that comprises zero or more Metadata items, and a Content part (which may be null)-this is where the actual document's contents go. With each Metadata element is associated a name attribute (the name can be anything), and some textual data. In XML, PCDATA stands for "parsed character data": basically text.

[/gsdoc/others/sample-doc.xml Here] you can see a simple document in this format, comprising a short book with two associated images. The book has two sections called 'Preface' and 'First and only chapter' respectively, the second of which has two subsections. Note that there is no notion of a "chapter" as such: it is represented simply as a top-level section The  tag denotes the start of each document section, and the corresponding  closing tag marks the end of that section. Following each  tag is a  section. Within this come any number of  elements. Thus different metadata can be associated with individual sections of a document. Most of these are for particular metadata types such as .

Some metadata elements are special to Greenstone: