User Tools

Site Tools


old:greenstone_archive_format

This page is in the 'old' namespace, and was imported from our previous wiki. We recommend checking for more up-to-date information using the search box.

Greenstone Archival Format

During collection "importing", all source documents are brought into the Greenstone system by converting them to a format known as the Greenstone Archive Format. This is an XML style that marks documents into sections, and can hold metadata at the document or section level. During collection "building" these archive documents are processed, and the content indexed and classified.

In XML, tags are enclosed in angle brackets for markup. The Greenstone archive format encodes documents that are already in HTML, and any embedded <, >, or " characters within the original text are escaped using the standard convention lt;, gt; and quot;.

Here is the XML Document Type Definition (DTD) for the Greenstone archive format. Basically, a document is split up into Sections, which can be nested. Each Section has a Description that comprises zero or more Metadata items, and a Content part (which may be null)-this is where the actual document's contents go. With each Metadata element is associated a name attribute (the name can be anything), and some textual data. In XML, PCDATA stands for "parsed character data": basically text.

Here you can see a simple document in this format, comprising a short book with two associated images. The book has two sections called 'Preface' and 'First and only chapter' respectively, the second of which has two subsections. Note that there is no notion of a "chapter" as such: it is represented simply as a top-level section.

The <Section> tag denotes the start of each document section, and the corresponding </Section> closing tag marks the end of that section. Following each <Section> tag is a <Description> section. Within this come any number of <Metadata> elements. Thus different metadata can be associated with individual sections of a document. Most of these are for particular metadata types such as <Title>.

Some metadata elements are special to Greenstone:

Identifier The greenstone identifier for the document. Must be unique to the collection.
gsdlsourcefilename the original file from which the archive file was generated (path relative to the import directory)
gsdldoctype generally set to indexed_doc
gsdlassocfile One or more associated files that belong to the document. These get copied over during collection building into the index directory. These may include a cover image, images that are linked to by an HTML page, the original file for a Word or PDF document.
assocfilepath The sub directory of index/assoc in which this document's associated files are stored.
hascover set to 1 if the document has a cover image - this must be named cover.jpg.
Source the original file name
srclink A link to the original source file, for file types that have been converted (such as Word, PDF) or binary file types (such as MP3, Images)
srcicon an appropriate icon for the source file type
old/greenstone_archive_format.txt · Last modified: 2023/03/13 01:46 by 127.0.0.1