Greenstone 3 OAI server

The Greenstone3 OAI data provider facility comes enabled by default. It runs as a servlet called "oaiserver", and can be accessed using the same URL as the library, by replacing library with oaiserver. For example, http://localhost:8080/greenstone3/oaiserver.

You can see a demonstration OAI server at http://www.greenstone.org/greenstone3/oaiserver?verb=Identify.

Configuration

Configuration is done via the files: OAIConfig.xml for repository wide configuration, and collectionConfig.xml for collection specific configuration.

OAIConfig.xml

This file specifies general information about the repository and can be found in greenstone3/resources/oai. Please edit the file here. When the server starts up, this file will be copied to greenstone3/web/WEB-INF/classes/.

Please modify this file and enter the correct values for repositoryName and repositoryIdentifier. Other values may be modified as needed. The following table lists important configuration options in OAIConfig.xml.

Configurations in OAIConfig.xml
<repositoryName>repository-name</repositoryName>The name of this oai repository, which is human readable.
<repositoryIdentifier>repository-identifier</repositoryIdentifier>The unique id of this oai repository. If using OAI 2.0, this should be the same as your domain name.
<baseURL>your-web-server-domain-name/greenstone3/oaiserver </baseURL>The base url to access this repository.
<protocolVersion>2.0</protocolVersion>The version of OAI specification this repository supports. The Greenstone 3 OAI server supports both version 1.1 and 2.0, although the support for registration for version 1.1 of the protocol was discontinued on 1 September 2002 by the OAI organization, some may still be using it (for example, the http://rocky.dlib.vt.edu/~jcdlpix/cgi-bin/OAI/jcdlpix.pl OAI server used in the Greenstone tutorial exercises).
<deletedRecord>no</deleteRecord>The manner in which the repository supports the notion of deleted records.
<granularity>yyyy-MM-ddTHH:mm:ssZ </granularity>The granularity of the datestamp. The meaning of the string is defined in the specification ISO8601. The other legitimate value of the datestamp which is less fine than this is YYYY-MM-DD.
<adminEmail>maintainer-email-address</adminEmail>The repository maintainer email address. There can be more than one email address here, one element for each.
<oaiInfo><metadata name="meta-name">meta-value</metadata>…</oaiInfo>Metadata describing the repository. Any user defined metadata can go here.
<oaiSuperSet>In the Greenstone OAI server, each collection is presented as an OAI set. You can use the oaiSuperSet to group several collections together to be presented as a single set. See below
<useOAIStylesheet>yes</useOAIStylesheet>A stylesheet will be specified for the result - enables a nice view of the XML when viewing a response in a browser. Set to 'no' if you don't want the stylesheet specified.
<OAIStylesheet>url</OAIStylesheet>Set the url here if you want to use a different stylesheet to the default one.
<earliestDatestamp> 2001-06-24T18:09:47-05:00Z </earliestDatestamp>The Identify response includes earliestDatestamp, which is the earliest datestamp that is valid for the respository. Generally it is generated by looking at the earliestDatestamp of each collection. If for some reason, the collections don't have valid date stamps, then this value from the config file will be used.
<resumeAfter>250</resumeAfter>This value will decide whether or not the selective harvesting is allowed for a repository. In OAI, the commands ListSets, ListRecords, and ListIdentifiers are collectively called list requests. In some cases, these lists may be large and it may be practical to partition them among a series of requests and responses. This value decides how many sets/identifiers/records to send for the request before issuing a resumption token. A value less than 0 (e.g. -1) indicates that a complete list of items will be returned. See the OAI specification for how flow control is accomplished by using resumption tokens.
<resumptionTokenExpiration>7200 </resumptionTokenExpiration>The time period in which a newly generated resumption token will remain valid, specified in seconds. Hence, the default value 7200 is equivalent to 2 hours. The use of this property depends on the value of resumeAfter. If the resumeAfter parameter is specified to be negative (any value less than 0), there won't be any token issued.
<ListMetadataFormats>A list of metadata formats supported by this repository. Since the Dublin Core metadata format is mandatory according to the OAI specification, there must be a metadataFormat element with the oai_dc prefix specified here, along with the metadata name mappings if necessary. See below for more info.

collectionConfig.xml

Resides in the /etc directory of each collection. A serviceRackList contains services which are not defined by the collection building process. (These would end up in the buildConfig.xml file). The OAIPMH ServiceRack enables the collection to be served by the OAI server. It contains information about which metadata formats the collection supports. A mapping list may be provided for each format, mapping Greenstone metadata fields into the fields available for the format. If the collection is part of a super set, then this information is added here too.

Here is a sample OAIPMH ServiceRack element.

    <serviceRack name="OAIPMH">
      <setName>Lucene demo collection</setName>
      <setDescription>A demo collection for greenstone</setDescription>
      <!-- States that this collection is part of the humanity super set, which needs to be defined
       in the OAIConfig.xml file.  -->
      <oaiSuperSet name="humanity"/>
      <ListMetadataFormats>
	<!-- This collection supports the DC metadata set. -->
	<metadataFormat metadataPrefix="oai_dc">
	  <!--   a custom mapping as this collection doesn't have exclusive dc metadata -->
          <!-- this will replace the dc:publisher element from the main set -->
	  <element name="dc:publisher">
	    <mapping elements="dls.Organization"/>
	  </element>
	</metadataFormat>
      </ListMetadataFormats>
    </serviceRack>

OAI super sets

In the Greenstone OAI server, each collection is presented as an OAI set. You can use the oaiSuperSet to group several collections together to be presented as a single set.

The format for a super set specification is like the following:

<oaiSuperSet>
    <setSpec>oai set identifier</setSpec>
    <setName>Human readable set name</setName>
    <setDescription>Set description</setDescription>
  </oaiSuperSet>

There can be more than one super set specified in OAIConfig.xml. Collections themselves state which super set they belong to. The format is the following, where xxx must match the setSpec of the set it belongs to. This line must be added into the OAIPMH serviceRack element in the collectionConfig.xml file.

<oaiSuperSet name="xxx"/>

Metadata Formats

A repository must support the dublin core format, and may support others. Each format must be listed in the ListMetadataFormats element. Currently we only have defined the dublin core format. It looks like this:

  <ListMetadataFormats>
    <metadataFormat>
      <metadataPrefix>oai_dc</metadataPrefix>
      <schema>http://www.openarchives.org/OAI/2.0/oai_dc.xsd</schema>
      <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
      <elementList>
	<element name="dc:title"><mapping select="firstvalidmetadata" elements="dc.Title,Title"/></element>
	<element name="dc:creator"><mapping elements="dc.Creator"/></element>
	<element name="dc:subject"><mapping elements="dc.Subject"/></element>
	<element name="dc:description"><mapping elements="dc.Description"/></element>
	<element name="dc:publisher"><mapping elements="dc.Publisher"/></element>
	<element name="dc:contributor"><mapping elements="dc.Contributor"/></element>
	<element name="dc:date"><mapping elements="dc.Date"/></element>
	<element name="dc:type"><mapping elements="dc.Type"/></element>
	<element name="dc:format"><mapping elements="dc.Format"/></element>
	<element name="dc:identifier"><mapping elements="dc.Identifier,Identifier" select="firstvalue"/></element>
        <element name="dc:identifier"><mapping
        elements="gs.OAIResourceURL,gsflink.source,gsflink.document" 
        select="allvalues"/></element> (available July 2018)
	<element name="dc:source"><mapping elements="dc.Source"/></element>
	<element name="dc:language"><mapping elements="dc.Language"/></element>
	<element name="dc:relation"><mapping elements="dc.Relation"/></element>
	<element name="dc:coverage"><mapping elements="dc.Coverage"/></element>
	<element name="dc:rights"><mapping elements="dc.Rights"/></element>
      </elementList>
    </metadataFormat>
  </ListMetadataFormats>

The elementList lists all elements for the metadata format. Mapping rules dictate which elements in the greenstone collection get output for these OAI elements.

The OAIConfig.xml file defines the metadata formats supported by the respository, and each collection must specify which of these formats it supports. The format is like:

  <ListMetadataFormats>
	<metadataFormat metadataPrefix="oai_dc">
	  <!--   a custom mapping as this collection doesn't have exclusive dc metadata --><!-- this will replace the dc:publisher element from the main set -->
	  <element name="dc:publisher">
	    <mapping elements="dls.Organization"/>
	  </element>
	</metadataFormat>
  </ListMetadataFormats>

The collection can specify a custom mapping.

Field mapping is done in two levels: globally for all collections in a repository, and specifically for each collection. For a particular collection, the mapping specification in the collectionConfig.xml takes precedence over that in the OAIConfig.xml. Hence, the metadata mappings will be first looked for in each collection's collectionConfig.xml; if not found, go to the OAIConfig.xml; if not specified there either, the standard Dublin Core field names will be used to retrieve the metadata of the collection.

Mappings take the following format:

  <element name="oai-name"/>
  <element name="oai-name"><mapping select="allvalues|firstvalue|firstvalidmetadata" 
    elements="comma-separated-list-of-gs-metadata"/></element>

In the first case, the server will look for 'oai-name' metadata in the collection. No mapping will be done.
In the second case, the server will look for any metadata in the elements list and map it to oai-name in the output. The select attribute determines how many values are output. The default setting is 'allvalues'.

  • allvalues: will display all values of each metadata element
  • firstvalue: will go through each metadata element until it finds a value, and will return only one value.
  • firstvalidmetadata: will go through each element until it finds one that has a value, then output all values of that element.

Some examples:

<element name="dc:title"/>

The server will look for dc:title metadata and output it if found. Note that standard Greenstone metadata uses '.' for namespaces, not ':', so this will not find anything.

<element name="dc:title"><mapping select="firstvalidmetadata" elements="dc.Title,Title"/></element>

This will output all dc.Titles as dc:title metadata. If no dc.Title is found, then any Titles will be output.

<element name="dc:date"><mapping select="allvalues" elements="dc.Date,gs.Date"/></element>

This will output all dc.Dates and gs.Dates as dc:date.

OAI Identifiers

Prior to July 2018, an extra oai_dc:identifier element was automatically added to the metadata list, containing a link to the document. The values used would be chosen from the following list, using the first available value:

  • gs.OAIResourceURL - if this metadata was set for a document, this would be used as the identifier URL. This enables you to link to the document outside of Greenstone.
  • a link to the source document, eg PDF, Word files. (gsf:link type="source">)
  • a link to the Greenstone version of the document. (<gsf:link>)

After July 2018, this was modified to be specified via the OAIConfig in the same way as other metadata mappings are specified. gsflink.source and gsflink.document are keywords specifying the source url and the Greenstone documemnt URL, respectively.

<element name="dc:identifier"><mapping elements="gs.OAIResourceURL,gsflink.source,gsflink.document" select="allvalues"/></element>

This way, users can customize further which identifiers they want - they can use all of them, as the above mapping specifies.

Disabling Collections

By default, in Greenstone3, all collections are enabled for the OAI server, and each collection is mapped into an OAI set. A new collection in Greenstone contains the OAIPMH ServiceRack in its collectionConfig.xml file. To disable the collection in the oaiserver, comment out this ServiceRack. Note, this has to be done by hand, as GLI has not been set up to modify this part of the collectionConfig.xml file. The collectionConfig.xml file lives in greenstone3/web/sites/localsite/collect/<colname>/etc/collectionConfig.xml. Make sure the collection is not open in GLI when you are editing this file by hand.

Disabling the OAI Server

If you do not want to provide an OAI server alongside your Greenstone3 library you'll need to remove the servlet information from the Greenstone3 web.xml. Make sure Tomcat is not running (by closing the Greenstone3 server program, or running ant stop in the greenstone3 folder on the command line). Open up greenstone3/web/WEB-INF/web.xml. There are two parts you'll need to remove, or comment out. The first one is the servlet specification for the oaiserver. It looks like this:

        <servlet>
                <servlet-name>oaiserver</servlet-name>
                <description>an oai servlet</description>
                <servlet-class>org.greenstone.gsdl3.OAIServer</servlet-class>
                <init-param>
                        <param-name>default_lang</param-name>
                        <param-value>en</param-value>
                </init-param>
                <init-param>
                        <param-name>site_name</param-name>
                        <param-value>localsite</param-value>
                </init-param>
        </servlet>

The second part to remove or comment out is the servlet mapping, which maps the url to the servlet. It looks like this:

        <servlet-mapping>
                <servlet-name>oaiserver</servlet-name>
                <url-pattern>/oaiserver</url-pattern>
        </servlet-mapping>

Once these two sections are gone or commented out, Tomcat will no longer provide the oaiserver. You can add them back in if you wish to reinstate it later.

OAI Datestamps

The "datestamp" tag for a record comes from the "oailastmodified" metadata that is added automatically by Greenstone when you build a collection. This value is obtained from the operating system, and is usually the last time the file was edited. However, if the file has been copied (for example if you used the GLI to add the file into your collection) then the oailastmodified value will probably be the time the file was copied.

To manually set the OAI datestamp for a document, add gs.OAIDateStamp metadata. This must be in the form YYYY-MM-DD. This will be used instead of oailastmodified if it exists.

Resetting the server

If you have rebuilt collections then you need to reset the server. This can either be done by restarting it (stop the server and run gs3-server.sh/bat again, or run 'ant restart'), or you can reset the server using the reset command: http://localhost:8383/greenstone3/oaiserver?reset. This will make it reload all the collection information again. (Make sure you use the correct host name and port number.)

Testing

Once you have your OAI service in place, testing can be done via online validation facilities such as the following: http://www.openarchives.org/data/registerasprovider.html or http://re.cs.uct.ac.za/.

The former only verifies the Identify command, while extensive testing can be performed via the later one (called Repository Explorer).

The Greenstone OAI server must be publically accessible over the Internet to use these validation tools.