User Tools

Site Tools


en:user_advanced:oai

This is an old revision of the document!


OAI

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) allows for interoperability between document repositories.

  • Data Providers expose their metadata using OAI-PMH
  • Service Providers or harvesters access this metadata by making service requests.

Greenstone allows you to both harvest metadata provided by others and make the metadata for your own collections accessible via OAI-PMH.

Harvesting OAI records using Greenstone

Downloading records from an OAI repository

Greenstone can download records from an OAI repository and build them into a collection. The downloading can either be done from the Download panel of the GLI or from the command line. The following options are available for downloading via OAI:

OptionDescription
Source URL (-url <string>)(REQUIRED) OAI repository URL
Metadata prefix (-metadata_prefix <string>)The metadata format used in the exported metadata, e.g. oai_dc, qdc, etc. (Default: oai_dc)
Restrict to set (-set <string>) Restrict the download to the specified set in the repository
Get document (-get_doc)Download the source document if one is specified in the record
Only include file types (-get_doc_exts <string>)Permissible filename extensions of documents to get (Default: doc,pdf,ppt)
Max records (-max_records <int>)Maximum number of records to download. If not specified, will download all records.

In the GLI, clicking Server Information will cause the following request to be sent to the OAI data provider specified by the Source URL argument:

 <url>?verb=Identify

The response is shown in a popup window. You can use the returned server information to help fill out the arguments, for example, the set name and metadata prefix.

If you are using the GLI, you can view the downloaded files on the Gather panel. On the left-hand side of the panel, double click the Downloaded Files folder to expand its content. The subfolders are named by the OAI server URL. At the lowest level of each subfolder are the metadata files, which are organized by the specified set name. These metadata files are physically stored in a temporary cache directory. You can build a collection using these downloaded metadata files, using the OAIPlugin.

Downloading source documents

If you select the option to get document, then Greenstone will check the value of dc.identifier. If it is a URL (starts with http, https, ftp), then Greenstone will check whether the file extension matches those listed in the Only include file types argument. If so, the file is downloaded. If the extension does not match those listed and it is an HTML file, then Greenstone downloads the page and scans though it looking for href's that match the specified file extensions, and downloads these.

Serving OAI Data using Greenstone

Greenstone comes with a built-in OAI data provider, called oaiserver. A configuration file provides options for the set up of the server. Collections can opt in or out of the server, and each collection will be advertised as an OAI set. Multiple collections can be grouped into a single OAI set using Greenstone's OAI super set mechanism.

<TABAREA tabs="Greenstone3,Greenstone2"> <TAB>

The Greenstone3 OAI server

The Greenstone3 OAI data provider facility comes enabled by default. It runs as a servlet called "oaiserver", and can be accessed using the same URL as the library, by replacing library with oaiserver. For example, http://localhost:8080/greenstone3/oaiserver. You can see a demonstration OAI server at http://www.greenstone.org/greenstone3/oaiserver?verb=Identify.

Configuration

Configuration is done via the files: OAIConfig.xml for repository wide configuration, and collectionConfig.xml for collection specific configuration.

OAIConfig.xml

This file specifies general information about the repository and can be found in greenstone3/resources/oai. Please edit the file here. When the server starts up, this file will be copied to greenstone3/web/WEB-INF/classes/.

Please modify this file and enter the correct values for repositoryName and repositoryIdentifier. Other values may be modified as needed. The following table lists important configuration options in OAIConfig.xml.

Configurations in OAIConfig.xml
<repositoryName>repository-name</repositoryName>The name of this oai repository, which is human readable.
<repositoryIdentifier>repository-identifier</repositoryIdentifier>The unique id of this oai repository. If using OAI 2.0, this should be the same as your domain name.
<baseURL>your-web-server-domain-name/greenstone3/oaiserver </baseURL>The base url to access this repository.
<protocolVersion>2.0</protocolVersion>The version of OAI specification this repository supports. The Greenstone 3 OAI server supports both version 1.1 and 2.0, although the support for registration for version 1.1 of the protocol was discontinued on 1 September 2002 by the OAI organization, some may still be using it (for example, the http://rocky.dlib.vt.edu/~jcdlpix/cgi-bin/OAI/jcdlpix.pl OAI server used in the Greenstone tutorial exercises).
<deletedRecord>no</deleteRecord>The manner in which the repository supports the notion of deleted records.
<granularity>yyyy-MM-ddTHH:mm:ssZ </granularity>The granularity of the datestamp. The meaning of the string is defined in the specification ISO8601. The other legitimate value of the datestamp which is less fine than this is YYYY-MM-DD.
<adminEmail>maintainer-email-address</adminEmail>The repository maintainer email address. There can be more than one email address here, one element for each.
<oaiInfo><metadata name="meta-name">meta-value</metadata>…</oaiInfo>Metadata describing the repository. Any user defined metadata can go here.
<oaiSuperSet>In the Greenstone OAI server, each collection is presented as an OAI set. You can use the oaiSuperSet to group several collections together to be presented as a single set. See below
<useOAIStylesheet>yes</useOAIStylesheet>A stylesheet will be specified for the result - enables a nice view of the XML when viewing a response in a browser. Set to 'no' if you don't want the stylesheet specified.
<OAIStylesheet>url</OAIStylesheet>Set the url here if you want to use a different stylesheet to the default one.
<earliestDatestamp> 2001-06-24T18:09:47-05:00Z </earliestDatestamp>The Identify response includes earliestDatestamp, which is the earliest datestamp that is valid for the respository. Generally it is generated by looking at the earliestDatestamp of each collection. If for some reason, the collections don't have valid date stamps, then this value from the config file will be used.
<resumeAfter>250</resumeAfter>This value will decide whether or not the selective harvesting is allowed for a repository. In OAI, the commands ListSets, ListRecords, and ListIdentifiers are collectively called list requests. In some cases, these lists may be large and it may be practical to partition them among a series of requests and responses. This value decides how many sets/identifiers/records to send for the request before issuing a resumption token. A value less than 0 (e.g. -1) indicates that a complete list of items will be returned. See the OAI specification for how flow control is accomplished by using resumption tokens.
<resumptionTokenExpiration>7200 </resumptionTokenExpiration>The time period in which a newly generated resumption token will remain valid, specified in seconds. Hence, the default value 7200 is equivalent to 2 hours. The use of this property depends on the value of resumeAfter. If the resumeAfter parameter is specified to be negative (any value less than 0), there won't be any token issued.
<ListMetadataFormats>A list of metadata formats supported by this repository. Since the Dublin Core metadata format is mandatory according to the OAI specification, there must be a metadataFormat element with the oai_dc prefix specified here, along with the metadata name mappings if necessary. See below for more info.

collectionConfig.xml

Resides in the /etc directory of each collection. A serviceRackList contains services which are not defined by the collection building process. (These would end up in the buildConfig.xml file). The OAIPMH ServiceRack enables the collection to be served by the OAI server. It contains information about which metadata formats the collection supports. A mapping list may be provided for each format, mapping Greenstone metadata fields into the fields available for the format. If the collection is part of a super set, then this information is added here too.

Here is a sample OAIPMH ServiceRack element.

    <serviceRack name="OAIPMH">
      <setName>Lucene demo collection</setName>
      <setDescription>A demo collection for greenstone</setDescription>
      <!-- States that this collection is part of the humanity super set, which needs to be defined
       in the OAIConfig.xml file.  -->
      <oaiSuperSet name="humanity"/>
      <ListMetadataFormats>
	<!-- This collection supports the DC metadata set. -->
	<metadataFormat metadataPrefix="oai_dc">
	  <!--   a custom mapping as this collection doesn't have exclusive dc metadata -->
          <!-- this will replace the dc:publisher element from the main set -->
	  <element name="dc:publisher">
	    <mapping elements="dls.Organization"/>
	  </element>
	</metadataFormat>
      </ListMetadataFormats>
    </serviceRack>

OAI super sets

In the Greenstone OAI server, each collection is presented as an OAI set. You can use the oaiSuperSet to group several collections together to be presented as a single set.

The format for a super set specification is like the following:

<oaiSuperSet>
    <setSpec>oai set identifier</setSpec>
    <setName>Human readable set name</setName>
    <setDescription>Set description</setDescription>
  </oaiSuperSet>

There can be more than one super set specified in OAIConfig.xml. Collections themselves state which super set they belong to. The format is the following, where xxx must match the setSpec of the set it belongs to. This line must be added into the OAIPMH serviceRack element in the collectionConfig.xml file.

<oaiSuperSet name="xxx"/>

Metadata Formats

A repository must support the dublin core format, and may support others. Each format must be listed in the ListMetadataFormats element. Currently we only have defined the dublin core format. It looks like this:

  <ListMetadataFormats>
    <metadataFormat>
      <metadataPrefix>oai_dc</metadataPrefix>
      <schema>http://www.openarchives.org/OAI/2.0/oai_dc.xsd</schema>
      <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
      <elementList>
	<element name="dc:title"><mapping select="firstvalidmetadata" elements="dc.Title,Title"/></element>
	<element name="dc:creator"><mapping elements="dc.Creator"/></element>
	<element name="dc:subject"><mapping elements="dc.Subject"/></element>
	<element name="dc:description"><mapping elements="dc.Description"/></element>
	<element name="dc:publisher"><mapping elements="dc.Publisher"/></element>
	<element name="dc:contributor"><mapping elements="dc.Contributor"/></element>
	<element name="dc:date"><mapping elements="dc.Date"/></element>
	<element name="dc:type"><mapping elements="dc.Type"/></element>
	<element name="dc:format"><mapping elements="dc.Format"/></element>
	<element name="dc:identifier"><mapping elements="dc.Identifier,Identifier" select="firstvalue"/></element>
	<element name="dc:source"><mapping elements="dc.Source"/></element>
	<element name="dc:language"><mapping elements="dc.Language"/></element>
	<element name="dc:relation"><mapping elements="dc.Relation"/></element>
	<element name="dc:coverage"><mapping elements="dc.Coverage"/></element>
	<element name="dc:rights"><mapping elements="dc.Rights"/></element>
      </elementList>
    </metadataFormat>
  </ListMetadataFormats>

The elementList lists all elements for the metadata format. Mapping rules dictate which elements in the greenstone collection get output for these OAI elements.

The OAIConfig.xml file defines the metadata formats supported by the respository, and each collection must specify which of these formats it supports. The format is like:

  <ListMetadataFormats>
	<metadataFormat metadataPrefix="oai_dc">
	  <!--   a custom mapping as this collection doesn't have exclusive dc metadata --><!-- this will replace the dc:publisher element from the main set -->
	  <element name="dc:publisher">
	    <mapping elements="dls.Organization"/>
	  </element>
	</metadataFormat>
  </ListMetadataFormats>

The collection can specify a custom mapping.

Field mapping is done in two levels: globally for all collections in a repository, and specifically for each collection. For a particular collection, the mapping specification in the collectionConfig.xml takes precedence over that in the OAIConfig.xml. Hence, the metadata mappings will be first looked for in each collection's collectionConfig.xml; if not found, go to the OAIConfig.xml; if not specified there either, the standard Dublin Core field names will be used to retrieve the metadata of the collection.

Mappings take the following format:

  <element name="oai-name"/>
  <element name="oai-name"><mapping select="allvalues|firstvalue|firstvalidmetadata" 
    elements="comma-separated-list-of-gs-metadata"/></element>

In the first case, the server will look for 'oai-name' metadata in the collection. No mapping will be done.
In the second case, the server will look for any metadata in the elements list and map it to oai-name in the output. The select attribute determines how many values are output. The default setting is 'allvalues'.

  • allvalues: will display all values of each metadata element
  • firstvalue: will go through each metadata element until it finds a value, and will return only one value.
  • firstvalidmetadata: will go through each element until it finds one that has a value, then output all values of that element.

Some examples:

<element name="dc:title"/>

The server will look for dc:title metadata and output it if found. Note that standard Greenstone metadata uses '.' for namespaces, not ':', so this will not find anything.

<element name="dc:title"><mapping select="firstvalidmetadata" elements="dc.Title,Title"/></element>

This will output all dc.Titles as dc:title metadata. If no dc.Title is found, then any Titles will be output.

<element name="dc:date"><mapping select="allvalues" elements="dc.Date,gs.Date"/></element>

This will output all dc.Dates and gs.Dates as dc:date.

Disabling Collections

By default, in Greenstone3, all collections are enabled for the OAI server, and each collection is mapped into an OAI set. A new collection in Greenstone contains the OAIPMH ServiceRack in its collectionConfig.xml file. To disable the collection in the oaiserver, comment out this ServiceRack. Note, this has to be done by hand, as GLI has not been set up to modify this part of the collectionConfig.xml file. The collectionConfig.xml file lives in greenstone3/web/sites/localsite/collect/<colname>/etc/collectionConfig.xml. Make sure the collection is not open in GLI when you are editing this file by hand.

Disabling the OAI Server

If you do not want to provide an OAI server alongside your Greenstone3 library you'll need to remove the servlet information from the Greenstone3 web.xml. Make sure Tomcat is not running (by closing the Greenstone3 server program, or running ant stop in the greenstone3 folder on the command line). Open up greenstone3/web/WEB-INF/web.xml. There are two parts you'll need to remove, or comment out. The first one is the servlet specification for the oaiserver. It looks like this:

        <servlet>
                <servlet-name>oaiserver</servlet-name>
                <description>an oai servlet</description>
                <servlet-class>org.greenstone.gsdl3.OAIServer</servlet-class>
                <init-param>
                        <param-name>default_lang</param-name>
                        <param-value>en</param-value>
                </init-param>
                <init-param>
                        <param-name>site_name</param-name>
                        <param-value>localsite</param-value>
                </init-param>
        </servlet>

The second part to remove or comment out is the servlet mapping, which maps the url to the servlet. It looks like this:

        <servlet-mapping>
                <servlet-name>oaiserver</servlet-name>
                <url-pattern>/oaiserver</url-pattern>
        </servlet-mapping>

Once these two sections are gone or commented out, Tomcat will no longer provide the oaiserver. You can add them back in if you wish to reinstate it later.

OAI Datestamps

The "datestamp" tag for a record comes from the "oailastmodified" metadata that is added automatically by Greenstone when you build a collection. This value is obtained from the operating system, and is usually the last time the file was edited. However, if the file has been copied (for example if you used the GLI to add the file into your collection) then the oailastmodified value will probably be the time the file was copied.

To manually set the OAI datestamp for a document, add gs.OAIDateStamp metadata. This must be in the form YYYY-MM-DD. This will be used instead of oailastmodified if it exists.

Resetting the server

If you have rebuilt collections then you need to reset the server. This can either be done by restarting it (stop the server and run gs3-server.sh/bat again, or run 'ant restart'), or you can reset the server using the reset command: http://localhost:8383/greenstone3/oaiserver?reset. This will make it reload all the collection information again. (Make sure you use the correct host name and port number.) .</TAB> <TAB>

The Greenstone2 OAI server

Greenstone comes with a built-in OAI data provider. This runs as a CGI program called oaiserver.cgi, and is installed in the Greenstone cgi-bin directory. It can be accessed via the same URL as the Greenstone library (replacing library.cgi with oaiserver.cgi). On Windows, you must be using a web server (eg Apache) not the local library server.

Configuration of the server is done via the oai.cfg file in the Greenstone etc directory. This file specifies general information about the repository, lists collections to be made accessible and may include metadata mapping information. Important: the oai.cfg file must be utf-8 encoded.

Please edit oai.cfg and set the repositoryName and repositoryId fields. If you are not using the standard Apache setup that comes with Greenstone, you may need to set oaiserverPath, libraryPath, docRootPath. Optionally, you can set baseServerURL to use a domain name instead of IP address in URLs.

By default, collections are not accessible. To enable a collection, add its name to the oaicollection list.

Greenstone's OAI server currently supports Dublin Core, Qualified Dublin Core, and RFC1807 metadata. For collections that use other metadata sets, including extracted metadata, metadata mapping rules should be provided to map the existing metadata to Dublin Core. oai.cfg has more details.

To add a new metadata set for use with oaiserver

You need to do the following:

  • Create a schema (or find an existing one) for the metadata set. See Greenstone's qualified dublin core schema, OAI standard dublin core files for examples.
    • Put the new schema somewhere web accessible
  • Coding in GSDLHOME/runtime-src/src/oaiservr:
    • Create a new metaformat class for the metadata set. See dublincore.h/cpp, qualified_dublincore.h/cpp, rfc1807.h/cpp for examples.
    • edit Makefile.in, Makefile and win32.mak to use the new files
    • Edit recordaction.cpp to include the new header file and instantiate the new class (in recordaction())
  • Tell the server to use the new set: edit etc/oai.cfg and add the set name to the oaimetadata line. You may also need to add oaimapping information.
  • Recompile and test.

</TAB></TABAREA>

Testing

Once you have your OAI service in place, testing can be done via online validation facilities such as the following: http://www.openarchives.org/data/registerasprovider.html or http://re.cs.uct.ac.za/.

The former only verifies the Identify command, while extensive testing can be performed via the later one (called Repository Explorer).

The Greenstone OAI server must be publically accessible over the Internet to use these validation tools.

Additional Resources

<TABAREA tabs="Greenstone3,Greenstone2"> <TAB> There are several tutorials concerning using OAI in Greenstone:

</TAB> <TAB>

There are several tutorials concerning using OAI in Greenstone:

The OAI example collection demonstrates a library built from OAI records.

</TAB> </TABAREA>

en/user_advanced/oai.1414974429.txt.gz · Last modified: 2018/07/30 23:01 (external edit)