Creating a collection from an OAI repository
Greenstone can download records from an OAI repository and build them into a collection. The downloading can be done in two ways:
1. From the GLI
Start the Greenstone Librarian Interface. On the left-hand side of the Librarian Interface's Download panel, select OAI, and then specify the arguments on the right-hand side of the panel. Hovering your mouse over each argument will trigger the tooltip for the argument. There are four buttons beneath the argument field. Clicking the Preferences button is the same as you go:
where you can specify the proxy information for your connection if necessary. Clicking Server Information will cause the following request to be sent to the oai data provider specified by the url argument:
The response is shown in a popup window. You can use the returned server information to help fill out the arguments, for example, the set name. Clear Cache will delete all previously downloaded metadata files. To start downloading the metadata records, click the Download button. A download progress panel will show up. If you see something like "Downloaded 0 of 0 files...", you may have to view the log file by clicking the View Log button, in which it is very likely that you have specified an invalid argument.
Behind the scenes, GLI uses a script called downloadfrom.pl. This can be run from the command line as described in the next section.
You can view the downloaded files on the Gather panel. On the left-hand side of the panel, double click the Downloaded Files folder to expand its content. The subfolders are named by the oai server url. At the lowest level of each subfolder are the metadata files, which are organized by the specified set name. These metadata files are physically stored in a temporary cache directory.
You can build a collection using these downloaded metadata files. OAIPlug must be included in the collection plugin list.
The OAI example collection demonstates these features.
Downloading source documents
If the get document checkbox is selected, then Greenstone will check the value of dc.identifier. If it is a URL (starts with http, https, ftp), then: There is an option - include filetype - which defaults to doc,pdf,ppt We check the file extension to see if it matches one of these. If so, we download it. If there is no file extension, or if the file extension is html, then we download the page and scan though it looking for hrefs that match the specified file extensions, and download those. Also, apparently it can cope with handle URLs, eg http://hdl.handle.net/xxx
2. Through the command line
GLI uses a perl script, downloadfrom.pl, to do the downloading. This can be run on the command line, outside of GLI. Before you start, you must set up your Greenstone environment in the terminal: Go to your Greenstone folder, and run source setup.bash (Linux/Mac) or setup.bat (Windows).
downloadfrom.pl can download using several different protocols. These are specified using the -mode option.
To see the available options for download mode, run perl -S downloadfrom.pl -h
The current options are
- Web: download a website using http
- MediaWiki: download a MediaWiki website
- OAI: download using OAI
- Z3950: download using z3950
- SRW: download using a SearchRetrieve Webservice
For OAI downloading, use -mode OAI
To see the options for OAI downloading, you can run perl -S downloadinfo.pl OAIDownload. The options are the same as you can see in the GLI OAI download panel. They are:
- -url <string>: (Required) The OAI repository URL.
- -metadata_prefix <string>: The metadata format to be used in the downloaded records. e.g. oai_dc, qdc, etc. Formats available depend on what is offered by the OAI server. All repositories must offer oai_dc. Default: oai_dc
- -set <string>: Restrict the download to the specified set in the repository
- -get_doc: Download source documents too, if available
- -get_doc_exts <string>: If downloading source documents, only download those whose file extensions match this list. Default: doc,pdf,ppt
- -max_records <int>: Maximum number of records to download. If not specified, will download all records.
An example usage would be:
perl -S downloadfrom.pl -mode OAI -url http://www.nzdl.org/cgi-bin/oaiserver.cgi -set demo -max_records 5
This will try to download 5 records from the set demo at the nzdl.org's OAI server.
The records (and optionally documents) will be downloaded into the folder the script is run from. To change this, use the -cache_dir full-path-to-folder option.
NOTE, this description is valid for Greenstone 2.85 with patched OAIDownload file, see OAI patch section in 2.85 Release Notes.
The Greenstone OAI server
Greenstone comes with a built-in OAI data provider. This runs as a CGI program called "oaiserver.cgi", and is installed in the Greenstone cgi-bin directory. It can be accessed via the same URL as the Greenstone library (replacing "library.cgi with "oaiserver.cgi"). On Windows, you must be using a web server (eg Apache) not the local library server.
Configuration of the server is done via the oai.cfg file in the Greenstone etc directory. Please edit this file and set the repositoryName and repositoryId fields. If you are not using the standard Apache setup that comes with Greenstone, you may need to set oaiserverPath, libraryPath, docRootPath. Optionally, you can set baseServerURL to use a domain name instead of IP address in URLs.
This file specifies general information about the repository, and lists collections to be made accessible to OAI clients. By default, collections are not accessible. To enable a collection, add its name to the oaicollection list.
Greenstone's OAI server currently supports Dublin Core, Qualified Dublin Core, RFC1807 metadata. For collections that use other metadata sets, including extracted metadata, metadata mapping rules should be provided to map the existing metadata to Dublin Core. See your Greenstone installation's etc/oai.cfg file for details.
To add a new metadata set for use with oaiserver
You need to do the following:
- Create a schema (or find an existing one) for the metadata set. See Greenstone's qualified dublin core schema, OAI standard dublin core files for examples.
- Put the new schema somewhere web accessible
- Coding in GSDLHOME/runtime-src/src/oaiservr:
- Create a new metaformat class for the metadata set. See dublincore.h/cpp, qualified_dublincore.h/cpp, rfc1807.h/cpp for examples.
- edit Makefile.in, Makefile and win32.mak to use the new files
- Edit recordaction.cpp to include the new header file and instantiate the new class (in recordaction())
- Tell the server to use the new set: edit etc/oai.cfg and add the set name to the oaimetadata line. You may also need to add oaimapping information.
- Recompile and test.
The Greenstone 3 OAI Server
The Greenstone3 OAI data provider facility is available with versions 3.03 and later. This runs as a servlet called "oaiserver", and can be accessed using the same URL as the library, by replacing library with oaiserver. For example, http://localhost:8080/greenstone3/oaiserver .
You can see a demonstration one at http://www.greenstone.org/greenstone3/oaiserver?verb=Identify.
Configuration is done via the two files: OAIConfig.xml for repository wide configuration, and collectionConfig.xml for collection specific configuration.
This resides in $GSDL3HOME/WEB-INF/classes/. (Note GSDL3HOME refers to the location of the web directory, normally .../greenstone3/web, but may be moved into Tomcat's folders). This file specifies general information about the repository.
Please modify this file and enter the correct values for repositoryName, and baseURL. Other values may be modified as needed.
The configurations provided in this file are described as follows:
<repositoryName>repository-name</repositoryName> The name of this oai repository, which is human readable.
<baseURL>your-web-server-domain-name/greenstone3/oaiserver </baseURL> The base url to access this repository.
<protocolVersion>2.0</protocolVersion> The version of OAI specification this repository supports. The Greenstone 3 OAI server supports both version 1.1 and 2.0, although the support for registration for version 1.1 of the protocol was discontinued on 1 September 2002 by the OAI organization, some may still be using it (for example, the http://rocky.dlib.vt.edu/~jcdlpix/cgi-bin/OAI/jcdlpix.pl OAI server used in the Greenstone tutorial exercises).
<deletedRecord>no</deleteRecord> The manner in which the repository supports the notion of deleted records.
<granularity>yyyy-MM-ddTHH:mm:ssZ</granularity> The granularity of the datestamp. The meaning of the string is defined in the specification ISO8601. The other legitimate value of the datestamp which is less fine than this is YYYY-MM-DD.
<adminEmail>maintainer-email-address</adminEmail> The repository maintainer email address. There can be more than one email address here, one element for each.
The information that goes into the response to the Identify verb request along with the above also includes:
which is the earliest time stamp among the built times of all collections in the repository. It is not provided here because it has to be dynamically generated by going through all collections to find whichever collection was built the earliest.
The following information also must be specified in the OAIConfig.xml file:
<resumeAfter>250</resumeAfter> This value will decide whether or not the selective harvesting is allowed for a repository. In OAI, the commands ListSets, ListRecords, and ListIdentifiers are collectively called list requests. In some cases, these lists may be large and it may be practical to partition them among a series of requests and responses. This value decides how many sets/identifiers/records to send for the request before issuing a resumption token. A value less than 0 (e.g. -1) indicates that a complete list of items will be returned. See the OAI specification for how flow control is accomplished by using resumption tokens.
<resumptionTokenExpiration>7200</resumptionTokenExpiration> The time period in which a newly generated resumption token will remain valid, specified in seconds. Hence, the default value 7200 is equivalent to 2 hours. The use of this property depends on the value of resumeAfter. If the resumeAfter parameter is specified to be negative (any value less than 0), there won't be any token issued.
<ListMetadataFormats> A list of metadata formats supported by this repository. Since the Dublin Core metadata format is mandatory according to the OAI specification, there must be a metadataFormat element with the oai_dc prefix specified here, along with the metadata name mappings if necessary. We will get back to this when describing the field mappings later in this document.
An element containing the standard Dublin Core metadata names is also provided here, instead of hard-coded in the program, in case a repository supports only a modification or extension of the Dublin Core standard.
Resides in the /etc directory of each collection. The only information relating to the OAI configuration of the collection specified in this file is a list of metadata formats that this particular collection supports, along with some metadata field mappings.
Metadata Field Mapping
Metadata mapping is necessary if the metadata fields you have used in your collections are not the ones that you claim to support in the above two configuration files. For example, the Dublin Core metadata format is mandatory for any repository (hence all collections in the repository). If a particular collection uses a field name such as Title, instead of the Dublin Core name dc.Title (or whatever the supported metadata field name specified in the two configuration files), the filed Title must be linked to dc.Title in order for your metadata to be accessible by the metadata harvestors.
Field mapping is done in two levels: globally for all collections in a repository, and specifically for one collection. For a particular collection, the mapping specification in the collectionConfig.xml takes precedence over that in the OAIConfig.xml. Hence, the metadata mappings will be first looked for in each collection's collectionConfig.xml; if not found, go to the OAIConfig.xml; if not specified there either, the standard Dublin Core field names will be used to retrieve the metadata of the collection.
Mapping in the configuration files takes the following format:
In this case, the first name dc.Title is the publicly accessible field, and the second is the field name that is used in the collection, i.e., the value of the field Title will be returned as the value of dc.Title (if the field dc.Title is requested).
There is another mapping format that is possible if you have created your own metadata fields and want to make them available for harvesting. For example, the following mapping
means the collection supports a metadata format with the prefix oai_gs, and one of the metadata fields used in the collection is called gs.Title.
The concept of 'set' in the OAI specification is mapped into Greenstone as 'collection'. Hence, the setHierarchy mechanism is supported by dividing a repository into collections. By default, all collections in the repository are OAI-enabled. This is done by writing a OAIPMH ServiceRack in the buildConfig.xml file of a collection. Since it involves the buildConfig.xml file, a collection has to be rebuilt in order to disable its OAI accessibility. This can be done in two ways: providing a -disable_OAI flag on the command line when executing the buildcol.pl; ticking the disable_OAI checkbox in the Create panel in GLI (select build options on the left-hand side of the panel). If you don't want to rebuild the collection (for example, it's quite large and would take a long while), you can manually delete the OAIPMH ServiceRack element in the buildConfig.xml file of a particular collection. By taking this alternative approach, please make sure the file is still well-formed after modification. The easiest way of checking this is open the file in a web browser. You will see an error report instead of proper xml if it's not well-formed. This well-formedness requirement also applies to the two configuration files OAIConfig.xml and collectionConfig.xml.
Once you have your OAI service in place, testing can be done via the following online validation facilities
The former only verifies the Identify command, while extensive testing can be performed via the later one (called Repository Explorer).