Document Identifiers

Every document in a Greenstone collection gets given a unique object identifier (OID). This is used in URLs to refer to that document. There are several methods used for assigning identifiers. The method can be changed using the OIDtype option. This option is available as an import option and as a plugin option. Setting it for import will mean that all plugins will use that setting. Setting it for a plugin will mean that only that plugin will use the setting, overriding the import setting. You can set a default setting for import, and have one or more plugins override that setting with a different one if need be.

The options for OIDtype are:

  • auto (default for plugins): Use the OIDtype set in import.pl. (Only for use with a plugin, not with import.pl)
  • hash: (default for import.pl) This usually hashes the contents of the file, and will be consistent every time the collection is imported. (Hashing takes an input and generates a unique id for that input.) Duplicate documents will not be included in the collection as they will hash to the same value. For some filetypes, hashing the contents is not useful: PDF, MP3, OggVorbis, RealMedia. For these files, 'hash_on_ga_xml' will be used instead (see below).
  • hash_on_ga_xml: Hash the contents of the Greenstone Archive XML file. Document identifiers will be the same every time the collection is imported as long as the metadata does not change. Identifiers will probably change between software updates, as updates can affect the Archive XML format/contents.
  • hash_on_full_filename: Hash on the full filename to the document within the 'import' folder (and not its contents). Helps make document identifiers more stable across upgrades of the software, although it means that duplicate documents contained in the collection are no longer detected automatically.
  • assigned: Your documents may already have unique identifiers in their metadata. If so, you can set the OIDtype to 'assigned' and use the OIDmetadata option to specify which metadata field contains the document identifiers. OIDmetadata defaults to 'dc.Identifier'. Purely numeric identifiers will be prefixed by 'D', and directory slashes and periods will be replaced by underscores.
  • filename: Use the tail file name (without the file extension) as the document identifier. Requires every filename across all the folders within 'import' to be unique. Numeric identifiers will be preceded by 'D'.
  • dirname: Use the immediate parent directory name. There should only be one document per directory, and directory names should be unique. E.g. import/b13as/h15ef/page.html will get an identifier of h15ef. Numeric identifiers will be preceded by 'D'.
  • full_filename: Use the full file name within the 'import' folder as the identifier for the document (with _ substitutions made for symbols such as directory separators and the fullstop in a filename extension).
  • incremental: This option will just number the documents in order as it imports them. While it is significantly faster than using the hash, there is no guarantee that the documents will get the same identifier every time the collection is built. Incremental OID's start with "D" by default (e.g. "D1", "D2", "D3"), as they cannot be purely numerical to avoid conflicting with indexer document numbers. You can change this prefix in

by replacing the "D" in the code below:

	} elsif ($self->{'OIDtype'} eq "incremental") {
	    $OID = "D" . $OIDcount;
	    $OIDcount ++;

(contributed by Michael Dewsnip & Stephen)