Advanced Metadata Topics

Obtaining a list of all subject headings in a collection

Diego Spano explains:

There is no direct way to do that, but there is a workaround. Run the following commands in your terminal:

1. To set up the GS environment, type:

. ./setup.bash

2. Then type:

perl -S buildcol.pl -store_metadata_coverage your_collection_name

This will create a building folder inside your collection and there you will find a folder named text. Inside you will have a file, "your_collection_name.gdb". This is a database file that will contain all the metadata assigned to each document.

Now, you have to export this file to a txt format with the command db2txt:

3. Move to the folder that has the gdb file, by typing:

cd /greenstone/collect/your collection/building/text"

4. Next, run:

db2txt your_collection_name.gdb > meta.txt

Now you have a file named meta.txt (just plain text) in the same folder of your gdb file. Open it and take a look. You will have a list of metadata for each document.

5. Now you can "Filter" it with grep command (you have it on Linux but in Windows you can use Cygwin):

grep "<dc.Subject>" ./meta.txt > onlysubject.txt

Now you have onlysubject.txt file with all the values (not unique values).

Obtaining metadata coverage statistics for a collection

Metadata coverage statistics can be gathered during collection building by adding the line store_metadata_coverage true to the collection's etc/collect.cfg file. Rebuild the collection (don't need to reimport), then the collection's GDBM database will contain the following information in the 'collection' entry. Examples are from the demo collection.

  • Which metadata sets have been used in the collection
    • <metadataset>dls
    • <metadataset>ex
  • Which elements are present in each metadata set.
    • <metadatalist-ex>URL
    • <metadatalist-ex>Plugin
    • <metadatalist-ex>Encoding
    • <metadatalist-ex>Language
    • <metadatalist-ex>SourceFile
    • <metadatalist-ex>Source
    • <metadatalist-ex>FileSize
    • <metadatalist-ex>Title
    • <metadatalist-dls>Subject
    • <metadatalist-dls>Language
    • <metadatalist-dls>Keyword
    • <metadatalist-dls>Organization
    • <metadatalist-dls>Title
  • The frequency of each metadata element.
    • <metadatafreq-dls-Subject>17
    • <metadatafreq-dls-Title>11
    • <metadatafreq-dls-Organization>11
    • <metadatafreq-dls-Keyword>6
    • <metadatafreq-dls-Language>11
    • <metadatafreq-ex-SourceFile>11
    • <metadatafreq-ex-Plugin>11
    • <metadatafreq-ex-URL>11
    • <metadatafreq-ex-Title>11
    • <metadatafreq-ex-Encoding>11
    • <metadatafreq-ex-FileSize>11
    • <metadatafreq-ex-Language>11
    • <metadatafreq-ex-Source>11

Note, to view all the entries in the GDBM database, run

 db2txt path-to-collection/index/text/collname.gdb > database.txt

Inserting metadata into a live collection using metadata-server.pl

This functionality is only available in versions of Greenstone newer than 2.83.

The metadata-server.pl script inserts the new value into the metadata.xml in the import directory and the collection.gdb file in the index directory Important: it only updates the live metadata display; the search index and classifiers will not be updated until the collection is rebuilt. Collection administrator will have to rebuild the collection to see the changes. In our example, we set up a cron job to rebuild the collection every-night.

For greenstone 3, you can access metadata-server.pl at <server-name>:<server-port>/<greenstone context>/cgi-bin/metadata-server.pl, for example localhost:8383/greenstone3/cgi-bin/metadata-server.pl

Before starting

  • Change the permission on the "index" directory to writable by the web server
  • Change the permission on the "import" directory to writable by the web server
  • Add custom macros folder and extra.dm, and add two Global macros
 _rebuildpendingmessage_ (This will appear on top of the newly added metadata value to remind user this piece of information is not searchable)
 _newline_ (New line characters in the inserted metadata value will be replaced by this macro)

Calling the script

metadata-server.pl has the following arguments:

  • a=insert-metadata
  • un=[Username] - The authentication is not enabled at the moment, collection developers will have to handle the authentication themselves
  • c=[Collection Shortname]
  • d=[Document ID] - In format statement, it is referred as [DocOID]
  • metaname=[Metadata Field Name]
  • metavalue=[New Metadata Value] - This new value will be appended at the end

It returns following message on success

 insert-metadata successful: Key[D0]
 [In metadata.xml] giv.submittedText = new metadata value
 [In database] giv.submittedText = _rebuildpendingmessage_new metadata value
 The new text has not been indexed, rebuilding collection is required

You need to customize your collection interface to have an AJAX call to this script when user submit new values. This AJAX call will have to evaluate the return value of the script to determine if that was successful. The possible error messages:

  • Collection is locked (see Locking Section)
  • Missing compulsory arguments
  • Don't have permission to write to metadata.xml file
  • Don't have permission to write to database file
  • Invalid metadata.xml file

Locking

This script shares the same lock as the GLI:

  • If GLI is using the collection, this script will be locked.
  • On the other hand, while this script is in progress, GLI will not be able to access this collection.
  • Two users can not run the script at the same time as one of them will be locked out.
  • It is the caller's responsibility to retry the submit process.

Specifying filenames manually in metadata.xml

If you're writing your own metadata.xml files that will specify what metadata is attached to which folders and files, you will need to specify the <FileName> element as a regular expression and any filepaths must be in URI format (which uses forward slashes). Because such filepaths represent regular expressions, backslashes can be used to escape special characters, e.g. "\." means the literal full-stop character.

An example of a valid metadata.xml file is:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DirectoryMetadata SYSTEM "http://greenstone.org/dtd/DirectoryMetadata/1.0/DirectoryMetadata.dtd">
<DirectoryMetadata>
    <FileSet>
        <FileName>pinky/golala/filename1\.txt</FileName>
        <Description>
            <Metadata name="dc.Title">Lala</Metadata>
        </Description>
    </FileSet>
    <FileSet>
        <FileName>pinky/nono/filename2\.txt</FileName>
        <Description>
            <Metadata name="dc.Title">Nono</Metadata>
        </Description>
    </FileSet>
    <FileSet>
        <FileName>pinky/toto/filename3\.txt</FileName>
        <Description>
            <Metadata name="dc.Title">Toto</Metadata>
        </Description>       
    </FileSet>
</DirectoryMetadata>

Manually editing extracted metadata

Can extracted metadata be edited manually?

In general, for the default collection building scenario, it is not possible. When you reimport the document again, all the extracted metadata will be regenerated. However, if you are happy not to reimport that document again, you can edit the metadata. Whether you want to do this or not depends on your process for adding documents and metadata to a collection.

Summary of building process: source documents —-import—> archive files = greenstone xml format. contains extracted text and metadata —-index—>search indexes and metadata database.

The importing and indexing phases can be run separately, so it is possible to import your documents, edit the greenstone xml format to change metadata, then index the modified files. If you later reimport though, any changes will be lost. This cannot be done in GLI.

Cases where this might be useful:
* you have a static collection that will not change over time. import, edit the xml files, then index. And never rebuild the collection.
* the collection will grow, but you don't modify existing documents. put current docs into import folder, and import (optionally modify archive files)and index them. clear out the import folder, add new documents and import -keepold. this will add the new documents into the current archives without changing what is already there. then reindex the collection.

A third option, if you are using Greenstone 3, is to use the web based metadata editing facility.

You build the collection normally (in GLI or command line) for the first time. Then in the browser, you can log in, and if you have edit privileges for that collection, you can modify the section text and/or metadata for documents, from the document view page. This is the easiest solution for the user as you don't need to worry about running build scripts on the command line.

Behind the scenes, it is modifying the xml archive files, then reindexing the document. If you use this scenario, then you cannot reimport the existing documents or your changes will be lost. So you cannot go back and use GLI to build the collection. You can add new files using the process described above, where you put new docs into an empty import folder, and run import.pl -keepold.

A fourth option is also available, which doesn't actually change the metadata. What do you want the metadata for? Say you are looking at the Language metadata, and it has been extracted wrong. This doesn't actually affect the document unless you want to eg display it, or use it for classifying on, for example. For these situations, what you can do is have two metadata fields. eg ex.Language and dc.Language. If ex.Language has been set wrongly, then the user could set dc.Language to the correct value. Then in classifiers or format statements, you use dc.Language if it is there, or ex.Language if it is not. While this option doesn't modify the extracted metadata, it overrides it. Actually, I should have put this option first as it is probably the most useful one. Using this means you don't have any restrictions on how you build the collection in future, as you are not modifying the extracted metadata.