User Tools

Site Tools


en:user_advanced:metadata
This version is outdated by a newer approved version.DiffThis version (2017/02/12 20:49) is a draft.
Approvals: 0/1

This is an old revision of the document!


Advanced Metadata Topics

Obtaining a list of all subject headings in a collection

Diego Spano explains:

There is no direct way to do that, but there is a workaround. Run the following commands in your terminal:

1. To set up the GS environment, type:

. ./setup.bash

2. Then type:

perl -S buildcol.pl -store_metadata_coverage your_collection_name

This will create a building folder inside your collection and there you will find a folder named text. Inside you will have a file, "your_collection_name.gdb". This is a database file that will contain all the metadata assigned to each document.

Now, you have to export this file to a txt format with the command db2txt:

3. Move to the folder that has the gdb file, by typing:

cd /greenstone/collect/your collection/building/text"

4. Next, run:

db2txt your_collection_name.gdb > meta.txt

Now you have a file named meta.txt (just plain text) in the same folder of your gdb file. Open it and take a look. You will have a list of metadata for each document.

5. Now you can "Filter" it with grep command (you have it on Linux but in Windows you can use Cygwin):

grep "<dc.Subject>" ./meta.txt > onlysubject.txt

Now you have onlysubject.txt file with all the values (not unique values).

Obtaining metadata coverage statistics for a collection

Metadata coverage statistics can be gathered during collection building by adding the line store_metadata_coverage true to the collection's etc/collect.cfg file. Rebuild the collection (don't need to reimport), then the collection's GDBM database will contain the following information in the 'collection' entry. Examples are from the demo collection.

  • Which metadata sets have been used in the collection
    • <metadataset>dls
    • <metadataset>ex
  • Which elements are present in each metadata set.
    • <metadatalist-ex>URL
    • <metadatalist-ex>Plugin
    • <metadatalist-ex>Encoding
    • <metadatalist-ex>Language
    • <metadatalist-ex>SourceFile
    • <metadatalist-ex>Source
    • <metadatalist-ex>FileSize
    • <metadatalist-ex>Title
    • <metadatalist-dls>Subject
    • <metadatalist-dls>Language
    • <metadatalist-dls>Keyword
    • <metadatalist-dls>Organization
    • <metadatalist-dls>Title
  • The frequency of each metadata element.
    • <metadatafreq-dls-Subject>17
    • <metadatafreq-dls-Title>11
    • <metadatafreq-dls-Organization>11
    • <metadatafreq-dls-Keyword>6
    • <metadatafreq-dls-Language>11
    • <metadatafreq-ex-SourceFile>11
    • <metadatafreq-ex-Plugin>11
    • <metadatafreq-ex-URL>11
    • <metadatafreq-ex-Title>11
    • <metadatafreq-ex-Encoding>11
    • <metadatafreq-ex-FileSize>11
    • <metadatafreq-ex-Language>11
    • <metadatafreq-ex-Source>11

Note, to view all the entries in the GDBM database, run

 db2txt path-to-collection/index/text/collname.gdb > database.txt

Inserting metadata into a live collection using metadata-server.pl

This functionality is only available in versions of Greenstone newer than 2.83.

The metadata-server.pl script inserts the new value into the metadata.xml in the import directory and the collection.gdb file in the index directory Important: it only updates the live metadata display; the search index and classifiers will not be updated until the collection is rebuilt. Collection administrator will have to rebuild the collection to see the changes. In our example, we set up a cron job to rebuild the collection every-night.

Before starting

  • Change the permission on the "index" directory to writable by the web server
  • Change the permission on the "import" directory to writable by the web server
  • Add custom macros folder and extra.dm, and add two Global macros
 _rebuildpendingmessage_ (This will appear on top of the newly added metadata value to remind user this piece of information is not searchable)
 _newline_ (New line characters in the inserted metadata value will be replaced by this macro)

Calling the script

metadata-server.pl has the following arguments:

  • a=insert-metadata
  • un=[Username] - The authentication is not enabled at the moment, collection developers will have to handle the authentication themselves
  • c=[Collection Shortname]
  • d=[Document ID] - In format statement, it is referred as [DocOID]
  • metaname=[Metadata Field Name]
  • metavalue=[New Metadata Value] - This new value will be appended at the end

It returns following message on success

 insert-metadata successful: Key[D0]
 [In metadata.xml] giv.submittedText = new metadata value
 [In database] giv.submittedText = _rebuildpendingmessage_new metadata value
 The new text has not been indexed, rebuilding collection is required

You need to customize your collection interface to have an AJAX call to this script when user submit new values. This AJAX call will have to evaluate the return value of the script to determine if that was successful. The possible error messages:

  • Collection is locked (see Locking Section)
  • Missing compulsory arguments
  • Don't have permission to write to metadata.xml file
  • Don't have permission to write to database file
  • Invalid metadata.xml file

Locking

This script shares the same lock as the GLI:

  • If GLI is using the collection, this script will be locked.
  • On the other hand, while this script is in progress, GLI will not be able to access this collection.
  • Two users can not run the script at the same time as one of them will be locked out.
  • It is the caller's responsibility to retry the submit process.

Specifying filenames manually in metadata.xml

If you're writing your own metadata.xml files that will specify what metadata is attached to which folders and files, you will need to specify the <FileName> element as a regular expression and any filepaths must be in URI format (which uses forward slashes). Because such filepaths represent regular expressions, backslashes can be used to escape special characters, e.g. "\." means the literal full-stop character.

An example of a valid metadata.xml file is:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DirectoryMetadata SYSTEM "http://greenstone.org/dtd/DirectoryMetadata/1.0/DirectoryMetadata.dtd">
<DirectoryMetadata>
    <FileSet>
        <FileName>pinky/golala/filename1\.txt</FileName>
        <Description>
            <Metadata name="dc.Title">Lala</Metadata>
        </Description>
    </FileSet>
    <FileSet>
        <FileName>pinky/nono/filename2\.txt</FileName>
        <Description>
            <Metadata name="dc.Title">Nono</Metadata>
        </Description>
    </FileSet>
    <FileSet>
        <FileName>pinky/toto/filename3\.txt</FileName>
        <Description>
            <Metadata name="dc.Title">Toto</Metadata>
        </Description>       
    </FileSet>
</DirectoryMetadata>

Manually editing extracted metadata

en/user_advanced/metadata.1486932589.txt.gz · Last modified: 2017/02/12 20:49 by kjdon