Warning: This page is part of Greenstone's old deprecated wiki. Please visit the new wiki at wiki.greenstone.org/doku.php

Building Greenstone collections

From GreenstoneWiki

Revision as of 01:49, 24 January 2012 by Ak19 (Talk | contribs)
Jump to: navigation, search

Contents

What is the "Greenstone Librarian Interface"?

The Greenstone Librarian Interface (GLI) is a graphical tool for building new collections, altering or deleting existing collections, and exporting existing collections to stand-alone CD-ROMs. It allows you to import or assign metadata, and has an interactive collection design module. Launch the GLI under Windows by selecting Greenstone Digital Library from the Programs section of the Start menu and choosing Librarian Interface. Under Linux, run gli.sh from the gsdl/gli directory. For details on using the Librarian Interface see the Greenstone User's Guide.

What is "the Collector"?

The Collector is a web interface for collection building, altering and exporting. It predates the Librarian Interface and for most practical purposes, the Librarian Interface should be used instead. To begin using the Collector, click the "The Collector" button on your Greenstone home page. For further details on using the Collector see the Greenstone User's Guide.

How do I build a collection from the command line or DOS prompt?

It's occasionally preferable to build your Greenstone collections from the command line rather than from the Librarian Interface. This allows you greater control over how your new collection turns out. This page has an overview of the collection building process. Or see the Greenstone Developer's Guide for detailed step by step instructions on building collections from the command line.

I built a new Greenstone collection on my Windows machine. Everything appeared to work fine while building, however when I tried to view the collection some of the documents contained no text. Sometimes Greenstone appeared to crash completely. What have I done wrong?

Are you running Norton Anti-Virus? There are some incompatibilities between Norton and the Greenstone collection building process that cause unpredictable things to happen if you build your collection while Norton is running. Try disabling Norton and rebuilding the collection.

If you do not have Norton or disabling Norton does not solve the problem please contact us for further help.

Why won't the Collector's "export to CD-ROM" function work?

If you downloaded Greenstone from the web you will not have all the components required to make the "export to CD-ROM" function work. These extra components have been made available in a separate download which you can get from the download page.

I'm trying to use the Collector on Windows 2000 but it's running extremely slowly. Is this normal?

Are you using a Netscape web browser with the local library? If so, try using Internet Explorer instead. There are some socket connection problems that show up on Windows 2000 when using Netscape.

What is "the Organizer"?

The Organizer (also called the "Collection Organizer") is a Windows utility used for automatically generating some of the configuration files (metadata.xml, sub.txt etc.) used by complex Greenstone collections.

Where do I get the Organizer?

From the download page.

I'm attempting to build a collection with the collector but it keeps failing with an error. What am I doing wrong?

There are several reasons that the collector might fail to build a collection and the error messages it produces are not always very helpful.

If you changed the default configuration during the configure collection stage you'll need to make sure the changes were valid. For example, if you added a new classify or plugin line you'll need to make sure that the classifier and/or plugin names and arguments are all correct. If they're not the collector will fail. A good test is to build your collection without changing the configuration. If it builds ok with the default configuration but fails after you change the configuration you'll need to look closely at the changes you're making.

Another good thing to do if having problems with the collector is to build your collection from the command line instead. You'll get much more feedback to help debug problems when building in this way. For details on how to build a collection from the command line see the Greenstone developer's guide.

What options are available for the collect.cfg file?

See here for a list of all configuration file options.

Where can I find some example collect.cfg configuration files?

The collect.cfg files for many of the collections at www.nzdl.org have been made available here.

How can I build my collection using MGPP?

The MGPP user manual gives some instructions.

How do I fix XML::Parser errors

Our Mac OS X Greenstone distributions are built on machines using Perl 5.6, and these distributions contain a few binary perl modules. These cause problems if you are using a recent version of perl like 5.8 or 5.8.1 (you can type "perl -v" from the command line to see the version).

On the Mac, our distribution contains modules for both perl 5.6 and 5.8 and the correct one should (hopefully) be installed.

A typical error message during import.pl would be:

Uncaught exception from user code: Can't load
'/home/httpd/gsdl/perllib/cpan/auto/XML/Parser/Expat/Expat.so' for module XML::Parser::Expat:/home/httpd/gsdl/perllib/cpan/auto/XML/Parser/Expat/Expat.so:
undefined symbol:PL_sv_undef at /usr/lib/perl5/5.8.0/i386-linux-thread-multi/DynaLoader.pm line 229. at /home/httpd/gsdl/perllib/cpan/XML/Parser.pm line 14

To remedy this, you need to remove the "gsdl/perllib/cpan/perl-5.8/XML" and "gsdl/perllib/cpan/perl-5.8/auto" directories. (For versions earlier than 2.52, remove "gsdl/perllib/cpan/XML" and "gsdl/perllib/cpan/auto".) Then you need to install the perl XML::Parser natively for your system.

On redhat or mandrake, install the .rpm named "perl-XML-Parser", on debian, install the "libxml-parser-perl" package. For other Linuxes, use your distribution's package, or you can get it from http://search.cpan.org/~msergeant/XML-Parser-2.34/.

You may also need to get Expat, available from http://sourceforge.net/projects/expat/.

An alternative solution that may work is to compile Greenstone from source code on your machine. See this page.

Are there any limits to the size of collections?

The largest collections we have built have been 7 GB of text, and 11 million short documents (about 3 GB of text). These built with no problems. We haven't tried larger amounts of text because we don't have larger amounts of text lying around. It's no good using 7 GB twice over to make 14 GB because the vocabulary hasn't grown accordingly, as it would with a real collection.

There are three main limitations:

  1. Operating system limitations: Although this is not likely to be an issue on modern operating systems—which have a file size limits of around 16 TB (specifically the NTFS file system for Windows and either ext3 or ext4 file system on Linux)—older file systems have file size limits of 2-4 GB, allowing for a maximum of around 7 GB worth of text before compression.
  2. Technical limitations: There is a Huffman coding limitation on the MG and MGPP indexers which we would expect to run into at collections of around 16 GB. Building with the Lucene indexer however removes this limitation.
  3. Build time limitations: For building a single index on an already-imported collection, extrapolations indicate that on a modern machine with 1 GB of main memory, you should be able to build a 60 GB collection in about 3 days. However, there are often large gaps between theory and practice in this area! The more indexes you have, the longer things take to build.
  4. GLI limitations: The GLI program for building collections with a graphical user interface may be expected to fail for collections smaller than 16 GB if there are large amounts of metadata per record (for example in the case of complex bibliographic records and/or abstracts). Although no benchmarking has been conducted, problems have been experienced for collections approaching 15,000 documents in this case. In the event of such problems, the collection can be readily built in command line mode.

In practice, the solution for very large amounts of data is not to treat the collection as one huge monolith, but to partition it into subcollections and arrange for the search engine to search them all together behind the scenes. However, while you can amalgamate the results of searching subcollections fairly easily, it's much harder with browsing. Of course, A-Z lists and datelists and the like aren't really much use with very large collections. This is where new techniques of hierarchical phrase browsing come into their own. And the really good news is that you can partition a collection into subcollections, each with individual phrase browsers, and arrange to view them all together in a single hierarchical browsing structure, as one coordinated whole. We haven't actually demonstrated this yet, but it seems quite feasible.

In 2004 a test collection was built by "Archivo Digital", an office that depends on the "Archivo Nacional de la Memoria" (National Memory Archive in English), in Argentina. It contained sequences of page images with associated OCR text.

Setup details

  • Greenstone version: 2.52
  • Server: Pentium IV 1.8 GHz, 512 Mb RAM, Windows XP Prof.
  • Number of indexed documents: 17,655
  • Number of images (tiff format): 980,000
  • Total size of text files: 3.2 GB
  • Built indexes: section:text document:Title
  • Used Plugin: PagedImgPlug
  • 5 classifiers

Statistics

  • Time to import the collection: Almost a week was spent collecting documents and importing them. No image conversion was done.
  • Time to build the collection (excluding import): almost 24 hours. The archives and the indexes were on separate hard disks, to reduce the overhead that reading and writing from the same disk would cause.
  • Time to open a hierarchy node that contains 908 objects: 23 seconds
  • Average Time to search only one word in text index: 2 to 5 seconds
  • Average Time to search 3 words in text index: 2 to 5 seconds
  • Average Time to search exact phrases (includes 4, 5 and 6 words): 30 seconds

Greenstone in practise: The Papers Past collection, a large real life data set

Greenstone has been used to build the digital library collection for the Papers Past initiative of the National Library of New Zealand Te Puna Mātauranga o Aotearoa. The collection contains historic New Zealand newspapers that are out of copyright. According to the Papers Past web site, a third of the collection is now indexed and searchable and the intent is to make all of the contents searchable.

At the start of February 2008, the collection for Papers Past comprised:

  • 1,119,788 newspaper pages, from 207,793 issues. Of those, 91,545 issues—601,516 pages, 6,461,804 articles—have been OCRed to METS/ALTO.
  • As at 06 March 2008, the number of documents—which corresponds to the number of newspaper issues—was 207,844. The space these take up in Greenstone is about 17.25GB (18,524,818,217 bytes).
  • The total built index directory is 87GB. That includes the GDBM databases used to store word coordinates and the Lucene index itself (but no images).
  • The 1,119,788 newspapers images are stored in TIFF format. (The total size of the collection data is still uncertain: it is either 3Tb or—if the images average 500kb each, as they have been estimated to—it is 546GB of image data.)

How do I enter non-English metadata in GLI?

Metadata in the GLI should be entered in UTF-8. If your system doesn't allow typing directly in UTF-8 (your metadata looks like ??? in GLI), then type your metadata in another application such as Notepad, save it as UTF-8, then open it again and cut and paste into GLI. If the metadata has been properly entered in UTF-8, then it should appear fine in a browser once the collection is built.

If your metadata appears as square boxes in GLI, then you will need to use a different font to display it. You can change the font in GLI by going to File->Preferences. The font that you will need to use depends on what language you are using and what fonts are installed on your computer. A good one to try is Arial Unicode MS, PLAIN, 12.

How do I change the search results order?

The order of search results is dependent on the kind of query you are running. For simple (MG) collections, search results are either ranked (for a 'some' or ranked search) or in build order (for an 'or' or boolean search). MG cannot do ranking and boolean searching at the same time. For advanced (MGPP) collections, search results can be in ranked or build order as above, but this doesn't depend on the kind of search you are doing. Boolean searches may be ranked.

Build order is the seemingly random order that documents are processed during import. This can be changed by using the sortmeta option to import.pl. If a metadata element is specified here, then documents will be sorted during import by that metadata.

This option can be specified as an option to import.pl (in GLI Expert mode), or specified in the collect.cfg file. Note that it needs to be added manually to collect.cfg like e.g.:

sortmeta dc.Date

It cannot be added to the config file using GLI at this stage.

For MGPP collections, in advanced searching mode, the query form has a drop down box specifying "display search results in ranked/natural order". If you have sorted the documents by metadata, then you may like to change the text for this box, e.g. have it display "ranked/date order". To achieve this, add the following line to the collection's collect.cfg file:

collectionmacro query:textnatural "date"

For Lucene collections, sorting does not depend on build order. The user can sort by rank, or by any of the fields that have been indexed (apart from text and allfields). This will be offered automatically by Greenstone. If you are indexing sections and want sorting by field to work, you need to make sure that each section has the appropriate metadata. Use the "-sections_index_document_metadata unless_section_metadata_exists" option to buildcol.pl to give each section all the metadata from the top level document.

What's the difference between MG, MGPP, Lucene?

Greenstone gives you a choice of three indexing tools to index your collection. MG is the default indexer, MGPP and Lucene can be used by turning on "Enable Advanced Searching" in the "Search Types" section of the "Design" panel in the Librarian Interface.

MG
This is the original indexer used by Greenstone, developed mainly by Alistair Moffat and described in the classic book Managing Gigabytes. It does section level indexing, and searches can be boolean or ranked (not both at once). For each index specified in the collection, a separate physical index is created. For phrase searching, Greenstone does an "AND" search on all the terms, then scans the resulting hits to see if the phrase is present. It has been extensively tested on very large collections (many GB of text). MG in Greenstone.
MGPP
This new version of MG (MG plus plus) was developed by the New Zealand Digital Library Project. It does word level indexing, which allows fielded, phrase and proximity searching to be handled by the indexer. Boolean searches can be ranked. Only a single index is created for a Greenstone collection: document/section levels and text/metadata fields are all handled by the one index. For collections with many indexes, this results in a smaller collection size than using MG. For large collections, searching may be a bit slower due to the index being word level rather than section level. MGPP user guide
Lucene
Lucene was developed by the Apache Software Foundation. It handles field and proximity searching, but only at a single level (e.g. complete documents or individual sections, but not both). Therefore document and section indexes for a collection require two separate indexes. It provides a similar range of search functionality to MGPP with the addition of single-character wildcards and range searching. It was added to Greenstone to facilitate incremental collection building, which MG and MGPP can't provide. Lucene home page


How do I build my collection incrementally?

At the moment, its best to create and configure your collection using the Librarian Interface, but then do the building phase using the command line. For a brief introduction to command line building, see this page.

You need to use Lucene as your indexer (set this on Design->Search Indexes).

We only support incremental addition, so if you want to change the documents or metadata, then you will need to do a full import and build.

If you change the design of the collection (plugin options, search indexes, classifiers) then you will need to do a full rebuild.

Once you have set up the collection using GLI, it will live in greenstone/collect/collname, where collname is the short collection name shown in brackets in the GLI title bar.

The source documents and metadata live in the import subfolder.

To build the collection the first time, do:

import.pl collname
buildcol.pl collname
rename building to index

Next time you need to update the collection, you can add documents and metadata for those documents directly into the import directory, or using the Librarian Interface.

To add these into the collection, do:

import.pl -incremental collname
buildcol.pl -incremental -builddir <path-to-index-dir> collname

The -incremental option to both import and buildcol tells it not to delete the current archives/building directory, and to only import/index those documents that are new. The builddir option to buildcol is necessary if the building directory has been deleted or renamed. buildcol by default puts the output into building. If you have renamed the initial building directory to index, then you need to tell buildcol to use that directory instead.


Can I build collections using the Librarian Interface on a remote server?

The Greenstone installation running on the server will need to be set up for remote collection building. Full instructions on how to set this up and use it can be found on the remote building page.

What is the Depositor?

The depositor is a web interface for adding new documents, along with metadata, to existing collections. Please see this page for more information about how to enable and use the Depositor.

Can I get any information about the metadata coverage in my collection?

Metadata coverage statistics can be gathered during collection building by adding the line

store_metadata_coverage true

to the collection's etc/collect.cfg file. Rebuild the collection (don't need to reimport), then the collection's GDBM database will contain the following information in the 'collection' entry. Examples are from the demo collection.

  • Which metadata sets have been used in the collection
<metadataset>dls
<metadataset>ex
  • Which elements are present in each metadata set.
<metadatalist-ex>URL
<metadatalist-ex>Plugin
<metadatalist-ex>Encoding
<metadatalist-ex>Language
<metadatalist-ex>SourceFile
<metadatalist-ex>Source
<metadatalist-ex>FileSize
<metadatalist-ex>Title
<metadatalist-dls>Subject
<metadatalist-dls>Language
<metadatalist-dls>Keyword
<metadatalist-dls>Organization
<metadatalist-dls>Title
  • The frequency of each metadata element.
<metadatafreq-dls-Subject>17
<metadatafreq-dls-Title>11
<metadatafreq-dls-Organization>11
<metadatafreq-dls-Keyword>6
<metadatafreq-dls-Language>11
<metadatafreq-ex-SourceFile>11
<metadatafreq-ex-Plugin>11
<metadatafreq-ex-URL>11
<metadatafreq-ex-Title>11
<metadatafreq-ex-Encoding>11
<metadatafreq-ex-FileSize>11
<metadatafreq-ex-Language>11
<metadatafreq-ex-Source>11

Note, to view all the entries in the GDBM database, run

db2txt path-to-collection/index/text/collname.gdb > database.txt

How to manually specify filenames in metadata.xml

If you're writing your own metadata.xml files that will specify what metadata is attached to which folders and files, you will need to specify the <FileName> element as a regular expression and any filepaths must be in URI format (which uses forward slashes). Because such filepaths represent regular expressions, backslashes can be used to escape special characters, e.g. "\." means the literal full-stop character.

An example of a valid metadata.xml file is:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DirectoryMetadata SYSTEM "http://greenstone.org/dtd/DirectoryMetadata/1.0/DirectoryMetadata.dtd">
<DirectoryMetadata>
    <FileSet>
        <FileName>pinky/golala/filename1\.txt</FileName>
        <Description>
            <Metadata name="dc.Title">Lala</Metadata>
        </Description>
    </FileSet>
    <FileSet>
        <FileName>pinky/nono/filename2\.txt</FileName>
        <Description>
            <Metadata name="dc.Title">Nono</Metadata>
        </Description>
    </FileSet>
    <FileSet>
        <FileName>pinky/toto/filename3\.txt</FileName>
        <Description>
            <Metadata name="dc.Title">Toto</Metadata>
        </Description>       
    </FileSet>
</DirectoryMetadata>
Personal tools