Building Greenstone collections

= Using GLI: Greenstone Librarian Interface=

What is the "Greenstone Librarian Interface"?
The Greenstone Librarian Interface (GLI) is a graphical tool for building new collections, altering or deleting existing collections, and exporting existing collections to stand-alone CD-ROMs. It allows you to import or assign metadata, and has an interactive collection design module. Launch the GLI under Windows by selecting Greenstone Digital Library from the Programs section of the Start menu and choosing Librarian Interface. Under Linux, run gli.sh from the gsdl/gli directory. For details on using the Librarian Interface see the Greenstone User's Guide.

How do I enter non-English metadata in GLI?
Metadata in the GLI should be entered in UTF-8. If your system doesn't allow typing directly in UTF-8 (your metadata looks like ??? in GLI), then type your metadata in another application such as Notepad, save it as UTF-8, then open it again and cut and paste into GLI. If the metadata has been properly entered in UTF-8, then it should appear fine in a browser once the collection is built.

If your metadata appears as square boxes in GLI, then you will need to use a different font to display it. You can change the font in GLI by going to File->Preferences. The font that you will need to use depends on what language you are using and what fonts are installed on your computer. A good one to try is Arial Unicode MS, PLAIN, 12.

Running GLI in debug mode: GLI failed to start up properly, how do I find out what just happened?
Sometimes you run GLI and it doesn't start up: a black window appears and then disappears, and you're left wondering what just happened. At other times you get a whole sequence of error messages, none of which may be be the root problem, after which GLI fails to work.

In order to help the Greenstone team find out what went wrong, you can run GLI in debug mode, which will print out the errors as it runs.

To do so, 1. make sure to run GLI through a terminal instead of from Windows' Start menu, by opening a DOS prompt (see Miscellaneous_Questions).

2. Next, use your DOS prompt to Change Directory (cd) into your Greenstone installation's gli folder. For example, if your Greenstone is installed in C:\Program Files\Greenstone, you'd type the following line and hit Enter: cd "C:\Program Files\Greenstone\gli"

3. Now run GLI in debug mode by typing the following: gli.bat -debug On Linux and Macs, you'd type ./gli.sh -debug

4. The black DOS screen will no longer flash and disappear, but will remain in the background. If GLI starts up momentarily, there will be several additional dialogs (error consoles) that open up alongside it. When GLI finally fails, errors are likely to get printed to the black DOS window. If they do, copy the error messages and send them to the Greenstone mailing list, as explained in Miscellaneous_Questions

Getting better error reporting in GLI (the Greenstone Librarian Interface)
Sometimes, GLI runs and tries to build your collection, but something goes wrong. The build log tells you that some error occurred and your collection is left unbuilt. In cases like these, when you contact the mailing list, it is helpful to provide as much of the error information as is available. To do this, you want to increase the verbosity of the build process and error reporting.

1. In GLI, go to File > Preferences > Mode tab. Set the mode to Expert.

2. Then, turn on the verbosity for both import and building as follows:
 * In GLI's Create tab, click the "Import Options" on the left. On the right, scroll down to the Verbosity field and tick it. Then set its text field value to 5.
 * Do the same for the "Build Options", once again on the left part of the Create tab.

3. Before rebuilding, still on the left of the Create tab, choose "Message log" to return to viewing the build output.

4. Now rebuild the collection once more and you will hopefully get more error reporting. Copy the relevant portions of the error output (using Ctrl+C), and paste them into a new email message (with Ctrl+V) and mail it to us.

= Manually Using the Command Line=

How do I build a collection from the command line or DOS prompt?
It's occasionally preferable to build your Greenstone collections from the command line rather than from the Librarian Interface. This allows you greater control over how your new collection turns out. This page has an overview of the collection building process. Or see the Greenstone Developer's Guide for detailed step by step instructions on building collections from the command line.

How do I manually (re)build a collection?
When GLI fails mysteriously, you can try manually rebuilding a collection to let us know if the problem is with GLI or with the underlying building scripts. Manually building a collection is also helpful if GLI for some reason will not rebuild a collection.

1. On Linux and Mac, open a terminal. On Windows you would open a DOS console.

2. Use the terminal or DOS console to Change Directory into your Greenstone installation folder. Let's assume you have Greenstone installed in C:\Program Files\Greenstone. If so, you'd type the following line and hit Enter: cd "C:\Program Files\Greenstone" On Linux your slashes will go the other way. For example, /home/me/greenstone.

3. Now you've set your DOS Prompt to be in your Greenstone installation folder. At this point, you want to run some Greenstone scripts. Firstly, the script to set up the Greenstone environment. On Windows, type the following (without the >) and hit Enter as usual): setup.bat

On Linux and Mac, type: source setup.bash

4. Now that you've set up your Greenstone's environment:

If you're rebuilding an existing collection, skip to full-rebuild step just below. If you want to create a new collection first, do the following: perl -S mkcol.pl COLLECTION-NAME This will create an empty collection. You need to use your file browser to copy across the files you want to gather into your collection. Do this by copying the selected files into your Greenstone installation's collect/COLLECTION-NAME/import folder.

5. To build your collection manually: Type the following, substituting the name of the collection you wish to build and then hit Enter: perl -S full-rebuild.pl COLLECTION-NAME On Linux and Mac, you can leave out the perl -S at the start.
 * EITHER: In newer versions of Greenstone

If you want more error reporting, use: perl -S full-rebuild.pl -verbosity 5 COLLECTION-NAME

You'll run the 2 scripts that perform the 2 stages of the Greenstone build operation (import and buildcollection). Type the following, substituting the name of the collection you wish to build, then hit Enter:
 * OR: In older versions of Greenstone

a. First the import phase of building a collection: perl -S import.pl COLLECTION-NAME On Linux and Mac, you can leave out the "perl -S" at the start.

If you want more error reporting, use: perl -S import.pl -verbosity 5 COLLECTION-NAME

b. Then the buildcol step: perl -S buildcol.pl COLLECTION-NAME

If you want more error reporting, use: perl -S buildcol.pl -verbosity 5 COLLECTION-NAME

c. This will have generated a folder called "building" inside your collection's folder. You will need to rename it to "index" (after either deleting your old index folder or moving it out of the way). Use Windows Explorer to navigate into your Greenstone installation's collect/COLLECTION-NAME directory and to rename your collection's new "building" folder to "index".

d. If there are any errors that appear in the terminal, copy and paste these errors and mail them into the Greenstone list when explaining your problem. If your terminal is a Windows' DOS prompt, see Miscellaneous_Questions.

6. Finally, run the Greenstone server and see if your collection looks okay.

= Remotely=

Can I build collections using the Librarian Interface on a remote server?
The Greenstone installation running on the server will need to be set up for remote collection building. Full instructions on how to set this up and use it can be found on the remote building page.

What is the Depositor?
The depositor is a web interface for adding new documents, along with metadata, to existing collections. Please see this page for more information about how to enable and use the Depositor.

What is "the Collector"?
[Deprecated] The Collector is a web interface for collection building, altering and exporting. It predates the Librarian Interface and for most practical purposes, the Librarian Interface should be used instead. To begin using the Collector, click the "The Collector" button on your Greenstone home page. For further details on using the Collector see the Greenstone User's Guide.

I'm trying to use the Collector on Windows 2000 but it's running extremely slowly. Is this normal?
Are you using a Netscape web browser with the local library? If so, try using Internet Explorer instead. There are some socket connection problems that show up on Windows 2000 when using Netscape.

I'm attempting to build a collection with the collector but it keeps failing with an error. What am I doing wrong?
There are several reasons that the collector might fail to build a collection and the error messages it produces are not always very helpful.

If you changed the default configuration during the configure collection stage you'll need to make sure the changes were valid. For example, if you added a new classify or plugin line you'll need to make sure that the classifier and/or plugin names and arguments are all correct. If they're not the collector will fail. A good test is to build your collection without changing the configuration. If it builds ok with the default configuration but fails after you change the configuration you'll need to look closely at the changes you're making.

Another good thing to do if having problems with the collector is to build your collection from the command line instead. You'll get much more feedback to help debug problems when building in this way. For details on how to build a collection from the command line see the Greenstone developer's guide.

==I built a new Greenstone collection on my Windows machine. Everything appeared to work fine while building, however when I tried to view the collection some of the documents contained no text. Sometimes Greenstone appeared to crash completely. What have I done wrong?== Are you running Norton Anti-Virus? There are some incompatibilities between Norton and the Greenstone collection building process that cause unpredictable things to happen if you build your collection while Norton is running. Try disabling Norton and rebuilding the collection.

If you do not have Norton or disabling Norton does not solve the problem please contact us for further help.

=Building Issues and Suggested Solutions=

Windows Building error: can't spawn cmd.exe
When building a collection in GLI on Windows, if you see an error message like the following in the build log output: Can't spawn "cmd.exe": No such file or directory ... then try the steps below. This error has been reported by some users. It appears to occur because %SystemRoot%, an environment variable used in the %PATH% which evaluates to C:\windows, isn't being passed in to the perl code from GLI. As a result, the PATH does not contain c:\windows\system32 which contains cmd.exe used by perl to execute commands. However, this problem does not appear to manifest for all Windows users of Greenstone.

The solution that has worked for the 2 members of the mailing list who reported this problem is as follows.

1. In a text editor, open the file gli.bat which is located in your Greenstone installation's "gli" folder

2. Near the top, just after set GLILANG=en add the following line: set PATH=c:\windows\systems32;%PATH% The above manually prefixes "c:\windows\systems32;" to the PATH during GLI's execution.

3. Now try running GLI and building the collection again.

= General =

How do I build my collection incrementally?
If you are using GLI, you need to select "Minimal Rebuild" on the Create Pane.

If you are building on the command line, you can use incremental-rebuild.pl, incremental-import.pl and incremental-buildcol.pl in place of full-rebuild.pl, import.pl and buildcol.pl.

Incremental importing: New documents will be imported. Modified documents will be re-imported. Deleted documents will be removed from the collection. If metadata has changed, then documents will be reimported.

Important note for collection design: GLI can notice that metadata in a folder has beed added/changed, but it is not smart enough to tell which documents in the folder have changed metadata. Therefore, if metadata in a folder has changed (including new metadata being added), then all documents in that folder will be reimported. This means that if you have all your documents in the top level import folder, adding new metadata or changing any metadata for any document will result in all documents being reimported. If you intend to do incremental import, then please organise your documents into subfolders. That way modifying metadata for some documents won't result in all other documents being reimported.

Incremental indexing: Currently only the Lucene indexer can do incremental indexing. If you are using MG/MGPP then a full buildcol pass will be done, even if incremental-buildcol.pl is used.

If collection design has changed, then you will need to do a full rebuild. Changes to plugin options, and some import options will necessitate a full import. Changes to search indexes, partition indexes, browsing classifiers will necessitate a full buildcol.

Note, changes on the Format pane do not require a rebuild at all.

What options are available for the collect.cfg file?
See here for a list of all configuration file options.

Where can I find some example collect.cfg configuration files?
The collect.cfg files for many of the collections at www.nzdl.org have been made available here.

How do I fix XML::Parser errors
Our Mac OS X Greenstone distributions are built on machines using Perl 5.6, and these distributions contain a few binary perl modules. These cause problems if you are using a recent version of perl like 5.8 or 5.8.1 (you can type "perl -v" from the command line to see the version).

On the Mac, our distribution contains modules for both perl 5.6 and 5.8 and the correct one should (hopefully) be installed.

A typical error message during import.pl would be:
 * Uncaught exception from user code: Can't load '/home/httpd/gsdl/perllib/cpan/auto/XML/Parser/Expat/Expat.so' for module XML::Parser::Expat:/home/httpd/gsdl/perllib/cpan/auto/XML/Parser/Expat/Expat.so: undefined symbol:PL_sv_undef at /usr/lib/perl5/5.8.0/i386-linux-thread-multi/DynaLoader.pm line 229. at /home/httpd/gsdl/perllib/cpan/XML/Parser.pm line 14

To remedy this, you need to remove the "gsdl/perllib/cpan/perl-5.8/XML" and "gsdl/perllib/cpan/perl-5.8/auto" directories. (For versions earlier than 2.52, remove "gsdl/perllib/cpan/XML" and "gsdl/perllib/cpan/auto".) Then you need to install the perl XML::Parser natively for your system.

On redhat or mandrake, install the .rpm named "perl-XML-Parser", on debian, install the "libxml-parser-perl" package. For other Linuxes, use your distribution's package, or you can get it from http://search.cpan.org/~msergeant/XML-Parser-2.34/.

You may also need to get Expat, available from http://sourceforge.net/projects/expat/.

An alternative solution that may work is to compile Greenstone from source code on your machine. See this page.

Are there any limits to the size of collections?
The largest collections we have built have been 7 GB of text, and 11 million short documents (about 3 GB of text). These built with no problems. We haven't tried larger amounts of text because we don't have larger amounts of text lying around. It's no good using 7 GB twice over to make 14 GB because the vocabulary hasn't grown accordingly, as it would with a real collection.

There are three main limitations:
 * Operating system limitations: Although this is not likely to be an issue on modern operating systems&mdash;which have a file size limits of around 16 TB (specifically the NTFS file system for Windows and either ext3 or ext4 file system on Linux)&mdash;older file systems have file size limits of 2-4 GB, allowing for a maximum of around 7 GB worth of text before compression.
 * Technical limitations: There is a Huffman coding limitation on the MG and MGPP indexers which we would expect to run into at collections of around 16 GB. Building with the Lucene indexer however removes this limitation.
 * Build time limitations: For building a single index on an already-imported collection, extrapolations indicate that on a modern machine with 1 GB of main memory, you should be able to build a 60 GB collection in about 3 days. However, there are often large gaps between theory and practice in this area! The more indexes you have, the longer things take to build.
 * GLI limitations: The GLI program for building collections with a graphical user interface may be expected to fail for collections smaller than 16 GB if there are large amounts of metadata per record (for example in the case of complex bibliographic records and/or abstracts). Although no benchmarking has been conducted, problems have been experienced for collections approaching 15,000 documents in this case. In the event of such problems, the collection can be readily built in command line mode.

In practice, the solution for very large amounts of data is not to treat the collection as one huge monolith, but to partition it into subcollections and arrange for the search engine to search them all together behind the scenes. However, while you can amalgamate the results of searching subcollections fairly easily, it's much harder with browsing. Of course, A-Z lists and datelists and the like aren't really much use with very large collections. This is where new techniques of hierarchical phrase browsing come into their own. And the really good news is that you can partition a collection into subcollections, each with individual phrase browsers, and arrange to view them all together in a single hierarchical browsing structure, as one coordinated whole. We haven't actually demonstrated this yet, but it seems quite feasible.

In 2004 a test collection was built by "Archivo Digital", an office that depends on the "Archivo Nacional de la Memoria" (National Memory Archive in English), in Argentina. It contained sequences of page images with associated OCR text.

Setup details
 * Greenstone version: 2.52
 * Server: Pentium IV 1.8 GHz, 512 Mb RAM, Windows XP Prof.
 * Number of indexed documents: 17,655
 * Number of images (tiff format): 980,000
 * Total size of text files: 3.2 GB
 * Built indexes: section:text document:Title
 * Used Plugin: PagedImgPlug
 * 5 classifiers

Statistics
 * Time to import the collection: Almost a week was spent collecting documents and importing them. No image conversion was done.
 * Time to build the collection (excluding import): almost 24 hours. The archives and the indexes were on separate hard disks, to reduce the overhead that reading and writing from the same disk would cause.
 * Time to open a hierarchy node that contains 908 objects: 23 seconds
 * Average Time to search only one word in text index: 2 to 5 seconds
 * Average Time to search 3 words in text index: 2 to 5 seconds
 * Average Time to search exact phrases (includes 4, 5 and 6 words): 30 seconds

Greenstone in practise: The Papers Past collection, a large real life data set
Greenstone has been used to build the digital library collection for the Papers Past initiative of the National Library of New Zealand Te Puna Mātauranga o Aotearoa. The collection contains historic New Zealand newspapers that are out of copyright. According to the Papers Past web site, a third of the collection is now indexed and searchable and the intent is to make all of the contents searchable.

At the start of February 2008, the collection for Papers Past comprised:
 * 1,119,788 newspaper pages, from 207,793 issues. Of those, 91,545 issues&mdash;601,516 pages, 6,461,804 articles&mdash;have been OCRed to METS/ALTO.
 * As at 06 March 2008, the number of documents&mdash;which corresponds to the number of newspaper issues&mdash;was 207,844. The space these take up in Greenstone is about 17.25GB (18,524,818,217 bytes).
 * The total built index directory is 87GB. That includes the GDBM databases used to store word coordinates and the Lucene index itself (but no images).
 * The 1,119,788 newspapers images are stored in TIFF format. (The total size of the collection data is still uncertain: it is either 3Tb or&mdash;if the images average 500kb each, as they have been estimated to&mdash;it is 546GB of image data.)

What is "the Organizer"?
The Organizer (also called the "Collection Organizer") is a Windows utility used for automatically generating some of the configuration files (metadata.xml, sub.txt etc.) used by complex Greenstone collections.

Where do I get the Organizer?
From the download page.

Are there really multiple copies of my resource documents stored in my Greenstone installation?
You've just built your collection of documents and discover that it very much looks like there's a copy of the documents you gathered not just in your import folder but in the archives and index folders too. Surely this must be taking up three times the space needed? Fortunately, Greenstone uses hard-links not just on Linux but also on Windows, so in reality Greenstone keeps just the one set of your documents and hardlinks to these instead of making copies. Hard-links are like shortcuts, but your Operating System sees the hard-linked items (that are located elsewhere) as being "really" there.

The reason for the confusion is that, by default, Windows doesn't show you when files on your filesystem are hard-linked. If you choose to install the Windows extension program Link Shell Extension (LSE), it will put red arrows on files that are hard linked.