Table of Contents

Advanced collection building

In the GLI, the actual building of the collection is done in the Create panel. There are many options associated with the building process. Many of these are available in the GLI when in expert mode (File → Preferences → Mode → Expert). To access the full list of options, you can build from the command-line.

Importing

Building collections is split into two segments: importing and building. The import process is when all of the documents and metadata are actually imported into the collection–documents are processed by their respective plugins, metadata is extracted, and both assigned and extracted metadata are stored in a format Greenstone can use.

You have several import options available in the GLI:

Greenstone can store metadata in two different formats: GreenstoneXML (default) and GreenstoneMETS. Use the saveas option to change the storage format. (Note: that you must also ensure the correct plugin is in the Document Plugins list–GreenstoneXMLPlugin or GreenstoneMETSPlugin. See Greenstone metadata formats for more information.)

Logs: Logs provide information on what has occurred during the import/build process. The main message log, which appears in the Create panel while your collection is being built, is also written to a file, which can be found in the collection's log folder. You can control how detailed the log is by changing the verbosity. The higher the verbosity (from 0-3), the more information is printed in the log. This is especially helpful when there are any errors. There is a second log file–the faillog–which contains only a list of documents that failed the be processed. By default, this file is called FAIL.log and is located in the collection's etc folder. You can change the name and location of the fail log by putting a complete file path in the faillog option.

maxdocs allows you to set the maximum number of documents to process. This is useful when you are designing your collection. If you have a large collection, importing all of the documents and building the collection can take a very long time. By setting a low number for maxdocs, you can build a subset of the collection quickly to see design changes you have made.

sortmeta will import the documents in order depending on the value of one or more metadata fields. This is useful for giving some order to non-ranked search results, as the results will be displayed in the word they were imported. (removeprefix and removesuffix can be used in conjunction with sortmeta.)

Every document in the collection must have a unique object identifier (OID). Greenstone gives you four options for assigning this identifier (OIDtype):

Greenstone3

Greenstone3 → gs2build → perllib → doc.pm

Greenstone2

Greenstone3 → perllib → doc.pm

by replacing the "D" in the code below:

	} elsif ($self->{'OIDtype'} eq "incremental") {
	    $OID = "D" . $OIDcount;
	    $OIDcount ++;

(contributed by Michael Dewsnip & Stephen)

Building

The actual building part of the build process takes the imported documents and metadata and builds the indexes, classifiers, and partitions specified in the Design panel.

Options include only building specific indexes (index) or only doing a specific part of the build process (mode), and removing classifiers that are empty (remove_empty_classifications). You can also specify how detailed the log is for the build process (verbosity) and the name of the file containing a list of documents that failed to process (faillog).

Greenstone3

Greenstone2

===== Scheduling builds ===== Greenstone allows you to set your collection to automatically re-build either hourly, daily, or weekly. In Schedule Options in the Create panel, check schedule, select the frequency, and, under action specify whether you are adding a new schedule, or modifying or deleting an existing schedule. Then click Schedule Action. Read more about scheduled building.

Incremental Building

Incremental building is a way to only import new/modified documents and update indexes, instead of completely rebuilding them. This can save a lot of time, especially with larger collections. Incremental building can be done in the GLI by selecting the Minimal Rebuild option on the Create panel. Incremental building can only be performed in certain circumstances (for instance, you must be using Lucene if you want to incrementally build indexes). If the conditions aren't met, Greenstone will simply end up doing a complete rebuild instead. You can read more about incremental building to understand when it will work and to take full advantage it.

Apparent multiple copies of documents

When you look at a built collection, you may notice that there appear to be three copies of each document: in the import, archives, and index folders. This is not the case. Greenstone uses something called hard-links instead of making copies. Hard-links are like shortcuts, but your computer sees the hard-linked items (that are located elsewhere) as being "really" there. This gets confusing on Windows, because Windows doesn't show you when files on your filesystem are hard-linked. If you want to see files that are hard-linked on windows, you can install the Link Shell Extension (LSE) program, which will put red arrows on files that are hard-linked.

Collection Size Limits

The largest collections we have built have been 7 GB of text, and 11 million short documents (about 3 GB of text). These built with no problems. We haven't tried larger amounts of text because we don't have larger amounts of text lying around. It's no good using 7 GB twice over to make 14 GB because the vocabulary hasn't grown accordingly, as it would with a real collection.

There are four main limitations:

In practice, the solution for very large amounts of data is not to treat the collection as one huge monolith, but to partition it into subcollections and arrange for the search engine to search them all together behind the scenes. However, while you can amalgamate the results of searching subcollections fairly easily, it's much harder with browsing. Of course, A-Z lists and datelists and the like aren't really much use with very large collections. This is where new techniques of hierarchical phrase browsing come into their own. And the really good news is that you can partition a collection into subcollections, each with individual phrase browsers, and arrange to view them all together in a single hierarchical browsing structure, as one coordinated whole. We haven't actually demonstrated this yet, but it seems quite feasible.

In 2004 a test collection was built by "Archivo Digital", an office that depends on the "Archivo Nacional de la Memoria" (National Memory Archive in English), in Argentina. It contained sequences of page images with associated OCR text.

Setup details

Statistics

The Papers Past collection

Greenstone has been used to build the digital library collection for the Papers Past initiative of the National Library of New Zealand Te Puna Mātauranga o Aotearoa. The collection contains historic New Zealand newspapers that are out of copyright. According to the Papers Past web site, a third of the collection is now indexed and searchable and the intent is to make all of the contents searchable.

At the start of February 2008, the collection for Papers Past comprised:

Using the GLI with command line building

In some circumstances you may want to use the Librarian Interface to design your collection, then actually build it using command line building. When you click "Create Collection" in the GLI, its carrying out the last three steps: import, buildcol, and renaming building to index. So you can do the earlier steps using the Librarian Interface, and then import and build on the command line. If you are generating archive files by hand, then you will need to do this as you will not be able to use the Librarian Interface to "build" the collection.

Additional Resources