Advanced collection building

In the GLI, the actual building of the collection is done in the Create panel. There are many options associated with the building process. Many of these are available in the GLI when in expert mode (File → Preferences → Mode → Expert). To access the full list of options, you can build from the command-line.

Importing

Building collections is split into two segments: importing and building. The import process is when all of the documents and metadata are actually imported into the collection–documents are processed by their respective plugins, metadata is extracted, and both assigned and extracted metadata are stored in a format Greenstone can use.

You have several import options available in the GLI:

Greenstone can store metadata in two different formats: GreenstoneXML (default) and GreenstoneMETS. Use the saveas option to change the storage format. (Note: that you must also ensure the correct plugin is in the Document Plugins list–GreenstoneXMLPlugin or GreenstoneMETSPlugin. See Greenstone metadata formats for more information.)

Logs: Logs provide information on what has occurred during the import/build process. The main message log, which appears in the Create panel while your collection is being built, is also written to a file, which can be found in the collection's log folder. You can control how detailed the log is by changing the verbosity. The higher the verbosity (from 0-3), the more information is printed in the log. This is especially helpful when there are any errors. There is a second log file–the faillog–which contains only a list of documents that failed the be processed. By default, this file is called FAIL.log and is located in the collection's etc folder. You can change the name and location of the fail log by putting a complete file path in the faillog option.

maxdocs allows you to set the maximum number of documents to process. This is useful when you are designing your collection. If you have a large collection, importing all of the documents and building the collection can take a very long time. By setting a low number for maxdocs, you can build a subset of the collection quickly to see design changes you have made.

sortmeta will import the documents in order depending on the value of one or more metadata fields. This is useful for giving some order to non-ranked search results, as the results will be displayed in the word they were imported. (removeprefix and removesuffix can be used in conjunction with sortmeta.)

Every document in the collection must have a unique object identifier (OID). Greenstone gives you four options for assigning this identifier (OIDtype):

  • hash (default): This creates a hash of the contents of the file, and will be consistent every time the collection is imported
  • assigned: Your documents may already have unique identifiers in their metadata. If so, you can set the OIDtype to assigned and use the OIDmetadata option to specify which metadata field contains the document identifiers.
  • dirname: If your documents are each in individual folders, and every folder has a different name, you can specify the OIDtype to be the name of this folder.
  • incremental: This option will just number the documents in order as it imports them. While it is significantly faster than using the hash, there is no guarantee that the documents will get the same identifier every time the collection is built. Incremental OID's start with "D" by default (e.g. "D1", "D2", "D3"), as they cannot be purely numerical to avoid conflicting with indexer document numbers. You can change this prefix in

by replacing the "D" in the code below:

	} elsif ($self->{'OIDtype'} eq "incremental") {
	    $OID = "D" . $OIDcount;
	    $OIDcount ++;

(contributed by Michael Dewsnip & Stephen)

Building

The actual building part of the build process takes the imported documents and metadata and builds the indexes, classifiers, and partitions specified in the Design panel.

Options include only building specific indexes (index) or only doing a specific part of the build process (mode), and removing classifiers that are empty (remove_empty_classifications). You can also specify how detailed the log is for the build process (verbosity) and the name of the file containing a list of documents that failed to process (faillog).

Incremental Building

Incremental building is a way to only import new/modified documents and update indexes, instead of completely rebuilding them. This can save a lot of time, especially with larger collections. Incremental building can be done in the GLI by selecting the Minimal Rebuild option on the Create panel. Incremental building can only be performed in certain circumstances (for instance, you must be using Lucene if you want to incrementally build indexes). If the conditions aren't met, Greenstone will simply end up doing a complete rebuild instead. You can read more about incremental building to understand when it will work and to take full advantage it.

Apparent multiple copies of documents

When you look at a built collection, you may notice that there appear to be three copies of each document: in the import, archives, and index folders. This is not the case. Greenstone uses something called hard-links instead of making copies. Hard-links are like shortcuts, but your computer sees the hard-linked items (that are located elsewhere) as being "really" there. This gets confusing on Windows, because Windows doesn't show you when files on your filesystem are hard-linked. If you want to see files that are hard-linked on windows, you can install the Link Shell Extension (LSE) program, which will put red arrows on files that are hard-linked.

Collection Size Limits

The largest collections we have built have been 7 GB of text, and 11 million short documents (about 3 GB of text). These built with no problems. We haven't tried larger amounts of text because we don't have larger amounts of text lying around. It's no good using 7 GB twice over to make 14 GB because the vocabulary hasn't grown accordingly, as it would with a real collection.

There are four main limitations:

  • Operating system limitations: Although this is not likely to be an issue on modern operating systems—which have a file size limits of around 16 TB (specifically the NTFS file system for Windows and either ext3 or ext4 file system on Linux)—older file systems have file size limits of 2-4 GB, allowing for a maximum of around 7 GB worth of text before compression.
  • Technical limitations: There is a Huffman coding limitation on the MG and MGPP indexers which we would expect to run into at collections of around 16 GB. Building with the Lucene indexer however removes this limitation.
  • Build time limitations: For building a single index on an already-imported collection, extrapolations indicate that on a modern machine with 1 GB of main memory, you should be able to build a 60 GB collection in about 3 days. However, there are often large gaps between theory and practice in this area! The more indexes you have, the longer things take to build.
  • GLI limitations: The GLI program for building collections with a graphical user interface may be expected to fail for collections smaller than 16 GB if there are large amounts of metadata per record (for example in the case of complex bibliographic records and/or abstracts). Although no benchmarking has been conducted, problems have been experienced for collections approaching 15,000 documents in this case. In the event of such problems, the collection can be readily built in command line mode.

In practice, the solution for very large amounts of data is not to treat the collection as one huge monolith, but to partition it into subcollections and arrange for the search engine to search them all together behind the scenes. However, while you can amalgamate the results of searching subcollections fairly easily, it's much harder with browsing. Of course, A-Z lists and datelists and the like aren't really much use with very large collections. This is where new techniques of hierarchical phrase browsing come into their own. And the really good news is that you can partition a collection into subcollections, each with individual phrase browsers, and arrange to view them all together in a single hierarchical browsing structure, as one coordinated whole. We haven't actually demonstrated this yet, but it seems quite feasible.

In 2004 a test collection was built by "Archivo Digital", an office that depends on the "Archivo Nacional de la Memoria" (National Memory Archive in English), in Argentina. It contained sequences of page images with associated OCR text.

Setup details

  • Greenstone version: 2.52
  • Server: Pentium IV 1.8 GHz, 512 Mb RAM, Windows XP Prof.
  • Number of indexed documents: 17,655
  • Number of images (tiff format): 980,000
  • Total size of text files: 3.2 GB
  • Built indexes: section:text document:Title
  • Used Plugin: PagedImgPlug
  • 5 classifiers

Statistics

  • Time to import the collection: Almost a week was spent collecting documents and importing them. No image conversion was done.
  • Time to build the collection (excluding import): almost 24 hours. The archives and the indexes were on separate hard disks, to reduce the overhead that reading and writing from the same disk would cause.
  • Time to open a hierarchy node that contains 908 objects: 23 seconds
  • Average Time to search only one word in text index: 2 to 5 seconds
  • Average Time to search 3 words in text index: 2 to 5 seconds
  • Average Time to search exact phrases (includes 4, 5 and 6 words): 30 seconds

The Papers Past collection

Greenstone has been used to build the digital library collection for the Papers Past initiative of the National Library of New Zealand Te Puna Mātauranga o Aotearoa. The collection contains historic New Zealand newspapers that are out of copyright. According to the Papers Past web site, a third of the collection is now indexed and searchable and the intent is to make all of the contents searchable.

At the start of February 2008, the collection for Papers Past comprised:

  • 1,119,788 newspaper pages, from 207,793 issues. Of those, 91,545 issues–601,516 pages, 6,461,804 articles–have been OCRed to METS/ALTO.
  • As at 06 March 2008, the number of documents–which corresponds to the number of newspaper issues–was 207,844. The space these take up in Greenstone is about 17.25GB (18,524,818,217 bytes).
  • The total built index directory is 87GB. That includes the GDBM databases used to store word coordinates and the Lucene index itself (but no images).
  • The 1,119,788 newspapers images are stored in TIFF format. (The total size of the collection data is still uncertain: it is either 3Tb or–if the images average 500kb each, as they have been estimated to–it is 546GB of image data.)

Using the GLI with command line building

In some circumstances you may want to use the Librarian Interface to design your collection, then actually build it using command line building. When you click "Create Collection" in the GLI, its carrying out the last three steps: import, buildcol, and renaming building to index. So you can do the earlier steps using the Librarian Interface, and then import and build on the command line. If you are generating archive files by hand, then you will need to do this as you will not be able to use the Librarian Interface to "build" the collection.

Additional Resources