The Greenstone 2 Collector

The Collector is a deprecated facility for creating collections.

The Collector is a facility that helps you create new collections, modify or add to existing ones, or delete collections. To do this you will be guided through a sequence of web pages which request the information that is needed. The sequence is self-explanatory: this section takes you through it. As an alternative to using the Collector, you can also build collections from the command line. The Collector predates the librarian interface, and for most practical purposes the librarian interface (GLI) should be used instead of the Collector.

To access the Collector, click the appropriate link on the digital library home page.

In Greenstone, the structure of a particular collection is determined when the collection is set up. This includes such things as the format of the source documents, how they should be displayed on the screen, the source of metadata, what browsing facilities should be provided, what full-text search indexes should be provided, and how the search results should be displayed. Once the collection is in place, it is easy to add new documents to it—so long as they have the same format as the existing documents, and the same type of metadata is provided, in exactly the same way.

The Collector has the following basic functions:

  1. create a new collection with the same structure as an existing one;
  2. create a new collection with a different structure from existing ones;
  3. add new material to an existing collection;
  4. modify the structure of an existing collection;
  5. delete a collection; and
  6. write an existing collection to a self-contained, self-installing cd-rom.

You must first decide whether to work with an existing collection or build a new one. The former case covers options 1 and 2 above; the latter covers options 3—6.

Logging in

Either way it is necessary to log in before proceeding. Note that in general, people use their web browser to access the collection-building facility on a remote computer, and build the collection on that server. Of course, we cannot allow arbitrary people to build collections (for reasons of propriety if nothing else), so Greenstone contains a security system which forces people who want to build collections to log in first. This allows a central system to offer a service to those wishing to build information collections and use that server to make them available to others. Alternatively, if you are running Greenstone on your own computer you can build collections locally, but it is still necessary to log in because other people who use the Greenstone system on your computer should not be allowed to build collections without prior permission.

Dialog structure

Upon completion of login, a page appears showing the sequence of steps that are involved in collection building. They are:

  1. Collection information
  2. Source data
  3. Configuring the collection
  4. Building the collection
  5. Viewing the collection.

The first step is to specify the collection's name and associated information. The second is to say where the source data is to come from. The third is to adjust the configuration options, a step that becomes more useful as you gain experience with Greenstone. The fourth step is where all the (computer's) work is done. During the “building” process the system makes all the indexes and gathers together any other information that is required to make the collection operate. The fifth step is to view the collection that has been created.

These five steps are displayed as a linear sequence of gray buttons at the bottom of the screen, and at the bottom of all other pages generated by the Collector. This display helps users keep track of where they are in the process. The button that should be clicked to continue the sequence is shown in green. The gray buttons are inactive. The buttons change to yellow as you proceed through the sequence, and the user can return to an earlier step by clicking the corresponding yellow button in the diagram. This display is modeled after the “wizards” that are widely used in commercial software to guide users through the steps involved in installing new software.

Collection information

The next step in the sequence, is collection information. When creating a new collection, it is necessary to enter some information about it:

  • title,
  • contact E-mail address, and
  • brief description.

The collection title is a short phrase used through the digital library to identify the content of the collection. Example titles include Food and Nutrition Library, World Environmental Library, Development Library, and so on. The E-mail address specifies the first point of contact for any problems encountered with the collection. If the Greenstone software detects a problem, a diagnostic report may be sent to this address. Finally, the brief description is a statement describing the principles that govern what is included in the collection. It appears under the heading About this collection on the first page when the collection is presented.

Source data

Next, the user specifies the source text that comprises the collection. You may either base your collection on a default structure that is provided, or on the structure of an existing collection.

If you opt for the default structure, the new collection may contain html documents (files ending in .htm, .html), or plain text documents (files ending in .txt, .text), Microsoft Word documents (files ending in .doc), PDF documents (files ending in .pdf) or E-mail documents (files ending in .email). More information about the different document formats that can be accommodated is given in the section on “Document formats” below.

If you base your new collection on an existing one, the files in the new collection must be exactly the same type as those used to build the existing one. Note that some collections use non-standard input file formats, while others use metadata specified in auxiliary files. If your new input lacks this information, some browsing facilities may not work properly. For example, if you clone the Demo collection you may find that the subjects, organization, and how to buttons don't work.

Boxes are provided to indicate where the source documents are located: up to three separate input sources can be specified. If you need more, just click the button marked “more sources.”

There are three kinds of specification:

  • a directory name on the Greenstone server system (beginning with “file://”)
  • an address beginning with “http://” for files to be downloaded from the web
  • an address beginning with “ftp://” for files to be downloaded using anonymous FTP.

If you use file:// or ftp:// to specify a file, that file will be downloaded.

If you use http:// it depends on whether the URL gives you a normal web page in your browser, or a list of files. If a page, that page will be downloaded—and so will all pages it links to, and all pages they link to, etc.—provided they reside on the same site, below the URL.

If you use file:// or ftp:// to specify a folder or directory, or give a http:// URL that leads to a list of files, everything in the folder and all its subfolders will be included in the collection.

You can specify sources of more than one type, for instance, documents taken from a local file system and/or remote web site.

When you click the configure collection button to proceed to the next stage of building, the Collector checks that all the sources of input you specified can be reached. This might take a few seconds, or even a few minutes if you have specified several sources. If one or more of the input sources you specified is unavailable, you will be presented with a page where the unavailable sources are marked (both of them in this case).

Sources might be unavailable because

  • the file, FTP site or URL does not exist;
  • you need to dial up your ISP first;
  • you are trying to access a URL from behind a firewall.

The last case is potentially the most mysterious. It occurs if you normally have to present a username and password to access the Internet Sometimes it happens that you can see the page from your Web browser if you enter the URL, but the Collector claims that it is unavailable. The explanation is that the page in your browser may be coming from a locally cached copy. Unfortunately, locally cached copies are invisible to the Collector. In this case we recommend that you download the pages using your browser first.

Configuring the collection

The construction and presentation of all collections is controlled by specifications in a special collection configuration file (see below). Advanced users may use this page to alter the configuration settings. Most, however, will proceed directly to the final stage. Indeed, if both the configure collection and the build collection buttons are displayed in green, signifying that step 3 can be bypassed completely.

Building the collection

Up until now, the responses to the dialog have merely been recorded in a temporary file. The building stage is where the action takes place.

During building, indexes for both browsing and searching are constructed according to instructions in the collection configuration file. The building process takes some time: minutes to hours, depending on the size of the collection and the speed of your computer. Some very large collections take a day or more to build.

When you reach this stage in the interaction, a status line at the bottom of the web page gives feedback on how the operation is progressing, updated every five seconds.

Warnings are written if input files or URLs are requested that do not exist, or exist but there is no plugin that can process them, or the plugin cannot find an associated file, such as an image file embedded in a html document. The intention is that you will monitor progress by keeping this window open in your browser. If any errors cause the process to terminate, they are recorded in this status area.

You can stop the building process at any time by clicking on the stop building button. If you leave the web page (and have not cancelled the building process with the stop building button), the building operation will continue, and the new collection will be installed when the operation completes.

Viewing the collection

When the collection is built and installed, the sequence of progress buttons appears, with the View collection button active. This takes the user directly to the newly built collection.

Finally, there is a facility for E-mail to be sent to the collection's contact E-mail address, and to the system's administrator, whenever a collection is created (or modified.) This allows those responsible to check when changes occur, and monitor what is happening on the system. The facility is disabled by default but can be enabled by editing the main.cfg configuration file.

Working with existing collections

When you enter the Collector you have to specify whether you want to create an entirely new collection or work with an existing one, adding data to it or deleting it. By creating all searching and browsing structures automatically from the documents themselves Greenstone makes it easy to add new information to existing collections. Because no links are inserted by hand, when new documents in the same format become available they can be merged into the collection automatically.

To work with an existing collection, you first select the collection from a list that is provided. Some collections are “write protected” and cannot be altered: these ones don't appear in the selection list. With the collection, you can

  • Add more data and rebuild the collection
  • Edit the collection configuration file
  • Delete the collection entirely
  • Export the collection to CD-ROM.

Add new data

The files that you specify will be added to the collection. Make sure that you do not re-specify files that are already in the collection—otherwise two copies will be included. Files are identified by their full pathname, web pages by their absolute web address. You specify directories and files just as you do when building a new collection.

If you add data to a collection and for some reason the building process fails, the old version of the collection remains unchanged.

Edit configuration file

Advanced users can edit the collection configuration file, just as they can when a new collection is built.

Delete the collection

You will be asked to confirm whether you really want to delete the collection. Once deleted, Greenstone can not bring the collection back!

Export the collection

You can export the collection in a form that allows it to be written to a self-contained, self-installing Greenstone CD-ROM for Windows. Because commercial software that creates self-installing CD-ROMs is expensive, this facility includes a homegrown installer module.

When you export the collection, the dialogue informs you of the directory name in which the result has been placed. The entire contents of the directory should be written on to CD-ROM using a standard CD-writing utility.

The immense variety of different possible Windows configurations has made it difficult for us to test and debug the Greenstone installer under all possible conditions. Although the installer produces CD-ROMs that operate on most Windows systems, it is still under development. If you experience problems and you possess a commercial installation package (e.g. InstallShield), you can use it to create CD-ROMs from the information that Greenstone provides. The above-mentioned export directory contains four files that relate to the installation process, and three subdirectories that contain the complete collection and software. Remove the four files and use InstallShield to make a CD-ROM image that installs these directories and creates a shortcut to the program gsdl\server.exe.