List of Greenstone 2 collect.cfg file options

This page gives options available to the Greenstone 2 collection configuration file collect.cfg. This file is generated by GLI, however, if you are doing command line building you may want to edit it manually. Also, a few options here are not available through GLI.

Options only available in the configuration file

If marked as multiple, this means that you can have more than one specification of that option.

Option	Value(s)	Multiple?	Description
creator	single email address		Email address of the collection creator.
maintainer	multiple email addresses		Email address(es) of collection maintainer(s).
public	true/false		If true, the collection will be displayed on the home page, not otherwise. Even if false, the collection will still be available to anyone provided they know the URL to it.
plugin	plugin-name [plugin-options]	M	A list of plugins (and their options) to use for the collection. This determines which kinds of documents can be included in the collection. See here for more details.
buildtype	mg/mgpp/lucene		Determines which indexer should be used for the collection.
indexes	list of indexes		A list of indexes that should be built. For MG collections, the format of each index is level:fields, where level is one of 'document', 'section', 'paragraph', and fields is a comma separated list of 'text' or metadata names. For example, document:text,Title. For MGPP/Lucene collections, each index is a comma-separated list of 'text' or metadata names. For example, text dc.Subject dc.Title,Title. The order specifies the order they will be displayed in the drop-down box on the interface.
defaultindex	one of the indexes		The index that will be selected by default on the search page.
levels	one or more of 'document', 'section', 'paragraph'		Only for MGPP/Lucene collections, specifies which levels to index at. The order specifies the order they will be displayed in the drop-down box on the interface.
subcollection	id pattern	M	Subcollection definitions. The id is a name that can be used in the indexsubcollections option. The pattern is like "[!]field/expression/[i]", where ! and i are optional. Field can be 'filename' or any metadata name - this is the metadata that will be tested. The expression is a (perl) regular expression that defines the matching pattern. Documents whose 'field' metadata matches 'expression' will be included in this subcollection. A ! in front negates it, so only documents that don't match will be included. i specifies that the match should be case insensitive.
indexsubcollections	list of subcollection ids		A list of subcollections to index, where each entry is a comma separated list of subcollection identifiers. The order specifies the order they will be displayed in the drop-down box on the interface.
languages	list of language identifiers		Like subcollections, but based on the language of the documents. For example, "languages en fr en,fr" will provide subindexes for english documents, french documents, and both together. The order specifies the order they will be displayed in the drop-down box on the interface.
language_metadata	a single metadata name		The metadata element to use to determine the language of each document. default is ex.Language.
classify	classifier-name classifier-options	M	Specifications for classifiers for browsing.
format	option-name option-value	M	Formatting options for the collection. See here for more details.
collectionmeta	key [l=xx] value	M	Specifies language specific strings for some components of the interface. See here for more details.
supercollection	a space separated list of collection names		A list of collections that should be searched together (cross collection searching). The user will be given this list on the preferences page and can change which collections are included in a search
supercollectionoptions	uniform_search_results_formatting		By default, individual search results when cross-collection searching are formatted according to the collection the each result came from. Setting this option will make all search results use the format statement of the collection the user is currently in.
maxnumeric	integer		The maximum nuber of digits a 'word' can have in the index dictionary. Default is 4. This means that large numbers will be split into several words for indexing. For example, if maxnumeric is 4, "1342663" will be split into "1342" and "663".
mirror	"interval N"		Used by update.pl to specify that the collection is mirrored, and what interval the update should be done at (number of days). Requires some wget/w3mirror config files to be in the etc directory of the collection.
acquire	"OAI [-getdoc] -src &lt;url-to-oai-repository&gt;"	M	Specifies the repository(ies) to download records from. Currently only OAI protocol is supported. If -getdoc is specified, download the document too. Otherwise only the metadata will be downloaded.

Options also available as options to import.pl and buildcol.pl

Some options can be specified on the command line to import.pl and/or buildcol.pl. In general, the syntax is the same for both cases, except for on/off options: In the config file, they must have a value (true), which in the command line they are just flags (-optionname), where setting the flag makes it true, and not setting it makes it false.

Option	Usage	Description
archivedir	full path to a directory	Produce the archives in this directory instead of the default gsdl/collect/<collection-name>/archives
maxdocs	integer	Maximum number of documents to import/build.
verbosity	0-5	Indicates the level of output desired. The higher the number, the more verbose the output.
debug	true	Run import/build in debug mode

Options also available as options to import.pl

Option	Usage	Description
importdir	full path to directory	Use a different import directory instead of the default gsdl/collect/<collection-name>/import
removeold	true	Remove the current contents of the archives directory
keepold	false	Don't keep the current contents of the archives directory.
gzip	true	Use gzip to compress archive files. Then ZIPPlug will need to be added to the plugin list to enable building from compressed documents.
OIDtype	hash/incremental/assigned/dirname	Use this type of identifier generation scheme (default hash).
groupsize	integer	Group this many documents into a simgle archive file. Useful for bibliographic collections where there are many very small documents.
sortmeta	metadata name	Sort documents by this metadata for building. Search results for boolean queries will be displayed in this order.
saveas	METS/GA	Generate the archives in this format (default GA).
separate_cjk	true	Insert spaces between Chinese/Japanese/Korean characters to make each character a word. (These languages don't have spaces and so entire sentences can end up as 'words' in the index.)

Options also available as options to buildcol.pl

Option	Usage	Description
builddir	full path to directory	Produce the indexes in this directory instead of the default gsdl/collect/<collection-name>/building
cachedir	full path to directory	??
keepold	true	Keep the contents of the old building directory (useful when used with the mode option).
textcompress	comma separated list of 'text' and/or metadata names	Use the specified fields in the compressed text (default text). For MGPP collections only.
no_text	true	Don't store any compressed text
no_strip_html	true	Don't strip HTML tags from indexed text (MGPP/Lucene collections only)
remove_empty_classifications	true	Remove empty classifiers and empty nodes from other classifiers.
mode	all/compress_text/build_index/infodb	Carry out only a certain part of the build process (default all).
create_images	true	Attempt to create collection images. Relies on Gimp and Perl Gimp support being available.
dontbuild	list of indexes	Don't build the specified indexes (instead of building all specified in indexes)
index	one index name	Only build this one index (instead of building all specified in indexes)
dontgdbm	list of metadata fields	Don't store the specified metadata fields in the GDBM database
sections_index_document_metadata	never/always/unless_section_metadata_exists	Index document level metadata in each section

Collectionmeta options

There are some standard collectionmeta options:

collectionname	The full name of the collection
collectionextra	A short description of the collection
iconcollection	The icon to be used on the collection home page
iconcollectionsmall	The icon to be used on the library home page. iconcollection will be used if this not specified.

Other collection metadata is based on indexes, subcollections and languages. The keys must match the index names, preceded by a '.' dot. Here are some examples:

indexes document:text,Title section:text
- collectionmeta .document:text,Title [l=en] "text and titles"
- collectionmeta .section:text [l=en] "section text"
indexes text Title Subject (MGPP indexes)
- collectionmeta .text [l=en] "full text"
- collectionmeta .Title [l=en] "titles"
- collectionmeta .Subject [l=en] "subjects"
levels document section
- collectionmeta .document [l=en] "document"
- collectionmeta .section [l=en] "chapter"
languages en fr es en,fr,es
- collectionmeta .en [l=en] "english"
- collectionmeta .fr [l=en] "french"
- collectionmeta .es [l=en] "spanish"
- collectionmeta .en,fr,es [l=en] "all"

Greenstone Wiki

Table of Contents