Table of Contents
List of Greenstone 2 collect.cfg file options
This page gives options available to the Greenstone 2 collection configuration file collect.cfg. This file is generated by GLI, however, if you are doing command line building you may want to edit it manually. Also, a few options here are not available through GLI.
Options only available in the configuration file
If marked as multiple, this means that you can have more than one specification of that option.
Option | Value(s) | Multiple? | Description |
---|---|---|---|
creator | single email address | Email address of the collection creator. | |
maintainer | multiple email addresses | Email address(es) of collection maintainer(s). | |
public | true/false | If true, the collection will be displayed on the home page, not otherwise. Even if false, the collection will still be available to anyone provided they know the URL to it. | |
plugin | plugin-name [plugin-options] | M | A list of plugins (and their options) to use for the collection. This determines which kinds of documents can be included in the collection. See here for more details. |
buildtype | mg/mgpp/lucene | Determines which indexer should be used for the collection. | |
indexes | list of indexes | A list of indexes that should be built. For MG collections, the format of each index is level:fields, where level is one of 'document', 'section', 'paragraph', and fields is a comma separated list of 'text' or metadata names. For example, document:text,Title. For MGPP/Lucene collections, each index is a comma-separated list of 'text' or metadata names. For example, text dc.Subject dc.Title,Title. The order specifies the order they will be displayed in the drop-down box on the interface. | |
defaultindex | one of the indexes | The index that will be selected by default on the search page. | |
levels | one or more of 'document', 'section', 'paragraph' | Only for MGPP/Lucene collections, specifies which levels to index at. The order specifies the order they will be displayed in the drop-down box on the interface. | |
subcollection | id pattern | M | Subcollection definitions. The id is a name that can be used in the indexsubcollections option. The pattern is like "[!]field/expression/[i]", where ! and i are optional. Field can be 'filename' or any metadata name - this is the metadata that will be tested. The expression is a (perl) regular expression that defines the matching pattern. Documents whose 'field' metadata matches 'expression' will be included in this subcollection. A ! in front negates it, so only documents that don't match will be included. i specifies that the match should be case insensitive. |
indexsubcollections | list of subcollection ids | A list of subcollections to index, where each entry is a comma separated list of subcollection identifiers. The order specifies the order they will be displayed in the drop-down box on the interface. | |
languages | list of language identifiers | Like subcollections, but based on the language of the documents. For example, "languages en fr en,fr" will provide subindexes for english documents, french documents, and both together. The order specifies the order they will be displayed in the drop-down box on the interface. | |
language_metadata | a single metadata name | The metadata element to use to determine the language of each document. default is ex.Language. | |
classify | classifier-name classifier-options | M | Specifications for classifiers for browsing. |
format | option-name option-value | M | Formatting options for the collection. See here for more details. |
collectionmeta | key [l=xx] value | M | Specifies language specific strings for some components of the interface. See here for more details. |
supercollection | a space separated list of collection names | A list of collections that should be searched together (cross collection searching). The user will be given this list on the preferences page and can change which collections are included in a search | |
supercollectionoptions | uniform_search_results_formatting | By default, individual search results when cross-collection searching are formatted according to the collection the each result came from. Setting this option will make all search results use the format statement of the collection the user is currently in. | |
maxnumeric | integer | The maximum nuber of digits a 'word' can have in the index dictionary. Default is 4. This means that large numbers will be split into several words for indexing. For example, if maxnumeric is 4, "1342663" will be split into "1342" and "663". | |
mirror | "interval N" | Used by update.pl to specify that the collection is mirrored, and what interval the update should be done at (number of days). Requires some wget/w3mirror config files to be in the etc directory of the collection. | |
acquire | "OAI [-getdoc] -src <url-to-oai-repository>" | M | Specifies the repository(ies) to download records from. Currently only OAI protocol is supported. If -getdoc is specified, download the document too. Otherwise only the metadata will be downloaded. |
Options also available as options to import.pl and buildcol.pl
Some options can be specified on the command line to import.pl and/or buildcol.pl. In general, the syntax is the same for both cases, except for on/off options: In the config file, they must have a value (true), which in the command line they are just flags (-optionname), where setting the flag makes it true, and not setting it makes it false.
Option | Usage | Description |
---|---|---|
archivedir | full path to a directory | Produce the archives in this directory instead of the default gsdl/collect/<collection-name>/archives |
maxdocs | integer | Maximum number of documents to import/build. |
verbosity | 0-5 | Indicates the level of output desired. The higher the number, the more verbose the output. |
debug | true | Run import/build in debug mode |
Options also available as options to import.pl
Option | Usage | Description |
---|---|---|
importdir | full path to directory | Use a different import directory instead of the default gsdl/collect/<collection-name>/import |
removeold | true | Remove the current contents of the archives directory |
keepold | false | Don't keep the current contents of the archives directory. |
gzip | true | Use gzip to compress archive files. Then ZIPPlug will need to be added to the plugin list to enable building from compressed documents. |
OIDtype | hash/incremental/assigned/dirname | Use this type of identifier generation scheme (default hash). |
groupsize | integer | Group this many documents into a simgle archive file. Useful for bibliographic collections where there are many very small documents. |
sortmeta | metadata name | Sort documents by this metadata for building. Search results for boolean queries will be displayed in this order. |
saveas | METS/GA | Generate the archives in this format (default GA). |
separate_cjk | true | Insert spaces between Chinese/Japanese/Korean characters to make each character a word. (These languages don't have spaces and so entire sentences can end up as 'words' in the index.) |
Options also available as options to buildcol.pl
Option | Usage | Description |
---|---|---|
builddir | full path to directory | Produce the indexes in this directory instead of the default gsdl/collect/<collection-name>/building |
cachedir | full path to directory | ?? |
keepold | true | Keep the contents of the old building directory (useful when used with the mode option). |
textcompress | comma separated list of 'text' and/or metadata names | Use the specified fields in the compressed text (default text). For MGPP collections only. |
no_text | true | Don't store any compressed text |
no_strip_html | true | Don't strip HTML tags from indexed text (MGPP/Lucene collections only) |
remove_empty_classifications | true | Remove empty classifiers and empty nodes from other classifiers. |
mode | all/compress_text/build_index/infodb | Carry out only a certain part of the build process (default all). |
create_images | true | Attempt to create collection images. Relies on Gimp and Perl Gimp support being available. |
dontbuild | list of indexes | Don't build the specified indexes (instead of building all specified in indexes) |
index | one index name | Only build this one index (instead of building all specified in indexes) |
dontgdbm | list of metadata fields | Don't store the specified metadata fields in the GDBM database |
sections_index_document_metadata | never/always/unless_section_metadata_exists | Index document level metadata in each section |
Collectionmeta options
There are some standard collectionmeta options:
collectionname | The full name of the collection |
collectionextra | A short description of the collection |
iconcollection | The icon to be used on the collection home page |
iconcollectionsmall | The icon to be used on the library home page. iconcollection will be used if this not specified. |
Other collection metadata is based on indexes, subcollections and languages. The keys must match the index names, preceded by a '.' dot. Here are some examples:
- indexes document:text,Title section:text
- collectionmeta .document:text,Title [l=en] "text and titles"
- collectionmeta .section:text [l=en] "section text"
- indexes text Title Subject (MGPP indexes)
- collectionmeta .text [l=en] "full text"
- collectionmeta .Title [l=en] "titles"
- collectionmeta .Subject [l=en] "subjects"
- levels document section
- collectionmeta .document [l=en] "document"
- collectionmeta .section [l=en] "chapter"
- languages en fr es en,fr,es
- collectionmeta .en [l=en] "english"
- collectionmeta .fr [l=en] "french"
- collectionmeta .es [l=en] "spanish"
- collectionmeta .en,fr,es [l=en] "all"