Greenstone Scripts

This is a list of Greenstone scripts, with all available options and their descriptions and value information.

To run any of these scripts, you must first setup the Greenstone environment in your terminal. To do this, cd to the Greenstone folder and run the setup script:


To get information on any of these scripts simply run:

perl -S <script filename> -h

mkcol.pl

PERL script used to create the directory structure for a new Greenstone collection.

Option DescriptionValue
GLICommand line
creator -creator <string> The collection creator's e-mail address.
optionfile -optionfile <string> Get options from file, useful on systems where long command lines may cause problems.
maintainer -maintainer <string> The collection maintainer's email address (if different from the creator).
group -group Create a new collection group instead of a standard collection.
gs3mode -gs3mode Mode for Greenstone 3 collections.
collectdir -collectdir <string> Directory where new collection will be created.
site -site <string> In gs3mode, uses this site name with the GSDL3HOME environment variable to determine collectdir, unless -collectdir is specified.
public -public <enum> If this collection has anonymous access. Default: true List
title -title <string> The title of the collection.
about -about <string> The about text for the collection.
buildtype -buildtype <enum> The 'buildtype' for the collection (e.g. mg, mgpp, lucene) Default: mgpp List
infodbtype -infodbtype <enum> The 'infodbtype' for the collection (e.g. gdbm, jdbm, sqlite) Default: gdbm List
plugin -plugin <string> Perl plugin module to use (there may be multiple plugin entries).
quiet -quiet Operate quietly.
language -language <string> Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file.
win31compat -win31compat <enum> Whether or not the named collection directory must conform to Windows 3.1 file conventions or not (i.e. 8 characters long). Default: false List
not available -gli
not available -xml Produces the information in an XML form, without 'pretty' comments but with much more detail.

public option values

ValueDescription
trueCollection is public
falseCollection is private

buildtype option values

ValueDescription
mgpp{mkcol.buildtype.mgpp}
lucene{mkcol.buildtype.lucene}
mg{mkcol.buildtype.mg}

infodbtype option values

ValueDescription
gdbm{mkcol.infodbtype.gdbm}
sqlite{mkcol.infodbtype.sqlite}
jdbm{mkcol.infodbtype.jdbm}
mssql{mkcol.infodbtype.mssql}
gdbm-txtgz{mkcol.infodbtype.gdbm-txtgz}

win31compat option values

ValueDescription
trueDirectory name 8 characters or less
falseDirectory name any length

downloadfrom.pl

Downloads files from an external server

These are the basic options. Additional options depend on the value of -download_mode (These options are listed below).

Option DescriptionValue
GLICommand line
-download_mode <enum>(REQUIRED) The type of server to download from allowable values: Web, MediaWiki, OAI, Z3950, and SRW
not available -cache_dir <string>The location of the cache directory
not available -gli
Server Information-infoPrint information about the server, rather than downloading
Use proxy connection?-proxy_onIndicates you are using a proxy connection
Proxy host-proxy_host <string>Proxy host
Proxy port-proxy_port <string>Proxy port
Proxy usernameuser_name <string>Proxy username
Proxy passworduser_password <string>Proxy password

MediaWikiDownload

A module for downloading from MediaWiki websites

Option DescriptionValue
GLICommand line
url -url <string> (REQUIRED)Source URL. In case of http redirects, this value may change
depth -depth <int> How many hyperlinks deep to go when downloading Default: 0 Range: 0,
below -below Only mirror files below this URL
within -within Only mirror files within the same site
reject_files -reject_files <string> Ignore url list, separate by comma, e.g.*cgi-bin*,*.ppt ignores hyperlinks that contain either 'cgi-bin' or '.ppt' Default: *action=*,*diff=*,*oldid=*,*printable*,
*Recentchangeslinked*,*Userlogin*,*Whatlinkshere*,
*redirect*,*Special:*,Talk:*,
Image:*,*.ppt,*.pdf,*.zip,*.doc
exclude_directories -exclude_directories <string> List of exclude directories (must be absolute path to the directory), e.g. /people,/documentation will exclude the 'people' and 'documentation' subdirectory under the currently crawling site. Default: /wiki/index.php/Special:Recentchangeslinked,/wiki/index.php/Special:Whatlinkshere,/wiki/index.php/Talk:Creating_CD

OAIDownload

A module for downloading from OAI repositories

Option DescriptionValue
GLICommand line
url -url <string> (REQUIRED)OAI repository URL
metadata_prefix -metadata_prefix <string> The metadata format used in the exported, e.g. oai_dc, qdc, etc. Press the &lt;Server information&gt; button to find out what formats are supported. Default: oai_dc
set -set <string> Restrict the download to the specified set in the repository
get_doc -get_doc Download the source document if one is specified in the record
get_doc_exts -get_doc_exts <string> Permissible filename extensions of documents to get Default: doc,pdf,ppt
max_records -max_records <int> Maximum number of records to download Range: 1,

SRWDownload

A module for downloading from SRW (Search/Retrieve Web Service) repositories

Option DescriptionValue
GLICommand line
host -host <string> (REQUIRED) Host URL
port -port <string> (REQUIRED )Port number of the repository
database -database <string> (REQUIRED) Database to search for records in
find -find <string> (REQUIRED) Retrieve records containing the specified search term
max_records -max_records <int> Maximum number of records to download Default: 500

WebDownload

A module for downloading from the Internet via HTTP or FTP

Option DescriptionValue
GLICommand line
url -url <string> (REQUIRED)Source URL. In case of http redirects, this value may change
depth -depth <int> How many hyperlinks deep to go when downloading Default: 0 Range: 0,
below -below Only mirror files below this URL
within -within Only mirror files within the same site
html_only -html_only Download only HTML files, and ignore associated files e.g images and stylesheets

Z3950Download

A module for downloading from Z3950 repositories

Option DescriptionValue
GLICommand line
host -host <string> (REQUIRED) Host URL
port -port <string> (REQUIRED )Port number of the repository
database -database <string> (REQUIRED) Database to search for records in
find -find <string> (REQUIRED) Retrieve records containing the specified search term
max_records -max_records <int> Maximum number of records to download Default: 500

downloadinfo.pl

Provides information on the options available for downloadfrom.pl in each -download_mode.

perl -S downloadinfo.pl [options] [download-module]

explode_metadata_database.pl

Explode a metadata database

Option DescriptionValue
GLICommand line
not available -language <string> Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file.
not available -plugin <string> (REQUIRED) Plugin to use for exploding
input_encoding -input_encoding <enum> Encoding to use when reading in the database file Default: auto List
metadata_set -metadata_set <string> Metadata set (namespace) to export all metadata as
document_field -document_field <string> The metadata element specifying the file name of documents to obtain and include in the collection.
document_prefix -document_prefix <string> A prefix for the document locations (for use with the document_field option).
document_suffix -document_suffix <string> A suffix for the document locations (for use with the document_field option).
records_per_folder -records_per_folder <int> The number of records to put in each subfolder. Default: 100 Range: 0,
not available -collectdir <string> The path of the "collect" directory.
not available -site <string> Site to find collect directory in (for Greenstone 3 installation).
not available -collection <string> The collection name. Some plugins look for auxiliary files in the collection folder.
not available -use_collection_plugin_options Read the collection configuration file and use the options for the specified plugin. Requires the -collection option. Cannot be used with -plugin_options.
not available -plugin_options <string> Options to pass to the plugin before exploding. Option nmaes must start with -. Separate option names and values with space. Cannot be used with -use_collection_plugin_options.
verbosity -verbosity <int> Controls the quantity of output. 0=none, 3=lots. Default: 1 Range: 0,
not available -xml

input_encoding option values

ValueDescription
autoUse text categorization algorithm to automatically identify the encoding of each source document. This will be slower than explicitly setting the encoding but will work where more than one encoding is used within the same collection.
asciiPlain 7 bit ASCII. This may be a bit faster than using iso_8859_1. Beware of using this when the text may contain characters outside the plain 7 bit ASCII set though (e.g. German or French text containing accents), use iso_8859_1 instead.
utf8Either utf8 or unicode – automatically detected.
unicodeJust unicode.
iso_8859_6Arabic
gbChinese Simplified (GB)
big5Chinese Traditional (Big5)
koi8_rCyrillic
iso_8859_5Cyrillic
koi8_uCyrillic (Ukrainian)
dos_437DOS codepage 437 (US English)
dos_850DOS codepage 850 (Latin 1)
dos_852DOS codepage 852 (Central European)
dos_866DOS codepage 866 (Cyrillic)
iso_8859_7Greek
iso_8859_8Hebrew
iscii_deISCII Devanagari
euc_jpJapanese (EUC)
shift_jisJapanese (Shift-JIS)
koreanKorean (Unified Hangul Code - i.e. a superset of EUC-KR)
iso_8859_1Latin1 (western languages)
iso_8859_15Latin15 (revised western)
iso_8859_2Latin2 (central and eastern european languages)
iso_8859_3Latin3
iso_8859_4Latin4
iso_8859_9Turkish
windows_1250Windows codepage 1250 (WinLatin2)
windows_1251Windows codepage 1251 (WinCyrillic)
windows_1252Windows codepage 1252 (WinLatin1)
windows_1253Windows codepage 1253 (WinGreek)
windows_1254Windows codepage 1254 (WinTurkish)
windows_1255Windows codepage 1255 (WinHebrew)
windows_1256Windows codepage 1256 (WinArabic)
windows_1257Windows codepage 1257 (WinBaltic)
windows_1258Windows codepage 1258 (Vietnamese)
windows_874Windows codepage 874 (Thai)

import.pl

PERL script used to import files into a format (GreenstoneXML or GreenstoneMETS) ready for building.

Option DescriptionValue
GLICommand line
saveas -saveas <enum>Format that the archive files should be saved as.Default: GreenstoneXML List
not available -archivedir <string>Where the converted material ends up.
not available -importdir <string>Where the original material lives.
not available -collectdir <string>The path of the "collect" directory.
not available -site <string>Site to find collect directory in (for Greenstone 3 installation).
not available -manifest <string>An XML file that details what files are to be imported. Used instead of recursively descending the import folder, typically for incremental building.
not available -debugPrint imported text to STDOUT (for GreenstoneXML importing)
faillog -faillog <string>Fail log filename. This log receives the filenames of any files which fail to be processed.
not available -incrementalOnly import documents which are newer (by timestamp) than the current archives files. Implies -keepold.
not available -keepoldWill not destroy the current contents of the archives directory.
not available -removeoldWill remove the old contents of the archives directory.
not available -language <string>Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file.
maxdocs -maxdocs <int>Maximum number of documents to import. Range: 1,
OIDtype -OIDtype <enum>The method to use when generating unique identifiers for each document. List
OIDmetadata -OIDmetadata <string>Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned.
not available -out <string>Filename or handle to print output status to.Default: STDERR
sortmeta -sortmeta <string>Sort documents alphabetically by metadata for building. Search results for boolean queries will be displayed in this order. This will be disabled if groupsize > 1. May be a comma separated list to sort by more than one metadata value. Use the ArchiveInfPlugin options to dictate whether sorting will be ascending or descending.
removeprefix -removeprefix <regexp>A prefix to ignore in metadata values when sorting.
removesuffix -removesuffix <regexp>A suffix to ignore in metadata values when sorting.
groupsize -groupsize <int>Number of import documents to group into one XML file.Default: 1
gzip -gzipUse gzip to compress resulting xml documents (don't forget to include ZIPPlugin in your plugin list when building from compressed documents).
not available -statsfile <string>Filename or handle to print import statistics to.Default: STDERR
verbosity -verbosity <int>Controls the quantity of output. 0=none, 3=lots. Range: 0,
not available -gliA flag set when running this script from gli, enables output specific for gli.
not available -xmlProduces the information in an XML form, without 'pretty' comments but with much more detail.

saveas option values

ValueDescription
GreenstoneXMLGreenstone XML Archive format
GreenstoneMETSMETS format using the Greenstone profile.

OIDtype option values

ValueDescription
hashHash the contents of the file. Document identifiers will be the same every time the collection is imported.
hash_on_full_filenameHash on the full filename to the document within the 'import' folder (and not its contents). Helps make document identifiers more stable across upgrades of the software, although it means that duplicate documents contained in the collection are no longer detected automatically.
assignedUse the metadata value given by the OIDmetadata option (preceded by 'D'); if unspecified, for a particular document a hash is used instead. These identifiers should be unique.
incrementalUse a simple document count. Significantly faster than "hash", but does not necessarily assign the same identifier to the same document content if the collection is reimported.
dirnameUse the parent directory name (preceded by 'J'). There should only be one document per directory, and directory names should be unique. E.g. import/b13as/h15ef/page.html will get an identifier of Jh15ef.
full_filenameUse the full file name within the 'import' folder as the identifier for the document (with _ and - substitutions made for symbols such as directory separators and the fullstop in a filename extension)

buildcol.pl

PERL script used to build a greenstone collection from archive documents.

Option DescriptionValue
GLICommand line
remove_empty_classifications -remove_empty_classificationsHide empty classifiers and classification nodes (those that contain no documents).
not available -archivedir <string>Where the archives live.
not available -builddir <string>Where to put the built indexes.
not available -collectdir <string>The path of the "collect" directory.
not available -site <string>{buildcol.site}
not available -debugPrint output to STDOUT.
faillog -faillog <string>Fail log filename. This log receives the filenames of any files which fail to be processed.
index -index <string>Index to build (will build all in config file if not set).
not available -incrementalOnly index documents which have not been previously indexed. Implies -keepold. Relies on the lucene indexer.
not available -keepoldWill not destroy the current contents of the building directory.
not available -removeoldWill remove the old contents of the building directory.
language -language <string>Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file.
not available -maxdocs <int>Maximum number of documents to build.
maxnumeric -maxnumeric <int>The maximum nuber of digits a 'word' can have in the index dictionary. Large numbers are split into several words for indexing. For example, if maxnumeric is 4, "1342663" will be split into "1342" and "663".Default: 4 Range: 4,512
mode -mode <enum>The parts of the building process to carry out. List
no_strip_html -no_strip_htmlDo not strip the html tags from the indexed text (only used for mgpp collections).
store_metadata_coverage -store_metadata_coverageInclude statistics about which metadata sets are used in a collection, including which actual metadata terms are used. This is useful in the built collection if you want the list the metadata values that are used in a particular collection.
no_text -no_textDon't store compressed text. This option is useful for minimizing the size of the built indexes if you intend always to display the original documents at run time (i.e. you won't be able to retrieve the compressed text version).
sections_index_document_metadata -sections_index_document_metadata <enum>Index document level metadata at section level List
not available -out <string>Filename or handle to print output status to.Default: STDERR
verbosity -verbosity <int>Controls the quantity of output. 0=none, 3=lots.
not available -gli
not available -xmlProduces the information in an XML form, without 'pretty' comments but with much more detail.
not available -activateRun activate.pl after buildcol has finished, which will move building to index.

mode option values

ValueDescription
allDo everything.
compress_textJust compress the text.
build_indexJust index the text.
infodbJust build the metadata database.

sections_index_document_metadata option values

ValueDescription
neverDon't index any document metadata at section level.
alwaysAdd all specified document level metadata even if section level metadata of that name exists.
unless_section_metadata_existsOnly add document level metadata if no section level metadata of that name exists.

full-rebuild.pl

This program runs import.pl followed by buildcol.pl (in both cases removing any previously generated files in 'archives' or 'building'), and then replaces the content of collection's 'index' directory with 'building'.

full-rebuild.pl [options] collection

Remember for Greenstone3 you should always include the option -site site-name.

If a minus option is shared between import.pl and buildcol.pl then it can appear as is, such as -verbosity 5. This value will be passed to both programs. If a minus option is specific to one of the programs in particular, then prefix it with import: or buildcol: respectively, as in -import:OIDtype hash_on_full_filename.

schedule.pl

Interaction with Cron

Option DescriptionValue
GLICommand line
schedule -schedule Select to set up scheduled automatic collection re-building
frequency -frequency <enum> How often to automatically re-build the collection Default: daily List
action -action <enum> How to set up automatic re-building Default: add List
not available -import <quotestr> (REQUIRED)The import command to be scheduled
not available -build <quotestr> (REQUIRED)The buildcol command to be scheduled
not available -colname <string> The colletion name for which scheduling will be set up
not available -xml Produces the information in an XML form, without 'pretty' comments but with much more detail.
not available -language <string> Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file.
email -email Send email notification
toaddr -toaddr <string> The email address to send scheduled build notifications to
fromaddr -fromaddr <string> The sender email address
smtp -smtp <string> The mail server that sendmail must contact to send email
not available -out <string> Filename or handle to print output status to. Default: STDERR
not available -gli Running from the GLI

frequency option values

ValueDescription
hourlyRe-build every hour
dailyRe-build every day
weeklyRe-build every week

action option values

ValueDescription
addSchedule automatic re-building
updateUpdate existing scheduling
deleteDelete existing scheduling

export.pl

PERL script used to export files in a Greenstone collection to another format.

Option DescriptionValue
GLICommand line
saveas -saveas <enum> Format to export documents as. Default: GreenstoneMETS List
not available -exportdir <string> Where the export material ends up.
not available -importdir <string> Where the original material lives.
not available -collectdir <string> The path of the "collect" directory.
not available -site <string> Site to find collect directory in (for Greenstone 3 installation).
not available -manifest <string> An XML file that details what files are to be imported. Used instead of recursively descending the import folder, typically for incremental building.
not available -debug Print exported text to STDOUT (for GreenstoneXML exporting)
faillog -faillog <string> Fail log filename. This log receives the filenames of any files which fail to be processed. (Default: collectdir/collname/etc/fail.log)
not available -incremental Only import documents which are newer (by timestamp) than the current archives files. Implies -keepold.
not available -keepold Will not destroy the current contents of the export directory.
not available -removeold Will remove the old contents of the export directory.
not available -language <string> Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file.
maxdocs -maxdocs <int> Maximum number of documents to export. Range: 1,
OIDtype -OIDtype <enum> The method to use when generating unique identifiers for each document. List
OIDmetadata -OIDmetadata <string> Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned. Default: dc.Identifier
not available -out <string> Filename or handle to print output status to. Default: STDERR
not available -statsfile <string> Filename or handle to print export statistics to. Default: STDERR
not available -xsltfile <string> Transform a document with the XSLT in the named file.
xslt_txt -xslt_txt <string> Transform a mets's doctxt.xml with the XSLT in the named file.
xslt_mets -xslt_mets <string> Transform a mets's docmets.xml with the XSLT in the named file.
fedora_namespace -fedora_namespace <string> The prefix used in Fedora for process ids (PIDS) e.g. greenstone:HASH0122efe4a2c58d0 (-saveas FedoraMETS) Default: greenstone
mapping_file -mapping_file <string> Use the named mapping file for the transformation. (-saveas MARCXML)
group_marc -group_marc Output the marc xml records into a single file. (-saveas MARCXML)
metadata_prefix -metadata_prefix <string> Comma separated list of metadata prefixes to include in the exported data. For example, setting this value to 'dls' will generate a metadata_dls.xml file for each document exported in the format needed by DSpace. (-saveas DSpace)}
verbosity -verbosity <int> Controls the quantity of output. 0=none, 3=lots. Default: 2 Range: 0,3
not available -gli A flag set when running this script from gli, enables output specific for gli.
listall -listall List all the saveas formats
not available -xml Produces the information in an XML form, without 'pretty' comments but with much more detail.

saveas option values

ValueDescription
GreenstoneMETSMETS format using the Greenstone profile.
FedoraMETSMETS format using the Fedora profile.
MARCXMLMARC XML format (an XML version of MARC 21)
DSpaceDSpace Archive format.

OIDtype option values

ValueDescription
hashHash the contents of the file. Document identifiers will be the same every time the collection is imported.
assignedUse the metadata value given by the OIDmetadata option; if unspecified, for a particular document a hash is used instead. These identifiers should be unique. Numeric identifiers will be preceded by 'D'.
incrementalUse a simple document count. Significantly faster than "hash", but does not necessarily assign the same identifier to the same document content if the collection is reimported.
dirnameUse the immediate parent directory name. There should only be one document per directory, and directory names should be unique. E.g. import/b13as/h15ef/page.html will get an identifier of h15ef. Numeric identifiers will be preceded by 'D'.

exportcol.pl

PERL script used to export one or more collections to a Windows CD-ROM.

Option DescriptionValue
GLICommand line
cdname -cdname <string> The name of the CD-ROM – this is what will appear in the start menu once the CD-ROM is installed. Default: Greenstone Collections
cddir -cddir <string> The name of the directory that the CD contents are exported to. Default: exported_collections
not available -collectdir <string> The path of the "collect" directory.
noinstall -noinstall Create a CD-ROM where the library runs directly off the CD-ROM and nothing is installed on the host computer.
language -language <string> Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file.
out -out <string> Filename or handle to print output status to. Default: STDERR
not available -xml Produces the information in an XML form, without 'pretty' comments but with much more detail.
not available -gli