This is a list of Greenstone scripts, with all available options and their descriptions and value information.
To run any of these scripts, you must first setup the Greenstone environment in your terminal. To do this,
cd
to the Greenstone folder and run the setup script:
Linux/Mac
source gs3-setup.bash
Windows
gs3-setup.bat
Linux/Mac
source setup.bash
Windows
setup.bat
To get information on any of these scripts simply run:
perl -S <script filename> -h
PERL script used to create the directory structure for a new Greenstone collection.
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
creator | -creator <string> | The collection creator's e-mail address. | |
optionfile | -optionfile <string> | Get options from file, useful on systems where long command lines may cause problems. | |
maintainer | -maintainer <string> | The collection maintainer's email address (if different from the creator). | |
group | -group | Create a new collection group instead of a standard collection. | |
gs3mode | -gs3mode | Mode for Greenstone 3 collections. | |
collectdir | -collectdir <string> | Directory where new collection will be created. | |
site | -site <string> | In gs3mode, uses this site name with the GSDL3HOME environment variable to determine collectdir, unless -collectdir is specified. | |
public | -public <enum> | If this collection has anonymous access. | Default: true List |
title | -title <string> | The title of the collection. | |
about | -about <string> | The about text for the collection. | |
buildtype | -buildtype <enum> | The 'buildtype' for the collection (e.g. mg, mgpp, lucene) | Default: mgpp List |
infodbtype | -infodbtype <enum> | The 'infodbtype' for the collection (e.g. gdbm, jdbm, sqlite) | Default: gdbm List |
plugin | -plugin <string> | Perl plugin module to use (there may be multiple plugin entries). | |
quiet | -quiet | Operate quietly. | |
language | -language <string> | Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file. | |
win31compat | -win31compat <enum> | Whether or not the named collection directory must conform to Windows 3.1 file conventions or not (i.e. 8 characters long). | Default: false List |
not available | -gli | ||
not available | -xml | Produces the information in an XML form, without 'pretty' comments but with much more detail. |
Value | Description |
---|---|
true | Collection is public |
false | Collection is private |
Value | Description |
---|---|
mgpp | {mkcol.buildtype.mgpp} |
lucene | {mkcol.buildtype.lucene} |
mg | {mkcol.buildtype.mg} |
Value | Description |
---|---|
gdbm | {mkcol.infodbtype.gdbm} |
sqlite | {mkcol.infodbtype.sqlite} |
jdbm | {mkcol.infodbtype.jdbm} |
mssql | {mkcol.infodbtype.mssql} |
gdbm-txtgz | {mkcol.infodbtype.gdbm-txtgz} |
Value | Description |
---|---|
true | Directory name 8 characters or less |
false | Directory name any length |
Downloads files from an external server
These are the basic options. Additional options depend on
the value of -download_mode
(These options are listed below).
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
-download_mode <enum> | (REQUIRED) The type of server to download from | allowable values: Web , MediaWiki , OAI , Z3950 , and SRW |
|
not available | -cache_dir <string> | The location of the cache directory | |
not available | -gli | ||
Server Information | -info | Print information about the server, rather than downloading | |
Use proxy connection? | -proxy_on | Indicates you are using a proxy connection | |
Proxy host | -proxy_host <string> | Proxy host | |
Proxy port | -proxy_port <string> | Proxy port | |
Proxy username | user_name <string> | Proxy username | |
Proxy password | user_password <string> | Proxy password |
A module for downloading from MediaWiki websites
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
url | -url <string> | (REQUIRED)Source URL. In case of http redirects, this value may change | |
depth | -depth <int> | How many hyperlinks deep to go when downloading | Default: 0 Range: 0, |
below | -below | Only mirror files below this URL | |
within | -within | Only mirror files within the same site | |
reject_files | -reject_files <string> | Ignore url list, separate by comma, e.g.*cgi-bin*,*.ppt ignores hyperlinks that contain either 'cgi-bin' or '.ppt' | Default: *action=*,*diff=*,*oldid=*,*printable*, *Recentchangeslinked*,*Userlogin*,*Whatlinkshere*, *redirect*,*Special:*,Talk:*, Image:*,*.ppt,*.pdf,*.zip,*.doc |
exclude_directories | -exclude_directories <string> | List of exclude directories (must be absolute path to the directory), e.g. /people,/documentation will exclude the 'people' and 'documentation' subdirectory under the currently crawling site. | Default: /wiki/index.php/Special:Recentchangeslinked,/wiki/index.php/Special:Whatlinkshere,/wiki/index.php/Talk:Creating_CD |
A module for downloading from OAI repositories
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
url | -url <string> | (REQUIRED)OAI repository URL | |
metadata_prefix | -metadata_prefix <string> | The metadata format used in the exported, e.g. oai_dc, qdc, etc. Press the <Server information> button to find out what formats are supported. | Default: oai_dc |
set | -set <string> | Restrict the download to the specified set in the repository | |
get_doc | -get_doc | Download the source document if one is specified in the record | |
get_doc_exts | -get_doc_exts <string> | Permissible filename extensions of documents to get | Default: doc,pdf,ppt |
max_records | -max_records <int> | Maximum number of records to download | Range: 1, |
A module for downloading from SRW (Search/Retrieve Web Service) repositories
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
host | -host <string> | (REQUIRED) Host URL | |
port | -port <string> | (REQUIRED )Port number of the repository | |
database | -database <string> | (REQUIRED) Database to search for records in | |
find | -find <string> | (REQUIRED) Retrieve records containing the specified search term | |
max_records | -max_records <int> | Maximum number of records to download | Default: 500 |
A module for downloading from the Internet via HTTP or FTP
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
url | -url <string> | (REQUIRED)Source URL. In case of http redirects, this value may change | |
depth | -depth <int> | How many hyperlinks deep to go when downloading | Default: 0 Range: 0, |
below | -below | Only mirror files below this URL | |
within | -within | Only mirror files within the same site | |
html_only | -html_only | Download only HTML files, and ignore associated files e.g images and stylesheets |
A module for downloading from Z3950 repositories
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
host | -host <string> | (REQUIRED) Host URL | |
port | -port <string> | (REQUIRED )Port number of the repository | |
database | -database <string> | (REQUIRED) Database to search for records in | |
find | -find <string> | (REQUIRED) Retrieve records containing the specified search term | |
max_records | -max_records <int> | Maximum number of records to download | Default: 500 |
Provides information on the options available for downloadfrom.pl
in each -download_mode
.
perl -S downloadinfo.pl [options] [download-module]
Explode a metadata database
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
not available | -language <string> | Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file. | |
not available | -plugin <string> | (REQUIRED) Plugin to use for exploding | |
input_encoding | -input_encoding <enum> | Encoding to use when reading in the database file | Default: auto List |
metadata_set | -metadata_set <string> | Metadata set (namespace) to export all metadata as | |
document_field | -document_field <string> | The metadata element specifying the file name of documents to obtain and include in the collection. | |
document_prefix | -document_prefix <string> | A prefix for the document locations (for use with the document_field option). | |
document_suffix | -document_suffix <string> | A suffix for the document locations (for use with the document_field option). | |
records_per_folder | -records_per_folder <int> | The number of records to put in each subfolder. | Default: 100 Range: 0, |
not available | -collectdir <string> | The path of the "collect" directory. | |
not available | -site <string> | Site to find collect directory in (for Greenstone 3 installation). | |
not available | -collection <string> | The collection name. Some plugins look for auxiliary files in the collection folder. | |
not available | -use_collection_plugin_options | Read the collection configuration file and use the options for the specified plugin. Requires the -collection option. Cannot be used with -plugin_options. | |
not available | -plugin_options <string> | Options to pass to the plugin before exploding. Option nmaes must start with -. Separate option names and values with space. Cannot be used with -use_collection_plugin_options. | |
verbosity | -verbosity <int> | Controls the quantity of output. 0=none, 3=lots. | Default: 1 Range: 0, |
not available | -xml |
Value | Description |
---|---|
auto | Use text categorization algorithm to automatically identify the encoding of each source document. This will be slower than explicitly setting the encoding but will work where more than one encoding is used within the same collection. |
ascii | Plain 7 bit ASCII. This may be a bit faster than using iso_8859_1. Beware of using this when the text may contain characters outside the plain 7 bit ASCII set though (e.g. German or French text containing accents), use iso_8859_1 instead. |
utf8 | Either utf8 or unicode – automatically detected. |
unicode | Just unicode. |
iso_8859_6 | Arabic |
gb | Chinese Simplified (GB) |
big5 | Chinese Traditional (Big5) |
koi8_r | Cyrillic |
iso_8859_5 | Cyrillic |
koi8_u | Cyrillic (Ukrainian) |
dos_437 | DOS codepage 437 (US English) |
dos_850 | DOS codepage 850 (Latin 1) |
dos_852 | DOS codepage 852 (Central European) |
dos_866 | DOS codepage 866 (Cyrillic) |
iso_8859_7 | Greek |
iso_8859_8 | Hebrew |
iscii_de | ISCII Devanagari |
euc_jp | Japanese (EUC) |
shift_jis | Japanese (Shift-JIS) |
korean | Korean (Unified Hangul Code - i.e. a superset of EUC-KR) |
iso_8859_1 | Latin1 (western languages) |
iso_8859_15 | Latin15 (revised western) |
iso_8859_2 | Latin2 (central and eastern european languages) |
iso_8859_3 | Latin3 |
iso_8859_4 | Latin4 |
iso_8859_9 | Turkish |
windows_1250 | Windows codepage 1250 (WinLatin2) |
windows_1251 | Windows codepage 1251 (WinCyrillic) |
windows_1252 | Windows codepage 1252 (WinLatin1) |
windows_1253 | Windows codepage 1253 (WinGreek) |
windows_1254 | Windows codepage 1254 (WinTurkish) |
windows_1255 | Windows codepage 1255 (WinHebrew) |
windows_1256 | Windows codepage 1256 (WinArabic) |
windows_1257 | Windows codepage 1257 (WinBaltic) |
windows_1258 | Windows codepage 1258 (Vietnamese) |
windows_874 | Windows codepage 874 (Thai) |
PERL script used to import files into a format (GreenstoneXML or GreenstoneMETS) ready for building.
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
saveas | -saveas <enum> | Format that the archive files should be saved as. | Default: GreenstoneXML List |
not available | -archivedir <string> | Where the converted material ends up. | |
not available | -importdir <string> | Where the original material lives. | |
not available | -collectdir <string> | The path of the "collect" directory. | |
not available | -site <string> | Site to find collect directory in (for Greenstone 3 installation). | |
not available | -manifest <string> | An XML file that details what files are to be imported. Used instead of recursively descending the import folder, typically for incremental building. | |
not available | -debug | Print imported text to STDOUT (for GreenstoneXML importing) | |
faillog | -faillog <string> | Fail log filename. This log receives the filenames of any files which fail to be processed. | |
not available | -incremental | Only import documents which are newer (by timestamp) than the current archives files. Implies -keepold. | |
not available | -keepold | Will not destroy the current contents of the archives directory. | |
not available | -removeold | Will remove the old contents of the archives directory. | |
not available | -language <string> | Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file. | |
maxdocs | -maxdocs <int> | Maximum number of documents to import. | Range: 1, |
OIDtype | -OIDtype <enum> | The method to use when generating unique identifiers for each document. | List |
OIDmetadata | -OIDmetadata <string> | Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned. | |
not available | -out <string> | Filename or handle to print output status to. | Default: STDERR |
sortmeta | -sortmeta <string> | Sort documents alphabetically by metadata for building. Search results for boolean queries will be displayed in this order. This will be disabled if groupsize > 1. May be a comma separated list to sort by more than one metadata value. Use the ArchiveInfPlugin options to dictate whether sorting will be ascending or descending. | |
removeprefix | -removeprefix <regexp> | A prefix to ignore in metadata values when sorting. | |
removesuffix | -removesuffix <regexp> | A suffix to ignore in metadata values when sorting. | |
groupsize | -groupsize <int> | Number of import documents to group into one XML file. | Default: 1 |
gzip | -gzip | Use gzip to compress resulting xml documents (don't forget to include ZIPPlugin in your plugin list when building from compressed documents). | |
not available | -statsfile <string> | Filename or handle to print import statistics to. | Default: STDERR |
verbosity | -verbosity <int> | Controls the quantity of output. 0=none, 3=lots. | Range: 0, |
not available | -gli | A flag set when running this script from gli, enables output specific for gli. | |
not available | -xml | Produces the information in an XML form, without 'pretty' comments but with much more detail. |
Value | Description |
---|---|
GreenstoneXML | Greenstone XML Archive format |
GreenstoneMETS | METS format using the Greenstone profile. |
Value | Description |
---|---|
hash | Hash the contents of the file. Document identifiers will be the same every time the collection is imported. |
hash_on_full_filename | Hash on the full filename to the document within the 'import' folder (and not its contents). Helps make document identifiers more stable across upgrades of the software, although it means that duplicate documents contained in the collection are no longer detected automatically. |
assigned | Use the metadata value given by the OIDmetadata option (preceded by 'D'); if unspecified, for a particular document a hash is used instead. These identifiers should be unique. |
incremental | Use a simple document count. Significantly faster than "hash", but does not necessarily assign the same identifier to the same document content if the collection is reimported. |
dirname | Use the parent directory name (preceded by 'J'). There should only be one document per directory, and directory names should be unique. E.g. import/b13as/h15ef/page.html will get an identifier of Jh15ef. |
full_filename | Use the full file name within the 'import' folder as the identifier for the document (with _ and - substitutions made for symbols such as directory separators and the fullstop in a filename extension) |
PERL script used to build a greenstone collection from archive documents.
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
remove_empty_classifications | -remove_empty_classifications | Hide empty classifiers and classification nodes (those that contain no documents). | |
not available | -archivedir <string> | Where the archives live. | |
not available | -builddir <string> | Where to put the built indexes. | |
not available | -collectdir <string> | The path of the "collect" directory. | |
not available | -site <string> | {buildcol.site} | |
not available | -debug | Print output to STDOUT. | |
faillog | -faillog <string> | Fail log filename. This log receives the filenames of any files which fail to be processed. | |
index | -index <string> | Index to build (will build all in config file if not set). | |
not available | -incremental | Only index documents which have not been previously indexed. Implies -keepold. Relies on the lucene indexer. | |
not available | -keepold | Will not destroy the current contents of the building directory. | |
not available | -removeold | Will remove the old contents of the building directory. | |
language | -language <string> | Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file. | |
not available | -maxdocs <int> | Maximum number of documents to build. | |
maxnumeric | -maxnumeric <int> | The maximum nuber of digits a 'word' can have in the index dictionary. Large numbers are split into several words for indexing. For example, if maxnumeric is 4, "1342663" will be split into "1342" and "663". | Default: 4 Range: 4,512 |
mode | -mode <enum> | The parts of the building process to carry out. | List |
no_strip_html | -no_strip_html | Do not strip the html tags from the indexed text (only used for mgpp collections). | |
store_metadata_coverage | -store_metadata_coverage | Include statistics about which metadata sets are used in a collection, including which actual metadata terms are used. This is useful in the built collection if you want the list the metadata values that are used in a particular collection. | |
no_text | -no_text | Don't store compressed text. This option is useful for minimizing the size of the built indexes if you intend always to display the original documents at run time (i.e. you won't be able to retrieve the compressed text version). | |
sections_index_document_metadata | -sections_index_document_metadata <enum> | Index document level metadata at section level | List |
not available | -out <string> | Filename or handle to print output status to. | Default: STDERR |
verbosity | -verbosity <int> | Controls the quantity of output. 0=none, 3=lots. | |
not available | -gli | ||
not available | -xml | Produces the information in an XML form, without 'pretty' comments but with much more detail. | |
not available | -activate | Run activate.pl after buildcol has finished, which will move building to index. |
Value | Description |
---|---|
all | Do everything. |
compress_text | Just compress the text. |
build_index | Just index the text. |
infodb | Just build the metadata database. |
Value | Description |
---|---|
never | Don't index any document metadata at section level. |
always | Add all specified document level metadata even if section level metadata of that name exists. |
unless_section_metadata_exists | Only add document level metadata if no section level metadata of that name exists. |
This program runs import.pl
followed by buildcol.pl
(in both cases removing any
previously generated files in 'archives' or 'building'), and then replaces
the content of collection's 'index' directory with 'building'.
full-rebuild.pl [options] collection
Remember for Greenstone3 you should always include the option -site site-name
.
If a minus option is shared between import.pl
and buildcol.pl
then it can appear as is,
such as -verbosity 5. This value will be passed to both programs. If a minus option
is specific to one of the programs in particular, then prefix it with
import:
or buildcol:
respectively, as in -import:OIDtype hash_on_full_filename
.
Interaction with Cron
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
schedule | -schedule | Select to set up scheduled automatic collection re-building | |
frequency | -frequency <enum> | How often to automatically re-build the collection | Default: daily List |
action | -action <enum> | How to set up automatic re-building | Default: add List |
not available | -import <quotestr> | (REQUIRED)The import command to be scheduled | |
not available | -build <quotestr> | (REQUIRED)The buildcol command to be scheduled | |
not available | -colname <string> | The colletion name for which scheduling will be set up | |
not available | -xml | Produces the information in an XML form, without 'pretty' comments but with much more detail. | |
not available | -language <string> | Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file. | |
-email | Send email notification | ||
toaddr | -toaddr <string> | The email address to send scheduled build notifications to | |
fromaddr | -fromaddr <string> | The sender email address | |
smtp | -smtp <string> | The mail server that sendmail must contact to send email | |
not available | -out <string> | Filename or handle to print output status to. | Default: STDERR |
not available | -gli | Running from the GLI |
Value | Description |
---|---|
hourly | Re-build every hour |
daily | Re-build every day |
weekly | Re-build every week |
Value | Description |
---|---|
add | Schedule automatic re-building |
update | Update existing scheduling |
delete | Delete existing scheduling |
PERL script used to export files in a Greenstone collection to another format.
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
saveas | -saveas <enum> | Format to export documents as. | Default: GreenstoneMETS List |
not available | -exportdir <string> | Where the export material ends up. | |
not available | -importdir <string> | Where the original material lives. | |
not available | -collectdir <string> | The path of the "collect" directory. | |
not available | -site <string> | Site to find collect directory in (for Greenstone 3 installation). | |
not available | -manifest <string> | An XML file that details what files are to be imported. Used instead of recursively descending the import folder, typically for incremental building. | |
not available | -debug | Print exported text to STDOUT (for GreenstoneXML exporting) | |
faillog | -faillog <string> | Fail log filename. This log receives the filenames of any files which fail to be processed. (Default: collectdir/collname/etc/fail.log) | |
not available | -incremental | Only import documents which are newer (by timestamp) than the current archives files. Implies -keepold. | |
not available | -keepold | Will not destroy the current contents of the export directory. | |
not available | -removeold | Will remove the old contents of the export directory. | |
not available | -language <string> | Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file. | |
maxdocs | -maxdocs <int> | Maximum number of documents to export. | Range: 1, |
OIDtype | -OIDtype <enum> | The method to use when generating unique identifiers for each document. | List |
OIDmetadata | -OIDmetadata <string> | Specifies the metadata element that hold's the document's unique identifier, for use with -OIDtype=assigned. | Default: dc.Identifier |
not available | -out <string> | Filename or handle to print output status to. | Default: STDERR |
not available | -statsfile <string> | Filename or handle to print export statistics to. | Default: STDERR |
not available | -xsltfile <string> | Transform a document with the XSLT in the named file. | |
xslt_txt | -xslt_txt <string> | Transform a mets's doctxt.xml with the XSLT in the named file. | |
xslt_mets | -xslt_mets <string> | Transform a mets's docmets.xml with the XSLT in the named file. | |
fedora_namespace | -fedora_namespace <string> | The prefix used in Fedora for process ids (PIDS) e.g. greenstone:HASH0122efe4a2c58d0 (-saveas FedoraMETS) | Default: greenstone |
mapping_file | -mapping_file <string> | Use the named mapping file for the transformation. (-saveas MARCXML) | |
group_marc | -group_marc | Output the marc xml records into a single file. (-saveas MARCXML) | |
metadata_prefix | -metadata_prefix <string> | Comma separated list of metadata prefixes to include in the exported data. For example, setting this value to 'dls' will generate a metadata_dls.xml file for each document exported in the format needed by DSpace. (-saveas DSpace)} | |
verbosity | -verbosity <int> | Controls the quantity of output. 0=none, 3=lots. | Default: 2 Range: 0,3 |
not available | -gli | A flag set when running this script from gli, enables output specific for gli. | |
listall | -listall | List all the saveas formats | |
not available | -xml | Produces the information in an XML form, without 'pretty' comments but with much more detail. |
Value | Description |
---|---|
GreenstoneMETS | METS format using the Greenstone profile. |
FedoraMETS | METS format using the Fedora profile. |
MARCXML | MARC XML format (an XML version of MARC 21) |
DSpace | DSpace Archive format. |
Value | Description |
---|---|
hash | Hash the contents of the file. Document identifiers will be the same every time the collection is imported. |
assigned | Use the metadata value given by the OIDmetadata option; if unspecified, for a particular document a hash is used instead. These identifiers should be unique. Numeric identifiers will be preceded by 'D'. |
incremental | Use a simple document count. Significantly faster than "hash", but does not necessarily assign the same identifier to the same document content if the collection is reimported. |
dirname | Use the immediate parent directory name. There should only be one document per directory, and directory names should be unique. E.g. import/b13as/h15ef/page.html will get an identifier of h15ef. Numeric identifiers will be preceded by 'D'. |
PERL script used to export one or more collections to a Windows CD-ROM.
Option | Description | Value | |
---|---|---|---|
GLI | Command line | ||
cdname | -cdname <string> | The name of the CD-ROM – this is what will appear in the start menu once the CD-ROM is installed. | Default: Greenstone Collections |
cddir | -cddir <string> | The name of the directory that the CD contents are exported to. | Default: exported_collections |
not available | -collectdir <string> | The path of the "collect" directory. | |
noinstall | -noinstall | Create a CD-ROM where the library runs directly off the CD-ROM and nothing is installed on the host computer. | |
language | -language <string> | Language to display option descriptions in (eg. 'en_US' specifies American English). Requires translations of the option descriptions to exist in the perllib/strings_language-code.rb file. | |
out | -out <string> | Filename or handle to print output status to. | Default: STDERR |
not available | -xml | Produces the information in an XML form, without 'pretty' comments but with much more detail. | |
not available | -gli |