legacy:manuals:en:develop:understanding_the_collection-building_process
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | legacy:manuals:en:develop:understanding_the_collection-building_process [2023/03/13 01:46] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | |||
+ | |||
+ | |||
+ | ====== Understanding the collection-building process ====== | ||
+ | |||
+ | End users of Greenstone can build collections using the Collector, described in the // | ||
+ | |||
+ | We assume throughout this manual that you have installed Greenstone on your computer, be it Windows or Unix. If you have not yet done this you should consult the // | ||
+ | |||
+ | ===== Building collections from the command line ===== | ||
+ | |||
+ | Let us begin by walking through the operations involved in building a collection from the command line, to help understand the process better. Of course, for more user-friendly collection building, you should use the Collector instead. The collection we take as an example is one that comes on the Greenstone software distribution CD-ROM, and contains the WWW home pages of many of the people who have worked on the New Zealand Digital Library Project and the Greenstone software. | ||
+ | |||
+ | Separate subsections follow for building under Windows and Unix. In fact, the two subsections are very nearly identical—you need only go through the one that pertains to your system. When following the walkthrough, | ||
+ | |||
+ | ==== Collection building under Windows ==== | ||
+ | |||
+ | The first challenge when building a Greenstone collection from the command line under Windows is to get at the “command prompt,” the place where you type commands. Try looking in the //Start// menu, or under the // | ||
+ | |||
+ | Change into the directory where Greenstone has been installed. Assuming Greenstone was installed in its default location, you can move there by typing | ||
+ | |||
+ | < | ||
+ | cd " | ||
+ | </ | ||
+ | |||
+ | (You need the quotation marks because of the space in //Program Files//.) Next, at the prompt type | ||
+ | |||
+ | < | ||
+ | setup.bat | ||
+ | </ | ||
+ | |||
+ | This batch file (which you can read if you like) tells the system where to look for Greenstone programs.((On Windows 95/98 systems running // | ||
+ | |||
+ | Now you are in a position to make, build and rebuild collections. The first program we will look at is the Perl program // | ||
+ | |||
+ | Let us now use the command to create the initial files and subdirectories necessary for our home page collection of Greenstone Digital Library project members. To assign the collection the name // | ||
+ | |||
+ | < | ||
+ | perl —S mkcol.pl —creator [email protected] dlpeople | ||
+ | </ | ||
+ | |||
+ | (or //mkcol.pl —creator [email protected] dlpeople// if Perl is associated with the //.pl// file extension). Please substitute your email address for mine! | ||
+ | |||
+ | To view the newly created files, move to the newly created collection directory by typing | ||
+ | |||
+ | < | ||
+ | cd " | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | < | ||
+ | creator [email protected] | ||
+ | maintainer [email protected] | ||
+ | public true | ||
+ | beta true | ||
+ | indexes document: | ||
+ | defaultindex document: | ||
+ | plugin ZIPPlug | ||
+ | plugin GAPlug | ||
+ | plugin TEXTPlug | ||
+ | plugin HTMLPlug | ||
+ | plugin EMAILPlug | ||
+ | plugin ArcPlug | ||
+ | plugin RecPlug | ||
+ | classify AZList -metadata " | ||
+ | collectionmeta collectionname " | ||
+ | collectionmeta iconcollection "" | ||
+ | collectionmeta collectionextra "" | ||
+ | collectionmeta .document: | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | You can list the contents of this directory by typing //dir//. There should be seven subdirectories: | ||
+ | |||
+ | Now we must populate the collection with sample documents. Source material for the // | ||
+ | |||
+ | > select the contents of the // | ||
+ | |||
+ | Alternatively, | ||
+ | |||
+ | < | ||
+ | xcopy /s d: | ||
+ | </ | ||
+ | |||
+ | In the collection' | ||
+ | |||
+ | Now you are ready to “import” the collection. This is the process of bringing the documents into the Greenstone system, standardising the document format, the way that metadata is specified, and the file structure in which the documents are stored. Type //perl —S import.pl// at the prompt to get a list of all the options for the import program. The -//remove old// option is used to ensure that any previously imported documents are removed first. | ||
+ | |||
+ | < | ||
+ | perl —S import.pl -removeold dlpeople | ||
+ | </ | ||
+ | |||
+ | Don't worry about all the text that scrolls past—it' | ||
+ | |||
+ | Now let's make some changes to the collection configuration file to customize its appearance. First, give the collection a name. This will be treated by web browsers as the page title for the front page of the collection, and used as the collection icon in the absence of a picture. Change the line that reads // | ||
+ | |||
+ | Add a description of your collection between the quotes of the line that reads // | ||
+ | |||
+ | You can use any picture you can view in a web browser for a collection icon—the image I created is shown in Figure <imgref figure_collection_icon> | ||
+ | |||
+ | Save the collection configuration file, and close it—you won't need to look at it again during this tutorial. | ||
+ | |||
+ | The next phase is to “build” the collection, which creates all the indexes and files that make the collection work. Type //perl —S buildcol.pl// | ||
+ | |||
+ | < | ||
+ | perl —S buildcol.pl dlpeople | ||
+ | </ | ||
+ | |||
+ | Again, don't worry about the “progress report” text that scrolls past. | ||
+ | |||
+ | Make the collection “live” as follows: | ||
+ | |||
+ | select the contents of the // | ||
+ | |||
+ | Alternatively, | ||
+ | |||
+ | < | ||
+ | rd /s index # on Windows NT/2000 | ||
+ | deltree /Y index # on Windows 95/98 | ||
+ | </ | ||
+ | |||
+ | and then change the name of the // | ||
+ | |||
+ | < | ||
+ | ren building index | ||
+ | </ | ||
+ | |||
+ | Finally, type | ||
+ | |||
+ | < | ||
+ | mkdir building | ||
+ | </ | ||
+ | |||
+ | in preparation for any future rebuilds. It is important that these commands are issued from the correct directory (unlike the Greenstone commands // | ||
+ | |||
+ | You should be able to access the newly built collection from your Greenstone homepage. You will have to reload the page if you already had it open in your browser, or perhaps even close the browser and restart it (to prevent caching problems). Alternatively, | ||
+ | |||
+ | < | ||
+ | {{..: | ||
+ | |||
+ | In summary then, the commands typed to produce the // | ||
+ | |||
+ | < | ||
+ | cd " | ||
+ | setup.bat | ||
+ | perl —S mkcol.pl —creator [email protected] dlpeople | ||
+ | cd " | ||
+ | xcopy / | ||
+ | perl —S import.pl dlpeople | ||
+ | perl —S buildcol.pl dlpeople | ||
+ | rd /s index # on Windows NT/2000 | ||
+ | deltree /Y index # on Windows 95/98 | ||
+ | ren building index | ||
+ | mkdir building | ||
+ | </ | ||
+ | |||
+ | ==== Collection building under Unix ==== | ||
+ | |||
+ | First change into the directory where Greenstone has been installed. For example, if Greenstone is installed under its default name at the top level of your home account you can move there by typing | ||
+ | |||
+ | < | ||
+ | cd ~/gsdl | ||
+ | </ | ||
+ | |||
+ | Next at the prompt, type | ||
+ | |||
+ | < | ||
+ | source setup.bash | ||
+ | source setup.csh | ||
+ | </ | ||
+ | |||
+ | These batch files (which you can read if you like) tell the system where to look for Greenstone programs. If, later on in your command-line session with Greenstone, you wish to return to the top level Greenstone directory you can accomplish this by typing //cd $GSDLHOME// | ||
+ | |||
+ | If you are unsure of the shell type you are using, enter //echo $0// at your command-line prompt —it will print out the sought information. If you are using a different shell contact your system administrator for advice. | ||
+ | |||
+ | With the appropriate setup file sourced, we are now in a position to make, build and rebuild collections. The first program we will look at is the Perl program // | ||
+ | |||
+ | < | ||
+ | {{..: | ||
+ | |||
+ | Let us now use the command to create the initial files and directories necessary for our home page collection of Greenstone Digital Library project members. To assign the collection the name // | ||
+ | |||
+ | < | ||
+ | mkcol.pl —creator [email protected] dlpeople | ||
+ | </ | ||
+ | |||
+ | Please substitute your email address for mine! | ||
+ | |||
+ | To view the newly created files, move to the newly created collection directory by typing | ||
+ | |||
+ | < | ||
+ | cd $GSDLHOME/ | ||
+ | </ | ||
+ | |||
+ | You can list the contents of this directory by typing //ls//. There should be seven subdirectories: | ||
+ | |||
+ | Now we must populate the collection with sample documents. Source material for the // | ||
+ | |||
+ | < | ||
+ | mount /cdrom | ||
+ | </ | ||
+ | |||
+ | at the prompt (this command may differ from one system to another). Once mounted, the CD-ROM can be used like any other directory, so type //ls / | ||
+ | |||
+ | Next, copy the contents of the /// | ||
+ | |||
+ | < | ||
+ | cp —r / | ||
+ | </ | ||
+ | |||
+ | Then type | ||
+ | |||
+ | < | ||
+ | umount /cdrom | ||
+ | </ | ||
+ | |||
+ | to close the CD-ROM drive. | ||
+ | |||
+ | In the collection' | ||
+ | |||
+ | Now you are ready to “import” the collection. This is the process of bringing the documents into the Greenstone system, standardising the document format, the way that metadata is specified, and the file structure in which the documents are stored. Type // | ||
+ | |||
+ | < | ||
+ | import.pl —removeold dlpeople | ||
+ | </ | ||
+ | |||
+ | Don't worry about all the text that scrolls past—it' | ||
+ | |||
+ | Now let's make some changes to the collection configuration file to customize its appearance. First, give the collection a name. This will be treated by web browsers as the page title for the front page of the collection, and used as the collection icon in the absence of a picture. Change the line that reads // | ||
+ | |||
+ | Add a description of your collection between the quotes of the line that reads // | ||
+ | |||
+ | You can use any picture you can view in a web browser for a collection icon—the image I created is shown in Figure <imgref figure_collection_icon> | ||
+ | |||
+ | Save the collection configuration file, and close it—you won't need to look at it again during this tutorial. | ||
+ | |||
+ | The next phase is to “build” the collection, which creates all the indexes and files that make the collection work. Type // | ||
+ | |||
+ | < | ||
+ | buildcol.pl dlpeople | ||
+ | </ | ||
+ | |||
+ | at the prompt. Again, don't worry about the “progress report” text that scrolls past. | ||
+ | |||
+ | Make the collection “live” by putting all the material that has just been put in the collection' | ||
+ | |||
+ | < | ||
+ | rm —r index/* | ||
+ | </ | ||
+ | |||
+ | (assuming you are in the // | ||
+ | |||
+ | < | ||
+ | mv building/* index/ | ||
+ | </ | ||
+ | |||
+ | < | ||
+ | |< - 265 265 >| | ||
+ | | **Windows** | **Linux** | | ||
+ | | Run // | ||
+ | | Copy files from CD-ROM using the visual manager or Windows commands | Copy files from CD-ROM using //mount// and Unix commands | | ||
+ | | Old collection index replaced by typing //rd /s index// then //ren building index// followed by //mkdir building//, or by using visual file manager. | Old collection index replaced by typing //rm —r index/*// then //mv building/* index// | | ||
+ | |||
+ | You should be able to access the collection from your Greenstone homepage. You will have to reload the page if you already had it open in your browser, or perhaps even close the browser and restart it (to prevent caching problems). To view the new collection, click on the image. The result should look something like Figure <imgref figure_about_page_for_the_new_collection> | ||
+ | |||
+ | In summary then, the commands typed to produced the // | ||
+ | |||
+ | < | ||
+ | cd ~/gsdl # assuming default Greenstone in home directory | ||
+ | source setup.bash | ||
+ | source setup.csh | ||
+ | mkcol.pl —creator [email protected] dlpeople | ||
+ | cd $GSDLHOME/ | ||
+ | mount /cdrom # assuming this is where CD-ROM is mapped to | ||
+ | cp —r / | ||
+ | umount /cdrom | ||
+ | import.pl dlpeople | ||
+ | buildcol.pl dlpeople | ||
+ | rm -r index/* | ||
+ | mv building/* index | ||
+ | </ | ||
+ | |||
+ | ==== Differences between Windows and Unix ==== | ||
+ | |||
+ | The collection building process under Unix is very similar to that under Windows, but there are some small differences which are summarised in Table <tblref table_collection-building_differences_between_windows_and_linux> | ||
+ | |||
+ | ===== Greenstone directories ===== | ||
+ | |||
+ | Figure <imgref figure_structure_of_the_gsdlhome_directory> | ||
+ | |||
+ | < | ||
+ | {{..: | ||
+ | |||
+ | < | ||
+ | |< - 132 331 66 >| | ||
+ | | | **Contents** | Section | | ||
+ | | //bin// | Executable code, including binaries in the directory with your O/S name. | — | | ||
+ | | // | ||
+ | | //perllib// | Perl modules used at import and build time (plugins, for example). | 2.1 | | ||
+ | | // | ||
+ | | // | ||
+ | | //cgi-bin// | All Greenstone CGI scripts, which are moved to the system cgi-bin directory. | — | | ||
+ | | //tmp// | Directory used by Greenstone for storing temporary files. | — | | ||
+ | | //etc// | Configuration files, initialisation and error logs, user authorisation databases. | — | | ||
+ | | //src// | C++ code used for serving collections via a web server. | 3 | | ||
+ | | // | ||
+ | | // | ||
+ | | // | ||
+ | | // | ||
+ | | // | ||
+ | | //macros// | The macro files used for the user interface. | 2.4 | | ||
+ | | //collect// | Collections being served from this copy of Greenstone | 1.1 | | ||
+ | | //lib// | C++ source code used by both the collection server and the receptionist. | 3.1 | | ||
+ | | //images// | Images used in the user interface. | — | | ||
+ | | //docs// | Documentation. | — | | ||
+ | |||
+ | ===== Import and build processes ===== | ||
+ | |||
+ | In the command-line collection-building process of Section [[# | ||
+ | |||
+ | The import and build processes have many similarities, | ||
+ | |||
+ | < | ||
+ | |< - 132 104 293 >| | ||
+ | | | **Argument** | **Function** | | ||
+ | | // | ||
+ | | // | ||
+ | | // | ||
+ | | // | ||
+ | | //-out// | Filename | Specify a file to which to write all output messages, which defaults to standard error (the screen). Useful when working with debugging statements. | | ||
+ | | // | ||
+ | | // | ||
+ | |||
+ | ==== The import process ==== | ||
+ | |||
+ | The import process' | ||
+ | |||
+ | < | ||
+ | |< - 132 104 293 >| | ||
+ | | | **Argument** | **Function** | | ||
+ | | // | ||
+ | | // | ||
+ | | //-gzip// | None | Zip up the Greenstone archive documents produced by //import// (ZIPPlug must be included in the plugin list, and //gzip// must be installed on your machine). | | ||
+ | | // | ||
+ | | // | ||
+ | | // | ||
+ | |||
+ | < | ||
+ | {{..: | ||
+ | |||
+ | Figure <imgref figure_steps_in_the_import_process> | ||
+ | |||
+ | For step 3, note that import variables like // | ||
+ | |||
+ | In step 6, the archives information file (// | ||
+ | |||
+ | Step 7 creates an object that knows where documents are to be saved, and obeys any special saving instructions (such as // | ||
+ | |||
+ | Most of the work done in the import process is actually accomplished by plugins, which are called by the //plugin// module. This module creates a pipeline of the plugins specified in the collection configuration file. It also handles the writing of Greenstone archive documents (using a // | ||
+ | |||
+ | ==== The build process ==== | ||
+ | |||
+ | During the building process the text is compressed, and the full-text indexes that are specified in the collection configuration file are created. Furthermore, | ||
+ | |||
+ | < | ||
+ | |< - 132 104 293 >| | ||
+ | | | **Argument** | **Function** | | ||
+ | | // | ||
+ | | //-index// | Index name (e.g.< | ||
+ | | // | ||
+ | | // | ||
+ | | //-mode// | //all//, < | ||
+ | | // | ||
+ | |||
+ | < | ||
+ | {{..: | ||
+ | |||
+ | The diagram in Figure <imgref figure_steps_in_the_build_process> | ||
+ | |||
+ | Step 5 first checks to see whether there is a collection-specific build procedure. A few collections require special build-time processing, in which case a collection-specific builder must be written and placed in the collection' | ||
+ | |||
+ | Step 6 is the building step, in which the document text is compressed and indexed, collection titles and icons are stored in a collection information database, and data structures are built to support the classifiers that are called for in the collection' | ||
+ | |||
+ | The parts of the collection that are built can be specified by the //mode// option, but the default is to build everything—compressed text, indexes, and collection information database. | ||
+ | |||
+ | To make a collection available over the web once it is built, you must move it from the collection' | ||
+ | |||
+ | ===== Greenstone archive documents ===== | ||
+ | |||
+ | All source documents are brought into the Greenstone system by converting them to a format known as the Greenstone Archive Format. This is an XML style that marks documents into sections, and can hold metadata at the document or section level. You should not have to create Greenstone archive files manually—that is the job of the document processing plugins described in the next chapter. However, it may be helpful to understand the format of Greenstone files, and so we describe it here. | ||
+ | |||
+ | In XML, tags are enclosed in angle brackets for markup. The Greenstone archive format encodes documents that are already in html, and any embedded <, >, or " characters within the original text ar7e escaped using the standard convention //& | ||
+ | |||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | ]> | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | < | ||
+ | < | ||
+ | <?xml version=" | ||
+ | < | ||
+ | " | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | </ | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | </ | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | < | ||
+ | < | ||
+ | </ | ||
+ | </ | ||
+ | </ | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | Figure <imgref figure_greenstone_archive_format> | ||
+ | |||
+ | Figure <imgref figure_greenstone_archive_format_1> | ||
+ | |||
+ | < | ||
+ | |< - 130 400 >| | ||
+ | | // | ||
+ | | // | ||
+ | |||
+ | The //< | ||
+ | |||
+ | In some collections documents are split into individual pages. These are treated as sections. For example, a book might have first-level sections that correspond to chapters, within each of which are defined a number of “sections” that actually correspond to the individual pages of the chapter. | ||
+ | |||
+ | ==== Document metadata ==== | ||
+ | |||
+ | Metadata is descriptive information such as author, title, date, keywords, and so on, that is associated with a document. It has already been mentioned that metadata is stored with documents. Looking at Figure <imgref figure_greenstone_archive_format>, | ||
+ | |||
+ | Table <tblref table_dublin_core_metadata_standard> | ||
+ | |||
+ | < | ||
+ | |< - 133 94 305 >| | ||
+ | | **Name** | **Metadata < | ||
+ | | *Title | Title | A name given to the resource | | ||
+ | | *Creator | //Creator// | An entity primarily responsible for making the content of the resource | | ||
+ | | *Subject and keywords | //Subject// | The topic of the content of the resource | | ||
+ | | *Description | // | ||
+ | | *Publisher | // | ||
+ | | Contributor | // | ||
+ | | *Date | //Date// | The date that the resource was published or some other important date associated with the resource. | | ||
+ | | Resource type | //Type// | The nature or genre of the content of the resource | | ||
+ | | Format | //Format// | The physical or digital manifestation of the resource | | ||
+ | | *Resource identifier | // | ||
+ | | *Source | //Source// | A reference to a resource from which the present resource is derived | | ||
+ | | *Language | // | ||
+ | | Relation | // | ||
+ | | *Coverage | // | ||
+ | | Rights management | //Rights// | Information about rights held in and over the resource | | ||
+ | |||
+ | ==== Inside Greenstone archive documents ==== | ||
+ | |||
+ | Within a single document, the Greenstone archive format imposes a limited amount of structure. Documents are divided into paragraphs. They can be split hierarchically into sections and subsections; | ||
+ | |||
+ | When you read a book in a Greenstone collection, the section hierarchy is manifested in the table of contents of the book. For example, books in the Demo collection have a hierarchical table of contents showing chapters, sections, and subsections, | ||
+ | |||
+ | < | ||
+ | {{..: | ||
+ | |||
+ | < | ||
+ | {{..: | ||
+ | |||
+ | The document structure is also used for searchable indexes. There are three possible levels of index: // | ||
+ | |||
+ | The pulldown menu in Figure <imgref figure_hierarchical_structure_in_the_demo_collection_1> | ||
+ | |||
+ | ===== configuration file ===== | ||
+ | |||
+ | The collection configuration file governs the structure of a collection as seen by the user, allowing you to customise the “look and feel” of your collection and the way in which its documents are processed and presented. A simple collection configuration file is created when you run // | ||
+ | |||
+ | < | ||
+ | |< - 132 397 >| | ||
+ | | '' | ||
+ | | ''< | ||
+ | | ''< | ||
+ | | ''< | ||
+ | | ''< | ||
+ | | ''< | ||
+ | | ''< | ||
+ | | ''< | ||
+ | | ''< | ||
+ | | ''< | ||
+ | | ''< | ||
+ | | ''< | ||
+ | | ''< | ||
+ | | ''< | ||
+ | | ''< | ||
+ | |||
+ | Each line of the collection configuration file is essentially an “attribute, | ||
+ | |||
+ | The collection configuration file created by the // | ||
+ | |||
+ | < | ||
+ | |< - 132 132 265 >| | ||
+ | | | **Attribute** | **Value** | | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | |||
+ | Line 3 indicates whether the collection will be available to the public when it is built, and is either //true// (the default, meaning that the collection is publicly available), or //false// (meaning that it is not). This is useful when building collections to test software, or building collections of material for personal use. Line 4 indicates whether the collection is beta or not (this also defaults to //true//, meaning that the collection is a beta release). | ||
+ | |||
+ | Line 5 determines what collection indexes are created at build time: in this example only the document text is to be indexed. Indexes can be constructed at the // | ||
+ | |||
+ | Lines 7—13 specify which plugins to use when converting documents to the Greenstone archive format and when building collections from archive files. Section [[# | ||
+ | |||
+ | Line 14 specifies that an alphabetic list of titles is to be created for browsing purposes. Browsing structures are constructed by “classifiers”. Section [[# | ||
+ | |||
+ | Lines 15—18 are used to specify collection-level metadata. Specified through // | ||
+ | |||
+ | > collectionmeta collectionextra " | ||
+ | |||
+ | > collectionmeta collectionextra [l=fr] " | ||
+ | |||
+ | > collectionmeta collectionextra [l=mi] " | ||
+ | |||
+ | If the interface language is set to “fr” or “mi”, the appropriate version of the description will be displayed. For other languages the default version will appear. | ||
+ | |||
+ | This simple collection configuration file does not include any examples of format strings, nor of the subcollection and language facilities provided by the configuration file. Format strings are covered more thoroughly in Section [[# | ||
+ | |||
+ | ==== Subcollections ==== | ||
+ | |||
+ | Greenstone allows you to define subcollections and build separate indexes for each one. For example, in one collection there is a large subset of documents called //Food and Nutrition Bulletin//. We use this collection as an example. | ||
+ | |||
+ | This collection has three indexes, all at the section level: one for the whole collection, one for the //Food and Nutrition Bulletin//, and the third for the remaining documents. The relevant lines from the collection configuration file can be seen below. | ||
+ | |||
+ | < | ||
+ | indexes | ||
+ | subcollection | ||
+ | subcollection | ||
+ | indexsubcollections fn | ||
+ | </ | ||
+ | |||
+ | The second and third lines define subcollections called //fn//, which contains the //Food and Nutrition Bulletin// documents, and //other//, which contains the remaining documents. The third field of these definitions is a Perl regular expression that identifies these subsets using the //Title// metadata: we seek titles that begin with //Food and Nutrition Bulletin// in the first case and ones that do not in the second case (note the “!”). The final //i// makes the pattern-matching case-insensitive. The metadata field, in this case //Title//, can be any valid field, or // | ||
+ | |||
+ | If a collection contains documents in different languages, separate indexes can be built for each language. Language is a metadata statement; values are specified using the ISO 639 standard two-letter codes for representing the names of languages—for example, //en// is English, //zh// is Chinese, and //mi// is Maori. Since metadata values can be specified at the section level, parts of a document can be in different languages. | ||
+ | |||
+ | For example, if the configuration file contained | ||
+ | |||
+ | < | ||
+ | indexes section: | ||
+ | languages en zh mi | ||
+ | </ | ||
+ | |||
+ | section text, section title, document text, and paragraph text indexes would be created for English, Chinese, and Maori—twelve indexes altogether. Adding a couple of subcollections multiplies the number of indexes again. Care is necessary to guard against index bloat. | ||
+ | |||
+ | (This index specification could be defined using the // | ||
+ | |||
+ | ==== Cross-collection searching ==== | ||
+ | |||
+ | Greenstone has a facility for “cross-collection searching, | ||
+ | |||
+ | Cross-collection searching is enabled by a line | ||
+ | |||
+ | < | ||
+ | supercollection col _1 col _2 …. | ||
+ | </ | ||
+ | |||
+ | where the collections involved are called //col_1//, //col_2//, … The same line should appear in the configuration file of every collection that is involved. | ||
legacy/manuals/en/develop/understanding_the_collection-building_process.txt · Last modified: 2023/03/13 01:46 by 127.0.0.1