Differences

This shows you the differences between two versions of the page.

Link to this comparison view

legacy:manuals:en:develop:the_greenstone_runtime_system [2013/10/17 11:44] (current)
Line 1: Line 1:
 +====== <!-- id:756 -->The Greenstone runtime system ======
 +
 +<!-- id:757 -->This chapter describes the Greenstone runtime system so that you can understand, augment and extend its capabilities. The software is written in C++ and makes extensive use of virtual inheritance. If you are unfamiliar with this language you should learn about it before proceeding. Deitel and Deitel (1994) provide a comprehensive tutorial, while Stroustroup (1997) is the definitive reference.
 +
 +<!-- id:758 -->We begin by explaining the design philosophy behind the runtime system since this has a strong bearing on implementation. Then we provide the implementation details, which forms the main part of this chapter. The version of Greenstone described here is the CGI version (Web Library if for Windows users). The Windows Local Library uses the same source code but has a built-in webserver front end. Also, the Local Library is a persistent process.
 +
 +===== <!-- id:759 -->​Process structure =====
 +
 +<​imgcaption figure_overview_of_a_general_greenstone_system|%!-- id:760 --%Overview of a general Greenstone system ></​imgcaption>​
 +{{..:​images:​dev_fig_20.png?​382x335&​direct}}
 +
 +<!-- id:761 -->​Figure <imgref figure_overview_of_a_general_greenstone_system>​ shows several users, represented by computer terminals at the top of the diagram, accessing three Greenstone collections. Before going online, these collections undergo the importing and building processes described in earlier chapters. First, documents, shown at the bottom of the figure, are imported into the XML-compliant Greenstone Archive Format. Then the archive files are built into various searchable indexes and a collection information database that includes the hierarchical structures that support browsing. This done, the collection is ready to go online and respond to requests for information.
 +
 +<!-- id:762 -->Two components are central to the design of the runtime system: “receptionists” and “collection servers.” From a user's point of view, a receptionist is the point of contact with the digital library. It accepts user input, typically in the form of keyboard entry and mouse clicks; analyzes it; and then dispatches a request to the appropriate collection server (or servers). This locates the requested piece of information and returns it to the receptionist for presentation to the user. Collection servers act as an abstract mechanism that handle the content of the collection, while receptionists are responsible for the user interface.
 +
 +<​imgcaption figure_greenstone_system_using_the_null_protocol|%!-- id:763 --%Greenstone system using the “null protocol” ></​imgcaption>​
 +{{..:​images:​dev_fig_21.png?​329x424&​direct}}
 +
 +<!-- id:764 -->As Figure <imgref figure_overview_of_a_general_greenstone_system>​ shows, receptionists communicate with collection servers using a defined protocol. The implementation of this protocol depends on the computer configuration on which the digital library system is running. The most common case, and the simplest, is when there is one receptionist and one collection server, and both run on the same computer. This is what you get when you install the default Greenstone. In this case the two processes are combined to form a single executable (called //​library//​),​ and consequently using the protocol reduces to making function calls. We call this the //null protocol//. It forms the basis for the standard out-of-the-box Greenstone digital library system. This simplified configuration is illustrated in Figure <imgref figure_greenstone_system_using_the_null_protocol>,​ with the receptionist,​ protocol and collection server bound together as one entity, the //library// program. The aim of this chapter is to show how it works.
 +
 +<!-- id:765 -->​Usually,​ a “server” is a persistent process that, once started, runs indefinitely,​ responding to any requests that come in. Despite its name, however, the collection server in the null protocol configuration is not a server in this sense. In fact, every time any Greenstone web page is requested, the //library// program is started up (by the CGI mechanism), responds to the request, and then exits. We call it a “server” because it is also designed to work in the more general configuration of Figure <imgref figure_overview_of_a_general_greenstone_system>​.
 +
 +<!-- id:766 -->​Surprisingly,​ this start-up, process and exit cycle is not as slow as one might expect, and results in a perfectly usable service. However, it is clearly inefficient. There is a mechanism called Fast-CGI (// www.fastcgi.com //) which provides a middle ground. Using it, the //library// program can remain in memory at the end of the first execution, and have subsequent sets of CGI arguments fed to it, thus avoiding repeated initialisation overheads and accomplishing much the same behaviour as a server. Using Fast-CGI is an option in Greenstone, and is enabled by recompiling the source code with appropriate libraries.
 +
 +<!-- id:767 -->As an alternative to the null protocol, the Greenstone protocol has also been implemented using the well-known CORBA scheme (Slama //et al.//, 1999). This uses a unified object oriented paradigm to enable different processes, running on different computer platforms and implemented in different programming languages, to access the same set of distributed objects over the Internet (or any other network). Then, scenarios like Figure <imgref figure_overview_of_a_general_greenstone_system>​ can be fully implemented,​ with all the receptionists and collection servers running on different computers.
 +
 +<​imgcaption figure_graphical_query_interface_to_greenstone|%!-- id:768 --%Graphical query interface to Greenstone ></​imgcaption>​
 +{{..:​images:​dev_fig_22.png?​397x348&​direct}}
 +
 +<!-- id:769 -->This allows far more sophisticated interfaces to be set up to exactly the same digital library collections. As just one example, Figure <imgref figure_graphical_query_interface_to_greenstone>​ shows a graphical query interface, based on Venn diagrams, that lets users manipulate Boolean queries directly. Written in Java, the interface runs locally on the user's own computer. Using CORBA, it accesses a remote Greenstone collection server, written in C++.
 +
 +<!-- id:770 -->The distributed protocol is still being refined and readied for use, and so this manual does not discuss it further (see Bainbridge //et al//., 2001, for more information).
 +
 +===== <!-- id:771 -->​Conceptual framework =====
 +
 +<​imgcaption figure_generating_the_about_this_collection_page|%!-- id:772 --%Generating the “about this collection” page ></​imgcaption>​
 +{{..:​images:​dev_fig_23.png?​398x354&​direct}}
 +
 +<!-- id:773 -->​Figure <imgref figure_generating_the_about_this_collection_page>​ shows the “about this collection” page of a particular Greenstone collection (the Project Gutenberg collection). Look at the URL at the top. The page is generated as a result of running the CGI program //​library//,​ which is the above-mentioned executable comprising both receptionist and collection server connected by the null protocol. The arguments to //library// are //​c=gberg//,​ //a=p//, and //​p=about//​. They can be interpreted as follows:
 +
 +> <!-- id:774 -->//For the Project Gutenberg collection (c=gberg), the action is to generate a page (a=p), and the page to generate is called “about” (p=about).//​
 +
 +<​imgcaption figure_greenstone_runtime_system|%!-- id:775 --%Greenstone runtime system ></​imgcaption>​
 +{{..:​images:​dev_fig_24.gif?​534x345&​direct}}
 +
 +<!-- id:776 -->​Figure <imgref figure_greenstone_runtime_system>​ illustrates the main parts of the Greenstone runtime system. At the top, the receptionist first initialises its components, then parses the CGI arguments to decide which action to call. In performing the action (which includes further processing of the CGI arguments), the software uses the protocol to access the content of the collection. The response is used to generate a web page, with assistance from the format component and the macro language.
 +
 +<!-- id:777 -->The macro language, which we met in Section [[#​controlling_the_greenstone_user_interface|controlling_the_greenstone_user_interface]],​ is used to provide a Greenstone digital library system with a consistent style, and to create interfaces in different languages. Interacting with the library generates the bare bones of web pages; the macros in //​GSDLHOME/​macros//​ wrap them in flesh.
 +
 +<!-- id:778 -->The Macro Language object in Figure <imgref figure_greenstone_runtime_system>​ is responsible for reading these files and storing the parsed result in memory. Any action can use this object to expand a macro. It can even create new macro definitions and override existing ones, adding a dynamic dimension to macro use.
 +
 +<!-- id:779 -->The layout of the “about this collection” page (Figure <imgref figure_generating_the_about_this_collection_page>​) is known before runtime, and encoded in the macro file //​about.dm//​. Headers, footers, and the background image are not even mentioned because they are located in the //Global// macro package. However, the specific “about” text for a particular collection is not known in advance, but is stored in the collection information database during the building process. This information is retrieved using the protocol, and stored as //​_collectionextra_//​ in the //Global// macro package. Note this macro name is essentially the same name used to express this information in the collection configuration file described in Section [[#​configuration_file|configuration_file]]. To generate the content of the page, the //​_content_//​ macro in the //about// package (shown in Figure <imgref figure_part_of_the_aboutdm_macro_file>​) is expanded. This in turn expands //​_textabout_//,​ which itself accesses //​_collectionextra_//,​ which had just been dynamically placed there.
 +
 +<!-- id:780 -->One further important ingredient is the Format object. Format statements in the collection configuration file affect the presentation of particular pieces of information,​ as described in Section [[#​formatting_greenstone_output|formatting_greenstone_output]]. They are handled by the Format object in Figure <imgref figure_greenstone_runtime_system>​. This object'​s main task is to parse and evaluate statements such as the format strings in Figure <imgref figure_excerpt_from_the_demo_collection_collect>​. As we learned in Section [[#​formatting_greenstone_output|formatting_greenstone_output]],​ these can include references to metadata in square brackets (e.g. //​[Title]//​),​ which need to be retrieved from the collection server. Interaction occurs between the Format object and the Macro Language object, because format statements can include macros that, when expanded, include metadata, which when expanded include macros, and so on.
 +
 +<!-- id:781 -->At the bottom of Figure <imgref figure_greenstone_runtime_system>,​ the collection server also goes through an initialisation process, setting up Filter and Source objects to respond to incoming protocol requests, and a Search object to assist in this task. Ultimately these access the indexes and the collection information database, both formed during collection building.
 +
 +<!-- id:782 -->​Ignoring blank lines, the receptionist contains 15,000 lines of code. The collection server contains only 5,000 lines (75% of which are taken up by header files). The collection server is more compact because content retrieval is accomplished through two pre-compiled programs. mg, a full-text retrieval system, is used for searching, and gdbm, a database management system, is used to hold the collection information database.
 +
 +<!-- id:783 -->To encourage extensibility and flexibility,​ Greenstone uses inheritance widely—in particular, within Action, Filter, Source, and Search. For a simple digital library dedicated to text-based collections,​ this means that you need to learn slightly more to program the system. However, it also means that mg and gdbm could easily be replaced should the need arise. Furthermore,​ the software architecture is rich enough to support full multimedia capabilities,​ such as controlling the interface through speech input, or submitting queries as graphically drawn pictures.
 +
 +===== <!-- id:784 -->How the conceptual framework fits together =====
 +
 +<!-- id:785 -->​Sections [[#​collection_server|collection_server]] and [[#​receptionist|receptionist]] explain the operation of the collection server and receptionist in more detail, expanding on each module in Figure <imgref figure_greenstone_runtime_system>​ and describing how it is implemented. It is helpful to first work through examples of a user interacting with Greenstone, and describe what goes on behind the scenes. For the moment, we assume that all objects are correctly initialised. Initialisation is a rather intricate procedure that we revisit in Section [[#​initialisation|initialisation]].
 +
 +==== <!-- id:786 -->​Performing a search ====
 +
 +<​imgcaption figure_searching_gutenberg_for_darcy|%!-- id:787 --%Searching Gutenberg for //Darcy// ></​imgcaption>​
 +{{..:​images:​dev_fig_25.png?​396x353&​direct}}
 +
 +<!-- id:788 -->When a user enters a query by pressing //Begin search// on the search page, a new Greenstone action is invoked, which ends up by generating a new html page using the macro language. Figure <imgref figure_searching_gutenberg_for_darcy>​ shows the result of searching the Project Gutenberg collection for the name //Darcy//. Hidden within the html of the original search page is the statement //a=q//. When the search button is pressed this statement is activated, and sets the new action to be //​queryaction//​. Executing //​queryaction//​ sets up a call to the designated collection'​s Filter object (//​c=gberg//​) through the protocol.
 +
 +<!-- id:789 -->​Filters are an important basic function of collection servers. Tailored for both searching and browsing activities, they provide a way of selecting a subset of information from a collection. In this case, the //​queryaction//​ sets up a filter request by:
 +
 +  * <!-- id:790 -->​setting the filter request type to be //​QueryFilter//​(Section [[#​collection_server|collection_server]] describes the different filter types);
 +  * <!-- id:791 -->​storing the user's search preferences—case-folding,​ stemming and so on—in the filter request;
 +  * <!-- id:792 -->​calling the //​filter()//​ function using the null protocol.
 +
 +<!-- id:793 -->Calls to the protocol are synchronous. The receptionist is effectively blocked until the filter request has been processed by the collection server and any data generated has been returned.
 +
 +<!-- id:794 -->When a protocol call of type //​QueryFilter//​ is made, the Filter object (in Figure <imgref figure_greenstone_runtime_system>​) decodes the options and makes a call to the Search object, which uses mg to do the actual search. The role of the Search object is to provide an abstract program interface that supports searching, regardless of the underlying search tool being used. The format used for returning results also enforces abstraction,​ requiring the Search object to translate the data generated by the search tool into a standard form.
 +
 +<!-- id:795 -->Once the search results have been returned to the receptionist,​ the action proceeds by formatting the results for display, using the Format object and the Macro Language. As Figure <imgref figure_searching_gutenberg_for_darcy>​ shows, this involves generating: the standard Greenstone header, footer, navigation bar and background; repeating the main part of the query page just beneath the navigation bar; and displaying a book icon, title and author for each matching entry. The format of this last part is governed by the //format SearchVList//​ statement in the collection configuration file. Before title and author metadata can be displayed, they must be retrieved from the collection server. This requires further calls to the protocol, this time using //​BrowseFilter//​.
 +
 +==== <!-- id:796 -->​Retrieving a document ====
 +
 +<!-- id:797 -->​Following the above query for //Darcy//, consider what happens when a document is displayed. Figure <imgref figure_the_golf_course_mystery>​ shows the result of clicking on the icon beside //The Golf Course Mystery// in Figure <imgref figure_searching_gutenberg_for_darcy>​.
 +
 +<​imgcaption figure_the_golf_course_mystery|%!-- id:798 --%//The Golf Course Mystery// ></​imgcaption>​
 +{{..:​images:​dev_fig_26.png?​393x349&​direct}}
 +
 +<!-- id:799 -->The source text for the Gutenberg collection comprises one long file per book. At build time, these files are split into separate pages every 200 lines or so, and relevant information for each page is stored in the indexes and collection information database. The top of Figure <imgref figure_the_golf_course_mystery>​ shows that this book contains 104 computer-generated pages, and below it is the beginning of page one: who entered it, the title, the author, and the beginnings of a table of contents (this table forms part of the Gutenberg source text, and was not generated by Greenstone). At the top left are buttons that control the document'​s appearance: just one page or the whole document; whether query term highlighting is on or off; and whether or not the book should be displayed in its own window, detached from the main searching and browsing activity. At the top right is a navigation aid that supports direct access to any page in the book: simply type in the page number and press the “go to page” button. Alternatively,​ the next and previous pages are retrieved by clicking on the arrow icons either side of the page selection widget.
 +
 +<!-- id:800 -->The action for retrieving documents, //​documentaction//,​ is specified by setting //a=d// and takes several additional arguments. Most important is the document to retrieve: this is specified through the //d// variable. In Figure <imgref figure_the_golf_course_mystery>​ it is set to //​d=HASH51e598821ed6cbbdf0942b.1//​ to retrieve the first page of the document with the identifier //​HASH51e598821ed6cbbdf0942b//,​ known in more friendly terms as //The Golf Course Mystery//. There are further variables: whether query term highlighting is on or off (//hl//) and which page within a book is displayed (//gt//). These variables are used to support the activities offered by the buttons on the page in Figure <imgref figure_the_golf_course_mystery>,​ described above. Defaults are used if any of these variables are omitted.
 +
 +<!-- id:801 -->The action follows a similar procedure to //​queryaction//:​ appraise the CGI arguments, access the collection server using the protocol, and use the result to generate a web page. Options relating to the document are decoded from the CGI arguments and stored in the object for further work. To retrieve the document from the collection server, only the document identifier is needed to set up the protocol call to //​get_document()//​. Once the text is returned, considerable formatting must be done. To achieve this, the code for //​documentaction//​ accesses the stored arguments and makes use of the Format object and the Macro Language.
 +
 +==== <!-- id:802 -->​Browsing a hierarchical classifier ====
 +
 +<!-- id:803 -->​Figure <imgref figure_browsing_titles_in_the_gutenberg_collection>​ shows an example of browsing, where the user has chosen //Titles A-Z// and accessed the hyperlink for the letter //K//. The action that supports this is also //​documentaction//,​ given by the CGI argument //a=d// as before. However, whereas before a //d// variable was included, this time there is none. Instead, the node within the browsable classification hierarchy to display is specified in the variable //cl//. In our case this represents titles grouped under the letter //K//. This list was formed at build time and stored in the collection information database.
 +
 +<​imgcaption figure_browsing_titles_in_the_gutenberg_collection|%!-- id:804 --%Browsing titles in the Gutenberg collection ></​imgcaption>​
 +{{..:​images:​dev_fig_27.png?​394x351&​direct}}
 +
 +<!-- id:805 -->​Records that represent classifier nodes in the database use the prefix //CL//, followed by numbers separated by periods (.) to designate where they lie within the nested structure. Ignoring the search button (leftmost in the navigation bar), classifiers are numbered sequentially in increasing order, left to right, starting at 1. Thus the top level classifier node for titles in our example is //CL1// and the page sought is generated by setting //​cl=CL1.11//​. This can be seen in the URL at the top of Figure <imgref figure_browsing_titles_in_the_gutenberg_collection>​.
 +
 +<!-- id:806 -->To process a //cl// document request, the Filter object is used to retrieve the node over the protocol. Depending on the data returned, further protocol calls are made to retrieve document metadata. In this case, the titles of the books are retrieved. However, if the node were an interior one whose children are themselves nodes, the titles of the child nodes would be retrieved. From a coding point of view this amounts to the same thing, and is handled by the same mechanism.
 +
 +<!-- id:807 -->​Finally,​ all the retrieved information is bound together, using the macro language, to produce the web page shown in Figure <imgref figure_browsing_titles_in_the_gutenberg_collection>​.
 +
 +==== <!-- id:808 -->​Generating the home page ====
 +
 +<​imgcaption figure_greenstone_home_page|%!-- id:809 --%Greenstone home page ></​imgcaption>​
 +{{..:​images:​dev_fig_28.png?​397x393&​direct}}
 +
 +<!-- id:810 -->As a final example, we look at generating the Greenstone home page. Figure <imgref figure_greenstone_home_page>​ shows—for the default Greenstone installation —its home page after some test collections have been installed. Its URL, which you can see at the top of the screen, includes the arguments //a=p// and //p=home//. Thus, like the “about this collection” page, it is generated by a //​pageaction//​ (//a=p//), but this time the page to produce is //​home//​(//​p=home//​). The macro language, therefore, accesses the content of //​home.dm//​. There is no need to specify a collection (with the //c// variable) in this case.
 +
 +<!-- id:811 -->The purpose of the home page is to show what collections are available. Clicking on an icon takes the user to the “about this collection” page for that collection. The menu of collections is dynamically generated every time the page is loaded, based on the collections that are in the file system at that time. When a new one comes online, it automatically appears on the home page when that page is reloaded (provided the collection is stipulated to be “public”).
 +
 +<!-- id:812 -->To do this the receptionist uses the protocol (of course). As part of appraising the CGI arguments, //​pageaction//​ is programmed to detect the special case when //p=home//. Then, the action uses the protocol call //​get_collection_list()//​ to establish the current set of online collections. For each of these it calls //​get_collectinfo()//​ to obtain information about it. This information includes whether the collection is publicly available, what the URL is for the collection'​s icon (if any), and the collection'​s full name. This information is used to generate an appropriate entry for the collection on the home page.
 +
 +===== <!-- id:813 -->​Source code =====
 +
 +<​tblcaption table_standalone_programs_included_in_greenstone|Standalone programs included in Greenstone></​tblcaption>​
 +|< - 132 397 >|
 +| <!-- id:815 -->//​setpasswd///​ | <!-- id:816 -->​Password support for Windows. |
 +| <!-- id:817 -->//​getpw///​ | <!-- id:818 -->​Password support for Unix. |
 +| <!-- id:819 -->//​txt2db///​ | <!-- id:820 -->​Convert an XML-like ASCII text format to Gnu's database format. |
 +| <!-- id:821 -->//​db2txt///​ | <!-- id:822 -->​Convert the Gnu database format to an XML-like ASCII text format. |
 +| <!-- id:823 -->//​phind///​ | <!-- id:824 -->​Hierarchical phrase browsing tool. |
 +| <!-- id:825 -->//​hashfile///​ | <!-- id:826 -->​Compute unique document ID based on content of file. |
 +| <!-- id:827 -->//​mgpp///​ | <!-- id:828 -->​Rewritten and updated version of Managing Gigabytes package in C++. |
 +| <!-- id:829 -->//​w32server///​ | <!-- id:830 -->Local library server for Windows. |
 +| <!-- id:831 -->//​checkis///​ | <!-- id:832 -->​Specific support for installing Greenstone under Windows. |
 +
 +<!-- id:833 -->The source code for the runtime system resides in //​GSDLHOME/​src//​. It occupies two subdirectories,​ //recpt// for the receptionist'​s code and //​colservr//​ for the collection server'​s. Greenstone runs on Windows systems right down to Windows 3.1, and unfortunately this imposes an eight-character limit on file and directory names. This explains why cryptic abbreviations like //recpt// and //​colservr//​ are used. The remaining subdirectories include standalone utilities, mostly in support of the building process. They are listed in Table <tblref table_standalone_programs_included_in_greenstone>​.
 +
 +<!-- id:834 -->​Another directory, //​GSDLHOME/​lib//,​ includes low-level objects that are used by both receptionist and collection server. This code is described in Section [[#​common_greenstone_types|common_greenstone_types]].
 +
 +<!-- id:835 -->​Greenstone makes extensive use of the Standard Template Library (STL), a widely-used C++ library from Silicon Graphics (// www.sgi.com //) that is the result of many years of design and development. Like all programming libraries it takes some time to learn. Appendix A gives a brief overview of key parts that are used throughout the Greenstone code. For a fuller description,​ consult the official STL reference manual, available online at // www.sgi.com //, or one of the many STL textbooks, for example Josuttis (1999).
 +
 +===== <!-- id:836 -->​Common Greenstone types =====
 +
 +<!-- id:837 -->The objects defined in //​GSDLHOME/​lib//​ are low-level Greenstone objects, built on top of STL, which pervade the entire source code. First we describe //text_t//, an object used to represent Unicode text, in some detail. Then we summarize the purpose of each library file.
 +
 +==== <!-- id:838 -->The text_t object ====
 +
 +<!-- id:839 -->​Greenstone works with multiple languages, both for the content of a collection and its user interface. To support this, Unicode is used throughout the source code. The underlying object that realises a Unicode string is //text_t//.
 +
 +<​imgcaption figure_the_text_t_api|%!-- id:840 --%The //text_t// API (abridged) %!-- withLineNumber --%></​imgcaption>​
 +<code 1>
 +typedef vector<​unsigned short> usvector;
 +
 +class text_t {
 +protected:
 +usvector text;
 +unsigned short encoding; // 0 = unicode, 1 = other
 +
 +public:
 +   // constructors
 +   ​text_t ();
 +   ​text_t (int i);
 +   ​text_t (char *s); // assumed to be a normal c string
 +
 +   void setencoding (unsigned short theencoding);​
 +   ​unsigned short getencoding ();
 +
 +   // STL container support
 +   ​iterator begin ();
 +   ​iterator end ();
 +
 +   void erase(iterator pos);
 +   void push_back(unsigned short c);
 +   void pop_back();
 +
 +   void reserve (size_type n);
 +
 +   bool empty () const {return text.empty();​}
 +   ​size_type size() const {return text.size();​}
 +
 +   // added functionality
 +   void clear ();
 +   void append (const text_t &t);
 +
 +   // support for integers
 +   void appendint (int i);
 +   void setint (int i);
 +   int getint () const;
 +
 +   // support for arrays of chars
 +   void appendcarr (char *s, size_type len);
 +   void setcarr (char *s, size_type len);
 +};
 +</​code>​
 +
 +
 +
 +<!-- id:841 -->​Unicode uses two bytes to store each character. Figure <imgref figure_the_text_t_api>​ shows the main features of the //text_t// Application Program Interface (API). It fulfils the two-byte requirement using the C++ built-in type //short//, which is defined to be a two byte integer. The data type central to the //text_t// object is a dynamic array of unsigned shorts built using the STL declaration //​vector<​unsigned short>// and given the abbreviated name //​usvector//​.
 +
 +<!-- id:842 -->The constructor functions (lines 10—12) explicitly support three forms of initialisation:​ construction with no parameters, which generates an empty Unicode string; construction with an integer parameter, which generates a Unicode text version of the numeric value provided; and construction with a //char*// parameter, which treats the argument as a null-terminated C++ string and generates a Unicode version of it.
 +
 +<!-- id:843 -->​Following this, most of the detail (lines 17—28) is taken up maintaining an STL vector-style container: //​begin()//,​ //end()//, //​push_back()//,​ //empty()// and so forth. There is also support for clearing and appending strings, as well as for converting an integer value into a Unicode text string, and returning the corresponding integer value of text that represents a number.
 +
 +<​imgcaption figure_overloaded_operators_to_text_t|%!-- id:844 --%Overloaded operators to //text_t// %!-- withLineNumber --%></​imgcaption>​
 +<code 1>
 +class text_t {
 +   // ...
 +   ​public:​
 +   ​text_t &​operator=(const text_t &x);
 +   ​text_t &​operator+= (const text_t &t);
 +   ​reference operator[](size_type n);
 +
 +   ​text_t &​operator=(int i);
 +   ​text_t &​operator+= (int i);^ \\
 +   ​text_t &​operator= (char *s);
 +   ​text_t &​operator+= (char *s);
 +
 +   ​friend inline bool operator!=(const text_t& x, const text_t& y);
 +   ​friend inline bool operator==(const text_t& x, const text_t& y);
 +   ​friend inline bool operator<​ (const text_t& x, const text_t& y);
 +   ​friend inline bool operator>​ (const text_t& x, const text_t& y);
 +   ​friend inline bool operator>​=(const text_t& x, const text_t& y);
 +   ​friend inline bool operator<​=(const text_t& x, const text_t& y);
 +   // ...
 +};
 +</​code>​
 +
 +
 +
 +<!-- id:845 -->There are many overloaded operators that do not appear in Figure <imgref figure_the_text_t_api>​. To give a flavour of the operations supported, these are shown in Figure <imgref figure_overloaded_operators_to_text_t>​. Line 4 supports assignment of one //text_t// object to another, and line 5 overloads the //+=// operator to provide a more natural way to append one //text_t// object to the end of another. It is also possible, through line 6, to access a particular Unicode character (represented as a //short//) using array subscripting [ ]. Assign and append operators are also provided for integers and C++ strings. Lines 12—18 provide Boolean operators for comparing two //text_t// objects: equals, does not equal, precedes alphabetically,​ and so on.
 +
 +<!-- id:846 -->​Member functions that take //const// arguments instead of non- //const// ones are also provided (but not shown here). Such repetition is routine in C++ objects, making the API fatter but no bigger conceptually. In reality, many of these functions are implemented as single in-line statements. For more detail, refer to the source file //​GSDLHOME/​lib/​text_t.h//​.
 +
 +==== <!-- id:847 -->The Greenstone library code ====
 +
 +<!-- id:848 -->The header files in //​GSDLHOME/​lib//​ include a mixture of functions and objects that provide useful support for the Greenstone runtime system. Where efficiency is of concern, functions and member functions are declared //inline//. For the most part, implementation details are contained within a header file's //.cpp// counterpart.
 +
 +<​tblcaption table_table|##​HIDDEN##></​tblcaption>​
 +|< - 100 450 >|
 +| <!-- id:849 -->​**cfgread.h** | <!-- id:850 -->​Functions to read and write configuration files. For example, //​read_cfg_line()//​ takes as arguments the input stream to use and the //​text_tarray//​ (shorthand for //​vector<​text_t>//​) to fill out with the data that is read. |
 +| <!-- id:851 -->​**display.h** | <!-- id:852 -->A sophisticated object used by the receptionist for setting, storing and expanding macros, plus supporting types. Section [[#​receptionist|receptionist]] gives further details. |
 +| <!-- id:853 -->​**fileutil.h** | <!-- id:854 -->​Function support for several file utilities in an operating system independent way. For example, //​filename_cat()//​ takes up to six //text_t// arguments and returns a //text_t// that is the result of concatenating the items together using the appropriate directory separator for the current operating system. |
 +| <!-- id:855 -->​**gsdlconf.h** | <!-- id:856 -->​System-specific functions that answer questions such as: does the operating system being used for compilation need to access //​strings.h//​ as well as //​string.h//?​ Are all the appropriate values for file locking correctly defined? |
 +| <!-- id:857 -->​**gsdltimes.h** | <!-- id:858 -->​Function support for date and times. For example, //​time2text()//​ takes computer time, expressed as the number of seconds that have elapsed since 1 January 1970, and converts it into the form YYYY/MM/DD hh:mm:ss, which it returns as type //text_t//. |
 +| <!-- id:859 -->​**gsdltools.h** | <!-- id:860 -->​Miscellaneous support for the Greenstone runtime system: clarify if littleEndian or bigEndian; check whether Perl is available; execute a system command (with a few bells and whistles); and escape special macro characters in a //text_t// string. |
 +| <!-- id:861 -->​**gsdlunicode.h** | <!-- id:862 -->A series of inherited objects that support processing Unicode //text_t// strings through IO streams, such as Unicode to UTF-8 and //vice versa//; and the removal of zero-width spaces. Support for map files is also provided through the //​mapconvert//​ object, with mappings loaded from //​GSDLHOME/​mappings//​. |
 +| <!-- id:863 -->​**text_t.h** | <!-- id:864 -->​Primarily the Unicode text object described above. It also provides two classes for converting streams: //​inconvertclass//​ and //​outconvertclass//​. These are the base classes used in //​gsdlunicode.h//​. |
 +
 +===== <!-- id:865 -->​Collection server =====
 +
 +<!-- id:866 -->Now we systematically explain all the objects in the conceptual framework of Figure <imgref figure_greenstone_runtime_system>​. We start at the bottom of the diagram—which is also the foundations of the system—with Search, Source and Filter, and work our way up through the protocol layer and on to the central components in the receptionist:​ Actions, Format and Macro Language. Then we focus on object initialisation,​ since this is easier to understand once the role of the various objects is known.
 +
 +<!-- id:867 -->Most of the classes central to the conceptual framework are expressed using virtual inheritance to aid extensibility. With virtual inheritance,​ inherited objects can be passed around as their base class, but when a member function is called it is the version defined in the inherited object that is invoked. By ensuring that the Greenstone source code uses the base class throughout, except at the point of object construction,​ this means that different implementations—using,​ perhaps, radically different underlying technologies—can be slotted into place easily.
 +
 +<!-- id:868 -->For example, suppose a base class called //​BaseCalc//​ provides basic arithmetic: add, subtract, multiply and divide. If all its functions are declared virtual, and arguments and return types are all declared as strings, we can easily implement inherited versions of the object. One, called //​FixedPrecisionCalc//,​ might use C library functions to convert between strings and integers and back again, implementing the calculations using the standard arithmetic operators: //+//, —, //*//, and /////. Another, called //​InfinitePrecisionCalc//,​ might access the string arguments a character at a time, implementing arithmetic operations that are in principal infinite in their precision. By writing a main program that uses //​BaseCalc//​ throughout, the implementation can be switched between fixed precision and infinite precision by editing just one line: the point where the calculator object is constructed.
 +
 +==== <!-- id:869 -->The Search object ====
 +
 +<​imgcaption figure_search_base_class_api|%!-- id:870 --%Search base class API ></​imgcaption>​
 +<​code>​
 +class searchclass {
 +public:
 +   ​searchclass ();
 +   ​virtual ~searchclass ();
 +   // the index directory must be set before any searching
 +   // is done
 +   ​virtual void setcollectdir (const text_t &​thecollectdir);​
 +   // the search results are returned in queryresults
 +   // search returns '​true'​ if it was able to do a search
 +   ​virtual bool search(const queryparamclass &​queryparams,​
 +                         ​queryresultsclass &​queryresults)=0;​
 +   // the document text for '​docnum'​ is placed in '​output'​
 +   // docTargetDocument returns '​true'​ if it was able to
 +   // try to get a document
 +   // collection is needed to see if an index from the
 +   // collection is loaded. If no index has been loaded
 +   // defaultindex is needed to load one
 +   ​virtual bool docTargetDocument(const text_t &​defaultindex,​
 +                                               const text_t &​defaultsubcollection,​
 +                                               const text_t &​defaultlanguage,​
 +                                               const text_t &​collection,​
 +                                               int docnum,
 +                                               ​text_t &​output)=0;​
 +protected:
 +   ​querycache *cache;
 +   ​text_t collectdir; // the collection directory
 +};
 +</​code>​
 +
 +
 +
 +<!-- id:871 -->​Figure <imgref figure_search_base_class_api>​ shows the base class API for the Search object in Figure <imgref figure_greenstone_runtime_system>​. It defines two virtual member functions: //​search()//​ and //​docTargetDocument()//​. As signified by the //=0// that follows the argument declaration,​ these are //pure// functions—meaning that a class that inherits from this object must implement both (otherwise the compiler will complain).
 +
 +<!-- id:872 -->The class also includes two protected data fields: //​collectdir//​ and //cache//. A Search object is instantiated for a particular collection, and the //​collectdir//​ field is used to store where on the file system that collection (and more importantly its index files) resides. The //cache// field retains the result of a query. This is used to speed up subsequent queries that duplicate the query (and its settings). While identical queries may seem unlikely, in fact they occur on a regular basis. The Greenstone protocol is stateless. To generate a results page like Figure <imgref figure_searching_gutenberg_for_darcy>​ but for matches 11—20 of the same query, the search is transmitted again, this time specifying that documents 11—20 are returned. Caching makes this efficient, because the fact that the search has already been performed is detected and the results are lifted straight from the cache.
 +
 +<!-- id:873 -->Both data fields are applicable to every inherited object that implements a searching mechanism. This is why they appear in the base class, and are declared within a protected section of the class so that inherited classes can access them directly.
 +
 +==== <!-- id:874 -->​Search and retrieval with MG ====
 +
 +<!-- id:875 -->​Greenstone uses MG (short for Managing Gigabytes, see Witten //et al//., 1999) to index and retrieve documents, and the source code is included in the //​GSDLHOME/​packages//​ directory. MG uses compression techniques to maximise disk space utilisation without compromising execution speed. For a collection of English documents, the compressed text and full text indexes together typically occupy one third the space of the original uncompressed text alone. Search and retrieval is often quicker than the equivalent operation on the uncompressed version, because there are fewer disk operations.
 +
 +<​imgcaption figure_api_for_direct_access_to_mg|%!-- id:876 --%API for direct access to MG (abridged) ></​imgcaption>​
 +<​code>​
 +enum result_kinds {
 +   ​result_docs, ​          // Return the documents found in last search
 +   ​result_docnums, ​    // Return document id numbers and weights
 +   ​result_termfreqs,​ // Return terms and frequencies
 +   ​result_terms ​          // Return matching query terms
 +};
 +int mgq_ask(char *line);
 +int mgq_results(enum result_kinds kind, int skip, int howmany,
 +                               int (*sender)(char *, int, int, float, void *),
 +                               void *ptr);
 +int mgq_numdocs(void);​
 +int mgq_numterms(void);​
 +int mgq_equivterms(unsigned char *wordstem,
 +                           int (*sender)(char *, int, int, float, void *),
 +                           void *ptr);
 +int mgq_docsretrieved (int *total_retrieved,​ int *is_approx);​
 +int mgq_getmaxstemlen ();
 +void mgq_stemword (unsigned char *word);
 +</​code>​
 +
 +
 +
 +<!-- id:877 -->MG is normally used interactively by typing commands from the command line, and one way to implement //​mgsearchclass//​ would be to use the C library //​system()//​ call within the object to issue the appropriate mg commands. A more efficient approach, however, is to tap directly into the mg code using function calls. While this requires a deeper understanding of the mg code, much of the complexity can be hidden behind a new API that becomes the point of contact for the object //​mgsearchclass//​. This is the role of //​colserver/​mgq.c//,​ whose API is shown in Figure <imgref figure_api_for_direct_access_to_mg>​.
 +
 +<!-- id:878 -->The way to supply parameters to mg is via //​mgq_ask()//,​ which takes text options in a format identical to that used at the command line, such as:
 +
 +<​code>​
 +mgq_ask( ".set casefold off ");
 +</​code>​
 +
 +<!-- id:879 -->It is also used to invoke a query. Results are accessed through //​mgq_results//,​ which takes a pointer to a function as its fourth parameter. This provides a flexible way of converting the information returned in mg data structures into those needed by //​mgsearchclass//​. Calls such as //​mgq_numdocs()//,​ //​mgq_numterms()//,​ and //​mgq_docsretrieved()//​ also return information,​ but this time more tightly prescribed. The last two give support for stemming.
 +
 +==== <!-- id:880 -->The Source object ====
 +
 +<​imgcaption figure_source_base_class_api|%!-- id:881 --%Source base class API ></​imgcaption>​
 +<​code>​
 +class sourceclass {
 +public:
 +   ​sourceclass ();
 +   ​virtual ~sourceclass ();
 +   // configure should be called once for each configuration line
 +   ​virtual void configure (const text_t &key, const text_tarray &​cfgline);​
 +   // init should be called after all the configuration is done but
 +   // before any other methods are called
 +   ​virtual bool init (ostream &​logout);​
 +   // translate_OID translates OIDs using " .pr " , . " fc " etc.
 +   ​virtual bool translate_OID (const text_t &OIDin, text_t &​OIDout,​ comerror_t &err, ostream &​logout);​
 +   // get_metadata fills out the metadata if possible, if it is not
 +   // responsible for the given OID then it return s false.
 +   ​virtual bool get_metadata (const text_t &​requestParams,​ const text_t &​refParams,​
 +                           bool getParents, const text_tset &​fields,​ const text_t &OID,
 +                           ​MetadataInfo_tmap &​metadata,​ comerror_t &err, ostream &​logout);​
 +   ​virtual bool get_document (const text_t &OID, text_t &​doc, ​
 +                           ​comerror_t &err, ostream &​logout);​
 +};
 +</​code>​
 +
 +
 +
 +<!-- id:882 -->The role of Source in Figure <imgref figure_greenstone_runtime_system>​ is to access document metadata and document text, and its base class API is shown in Figure <imgref figure_source_base_class_api>​. A member function maps to each task: //​get_metadata()//​ and //​get_document()//​ respectively. Both are declared //​virtual//,​ so the version provided by a particular implementation of the base class is called at runtime. One inherited version of this object uses gdbm to implement //​get_metadata()//​ and mg to implement //​get_document()//:​ we detail this version below.
 +
 +<!-- id:883 -->Other member functions seen in Figure <imgref figure_source_base_class_api>​ are //​configure()//,​ //init()//, and //​translate_OID()//​. The first two relate to the initialisation process described in Section [[#​initialisation|initialisation]].
 +
 +<!-- id:884 -->The remaining one, //​translate_OID()//,​ handles the syntax for expressing document identifiers. In Figure <imgref figure_the_golf_course_mystery>​ we saw how a page number could be appended to a document identifier to retrieve just that page. This was possible because pages were stored as “sections” when the collection was built. Appending “.1” to an OID retrieves the first section of the corresponding document. Sections can be nested, and are accessed by concatenating section numbers separated by periods.
 +
 +<!-- id:885 -->As well as hierarchical section numbers, the document identifier syntax supports a form of relative access. For the current section of a document it is possible to access the //first child// by appending //.fc//, the //last child// by appending //.lc//, the //parent// by appending //.pr//, the //next sibling// by appending //.ns//, and the //previous sibling// by appending //.ps//.
 +
 +<!-- id:886 -->The //​translate_OID()//​ function uses parameters //OIDin// and //OIDout// to hold the source and result of the conversion. It takes two further parameters, //err// and //logout//. These communicate any error status that may arise during the translation operation, and determine where to send logging information. The parameters are closely aligned with the protocol, as we shall see in Section [[#​protocol|protocol]].
 +
 +=== <!-- id:887 -->​Database retrieval with gdbm ===
 +
 +<!-- id:888 -->GDBM is the Gnu database manager program (// www.gnu.org //). It implements a flat record structure of key/data pairs, and is backwards compatible with dbm and ndbm. Operations include storage, retrieval and deletion of records by key, and an unordered traversal of all keys.
 +
 +<​imgcaption figure_gdbm_database_for_the_gutenberg_collection|%!-- id:889 --%Gdbm database for the Gutenberg collection (excerpt) ></​imgcaption>​
 +<​code>​
 +[HASH01d7b30d4827b51282919e9b]
 +<​doctype> ​        doc
 +<​hastxt> ​          0
 +<​Title> ​            The Winter'​s Tale
 +<​Creator> ​        ​William Shakespeare
 +<​archivedir> ​  ​HASH01d7/​b30d4827.dir
 +<​thistype> ​      ​Invisible
 +<​childtype> ​    Paged
 +<​contains> ​       " .1; " .2; " .3; " .4; " .5; " .6; " .7; " .8; " .9; " .10; " .11; " .12;            \ <​br/> ​                          "​ .13; " .14; " .15; " .16; " .17; " .18; " .19; " .20; " .21; " .22;            \ <​br/> ​                          "​ .23; " .24; " .25; " .26; " .27; " .28; " .29; " .30; " .31; " .32;            \ <​br/> ​                          "​ .33; " .34; " .35
 +<​docnum> ​          ​168483
 +———————————————————————-
 +[CL1]
 +<​doctype> ​        ​classify
 +<​hastxt> ​          0
 +<​childtype> ​    HList
 +<​Title> ​            Title
 +<​numleafdocs>​ 1818
 +<​thistype> ​      ​Invisible
 +<​contains> ​      "​ .1; " .2; " .3; " .4; " .5; " .6; " .7; " .8; " .9; " .10; " .11; " .12;          \ <​br/> ​                          "​ .13; " .14; " .15; " .16; " .17; " .18; " .19; " .20; " .21; " .22;          \ <​br/> ​                           " .23; " .24
 +———————————————————————-
 +[CL1.1]
 +<​doctype> ​        ​classify
 +<​hastxt> ​          0
 +<​childtype> ​    VList
 +<​Title> ​            A
 +<​numleafdocs>​ 118
 +<​contains> ​     HASH0130bc5f9f90089b3723431f;​HASH9cba43bacdab5263c98545;​\
 +                          HASH12c88a01da6e8379df86a7;​HASH9c86579a83e1a2e4cf9736; ​  \
 +                           ​HASHdc2951a7ada1f36a6c3aca;​HASHea4dda6bbc7cdeb4abfdee; ​  \
 +                          HASHce55006513c47235ac38ba;​HASH012a33acaa077c0e612b9351;​\
 +                          HASH010dd1e923a123826ae30e4b;​HASHaf674616785679fed4b7ee;​\
 +                         ​HASH0147eef4b9d1cb135e096619;​HASHe69b9dbaa83ffb045d963b;​\
 +                         ​HASH01abc61c646c8e7a8ce88b10;​HASH5f9cd13678e21820e32f3a;​\
 +                         ​HASHe8cbba1594c72c98f9aa1b;​HASH01292a2b7b6b60dec96298bc;​\
 +                         ...
 +</​code>​
 +
 +
 +
 +<!-- id:890 -->​Figure <imgref figure_gdbm_database_for_the_gutenberg_collection>​ shows an excerpt from the collection information database that is created when building the Gutenberg collection. The excerpt was produced using the Greenstone utility //db2txt//, which converts the gdbm binary database format into textual form. Figure <imgref figure_gdbm_database_for_the_gutenberg_collection>​ contains three records, separated by horizontal rules. The first is a document entry, the other two are part of the hierarchy created by the //AZList// classifier for titles in the collection. The first line of each record is its key.
 +
 +<!-- id:891 -->The document record stores the book's title, author, and any other metadata provided (or extracted) when the collection was built. It also records values for internal use: where files associated with this document reside (//<​archivedir>//​) and the document number used internally by mg (//<​docnum>//​).
 +
 +<!-- id:892 -->The //<​contains>//​ field stores a list of elements, separated by semicolons, that point to related records in the database. For a document record, //<​contains>//​ is used to point to the nested sections. Subsequent record keys are formed by concatenating the current key with one of the child elements (separated by a period).
 +
 +<!-- id:893 -->The second record in Figure <imgref figure_gdbm_database_for_the_gutenberg_collection>​ is the top node for the classification hierarchy of //Titles A—Z//. Its children, accessed through the //<​contains>//​ field, include //CL1.1//, //CL1.2//, //CL1.3// and so on, and correspond to the individual pages for the letters //A//, //B//, //C// etc. There are only 24 children: the //AZList// classifier merged the //Q—R// and //Y—Z// entries because they covered only a few titles.
 +
 +<!-- id:894 -->The children in the //<​contains>//​ field of the third record, //CL1.1//, are the documents themselves. More complicated structures are possible—the //<​contains>//​ field can include a mixture of documents and further //CL// nodes. Keys expressed relative to the current one are distinguished from absolute keys because they begin with a quotation mark (").
 +
 +=== <!-- id:895 -->Using MG and GDBM to implement a Source object ===
 +
 +<​imgcaption figure_api_for_mg_and_gdbm_based_version_of_sourceclass|%!-- id:896 --%API for mg and gdbm based version of //​sourceclass//​ (abridged) ></​imgcaption>​
 +<​code>​
 +class mggdbmsourceclass : public sourceclass {
 +protected:
 +   // Omitted, data fields that store:
 +   // ​    ​collection specific file information
 +   // ​    index substructure
 +   // ​    ​information about parent
 +   // ​    ​pointers to gdbm and mgsearch objects
 +public:
 +   ​mggdbmsourceclass ();
 +   ​virtual ~mggdbmsourceclass ();
 +   void set_gdbmptr (gdbmclass *thegdbmptr);​
 +   void set_mgsearchptr (searchclass *themgsearchptr);​
 +   void configure (const text_t &key, const text_tarray &​cfgline);​
 +   bool init (ostream &​logout);​
 +   bool translate_OID (const text_t &OIDin, text_t &​OIDout,​
 +                                           ​comerror_t &err, ostream &​logout);​
 +   bool get_metadata (const text_t &​requestParams,​
 +                                           const text_t &​refParams,​
 +                                         bool getParents, const text_tset &​fields,​
 +                                         const text_t &OID, MetadataInfo_tmap &​metadata,​
 +                                         ​comerror_t &err, ostream &​logout);​
 +   bool get_document (const text_t &OID, text_t &doc,
 +                                         ​comerror_t &err, ostream &​logout);​
 +};
 +</​code>​
 +
 +
 +
 +<!-- id:897 -->The object that puts mg and gdbm together to realise an implementation of //​sourceclass//​ is //​mggdbmsourceclass//​. Figure <imgref figure_api_for_mg_and_gdbm_based_version_of_sourceclass>​ shows its API. The two new member functions //​set_gdbmptr()//​ and //​set_mgsearchptr()//​ store pointers to their respective objects, so that the implementations of //​get_metadata()//​ and //​get_document()//​ can access the appropriate tools to complete the job.
 +
 +==== <!-- id:898 -->The Filter object ====
 +
 +<​imgcaption figure_api_for_the_filter_base_class|%!-- id:899 --%API for the Filter base class ></​imgcaption>​
 +<​code>​
 +class filterclass {
 +protected:
 +   ​text_t gsdlhome;
 +   ​text_t collection;
 +   ​text_t collectdir;
 +   ​FilterOption_tmap filterOptions;​
 +public:
 +   ​filterclass ();
 +   ​virtual ~filterclass ();
 +   ​virtual void configure (const text_t &​key,​ const text_tarray &​cfgline);​
 +   ​virtual bool init (ostream &​logout);​
 +   // returns the name of this filter
 +   ​virtual text_t get_filter_name ();
 +   // returns the current filter options
 +   ​virtual void get_filteroptions (InfoFilterOptionsResponse_t &​response,​
 +                                 ​comerror_t &err, ostream &​logout);​
 +   ​virtual void filter (const FilterRequest_t &​request,​
 +                         ​FilterResponse_t &​response,​
 +                         ​comerror_t &err, ostream &​logout);​
 +};
 +</​code>​
 +
 +
 +
 +<!-- id:900 -->The base class API for the Filter object in Figure <imgref figure_greenstone_runtime_system>​ is shown in Figure <imgref figure_api_for_the_filter_base_class>​. It begins with the protected data fields //​gsdlhome//,​ //​collection//,​ and //​collectdir//​. These commonly occur in classes that need to access collection-specific files.
 +
 +  * <!-- id:901 -->//​gsdlhome//​ is the same as //​GSDLHOME//,​ so that the object can locate the Greenstone files.
 +  * <!-- id:902 -->//​collection//​ is the name of the directory corresponding to the collection.
 +  * <!-- id:903 -->//​collectdir//​ is the full pathname of the collection directory (this is needed because a collection does not have to reside within the //​GSDLHOME//​ area).
 +
 +<!-- id:904 -->//​mggdbsourceclass//​ is another class that includes these three data fields.
 +
 +<!-- id:905 -->The member functions //​configure()//​ and //init()// (first seen in //​sourceclass//​) are used by the initialisation process. The object itself is closely aligned with the corresponding filter part of the protocol; in particular //​get_filteroptions()//​ and //​filter()//​ match one for one.
 +
 +<​imgcaption figure_how_a_filter_option_is_stored|%!-- id:906 --%How a filter option is stored ></​imgcaption>​
 +<​code>​
 +struct FilterOption_t {
 +   void clear (); \   void check_defaultValue ();
 +   ​FilterOption_t () {clear();}
 +   ​text_t name;
 +   enum type_t {booleant=0,​ integert=1, enumeratedt=2,​ stringt=3};
 +   ​type_t type;
 +   enum repeatable_t {onePerQuery=0,​ onePerTerm=1,​ nPerTerm=2};​
 +   ​repeatable_t repeatable;
 +   ​text_t defaultValue;​
 +   ​text_tarray validValues;​
 +};
 +struct OptionValue_t {
 +   void clear ();
 +   ​text_t name;
 +   ​text_t value;
 +};
 +</​code>​
 +
 +
 +
 +<!-- id:907 -->​Central to the filter options are the two classes shown in Figure <imgref figure_how_a_filter_option_is_stored>​. Stored inside //​FilterOption_t//​ is the //name// of the option, its //type//, and whether or not it is //​repeatable//​. The interpretation of //​validValues//​ depends on the option type. For a Boolean type the first value is //false// and the second is //true//. For an integer type the first value is the minimum number, the second the maximum. For an enumerated type all values are listed. For a string type the value is ignored. For simpler situations, //​OptionValue_t//​ is used, which records as a //text_t// the //name// of the option and its //value//.
 +
 +<!-- id:908 -->The request and response objects passed as parameters to //​filterclass//​ are constructed from these two classes, using associative arrays to store a set of options such as those required for //​InfoFilterOptionsResponse_t//​. More detail can be found in //​GSDLHOME/​src/​recpt/​comtypes.h//​.
 +
 +==== <!-- id:909 -->​Inherited Filter objects ====
 +
 +<​imgcaption figure_inheritance_hierarchy_for_filter|%!-- id:910 --%Inheritance hierarchy for Filter ></​imgcaption>​
 +{{..:​images:​dev_fig_38.gif?​337x245&​direct}}
 +
 +<!-- id:911 -->Two levels of inheritance are used for filters, as illustrated in Figure <imgref figure_inheritance_hierarchy_for_filter>​. First a distinction is made between Query and Browse filters, and then for the former there is a specific implementation based on mg. To operate correctly, //​mgqueryfilterclass//​ needs access to mg through //​mgsearchclass//​ and to gdbm through //​gdbmclass//​. //​browsefilterclass//​ only needs access to gdbm. Pointers to these objects are stored as protected data fields within the respective classes.
 +
 +==== <!-- id:912 -->The collection server code ====
 +
 +<!-- id:913 -->Here are the header files in //​GSDLHOME/​src/​colservr//,​ with a description of each. The filename generally repeats the object name defined within it.
 +
 +<​tblcaption table_table_1|##​HIDDEN##></​tblcaption>​
 +|< - 120 420 >|
 +| <!-- id:914 -->​**browsefilter.h** | <!-- id:915 -->​Inherited from //​filterclass//,​ this object provides access to gdbm. (Described above.) |
 +| |
 +| <!-- id:916 -->​**collectserver.h** | <!-- id:917 -->This object binds Filters and Sources for one collection together, to form the Collection object depicted in Figure <imgref figure_greenstone_runtime_system>​. |
 +| <!-- id:918 -->​**colservrconfig.h** | <!-- id:919 -->​Function support for reading the collection-specific files //​etc/​collect.cfg//​ and //​index/​build.cfg//​. The former is the collection'​s configuration file. The latter is a file generated by the building process that records the time of the last successful build, an index map list, how many documents were indexed, and how large they are in bytes (uncompressed). |
 +| <!-- id:920 -->​**filter.h** | <!-- id:921 -->The base class Filter object //​filterclass//​ described above. |
 +| <!-- id:922 -->​**maptools.h** | <!-- id:923 -->​Defines a class called //​stringmap//​ that provides a mapping that remembers the original order of a //text_t// map, but is fast to look up. Used in //​mggdbmsourceclass//​ and //​queryfilterclass//​. |
 +| <!-- id:924 -->​**mggdbmsource.h** | <!-- id:925 -->​Inherited from //​sourceclass//,​ this object provides access to mg and gdbm. (Described above.) |
 +| <!-- id:926 -->​**mgppqueryfilter.h** | <!-- id:927 -->​Inherited from //​queryfilterclass//,​ this object provides an implementation of //​QueryFilter//​ based upon mg++, an improved version of mg written in C++. Note that Greenstone is set up to use mg by default, since mg++ is still under development. |
 +| <!-- id:928 -->​**mgppsearch.h** | <!-- id:929 -->​Inherited from //​searchclass//,​ this object provides an implementation of Search using mg++. Like //​mgppqueryfilterclass//,​ this is not used by default. |
 +| <!-- id:930 -->​**mgq.h** | <!-- id:931 -->​Function-level interface to the mg package. Principal functions are //​mg_ask()//​ and //​mg_results()//​. |
 +| <!-- id:932 -->​**mgqueryfilter.h** | <!-- id:933 -->​Inherited from //​queryfilterclass//,​ this object provides an implementation of //​QueryFilter//​ based upon mg. |
 +| <!-- id:934 -->​**mgsearch.h** | <!-- id:935 -->​Inherited from //​searchclass//,​ this object provides an implementation of Search using mg. (Described above.) |
 +| <!-- id:936 -->​**phrasequeryfilter.h** | <!-- id:937 -->​Inherited from //​mgqueryclass//,​ this object provides a phrase-based query class. It is not used in the default installation. Instead //​mgqueryfilterclass//​ provides this capability through functional support from //​phrasesearch.h//​. |
 +| <!-- id:938 -->​**phrasesearch.h** | <!-- id:939 -->​Functional support to implement phrase searching as a post-processing operation. |
 +| <!-- id:940 -->​**querycache.h** | <!-- id:941 -->Used by //​searchclass//​ and its inherited classes to cache the results of a query, in order to make the generation of further search results pages more efficient. (Described above.) |
 +| <!-- id:942 -->​**queryfilter.h** | <!-- id:943 -->​Inherited from the Filter base class //​filterclass//,​ this object establishes a base class for Query filter objects. (Described above.) |
 +| <!-- id:944 -->​**queryinfo.h** | <!-- id:945 -->​Support for searching: data structures and objects to hold query parameters, document results and term frequencies. |
 +| <!-- id:946 -->​**search.h** | <!-- id:947 -->The base class Search object //​searchclass//​. (Described above.) |
 +| <!-- id:948 -->​**source.h** | <!-- id:949 -->The base class Source object //​sourceclass//​. (Described above.) |
 +
 +===== <!-- id:950 -->​Protocol =====
 +
 +<​tblcaption table_list_of_protocol_calls|List of protocol calls></​tblcaption>​
 +|< - 132 397 >|
 +| <!-- id:952 -->//​get_protocol_name()//​ | <!-- id:953 -->​Returns the name of this protocol. Choices include //​nullproto//,​ //​corbaproto//,​ and //​z3950proto//​. Used by protocol-sensitive parts of the runtime system to decide which code to execute. |
 +| <!-- id:954 -->//​get_collection_list()//​ | <!-- id:955 -->​Returns the list of collections that this protocol knows about. |
 +| <!-- id:956 -->//​has_collection()//​ | <!-- id:957 -->​Returns //true// if the protocol can communicate with the named collection, i.e. it is within its collection list. |
 +| <!-- id:958 -->//​ping()//​ | <!-- id:959 -->​Returns //true// if a successful connection was made to the named collection. In the null protocol the implementation is identical to //​has_collection().//​ |
 +| <!-- id:960 -->//​get_collectinfo()//​ | <!-- id:961 -->​Obtains general information about the named collection: when it was last built, how many documents it contains, and so on. Also includes metadata from the collection configuration file: “about this collection” text; the collection icon to use, and so on. |
 +| <!-- id:962 -->//​get_filterinfo()//​ | <!-- id:963 -->Gets a list of all Filters for the named collection. |
 +| <!-- id:964 -->//​get_filteroptions()//​ | <!-- id:965 -->Gets all options for a particular Filter within the named collection. |
 +| <!-- id:966 -->//​filter()//​ | <!-- id:967 -->​Supports searching and browsing. For a given filter type and option settings, it accesses the content of the named collections to produce a result set that is filtered in accordance with the option settings. The data fields returned also depend on the option settings: examples include query term frequency and document metadata. |
 +| <!-- id:968 -->//​get_document()//​ | <!-- id:969 -->Gets a document or section of a document. |
 +
 +<!-- id:970 -->Table <tblref table_list_of_protocol_calls>​ lists the function calls to the protocol, with a summary for each entry. The examples in Section [[#​how_the_conceptual_framework_fits_together|how_the_conceptual_framework_fits_together]] covered most of these. Functions not previously mentioned are //​has_collection()//,​ //ping()//, //​get_protocol_name()//​ and //​get_filteroptions()//​. The first two provide yes/no answers to the questions “does the collection exists on this server?” and “is it running?” respectively. The purpose of the other two is to support multiple protocols within an architecture that is distributed over different computers, not just the null-protocol based single executable described here. One of these distinguishes which protocol is being used. The other lets a receptionist interrogate a collection server to find what options are supported, and so dynamically configure itself to take full advantage of the services offered by a particular server.
 +
 +<​imgcaption figure_null_protocol_api|%!-- id:971 --%Null protocol API (abridged) ></​imgcaption>​
 +<​code>​
 +class nullproto : public recptproto {
 +public:
 +   ​virtual text_t get_protocol_name ();
 +   ​virtual void get_collection_list (text_tarray &​collist,​
 +                comerror_t &err, ostream &​logout);​
 +   ​virtual void has_collection (const text_t &​collection,​
 +                bool &​hascollection,​
 +                comerror_t &err, ostream &​logout);​
 +   ​virtual void ping (const text_t &​collection,​
 +                  bool &​wassuccess,​
 +                  comerror_t &err, ostream &​logout);​
 +   ​virtual void get_collectinfo (const text_t &​collection,​
 +                  ColInfoResponse_t &​collectinfo,​
 +                  comerror_t &err, ostream &​logout);​
 +   ​virtual void get_filterinfo (const text_t &​collection,​
 +                   ​InfoFiltersResponse_t &​response,​
 +                   ​comerror_t &err, ostream &​logout);​
 +   ​virtual void get_filteroptions (const text_t &​collection,​
 +                   const InfoFilterOptionsRequest_t &​request,​
 +                   ​InfoFilterOptionsResponse_t &​response,​
 +                   ​comerror_t &err, ostream &​logout);​
 +   ​virtual void filter (const text_t &​collection,​
 +                   ​FilterRequest_t &​request,​
 +                   ​FilterResponse_t &​response,​
 +                   ​comerror_t &err, ostream &​logout);​
 +   ​virtual void get_document (const text_t &​collection,​
 +                  const DocumentRequest_t &​request,​
 +                  DocumentResponse_t &​response,​
 +                  comerror_t &err, ostream &​logout);​
 +};
 +</​code>​
 +
 +
 +
 +<!-- id:972 -->​Figure <imgref figure_null_protocol_api>​ shows the API for the null protocol. Comments, and certain low level details, have been omitted (see the source file //​recpt/​nullproto.h//​ for full details).
 +
 +<!-- id:973 -->This protocol inherits from the base class //​recptproto//​. Virtual inheritance is used so that more than one type of protocol—including protocols not even conceived yet—can be easily supported in the rest of the source code. This is possible because the base class object //​recptproto//​ is used throughout the source code, with the exception of the point of construction. Here we specify the actual variety of protocol we wish to use—in this case, the null protocol.
 +
 +<!-- id:974 -->With the exception of //​get_protocol_name()//,​ which takes no parameters and returns the protocol name as a Unicode-compliant text string, all protocol functions include an error parameter and an output stream as the last two arguments. The error parameter records any errors that occur during the execution of the protocol call, and the output stream is for logging purposes. The functions have type //​void//​—they do not explicitly return information as their final statement, but instead return data through designated parameters such as the already-introduced error parameter. In some programming languages, such routines would be defined as procedures rather than functions, but C++ makes no syntactic distinction.
 +
 +<!-- id:975 -->Most functions take the collection name as an argument. Three of the member functions, //​get_filteroptions()//,​ //​filter()//,​ and //​get_document()//,​ follow the pattern of providing a Request parameter and receiving the results in a Response parameter.
 +
 +===== <!-- id:976 -->​Receptionist =====
 +
 +<!-- id:977 -->The final layer of the conceptual model is the receptionist. Once the CGI arguments are parsed, the main activity is the execution of an Action, supported by the Format and Macro Language objects. These are described below. Although they are represented as objects in the conceptual framework, Format and Macro Language objects are not strictly objects in the C++ sense. In reality, Format is a collection of data structures with a set of functions that operate on them, and the Macro Language object is built around //​displayclass//,​ defined in //​lib/​display.h//,​ with stream conversion support from //​lib/​gsdlunicode.h//​.
 +
 +==== <!-- id:978 -->​Actions ====
 +
 +<​tblcaption table_actions_in_greenstone|Actions in Greenstone></​tblcaption>​
 +|< - 132 397 >|
 +| <!-- id:980 -->//​action//​ | <!-- id:981 -->Base class for virtual inheritance. |
 +| <!-- id:982 -->//​authenaction//​ | <!-- id:983 -->​Supports user authentication:​ prompts the user for a password if one has not been entered; checks whether it is valid; and forces the user to log in again if sufficient time lapses between accesses. |
 +| <!-- id:984 -->//​collectoraction//​ | <!-- id:985 -->​Generates the pages for the Collector. |
 +| <!-- id:986 -->//​documentaction//​ | <!-- id:987 -->​Retrieves documents, document sections, parts of the classification hierarchy, or formatting information. |
 +| <!-- id:988 -->//​extlinkaction//​ | <!-- id:989 -->Takes a user directly to a URL that is external to a collection, possibly generating an alert page first (dictated by the Preferences). |
 +| <!-- id:990 -->//​pageaction//​ | <!-- id:991 -->​Generates a page in conjunction with the macro language. |
 +| <!-- id:992 -->//​pingaction//​ | <!-- id:993 -->​Checks to see whether a collection is online. |
 +| <!-- id:994 -->//​queryaction//​ | <!-- id:995 -->​Performs a search. |
 +| <!-- id:996 -->//​statusaction//​ | <!-- id:997 -->​Generates the administration pages. |
 +| <!-- id:998 -->//​tipaction//​ | <!-- id:999 -->​Brings up a random tip for the user. |
 +| <!-- id:1000 -->//​usersaction//​ | <!-- id:1001 -->​Supports adding, deleting, and managing user access. |
 +
 +<!-- id:1002 -->​Greenstone supports the eleven actions summarised in Table <tblref table_actions_in_greenstone>​.
 +
 +<​imgcaption figure_using_the_cgiargsinfoclass_from_pageactioncpp|%!-- id:1003 --%Using the //​cgiargsinfoclass//​ from //​pageaction.cpp//​ %!-- withLineNumber --%></​imgcaption>​
 +<code 1>
 +cgiarginfo arg_ainfo;
 +arg_ainfo.shortname = " a " ;
 +arg_ainfo.longname = " action"​ ;
 +arg_ainfo.multiplechar = true;
 +arg_ainfo.argdefault = " p" ;
 +arg_ainfo.defaultstatus = cgiarginfo::​weak;​
 +arg_ainfo.savedarginfo = cgiarginfo::​must;​
 +argsinfo.addarginfo (NULL, arg_ainfo);
 +
 +arg_ainfo.shortname = " p" ;
 +arg_ainfo.longname = " page" ;
 +arg_ainfo.multiplechar = true;
 +arg_ainfo.argdefault = " home" ;
 +arg_ainfo.defaultstatus = cgiarginfo::​weak;​
 +arg_ainfo.savedarginfo = cgiarginfo::​must;​
 +argsinfo.addarginfo (NULL, arg_ainfo);
 +</​code>​
 +
 +
 +
 +<!-- id:1004 -->The CGI arguments needed by an action are formally declared in its constructor function using //​cgiarginfo//​(defined in //​recpt/​cgiargs.h//​). Figure <imgref figure_using_the_cgiargsinfoclass_from_pageactioncpp>​ shows an excerpt from the //​pageaction//​ constructor function, which defines the size and properties of the CGI arguments //a// and //p//.
 +
 +<!-- id:1005 -->For each CGI argument, the constructor must specify its short name (lines 2 and 10), which is the name of the CGI variable itself; a long name (lines 3 and 11) that is used to provide a more meaningful description of the action; whether it represents a single or multiple character value (lines 4 and 12); a possible default value (lines 5 and 13); what happens when more than one default value is supplied (lines 6 and 14) (since defaults can also be set in configuration files); and whether or not the value is preserved at the end of this action (lines 7 and 15) .
 +
 +<!-- id:1006 -->Since it is built into the code, web pages that detail this information can be generated automatically. The //​statusaction//​ produces this information. It can be viewed by entering the URL for the Greenstone administration page.
 +
 +<!-- id:1007 -->The twelve inherited actions are constructed in //main()//, the top-level function for the //library// executable, whose definition is given in //​recpt/​librarymain.cpp//​. This is also where the //​receptionist//​ object (defined in //​recpt/​receptionist.cpp//​) is constructed. Responsibility for all the actions is passed to the //​receptionist//,​ which processes them by maintaining,​ as a data field, an associative array of the Action base class, indexed by action name.
 +
 +<​imgcaption figure_action_base_class_api|%!-- id:1008 --%Action base class API ></​imgcaption>​
 +<​code>​
 +class action {
 +protected:
 +   ​cgiargsinfoclass argsinfo;
 +   ​text_t gsdlhome;
 +public:
 +   ​action ();
 +   ​virtual ~action ();
 +   ​virtual void configure (const text_t &key, const text_tarray &​cfgline);​
 +   ​virtual bool init (ostream &​logout);​
 +   ​virtual text_t get_action_name ();
 +   ​cgiargsinfoclass getargsinfo ();
 +   ​virtual bool check_cgiargs (cgiargsinfoclass &​argsinfo,​
 +              cgiargsclass &args, ostream &​logout);​
 +   ​virtual bool check_external_cgiargs (cgiargsinfoclass &​argsinfo,​
 +              cgiargsclass &args,
 +              outconvertclass &​outconvert,​
 +              const text_t &​saveconf,​
 +              ostream &​logout);​
 +   ​virtual void get_cgihead_info (cgiargsclass &args,
 +              recptprotolistclass *protos,
 +              response_t &​response,​
 +              text_t &​response_data,​
 +              ostream &​logout);​
 +   ​virtual bool uses_display (cgiargsclass &args);
 +   ​virtual void define_internal_macros (displayclass &disp,
 +              cgiargsclass &args,
 +              recptprotolistclass *protos,
 +              ostream &​logout);​
 +   ​virtual void define_external_macros (displayclass &disp,
 +              cgiargsclass &args,
 +              recptprotolistclass *protos,
 +              ostream &​logout);​
 +   ​virtual bool do_action (cgiargsclass &args,
 +              recptprotolistclass *protos,
 +              browsermapclass *browsers,
 +              displayclass &disp,
 +              outconvertclass &​outconvert,​
 +              ostream &​textout,​
 +              ostream &​logout);​
 +};
 +</​code>​
 +
 +
 +
 +<!-- id:1009 -->​Figure <imgref figure_action_base_class_api>​ shows the API for the Action base class. When executing an action, //​receptionist//​ calls several functions, starting with //​check_cgiargs()//​. Most help to check, set up, and define values and macros; while //​do_action()//​ actually generates the output page. If a particular inherited object has no definition for a particular member function, it passes through to the base class definition which implements appropriate default behaviour.
 +
 +<!-- id:1010 -->​Explanations of the member functions are as follows.
 +
 +  * <!-- id:1011 -->//​get_action_name()//​ returns the CGI //a// argument value that specifies this action. The name should be short but may be more than one character long.
 +  * <!-- id:1012 -->//​check_cgiargs()//​ is called before //​get_cgihead_info()//,​ //​define_external_macros()//,​ and //​do_action()//​. If an error is found a message is written to //logout//; if it is serious the function returns //false// and no page content is produced.
 +  * <!-- id:1013 -->//​check_external_cgiargs()//​ is called after //​check_cgiargs()//​ for all actions. It is intended for use only to override some other normal behaviour, for example producing a login page when the requested page needs authentication.
 +  * <!-- id:1014 -->//​get_cgihead_info()//​ sets the CGI header information. If //​response//​ is set to //​location//,​ then //​response_data//​ contains the redirect address. If //​response//​ is set to //​content//,​ then //​response_data//​ contains the content type.
 +  * <!-- id:1015 -->//​uses_display()//​ returns //true// if the //​displayclass//​ is needed to output the page content (the default).
 +  * <!-- id:1016 -->//​define_internal_macros()//​ defines all macros that are related to pages generated by this action.
 +  * <!-- id:1017 -->//​define_external_macros()//​ defines all macros that might be used by other actions to produce pages.
 +  * <!-- id:1018 -->//​do_action()//​ generates the output page, normally streamed through the macro language object //display// and the output conversion object //​textout//​. Returns //false// if there was an error that prevented the action from outputting anything.
 +
 +<!-- id:1019 -->At the beginning of the class definition, //​argsinfo//​ is the protected data field (used in the code excerpt shown in Figure <imgref figure_using_the_cgiargsinfoclass_from_pageactioncpp>​) that stores the CGI argument information specified in an inherited Action constructor function. The other data field, //​gsdlhome//,​ records //​GSDLHOME//​ for convenient access.((<​!-- id:1249 -->The value for //​gsdlhome//​ comes from //​gsdlsite.cfg//​ located in the same directory as the CGI executable //​library//,​ whereas //​GSDLHOME//​ is set by running the //setup// script which accesses a different file, so technically it is possible for the two values to be different. While possible, it is not desirable, and the above text is written assuming they are the same.))The object also includes //​configure()//​ and //init()// for initialisation purposes.
 +
 +==== <!-- id:1020 -->​Formatting ====
 +
 +<​imgcaption figure_core_data_structures_in_format|%!-- id:1021 --%Core data structures in Format ></​imgcaption>​
 +<​code>​
 +enum command_t {comIf, comOr, comMeta, comText, comLink, comEndLink,
 +         ​comNum,​ comIcon, comDoc,
 +         ​comHighlight,​ comEndHighlight};​
 +enum pcommand_t {pNone, pImmediate, pTop, pAll};
 +enum dcommand_t {dMeta, dText};
 +enum mcommand_t {mNone, mCgiSafe};
 +struct metadata_t {
 +   void clear();
 +   ​metadata_t () {clear();}
 +   ​text_t metaname;
 +   ​mcommand_t metacommand;​
 +   ​pcommand_t parentcommand;​
 +   ​text_t parentoptions;​
 +};
 +// The decision component of an {If}{decision,​true-text,​false-text}
 +// formatstring. The decision can be based on metadata or on text;
 +// normally that text would be a macro like
 +// _cgiargmode_.
 +struct decision_t {
 +   void clear();
 +   ​decision_t () {clear();}
 +   ​dcommand_t command;
 +   ​metadata_t meta;
 +   ​text_t text;
 +};
 +struct format_t {
 +   void clear();
 +   ​format_t () {clear();}
 +   ​command_t command;
 +   ​decision_t decision;
 +   ​text_t text;
 +   ​metadata_t meta;
 +   ​format_t *nextptr;
 +   ​format_t *ifptr;
 +   ​format_t *elseptr;
 +   ​format_t *orptr;
 +};
 +</​code>​
 +
 +
 +
 +<!-- id:1022 -->​Although formatting is represented as a single entity in Figure <imgref figure_greenstone_runtime_system>,​ in reality it constitutes a collection of data structures and functions. They are gathered together under the header file //​recpt/​formattools.h//​. The core data structures are shown in Figure <imgref figure_core_data_structures_in_format>​.
 +
 +<​imgcaption figure_data_structures_built_for_sample_format_statement|%!-- id:1023 --%Data structures built for sample //format// statement ></​imgcaption>​
 +{{..:​images:​dev_fig_43.png?​395x152&​direct}}
 +
 +<!-- id:1024 -->The implementation is best explained using an example. When the format statement
 +
 +<​code>​
 +format CL1Vlist
 +<!-- id:1025 -->"​[link][Title]{If}{[Creator],​ by [Creator]}[/​link]} "
 +</​code>​
 +
 +<!-- id:1026 -->is read from a collection configuration file, it is parsed by functions in //​formattools.cpp//​ and the interconnected data structure shown in Figure <imgref figure_data_structures_built_for_sample_format_statement>​ is built. When the format statement needs to be evaluated by an action, the data structure is traversed. The route taken at //comIf// and //comOr// nodes depends on the metadata that is returned from a call to the protocol.
 +
 +<!-- id:1027 -->One complication is that when metadata is retrieved, it might include further macros and format syntax. This is handled by switching back and forth between parsing and evaluating, as needed.
 +
 +==== <!-- id:1028 -->Macro language ====
 +
 +<!-- id:1029 -->The Macro Language entity in Figure <imgref figure_greenstone_runtime_system>,​ like Format, does not map to a single C++ class. In this case there is a core class, but the implementation of the macro language also calls upon supporting functions and classes.
 +
 +<!-- id:1030 -->​Again,​ the implementation is best explained using an example. First we give some sample macro definitions that illustrate macro precedence, then—with the aid of a diagram—we describe the core data structures built to support this activity. Finally we present and describe the public member functions to //​displayclass//,​ the top-level macro object.
 +
 +<​imgcaption figure_illustration_of_macro_precedence|%!-- id:1031 --%Illustration of macro precedence ></​imgcaption>​
 +<​code>​
 +package query
 +_header_ []         ​{_querytitle_}
 +_header_ [l=en] ​    ​{Search page}
 +_header_ [c=demo] ​  ​{<​table bgcolor=green><​tr><​td>​_querytitle_</​td></​tr></​table>​}
 +_header_ [v=1]      {_textquery_}
 +_header_ [l=fr,​v=1,​c=hdl] {HDL Page de recherche}
 +</​code>​
 +
 +
 +
 +<!-- id:1032 -->In a typical Greenstone installation,​ macro precedence is normally: //c//(for the collection) takes precedence over //v//(for graphical or text-only interface), which takes precedence over //l//(for the language). This is accomplished by the line
 +
 +<​code>​
 +macroprecedence c,v,l
 +</​code>​
 +
 +<!-- id:1033 -->in the main configuration file //​main.cfg//​. The macro statements in Figure <imgref figure_illustration_of_macro_precedence>​ define sample macros for //​_header_//​ in the //query// package for various settings of //c//, //v//, and //l//. If the CGI arguments given when an action is invoked included //c=dls//, //v=1//, and //l=en//, the macro //​_header_[v=1]//​ would be selected for display. It would be selected ahead //of _content_[l=en]//​ because //v// has a higher precedence than //l//. The //​_content_[l=fr,​v=1,​c=dls]//​ macro would not be selected because the page parameter for //l// is different.
 +
 +<​imgcaption figure_data_structures_representing_the_default_macros|%!-- id:1034 --%Data structures representing the default macros ></​imgcaption>​
 +{{..:​images:​dev_fig_45.png?​382x194&​direct}}
 +
 +<!-- id:1035 -->​Figure <imgref figure_data_structures_representing_the_default_macros>​ shows the core data structure built when reading the macro files specified in //​etc/​main.cfg//​. Essentially,​ it is an associative array of associative arrays of associative arrays. The top layer (shown on the left) indexes which package the macro is from, and the second layer indexes the macro name. The final layer indexes any parameters that were specified, storing each one as the type //mvalue// which records, along with the macro value, the file it came from. For example, the text defined for //​_header_[l=en]//​ in Figure <imgref figure_illustration_of_macro_precedence>​ can be seen stored in the lower of the two //mvalue// records in Figure <imgref figure_data_structures_representing_the_default_macros>​.
 +
 +<​imgcaption figure_displayclass_api|%!-- id:1036 --%//​Displayclass//​ API (abridged) ></​imgcaption>​
 +<​code>​
 +class displayclass
 +{
 +public:
 +   ​displayclass ();
 +   ​~displayclass ();
 +   int isdefaultmacro (text_t package, const text_t &​macroname);​
 +   int setdefaultmacro (text_t package, const text_t &​macroname,  ​
 +         ​text_t params, const text_t &​macrovalue);​
 +   int loaddefaultmacros (text_t thisfilename);​
 +   void openpage (const text_t &​thispageparams,​
 +         const text_t &​thisprecedence);​
 +   void setpageparams (text_t thispageparams,​
 +         ​text_t thisprecedence);​
 +   int setmacro (const text_t &​macroname,​
 +         ​text_t package,
 +         const text_t &​macrovalue);​
 +   void expandstring (const text_t &​inputtext,​ text_t &​outputtext);​
 +   void expandstring (text_t package, const text_t &​inputtext,​
 +         ​text_t &​outputtext,​ int recursiondepth = 0);
 +   void setconvertclass (outconvertclass *theoutc) {outc = theoutc;}
 +   ​outconvertclass *getconvertclass () {return outc;}
 +   ​ostream *setlogout (ostream *thelogout);​
 +};
 +</​code>​
 +
 +
 +
 +<!-- id:1037 -->The central object that supports the macro language is //​displayclass//,​ defined in //​lib/​display.h//​. Its public member functions are shown in Figure <imgref figure_displayclass_api>​. The class reads the specified macro files using //​loaddefaultmacros()//,​ storing in a protected section of the class (not shown) the type of data structure shown in Figure <imgref figure_data_structures_representing_the_default_macros>​. It is also permissible for macros to be set by the runtime system using //​setmacro()//​ (in the last example of Section [[#​how_the_conceptual_framework_fits_together|how_the_conceptual_framework_fits_together]],​ //​pageaction//​ sets //​_homeextra_//​ to be the dynamically generated table of available collections using this function.) This is supported by a set of associative arrays similar to those used to represent macro files (it is not identical, because the former does not require the “parameter” layer). In //​displayclass//,​ macros read from the file are referred to as //default macros//. Local macros specified through //​setmacro()//​ are referred to as //current macros//, and are cleared from memory once the page has been generated.
 +
 +<!-- id:1038 -->When a page is to be produced, //​openpage()//​ is first called to communicate the current settings of the page parameters (//l=en// and so on). Following that, text and macros are streamed through the class—typically from within an //​actionclass//​ —using code along the lines of:
 +
 +<​code>​
 +cout << text_t2ascii << display << "​_amacro_ "
 +         <<​ "​_anothermacro_ ";
 +</​code>​
 +
 +<!-- id:1039 -->The result is that macros are expanded according to the page parameter settings. If required, these settings can be changed partway through an action by using //​setpageparams()//​. The remaining public member functions provide lower level support.
 +
 +==== <!-- id:1040 -->The receptionist code ====
 +
 +<!-- id:1041 -->The principal objects in the receptionist have now been described. Below we detail the supporting classes, which reside in //​GSDLHOME/​src/​recpt//​. Except where efficiency is paramount—in which case definitions are in-line—implementation details are contained within a header file's //.cpp// counterpart. Supporting files often include the word //tool// as part of the file name, as in //​OIDtools.h//​ and //​formattools.h//​.
 +
 +<!-- id:1042 -->A second set of lexically scoped files include the prefix //z3950//. The files provide remote access to online databases and catalogs that make their content publicly available using the Z39.50 protocol.
 +
 +<!-- id:1043 -->​Another large group of supporting files include the term //​browserclass//​. These files are related through a virtual inheritance hierarchy. As a group they support an abstract notion of browsing: serial page generation of compartmentalised document content or metadata. Browsing activities include perusing documents ordered alphabetically by title or chronologically by date; progressing through the titles returned by a query ten entries at a time; and accessing individual pages of a book using the “go to page” mechanism. Each browsing activity inherits from //​browserclass//,​ the base class:
 +
 +  * <!-- id:1044 -->//​datelistbrowserclass//​ provides support for chronological lists;
 +  * <!-- id:1045 -->//​hlistbrowserclass//​ provides support for horizontal lists;
 +  * <!-- id:1046 -->//​htmlbrowserclass//​ provides support for pages of html;
 +  * <!-- id:1047 -->//​invbrowserclass//​ provides support for invisible lists;
 +  * <!-- id:1048 -->//​pagedbrowserclass//​ provides go to page support;
 +  * <!-- id:1049 -->//​vlistbrowserclass//​ provides support for vertical lists.
 +
 +<!-- id:1050 -->​Actions access //​browserclass//​ objects through //​browsetools.h//​.
 +
 +<​tblcaption table_table_2|##​HIDDEN##></​tblcaption>​
 +|< - 140 390 >|
 +| <!-- id:1051 -->​**OIDtools.h** | <!-- id:1052 -->​Function support for evaluating document identifiers over the protocol. |
 +| <!-- id:1053 -->​**action.h** | <!-- id:1054 -->Base class for the Actions entity depicted in Figure <imgref figure_greenstone_runtime_system>​. |
 +| <!-- id:1055 -->​**authenaction.h** | <!-- id:1056 -->​Inherited action for handling authentication of a user. |
 +| <!-- id:1057 -->​**browserclass.h** | <!-- id:1058 -->Base class for abstract browsing activities. |
 +| <!-- id:1059 -->​**browsetools.h** | <!-- id:1060 -->​Function support that accesses the //​browserclass//​ hierarchy. Functionality includes expanding and contracting contents, outputing a table of contents, and generating control such as the “go to page” mechanism. |
 +| <!-- id:1061 -->​**cgiargs.h** | <!-- id:1062 -->​Defines //​cgiarginfo//​ used in Figure <imgref figure_using_the_cgiargsinfoclass_from_pageactioncpp>,​ and other data structure support for CGI arguments. |
 +| <!-- id:1063 -->​**cgiutils.h** | <!-- id:1064 -->​Function support for CGI arguments using the data structures defined in //​cgiargs.h//​. |
 +| <!-- id:1065 -->​**cgiwrapper.h** | <!-- id:1066 -->​Function support that does everything necessary to output a page using the CGI protocol. Access is through the function \\ ''​void cgiwrapper (receptionist &recpt, text_t collection);''​ \\ <!-- id:1067 -->which is the only function declared in the header file. Everything else in the //.cpp// counterpart is lexically scoped to be local to the file (using the C++ keyword //​static//​). If the function is being run for a particular collection then //​collection//​ should be set, otherwise it should be the empty string ""​. The code includes support for Fast-CGI. |
 +| <!-- id:1068 -->​**collectoraction.h** | <!-- id:1069 -->​Inherited action that facilitates end-user collection-building through the Collector. The page generated comes from //​collect.dm//​ and is controlled by the CGI argument //p=page//. |
 +| <!-- id:1070 -->​**comtypes.h** | <!-- id:1071 -->Core types for the protocol. |
 +| <!-- id:1072 -->​**converter.h** | <!-- id:1073 -->​Object support for stream converters. |
 +| <!-- id:1074 -->​**datelistbrowserclass.h** | <!-- id:1075 -->​Inherited from //​browserclass//,​ this object provides browsing support for chronological lists such as that seen in the Greenstone Archives collection under “dates” in the navigation bar. |
 +| <!-- id:1076 -->​**documentaction.h** | <!-- id:1077 -->​Inherited action used to retrieve a document or part of a classification hierarchy. |
 +| <!-- id:1078 -->​**extlinkaction.h** | <!-- id:1079 -->​Inherited action that controls whether or not a user goes straight to an external link or passes through a warning page alerting the user to the fact that they are about to move outside the digital library system. |
 +| <!-- id:1080 -->​**formattools.h** | <!-- id:1081 -->​Function support for parsing and evaluating collection configuration //format// statements. Described in more detail in Section [[##​formatting|formatting]] above. |
 +| <!-- id:1082 -->​**historydb.h** | <!-- id:1083 -->Data structures and function support for managing a database of previous queries so a user can start a new query that includes previous query terms. |
 +| <!-- id:1084 -->​**hlistbrowserclass.h** | <!-- id:1085 -->​Inherited from //​browserclass//,​ this object provides browsing support for horizontal lists. |
 +| <!-- id:1086 -->​**htmlbrowserclass.h** | <!-- id:1087 -->​Inherited from //​browserclass//,​ this object provides browsing support for html pages. |
 +| <!-- id:1088 -->​**htmlgen.h** | <!-- id:1089 -->​Function support to highlight query terms in a //text_t// string. |
 +| <!-- id:1090 -->​**htmlutils.h** | <!-- id:1091 -->​Function support that converts a //text_t// string into the equivalent html. The symbols ", //&//, //<//, and //>// are converted into //&​quot;//,​ //&​amp;//,​ //&​lt;//​ and //&​gt;//​ respectively. |
 +| <!-- id:1092 -->​**infodbclass.h** | <!-- id:1093 -->​Defines two classes: //​gdbmclass//​ and //​infodbclass//​. The former provides the Greenstone API to gdbm; the latter is the object class used to store a record entry read in from a gdbm database, and is essentially an associative array of integer-indexed arrays of //text_t// strings. |
 +| <!-- id:1094 -->​**invbrowserclass.h** | <!-- id:1095 -->​Inherited from //​browserclass//,​ this object provides browsing support for lists that are not intended for display (invisible). |
 +| <!-- id:1096 -->​**nullproto.h** | <!-- id:1097 -->​Inherited from //​recptproto//,​ this class realises the null protocol, implemented through function calls from the receptionist to the collection server. |
 +| <!-- id:1098 -->​**pageaction.h** | <!-- id:1099 -->​Inherited action that, in conjunction with the macro file named in //p=page//, generates a web page. |
 +| <!-- id:1100 -->​**pagedbrowserclass.h** | <!-- id:1101 -->​Inherited from //​browserclass//,​ this object provides browsing support for the “go to page” mechanism seen (for example) in the Gutenberg collection. |
 +| <!-- id:1102 -->​**pingaction.h** | <!-- id:1103 -->​Inherited action that checks to see whether a particular collection is responding. |
 +| <!-- id:1104 -->​**queryaction.h** | <!-- id:1105 -->​Inherited action that takes the stipulated query, settings and preferences and performs a search, generating as a result the subset of //o=num// matching documents starting at position //r=num//. |
 +| <!-- id:1106 -->​**querytools.h** | <!-- id:1107 -->​Function support for querying. |
 +| <!-- id:1108 -->​**receptionist.h** | <!-- id:1109 -->​Top-level object for the receptionist. Maintains a record of CGI argument information,​ instantiations of each inherited action, instantiations of each inherited browser, the core macro language object //​displayclass//,​ and all possible converters. |
 +| <!-- id:1110 -->​**recptconfig.h** | <!-- id:1111 -->​Function support for reading the site and main configuration files. |
 +| <!-- id:1112 -->​**recptproto.h** | <!-- id:1113 -->Base class for the protocol. |
 +| <!-- id:1114 -->​**statusaction.h** | <!-- id:1115 -->​Inherited action that generates, in conjunction with //​status.dm//,​ the various administration pages. |
 +| <!-- id:1116 -->​**tipaction.h** | <!-- id:1117 -->​Inherited action that produces, in conjunction with //tip.dm//, a web page containing a tip taken at random from a list of tips stored in //tip.dm//. |
 +| <!-- id:1118 -->​**userdb.h** | <!-- id:1119 -->Data structure and function support for maintaining a gdbm database of users: their password, groups, and so on. |
 +| <!-- id:1120 -->​**usersaction.h** | <!-- id:1121 -->An administrator action inherited from the base class that supports adding and deleting users, as well as modifying the groups they are in. |
 +| <!-- id:1122 -->​**vlistbrowserclass.h** | <!-- id:1123 -->​Inherited from //​browserclass//,​ this object provides browsing support for vertical lists, the mainstay of classifiers. For example, the children of the node for titles beginning with the letter //N// are stipulated to be a //VList//. |
 +| <!-- id:1124 -->​**z3950cfg.h** | <!-- id:1125 -->Data structure support for the Z39.50 protocol. Used by //​z3950proto.cpp//,​ which defines the main protocol class(inherited from the base class //​recptproto//​),​ and configuration file parser //​zparse.y//​ (written using Yacc). |
 +| <!-- id:1126 -->​**z3950proto.h** | <!-- id:1127 -->​Inherited from //​recptproto//,​ this class realises the Z39.50 protocol so that a Greenstone receptionist can access remote library sites running Z39.50 servers. |
 +| <!-- id:1128 -->​**z3950server.h** | <!-- id:1129 -->​Further support for the Z39.50 protocol. |
 +
 +===== <!-- id:1130 -->​Initialisation =====
 +
 +<!-- id:1131 -->​Initialisation in Greenstone is an intricate operation that processes configuration files and assigns default values to data fields. In addition to inheritance and constructor functions, core objects define //init()// and //​configure()//​ functions to help standardise the task. Even so, the order of execution can be difficult to follow. This section describes what happens.
 +
 +<!-- id:1132 -->​Greenstone uses several configuration files for different purposes, but all follow the same syntax. Unless a line starts with the hash symbol (#) or consists entirely of white space, the first word defines a keyword, and the remaining words represent a particular setting for that keyword.
 +
 +<!-- id:1133 -->The lines from configuration files are passed, one at a time, to //​configure()//​ as two arguments: the keyword and an array of the remaining words. Based on the keyword, a particular version of //​configure()//​ decides whether the information is of interest, and if so stores it. For example, //​collectserver//​(which maps to the Collection object in Figure <imgref figure_greenstone_runtime_system>​) processes the format statements in a collection'​s configuration file. When the keyword //format// is passed to //​configure()//,​ an //if// statement is triggered that stores in the object a copy of the function'​s second argument.
 +
 +<!-- id:1134 -->After processing the keyword and before the function terminates, some versions of //​configure()//​ pass the data to //​configure()//​ functions in other objects. The Receptionist object calls //​configure()//​ for Actions, Protocols, and Browsers. The NullProtocol object calls //​configure()//​ for each Collection object; Collection calls Filters and Sources.
 +
 +<!-- id:1135 -->In C++, data fields are normally initialized by the object'​s constructor function. However, in Greenstone some initialisation depends on values read from configuration files, so a second round of initialisation is needed. This is the purpose of the //init()// member functions, and in some cases it leads to further calls to //​configure()//​.
 +
 +<​imgcaption figure_initialising_greenstone_using_the_null_protocol|%!-- id:1136 --%Initialising Greenstone using the null protocol ></​imgcaption>​
 +<​code>​
 +============
 +Main program
 +============
 +Statically construct Receptionist
 +Statically construct NullProtocol
 +Establish the value for ’gsdlhome’ by reading gsdlsite.cfg
 +Foreach directory in GSDLHOME/​collect that isn’t "​modelcol":​
 +  Add directory name (now treated as collection name) to NullProtocol:​
 +    Dynamically construct Collection
 +    Dynamically construct Gdbm class
 +    Dynamically construct the Null Filter
 +    Dynamically construct the Browse Filter
 +    Dynamically construct MgSearch
 +    Dynamically construct the QueryFilter
 +    Dynamically construct the MgGdbmSource
 +    Configure Collection with ’collection’
 +      Passing ’collection’ value on to Filters and Sources:
 +Configure Receptionist with ’collectinfo’:​
 +      Passing ’collectinfo’ value on to Actions, Protocols, and Browsers:
 +Add NullProtocol to Receptionist
 +Add in UTF-8 converter
 +Add in GB converter
 +Add in Arabic converter
 +Foreach Action:
 +  Statically construct Action
 +  Add Action to Receptionist
 +Foreach Browsers:
 +  Statically construct Browser
 +  Add Browser to Receptionist
 +Call function cgiwrapper:
 +  =================
 +  Configure objects
 +  =================
 +  Configure Receptionist with ’collection’
 +    Passing ’collection’ value on to Actions, Protocols, and Browsers:
 +    NullProtocol not interested in ’collection’
 +  Configure Receptionist with ’httpimg’
 +    Passing ’httpimg’ value on to Actions, Protocols, and Browsers:
 +    NullProtocol passing ’httpimg’ on to Collection
 +    Passing ’httpimg’ value on to Filters and Sources:
 +  Configure Receptionist with ’gwcgi’
 +    Passing ’gwcgi’ value on to Actions, Protocols, and Browsers:
 +    NullProtocol passing ’gwcgi’ on to Collection
 +    Passing ’gwcgi’ value on to Filters and Sources:
 +  Reading in site configuration file gsdlsite.cfg
 +    Configure Recptionist with ’gsdlhome’
 +      Passing ’gsdlhome’ value on to Actions, Protocols, and Browsers:
 +      NullProtocol passing ’gsdlhome’ on to Collection
 +        Passing ’gsdlhome’ value on to Filters and Sources:
 +    Configure Recptionist with ...
 +    ... and so on for all entries in gsdlsite.cfg ​
 +  Reading in main configuration file main.cfg
 +    Configure Recptionist with ...
 +    ... and so on for all entries in main.cfg
 +  ====================
 +  Initialising objects
 +  ====================
 +  Initialise the Receptionist
 +    Configure Receptionist with ’collectdir’
 +      Passing ’collectdir’ value on to Actions, Protocols, and Browsers:
 +      NullProtocol not interested in ’collectdir’
 +    Read in Macro files
 +    Foreach Actions
 +      Initialise Action
 +    Foreach Protocol
 +      Initialise Protocol
 +      When Protocol==NullProtocol:​
 +        Foreach Collection
 +          Reading Collection’s build.cfg
 +          Reading Collection’s collect.cfg
 +            Configure Collection with ’creator’
 +              Passing ’creator’ value on to Filters and Sources:
 +            Configure Collection with ’maintainer’
 +              Passing ’maintainer’ value on to Filters and Sources:
 +            ... and so on for all entries in collect.cfg
 +    Foreach Browsers
 +      Initialise Browser
 +  =============
 +  Generate page
 +  =============
 +  Parse CGI arguments
 +  Execute designated Action to produce page
 +End.
 +</​code>​
 +
 +
 +
 +<!-- id:1137 -->​Figure <imgref figure_initialising_greenstone_using_the_null_protocol>​ shows diagnostic statements generated from a version of Greenstone augmented to highlight the initialisation process. The program starts in the //main()// function in //​recpt/​librarymain.cpp//​. It constructs a Receptionist object and a NullProtocol object, then scans //​gsdlsite.cfg//​(located in the same directory as the //library// executable) for //​gsdlhome//​ and stores its value in a variable. For each online collection—as established by reading in the directories present in //​GSDLHOME/​collect//​ —it constructs a Collection object, through the NullProtocol object, that includes within it Filters, Search and Source, plus a few hardwired calls to //​configure()//​.
 +
 +<!-- id:1138 -->Next //main()// adds the NullProtocol object to the Receptionist,​ which keeps a base class array of protocols in a protected data field, and then sets up several converters. //main()// constructs all Actions and Browsers used in the executable and adds them to the Receptionist. The function concludes by calling //​cgiwrapper()//​ in //​cgiwrapper.cpp//,​ which itself includes substantial object initialisation.
 +
 +<!-- id:1139 -->There are three sections to //​cgiwrapper()//:​ configuration,​ initialisation and page generation. First some hardwired calls to //​configure()//​ are made. Then //​gsdlsite.cfg//​ is read and //​configure()//​ is called for each line. The same is done for //​etc/​main.cfg//​.
 +
 +<!-- id:1140 -->The second phase of //​cgiwrapper()//​ makes calls to //init()//. The Receptionist makes only one call to its //init()// function, but the act of invoking this calls //init()// functions in the various objects stored within it. First a hardwired call to //​configure()//​ is made to set //​collectdir//,​ then the macro files are read. For each action, its //init()// function is called. The same occurs for each protocol stored in the receptionist—but in the system being described only one protocol is stored, the NullProtocol. Calling //init()// for this object causes further configuration:​ for each collection in the NullProtocol,​ its collection-specific //​build.cfg//​ and //​collect.cfg//​ are read and processed, with a call to //​configure()//​ for each line.
 +
 +<!-- id:1141 -->The final phase of //​cgiwrapper()//​ is to parse the CGI arguments, and then call the appropriate action. Both these calls are made with the support of the Receptionist object.
 +
 +<!-- id:1142 -->The reason for the separation of the configuration,​ initialisation,​ and page generation code is that Greenstone is optimised to be run as a server (using Fast-cgi, or the Corba protocol, or the Windows Local Library). In this mode of operation, the configuration and initialisation code is executed once, then the program remains in memory and generates many web pages in response to requests from clients, without requiring re-initalisation.
 +