Table of Contents
The Greenstone runtime system
This chapter describes the Greenstone runtime system so that you can understand, augment and extend its capabilities. The software is written in C++ and makes extensive use of virtual inheritance. If you are unfamiliar with this language you should learn about it before proceeding. Deitel and Deitel (1994) provide a comprehensive tutorial, while Stroustroup (1997) is the definitive reference.
We begin by explaining the design philosophy behind the runtime system since this has a strong bearing on implementation. Then we provide the implementation details, which forms the main part of this chapter. The version of Greenstone described here is the CGI version (Web Library if for Windows users). The Windows Local Library uses the same source code but has a built-in webserver front end. Also, the Local Library is a persistent process.
Process structure
<imgcaption figure_overview_of_a_general_greenstone_system|%!– id:760 –%Overview of a general Greenstone system ></imgcaption>
Figure <imgref figure_overview_of_a_general_greenstone_system> shows several users, represented by computer terminals at the top of the diagram, accessing three Greenstone collections. Before going online, these collections undergo the importing and building processes described in earlier chapters. First, documents, shown at the bottom of the figure, are imported into the XML-compliant Greenstone Archive Format. Then the archive files are built into various searchable indexes and a collection information database that includes the hierarchical structures that support browsing. This done, the collection is ready to go online and respond to requests for information.
Two components are central to the design of the runtime system: “receptionists” and “collection servers.” From a user's point of view, a receptionist is the point of contact with the digital library. It accepts user input, typically in the form of keyboard entry and mouse clicks; analyzes it; and then dispatches a request to the appropriate collection server (or servers). This locates the requested piece of information and returns it to the receptionist for presentation to the user. Collection servers act as an abstract mechanism that handle the content of the collection, while receptionists are responsible for the user interface.
<imgcaption figure_greenstone_system_using_the_null_protocol|%!– id:763 –%Greenstone system using the “null protocol” ></imgcaption>
As Figure <imgref figure_overview_of_a_general_greenstone_system> shows, receptionists communicate with collection servers using a defined protocol. The implementation of this protocol depends on the computer configuration on which the digital library system is running. The most common case, and the simplest, is when there is one receptionist and one collection server, and both run on the same computer. This is what you get when you install the default Greenstone. In this case the two processes are combined to form a single executable (called library), and consequently using the protocol reduces to making function calls. We call this the null protocol. It forms the basis for the standard out-of-the-box Greenstone digital library system. This simplified configuration is illustrated in Figure <imgref figure_greenstone_system_using_the_null_protocol>, with the receptionist, protocol and collection server bound together as one entity, the library program. The aim of this chapter is to show how it works.
Usually, a “server” is a persistent process that, once started, runs indefinitely, responding to any requests that come in. Despite its name, however, the collection server in the null protocol configuration is not a server in this sense. In fact, every time any Greenstone web page is requested, the library program is started up (by the CGI mechanism), responds to the request, and then exits. We call it a “server” because it is also designed to work in the more general configuration of Figure <imgref figure_overview_of_a_general_greenstone_system>.
Surprisingly, this start-up, process and exit cycle is not as slow as one might expect, and results in a perfectly usable service. However, it is clearly inefficient. There is a mechanism called Fast-CGI ( www.fastcgi.com ) which provides a middle ground. Using it, the library program can remain in memory at the end of the first execution, and have subsequent sets of CGI arguments fed to it, thus avoiding repeated initialisation overheads and accomplishing much the same behaviour as a server. Using Fast-CGI is an option in Greenstone, and is enabled by recompiling the source code with appropriate libraries.
As an alternative to the null protocol, the Greenstone protocol has also been implemented using the well-known CORBA scheme (Slama et al., 1999). This uses a unified object oriented paradigm to enable different processes, running on different computer platforms and implemented in different programming languages, to access the same set of distributed objects over the Internet (or any other network). Then, scenarios like Figure <imgref figure_overview_of_a_general_greenstone_system> can be fully implemented, with all the receptionists and collection servers running on different computers.
<imgcaption figure_graphical_query_interface_to_greenstone|%!– id:768 –%Graphical query interface to Greenstone ></imgcaption>
This allows far more sophisticated interfaces to be set up to exactly the same digital library collections. As just one example, Figure <imgref figure_graphical_query_interface_to_greenstone> shows a graphical query interface, based on Venn diagrams, that lets users manipulate Boolean queries directly. Written in Java, the interface runs locally on the user's own computer. Using CORBA, it accesses a remote Greenstone collection server, written in C++.
The distributed protocol is still being refined and readied for use, and so this manual does not discuss it further (see Bainbridge et al., 2001, for more information).
Conceptual framework
<imgcaption figure_generating_the_about_this_collection_page|%!– id:772 –%Generating the “about this collection” page ></imgcaption>
Figure <imgref figure_generating_the_about_this_collection_page> shows the “about this collection” page of a particular Greenstone collection (the Project Gutenberg collection). Look at the URL at the top. The page is generated as a result of running the CGI program library, which is the above-mentioned executable comprising both receptionist and collection server connected by the null protocol. The arguments to library are c=gberg, a=p, and p=about. They can be interpreted as follows:
For the Project Gutenberg collection (c=gberg), the action is to generate a page (a=p), and the page to generate is called “about” (p=about).
<imgcaption figure_greenstone_runtime_system|%!– id:775 –%Greenstone runtime system ></imgcaption>
Figure <imgref figure_greenstone_runtime_system> illustrates the main parts of the Greenstone runtime system. At the top, the receptionist first initialises its components, then parses the CGI arguments to decide which action to call. In performing the action (which includes further processing of the CGI arguments), the software uses the protocol to access the content of the collection. The response is used to generate a web page, with assistance from the format component and the macro language.
The macro language, which we met in Section controlling_the_greenstone_user_interface, is used to provide a Greenstone digital library system with a consistent style, and to create interfaces in different languages. Interacting with the library generates the bare bones of web pages; the macros in GSDLHOME/macros wrap them in flesh.
The Macro Language object in Figure <imgref figure_greenstone_runtime_system> is responsible for reading these files and storing the parsed result in memory. Any action can use this object to expand a macro. It can even create new macro definitions and override existing ones, adding a dynamic dimension to macro use.
The layout of the “about this collection” page (Figure <imgref figure_generating_the_about_this_collection_page>) is known before runtime, and encoded in the macro file about.dm. Headers, footers, and the background image are not even mentioned because they are located in the Global macro package. However, the specific “about” text for a particular collection is not known in advance, but is stored in the collection information database during the building process. This information is retrieved using the protocol, and stored as _collectionextra_ in the Global macro package. Note this macro name is essentially the same name used to express this information in the collection configuration file described in Section configuration_file. To generate the content of the page, the _content_ macro in the about package (shown in Figure <imgref figure_part_of_the_aboutdm_macro_file>) is expanded. This in turn expands _textabout_, which itself accesses _collectionextra_, which had just been dynamically placed there.
One further important ingredient is the Format object. Format statements in the collection configuration file affect the presentation of particular pieces of information, as described in Section formatting_greenstone_output. They are handled by the Format object in Figure <imgref figure_greenstone_runtime_system>. This object's main task is to parse and evaluate statements such as the format strings in Figure <imgref figure_excerpt_from_the_demo_collection_collect>. As we learned in Section formatting_greenstone_output, these can include references to metadata in square brackets (e.g. [Title]), which need to be retrieved from the collection server. Interaction occurs between the Format object and the Macro Language object, because format statements can include macros that, when expanded, include metadata, which when expanded include macros, and so on.
At the bottom of Figure <imgref figure_greenstone_runtime_system>, the collection server also goes through an initialisation process, setting up Filter and Source objects to respond to incoming protocol requests, and a Search object to assist in this task. Ultimately these access the indexes and the collection information database, both formed during collection building.
Ignoring blank lines, the receptionist contains 15,000 lines of code. The collection server contains only 5,000 lines (75% of which are taken up by header files). The collection server is more compact because content retrieval is accomplished through two pre-compiled programs. mg, a full-text retrieval system, is used for searching, and gdbm, a database management system, is used to hold the collection information database.
To encourage extensibility and flexibility, Greenstone uses inheritance widely—in particular, within Action, Filter, Source, and Search. For a simple digital library dedicated to text-based collections, this means that you need to learn slightly more to program the system. However, it also means that mg and gdbm could easily be replaced should the need arise. Furthermore, the software architecture is rich enough to support full multimedia capabilities, such as controlling the interface through speech input, or submitting queries as graphically drawn pictures.
How the conceptual framework fits together
Sections collection_server and receptionist explain the operation of the collection server and receptionist in more detail, expanding on each module in Figure <imgref figure_greenstone_runtime_system> and describing how it is implemented. It is helpful to first work through examples of a user interacting with Greenstone, and describe what goes on behind the scenes. For the moment, we assume that all objects are correctly initialised. Initialisation is a rather intricate procedure that we revisit in Section initialisation.
Performing a search
<imgcaption figure_searching_gutenberg_for_darcy|%!– id:787 –%Searching Gutenberg for Darcy ></imgcaption>
When a user enters a query by pressing Begin search on the search page, a new Greenstone action is invoked, which ends up by generating a new html page using the macro language. Figure <imgref figure_searching_gutenberg_for_darcy> shows the result of searching the Project Gutenberg collection for the name Darcy. Hidden within the html of the original search page is the statement a=q. When the search button is pressed this statement is activated, and sets the new action to be queryaction. Executing queryaction sets up a call to the designated collection's Filter object (c=gberg) through the protocol.
Filters are an important basic function of collection servers. Tailored for both searching and browsing activities, they provide a way of selecting a subset of information from a collection. In this case, the queryaction sets up a filter request by:
- setting the filter request type to be QueryFilter(Section collection_server describes the different filter types);
- storing the user's search preferences—case-folding, stemming and so on—in the filter request;
- calling the filter() function using the null protocol.
Calls to the protocol are synchronous. The receptionist is effectively blocked until the filter request has been processed by the collection server and any data generated has been returned.
When a protocol call of type QueryFilter is made, the Filter object (in Figure <imgref figure_greenstone_runtime_system>) decodes the options and makes a call to the Search object, which uses mg to do the actual search. The role of the Search object is to provide an abstract program interface that supports searching, regardless of the underlying search tool being used. The format used for returning results also enforces abstraction, requiring the Search object to translate the data generated by the search tool into a standard form.
Once the search results have been returned to the receptionist, the action proceeds by formatting the results for display, using the Format object and the Macro Language. As Figure <imgref figure_searching_gutenberg_for_darcy> shows, this involves generating: the standard Greenstone header, footer, navigation bar and background; repeating the main part of the query page just beneath the navigation bar; and displaying a book icon, title and author for each matching entry. The format of this last part is governed by the format SearchVList statement in the collection configuration file. Before title and author metadata can be displayed, they must be retrieved from the collection server. This requires further calls to the protocol, this time using BrowseFilter.
Retrieving a document
Following the above query for Darcy, consider what happens when a document is displayed. Figure <imgref figure_the_golf_course_mystery> shows the result of clicking on the icon beside The Golf Course Mystery in Figure <imgref figure_searching_gutenberg_for_darcy>.
<imgcaption figure_the_golf_course_mystery|%!– id:798 –%The Golf Course Mystery ></imgcaption>
The source text for the Gutenberg collection comprises one long file per book. At build time, these files are split into separate pages every 200 lines or so, and relevant information for each page is stored in the indexes and collection information database. The top of Figure <imgref figure_the_golf_course_mystery> shows that this book contains 104 computer-generated pages, and below it is the beginning of page one: who entered it, the title, the author, and the beginnings of a table of contents (this table forms part of the Gutenberg source text, and was not generated by Greenstone). At the top left are buttons that control the document's appearance: just one page or the whole document; whether query term highlighting is on or off; and whether or not the book should be displayed in its own window, detached from the main searching and browsing activity. At the top right is a navigation aid that supports direct access to any page in the book: simply type in the page number and press the “go to page” button. Alternatively, the next and previous pages are retrieved by clicking on the arrow icons either side of the page selection widget.
The action for retrieving documents, documentaction, is specified by setting a=d and takes several additional arguments. Most important is the document to retrieve: this is specified through the d variable. In Figure <imgref figure_the_golf_course_mystery> it is set to d=HASH51e598821ed6cbbdf0942b.1 to retrieve the first page of the document with the identifier HASH51e598821ed6cbbdf0942b, known in more friendly terms as The Golf Course Mystery. There are further variables: whether query term highlighting is on or off (hl) and which page within a book is displayed (gt). These variables are used to support the activities offered by the buttons on the page in Figure <imgref figure_the_golf_course_mystery>, described above. Defaults are used if any of these variables are omitted.
The action follows a similar procedure to queryaction: appraise the CGI arguments, access the collection server using the protocol, and use the result to generate a web page. Options relating to the document are decoded from the CGI arguments and stored in the object for further work. To retrieve the document from the collection server, only the document identifier is needed to set up the protocol call to get_document(). Once the text is returned, considerable formatting must be done. To achieve this, the code for documentaction accesses the stored arguments and makes use of the Format object and the Macro Language.
Browsing a hierarchical classifier
Figure <imgref figure_browsing_titles_in_the_gutenberg_collection> shows an example of browsing, where the user has chosen Titles A-Z and accessed the hyperlink for the letter K. The action that supports this is also documentaction, given by the CGI argument a=d as before. However, whereas before a d variable was included, this time there is none. Instead, the node within the browsable classification hierarchy to display is specified in the variable cl. In our case this represents titles grouped under the letter K. This list was formed at build time and stored in the collection information database.
<imgcaption figure_browsing_titles_in_the_gutenberg_collection|%!– id:804 –%Browsing titles in the Gutenberg collection ></imgcaption>
Records that represent classifier nodes in the database use the prefix CL, followed by numbers separated by periods (.) to designate where they lie within the nested structure. Ignoring the search button (leftmost in the navigation bar), classifiers are numbered sequentially in increasing order, left to right, starting at 1. Thus the top level classifier node for titles in our example is CL1 and the page sought is generated by setting cl=CL1.11. This can be seen in the URL at the top of Figure <imgref figure_browsing_titles_in_the_gutenberg_collection>.
To process a cl document request, the Filter object is used to retrieve the node over the protocol. Depending on the data returned, further protocol calls are made to retrieve document metadata. In this case, the titles of the books are retrieved. However, if the node were an interior one whose children are themselves nodes, the titles of the child nodes would be retrieved. From a coding point of view this amounts to the same thing, and is handled by the same mechanism.
Finally, all the retrieved information is bound together, using the macro language, to produce the web page shown in Figure <imgref figure_browsing_titles_in_the_gutenberg_collection>.
Generating the home page
<imgcaption figure_greenstone_home_page|%!– id:809 –%Greenstone home page ></imgcaption>
As a final example, we look at generating the Greenstone home page. Figure <imgref figure_greenstone_home_page> shows—for the default Greenstone installation —its home page after some test collections have been installed. Its URL, which you can see at the top of the screen, includes the arguments a=p and p=home. Thus, like the “about this collection” page, it is generated by a pageaction (a=p), but this time the page to produce is home(p=home). The macro language, therefore, accesses the content of home.dm. There is no need to specify a collection (with the c variable) in this case.
The purpose of the home page is to show what collections are available. Clicking on an icon takes the user to the “about this collection” page for that collection. The menu of collections is dynamically generated every time the page is loaded, based on the collections that are in the file system at that time. When a new one comes online, it automatically appears on the home page when that page is reloaded (provided the collection is stipulated to be “public”).
To do this the receptionist uses the protocol (of course). As part of appraising the CGI arguments, pageaction is programmed to detect the special case when p=home. Then, the action uses the protocol call get_collection_list() to establish the current set of online collections. For each of these it calls get_collectinfo() to obtain information about it. This information includes whether the collection is publicly available, what the URL is for the collection's icon (if any), and the collection's full name. This information is used to generate an appropriate entry for the collection on the home page.
Source code
<tblcaption table_standalone_programs_included_in_greenstone|Standalone programs included in Greenstone></tblcaption>
< - 132 397 > | |
setpasswd/ | Password support for Windows. |
getpw/ | Password support for Unix. |
txt2db/ | Convert an XML-like ASCII text format to Gnu's database format. |
db2txt/ | Convert the Gnu database format to an XML-like ASCII text format. |
phind/ | Hierarchical phrase browsing tool. |
hashfile/ | Compute unique document ID based on content of file. |
mgpp/ | Rewritten and updated version of Managing Gigabytes package in C++. |
w32server/ | Local library server for Windows. |
checkis/ | Specific support for installing Greenstone under Windows. |
The source code for the runtime system resides in GSDLHOME/src. It occupies two subdirectories, recpt for the receptionist's code and colservr for the collection server's. Greenstone runs on Windows systems right down to Windows 3.1, and unfortunately this imposes an eight-character limit on file and directory names. This explains why cryptic abbreviations like recpt and colservr are used. The remaining subdirectories include standalone utilities, mostly in support of the building process. They are listed in Table <tblref table_standalone_programs_included_in_greenstone>.
Another directory, GSDLHOME/lib, includes low-level objects that are used by both receptionist and collection server. This code is described in Section common_greenstone_types.
Greenstone makes extensive use of the Standard Template Library (STL), a widely-used C++ library from Silicon Graphics ( www.sgi.com ) that is the result of many years of design and development. Like all programming libraries it takes some time to learn. Appendix A gives a brief overview of key parts that are used throughout the Greenstone code. For a fuller description, consult the official STL reference manual, available online at www.sgi.com , or one of the many STL textbooks, for example Josuttis (1999).
Common Greenstone types
The objects defined in GSDLHOME/lib are low-level Greenstone objects, built on top of STL, which pervade the entire source code. First we describe text_t, an object used to represent Unicode text, in some detail. Then we summarize the purpose of each library file.
The text_t object
Greenstone works with multiple languages, both for the content of a collection and its user interface. To support this, Unicode is used throughout the source code. The underlying object that realises a Unicode string is text_t.
<imgcaption figure_the_text_t_api|%!– id:840 –%The text_t API (abridged) %!– withLineNumber –%></imgcaption>
typedef vector<unsigned short> usvector; class text_t { protected: usvector text; unsigned short encoding; // 0 = unicode, 1 = other public: // constructors text_t (); text_t (int i); text_t (char *s); // assumed to be a normal c string void setencoding (unsigned short theencoding); unsigned short getencoding (); // STL container support iterator begin (); iterator end (); void erase(iterator pos); void push_back(unsigned short c); void pop_back(); void reserve (size_type n); bool empty () const {return text.empty();} size_type size() const {return text.size();} // added functionality void clear (); void append (const text_t &t); // support for integers void appendint (int i); void setint (int i); int getint () const; // support for arrays of chars void appendcarr (char *s, size_type len); void setcarr (char *s, size_type len); };
Unicode uses two bytes to store each character. Figure <imgref figure_the_text_t_api> shows the main features of the text_t Application Program Interface (API). It fulfils the two-byte requirement using the C++ built-in type short, which is defined to be a two byte integer. The data type central to the text_t object is a dynamic array of unsigned shorts built using the STL declaration vector<unsigned short> and given the abbreviated name usvector.
The constructor functions (lines 10—12) explicitly support three forms of initialisation: construction with no parameters, which generates an empty Unicode string; construction with an integer parameter, which generates a Unicode text version of the numeric value provided; and construction with a char* parameter, which treats the argument as a null-terminated C++ string and generates a Unicode version of it.
Following this, most of the detail (lines 17—28) is taken up maintaining an STL vector-style container: begin(), end(), push_back(), empty() and so forth. There is also support for clearing and appending strings, as well as for converting an integer value into a Unicode text string, and returning the corresponding integer value of text that represents a number.
<imgcaption figure_overloaded_operators_to_text_t|%!– id:844 –%Overloaded operators to text_t %!– withLineNumber –%></imgcaption>
class text_t { // ... public: text_t &operator=(const text_t &x); text_t &operator+= (const text_t &t); reference operator[](size_type n); text_t &operator=(int i); text_t &operator+= (int i);^ \\ text_t &operator= (char *s); text_t &operator+= (char *s); friend inline bool operator!=(const text_t& x, const text_t& y); friend inline bool operator==(const text_t& x, const text_t& y); friend inline bool operator< (const text_t& x, const text_t& y); friend inline bool operator> (const text_t& x, const text_t& y); friend inline bool operator>=(const text_t& x, const text_t& y); friend inline bool operator<=(const text_t& x, const text_t& y); // ... };
There are many overloaded operators that do not appear in Figure <imgref figure_the_text_t_api>. To give a flavour of the operations supported, these are shown in Figure <imgref figure_overloaded_operators_to_text_t>. Line 4 supports assignment of one text_t object to another, and line 5 overloads the += operator to provide a more natural way to append one text_t object to the end of another. It is also possible, through line 6, to access a particular Unicode character (represented as a short) using array subscripting [ ]. Assign and append operators are also provided for integers and C++ strings. Lines 12—18 provide Boolean operators for comparing two text_t objects: equals, does not equal, precedes alphabetically, and so on.
Member functions that take const arguments instead of non- const ones are also provided (but not shown here). Such repetition is routine in C++ objects, making the API fatter but no bigger conceptually. In reality, many of these functions are implemented as single in-line statements. For more detail, refer to the source file GSDLHOME/lib/text_t.h.
The Greenstone library code
The header files in GSDLHOME/lib include a mixture of functions and objects that provide useful support for the Greenstone runtime system. Where efficiency is of concern, functions and member functions are declared inline. For the most part, implementation details are contained within a header file's .cpp counterpart.
<tblcaption table_table|##HIDDEN##></tblcaption>
< - 100 450 > | |
cfgread.h | Functions to read and write configuration files. For example, read_cfg_line() takes as arguments the input stream to use and the text_tarray (shorthand for vector<text_t>) to fill out with the data that is read. |
display.h | A sophisticated object used by the receptionist for setting, storing and expanding macros, plus supporting types. Section receptionist gives further details. |
fileutil.h | Function support for several file utilities in an operating system independent way. For example, filename_cat() takes up to six text_t arguments and returns a text_t that is the result of concatenating the items together using the appropriate directory separator for the current operating system. |
gsdlconf.h | System-specific functions that answer questions such as: does the operating system being used for compilation need to access strings.h as well as string.h? Are all the appropriate values for file locking correctly defined? |
gsdltimes.h | Function support for date and times. For example, time2text() takes computer time, expressed as the number of seconds that have elapsed since 1 January 1970, and converts it into the form YYYY/MM/DD hh:mm:ss, which it returns as type text_t. |
gsdltools.h | Miscellaneous support for the Greenstone runtime system: clarify if littleEndian or bigEndian; check whether Perl is available; execute a system command (with a few bells and whistles); and escape special macro characters in a text_t string. |
gsdlunicode.h | A series of inherited objects that support processing Unicode text_t strings through IO streams, such as Unicode to UTF-8 and vice versa; and the removal of zero-width spaces. Support for map files is also provided through the mapconvert object, with mappings loaded from GSDLHOME/mappings. |
text_t.h | Primarily the Unicode text object described above. It also provides two classes for converting streams: inconvertclass and outconvertclass. These are the base classes used in gsdlunicode.h. |
Collection server
Now we systematically explain all the objects in the conceptual framework of Figure <imgref figure_greenstone_runtime_system>. We start at the bottom of the diagram—which is also the foundations of the system—with Search, Source and Filter, and work our way up through the protocol layer and on to the central components in the receptionist: Actions, Format and Macro Language. Then we focus on object initialisation, since this is easier to understand once the role of the various objects is known.
Most of the classes central to the conceptual framework are expressed using virtual inheritance to aid extensibility. With virtual inheritance, inherited objects can be passed around as their base class, but when a member function is called it is the version defined in the inherited object that is invoked. By ensuring that the Greenstone source code uses the base class throughout, except at the point of object construction, this means that different implementations—using, perhaps, radically different underlying technologies—can be slotted into place easily.
For example, suppose a base class called BaseCalc provides basic arithmetic: add, subtract, multiply and divide. If all its functions are declared virtual, and arguments and return types are all declared as strings, we can easily implement inherited versions of the object. One, called FixedPrecisionCalc, might use C library functions to convert between strings and integers and back again, implementing the calculations using the standard arithmetic operators: +, —, *, and /. Another, called InfinitePrecisionCalc, might access the string arguments a character at a time, implementing arithmetic operations that are in principal infinite in their precision. By writing a main program that uses BaseCalc throughout, the implementation can be switched between fixed precision and infinite precision by editing just one line: the point where the calculator object is constructed.
The Search object
<imgcaption figure_search_base_class_api|%!– id:870 –%Search base class API ></imgcaption>
class searchclass { public: searchclass (); virtual ~searchclass (); // the index directory must be set before any searching // is done virtual void setcollectdir (const text_t &thecollectdir); // the search results are returned in queryresults // search returns 'true' if it was able to do a search virtual bool search(const queryparamclass &queryparams, queryresultsclass &queryresults)=0; // the document text for 'docnum' is placed in 'output' // docTargetDocument returns 'true' if it was able to // try to get a document // collection is needed to see if an index from the // collection is loaded. If no index has been loaded // defaultindex is needed to load one virtual bool docTargetDocument(const text_t &defaultindex, const text_t &defaultsubcollection, const text_t &defaultlanguage, const text_t &collection, int docnum, text_t &output)=0; protected: querycache *cache; text_t collectdir; // the collection directory };
Figure <imgref figure_search_base_class_api> shows the base class API for the Search object in Figure <imgref figure_greenstone_runtime_system>. It defines two virtual member functions: search() and docTargetDocument(). As signified by the =0 that follows the argument declaration, these are pure functions—meaning that a class that inherits from this object must implement both (otherwise the compiler will complain).
The class also includes two protected data fields: collectdir and cache. A Search object is instantiated for a particular collection, and the collectdir field is used to store where on the file system that collection (and more importantly its index files) resides. The cache field retains the result of a query. This is used to speed up subsequent queries that duplicate the query (and its settings). While identical queries may seem unlikely, in fact they occur on a regular basis. The Greenstone protocol is stateless. To generate a results page like Figure <imgref figure_searching_gutenberg_for_darcy> but for matches 11—20 of the same query, the search is transmitted again, this time specifying that documents 11—20 are returned. Caching makes this efficient, because the fact that the search has already been performed is detected and the results are lifted straight from the cache.
Both data fields are applicable to every inherited object that implements a searching mechanism. This is why they appear in the base class, and are declared within a protected section of the class so that inherited classes can access them directly.
Search and retrieval with MG
Greenstone uses MG (short for Managing Gigabytes, see Witten et al., 1999) to index and retrieve documents, and the source code is included in the GSDLHOME/packages directory. MG uses compression techniques to maximise disk space utilisation without compromising execution speed. For a collection of English documents, the compressed text and full text indexes together typically occupy one third the space of the original uncompressed text alone. Search and retrieval is often quicker than the equivalent operation on the uncompressed version, because there are fewer disk operations.
<imgcaption figure_api_for_direct_access_to_mg|%!– id:876 –%API for direct access to MG (abridged) ></imgcaption>
enum result_kinds { result_docs, // Return the documents found in last search result_docnums, // Return document id numbers and weights result_termfreqs, // Return terms and frequencies result_terms // Return matching query terms }; int mgq_ask(char *line); int mgq_results(enum result_kinds kind, int skip, int howmany, int (*sender)(char *, int, int, float, void *), void *ptr); int mgq_numdocs(void); int mgq_numterms(void); int mgq_equivterms(unsigned char *wordstem, int (*sender)(char *, int, int, float, void *), void *ptr); int mgq_docsretrieved (int *total_retrieved, int *is_approx); int mgq_getmaxstemlen (); void mgq_stemword (unsigned char *word);
MG is normally used interactively by typing commands from the command line, and one way to implement mgsearchclass would be to use the C library system() call within the object to issue the appropriate mg commands. A more efficient approach, however, is to tap directly into the mg code using function calls. While this requires a deeper understanding of the mg code, much of the complexity can be hidden behind a new API that becomes the point of contact for the object mgsearchclass. This is the role of colserver/mgq.c, whose API is shown in Figure <imgref figure_api_for_direct_access_to_mg>.
The way to supply parameters to mg is via mgq_ask(), which takes text options in a format identical to that used at the command line, such as:
mgq_ask( ".set casefold off ");
It is also used to invoke a query. Results are accessed through mgq_results, which takes a pointer to a function as its fourth parameter. This provides a flexible way of converting the information returned in mg data structures into those needed by mgsearchclass. Calls such as mgq_numdocs(), mgq_numterms(), and mgq_docsretrieved() also return information, but this time more tightly prescribed. The last two give support for stemming.
The Source object
<imgcaption figure_source_base_class_api|%!– id:881 –%Source base class API ></imgcaption>
class sourceclass { public: sourceclass (); virtual ~sourceclass (); // configure should be called once for each configuration line virtual void configure (const text_t &key, const text_tarray &cfgline); // init should be called after all the configuration is done but // before any other methods are called virtual bool init (ostream &logout); // translate_OID translates OIDs using " .pr " , . " fc " etc. virtual bool translate_OID (const text_t &OIDin, text_t &OIDout, comerror_t &err, ostream &logout); // get_metadata fills out the metadata if possible, if it is not // responsible for the given OID then it return s false. virtual bool get_metadata (const text_t &requestParams, const text_t &refParams, bool getParents, const text_tset &fields, const text_t &OID, MetadataInfo_tmap &metadata, comerror_t &err, ostream &logout); virtual bool get_document (const text_t &OID, text_t &doc, comerror_t &err, ostream &logout); };
The role of Source in Figure <imgref figure_greenstone_runtime_system> is to access document metadata and document text, and its base class API is shown in Figure <imgref figure_source_base_class_api>. A member function maps to each task: get_metadata() and get_document() respectively. Both are declared virtual, so the version provided by a particular implementation of the base class is called at runtime. One inherited version of this object uses gdbm to implement get_metadata() and mg to implement get_document(): we detail this version below.
Other member functions seen in Figure <imgref figure_source_base_class_api> are configure(), init(), and translate_OID(). The first two relate to the initialisation process described in Section initialisation.
The remaining one, translate_OID(), handles the syntax for expressing document identifiers. In Figure <imgref figure_the_golf_course_mystery> we saw how a page number could be appended to a document identifier to retrieve just that page. This was possible because pages were stored as “sections” when the collection was built. Appending “.1” to an OID retrieves the first section of the corresponding document. Sections can be nested, and are accessed by concatenating section numbers separated by periods.
As well as hierarchical section numbers, the document identifier syntax supports a form of relative access. For the current section of a document it is possible to access the first child by appending .fc, the last child by appending .lc, the parent by appending .pr, the next sibling by appending .ns, and the previous sibling by appending .ps.
The translate_OID() function uses parameters OIDin and OIDout to hold the source and result of the conversion. It takes two further parameters, err and logout. These communicate any error status that may arise during the translation operation, and determine where to send logging information. The parameters are closely aligned with the protocol, as we shall see in Section protocol.
Database retrieval with gdbm
GDBM is the Gnu database manager program ( www.gnu.org ). It implements a flat record structure of key/data pairs, and is backwards compatible with dbm and ndbm. Operations include storage, retrieval and deletion of records by key, and an unordered traversal of all keys.
<imgcaption figure_gdbm_database_for_the_gutenberg_collection|%!– id:889 –%Gdbm database for the Gutenberg collection (excerpt) ></imgcaption>
[HASH01d7b30d4827b51282919e9b] <doctype> doc <hastxt> 0 <Title> The Winter's Tale <Creator> William Shakespeare <archivedir> HASH01d7/b30d4827.dir <thistype> Invisible <childtype> Paged <contains> " .1; " .2; " .3; " .4; " .5; " .6; " .7; " .8; " .9; " .10; " .11; " .12; \ <br/> " .13; " .14; " .15; " .16; " .17; " .18; " .19; " .20; " .21; " .22; \ <br/> " .23; " .24; " .25; " .26; " .27; " .28; " .29; " .30; " .31; " .32; \ <br/> " .33; " .34; " .35 <docnum> 168483 ———————————————————————- [CL1] <doctype> classify <hastxt> 0 <childtype> HList <Title> Title <numleafdocs> 1818 <thistype> Invisible <contains> " .1; " .2; " .3; " .4; " .5; " .6; " .7; " .8; " .9; " .10; " .11; " .12; \ <br/> " .13; " .14; " .15; " .16; " .17; " .18; " .19; " .20; " .21; " .22; \ <br/> " .23; " .24 ———————————————————————- [CL1.1] <doctype> classify <hastxt> 0 <childtype> VList <Title> A <numleafdocs> 118 <contains> HASH0130bc5f9f90089b3723431f;HASH9cba43bacdab5263c98545;\ HASH12c88a01da6e8379df86a7;HASH9c86579a83e1a2e4cf9736; \ HASHdc2951a7ada1f36a6c3aca;HASHea4dda6bbc7cdeb4abfdee; \ HASHce55006513c47235ac38ba;HASH012a33acaa077c0e612b9351;\ HASH010dd1e923a123826ae30e4b;HASHaf674616785679fed4b7ee;\ HASH0147eef4b9d1cb135e096619;HASHe69b9dbaa83ffb045d963b;\ HASH01abc61c646c8e7a8ce88b10;HASH5f9cd13678e21820e32f3a;\ HASHe8cbba1594c72c98f9aa1b;HASH01292a2b7b6b60dec96298bc;\ ...
Figure <imgref figure_gdbm_database_for_the_gutenberg_collection> shows an excerpt from the collection information database that is created when building the Gutenberg collection. The excerpt was produced using the Greenstone utility db2txt, which converts the gdbm binary database format into textual form. Figure <imgref figure_gdbm_database_for_the_gutenberg_collection> contains three records, separated by horizontal rules. The first is a document entry, the other two are part of the hierarchy created by the AZList classifier for titles in the collection. The first line of each record is its key.
The document record stores the book's title, author, and any other metadata provided (or extracted) when the collection was built. It also records values for internal use: where files associated with this document reside (<archivedir>) and the document number used internally by mg (<docnum>).
The <contains> field stores a list of elements, separated by semicolons, that point to related records in the database. For a document record, <contains> is used to point to the nested sections. Subsequent record keys are formed by concatenating the current key with one of the child elements (separated by a period).
The second record in Figure <imgref figure_gdbm_database_for_the_gutenberg_collection> is the top node for the classification hierarchy of Titles A—Z. Its children, accessed through the <contains> field, include CL1.1, CL1.2, CL1.3 and so on, and correspond to the individual pages for the letters A, B, C etc. There are only 24 children: the AZList classifier merged the Q—R and Y—Z entries because they covered only a few titles.
The children in the <contains> field of the third record, CL1.1, are the documents themselves. More complicated structures are possible—the <contains> field can include a mixture of documents and further CL nodes. Keys expressed relative to the current one are distinguished from absolute keys because they begin with a quotation mark (").
Using MG and GDBM to implement a Source object
<imgcaption figure_api_for_mg_and_gdbm_based_version_of_sourceclass|%!– id:896 –%API for mg and gdbm based version of sourceclass (abridged) ></imgcaption>
class mggdbmsourceclass : public sourceclass { protected: // Omitted, data fields that store: // collection specific file information // index substructure // information about parent // pointers to gdbm and mgsearch objects public: mggdbmsourceclass (); virtual ~mggdbmsourceclass (); void set_gdbmptr (gdbmclass *thegdbmptr); void set_mgsearchptr (searchclass *themgsearchptr); void configure (const text_t &key, const text_tarray &cfgline); bool init (ostream &logout); bool translate_OID (const text_t &OIDin, text_t &OIDout, comerror_t &err, ostream &logout); bool get_metadata (const text_t &requestParams, const text_t &refParams, bool getParents, const text_tset &fields, const text_t &OID, MetadataInfo_tmap &metadata, comerror_t &err, ostream &logout); bool get_document (const text_t &OID, text_t &doc, comerror_t &err, ostream &logout); };
The object that puts mg and gdbm together to realise an implementation of sourceclass is mggdbmsourceclass. Figure <imgref figure_api_for_mg_and_gdbm_based_version_of_sourceclass> shows its API. The two new member functions set_gdbmptr() and set_mgsearchptr() store pointers to their respective objects, so that the implementations of get_metadata() and get_document() can access the appropriate tools to complete the job.
The Filter object
<imgcaption figure_api_for_the_filter_base_class|%!– id:899 –%API for the Filter base class ></imgcaption>
class filterclass { protected: text_t gsdlhome; text_t collection; text_t collectdir; FilterOption_tmap filterOptions; public: filterclass (); virtual ~filterclass (); virtual void configure (const text_t &key, const text_tarray &cfgline); virtual bool init (ostream &logout); // returns the name of this filter virtual text_t get_filter_name (); // returns the current filter options virtual void get_filteroptions (InfoFilterOptionsResponse_t &response, comerror_t &err, ostream &logout); virtual void filter (const FilterRequest_t &request, FilterResponse_t &response, comerror_t &err, ostream &logout); };
The base class API for the Filter object in Figure <imgref figure_greenstone_runtime_system> is shown in Figure <imgref figure_api_for_the_filter_base_class>. It begins with the protected data fields gsdlhome, collection, and collectdir. These commonly occur in classes that need to access collection-specific files.
- gsdlhome is the same as GSDLHOME, so that the object can locate the Greenstone files.
- collection is the name of the directory corresponding to the collection.
- collectdir is the full pathname of the collection directory (this is needed because a collection does not have to reside within the GSDLHOME area).
mggdbsourceclass is another class that includes these three data fields.
The member functions configure() and init() (first seen in sourceclass) are used by the initialisation process. The object itself is closely aligned with the corresponding filter part of the protocol; in particular get_filteroptions() and filter() match one for one.
<imgcaption figure_how_a_filter_option_is_stored|%!– id:906 –%How a filter option is stored ></imgcaption>
struct FilterOption_t { void clear (); \ void check_defaultValue (); FilterOption_t () {clear();} text_t name; enum type_t {booleant=0, integert=1, enumeratedt=2, stringt=3}; type_t type; enum repeatable_t {onePerQuery=0, onePerTerm=1, nPerTerm=2}; repeatable_t repeatable; text_t defaultValue; text_tarray validValues; }; struct OptionValue_t { void clear (); text_t name; text_t value; };
Central to the filter options are the two classes shown in Figure <imgref figure_how_a_filter_option_is_stored>. Stored inside FilterOption_t is the name of the option, its type, and whether or not it is repeatable. The interpretation of validValues depends on the option type. For a Boolean type the first value is false and the second is true. For an integer type the first value is the minimum number, the second the maximum. For an enumerated type all values are listed. For a string type the value is ignored. For simpler situations, OptionValue_t is used, which records as a text_t the name of the option and its value.
The request and response objects passed as parameters to filterclass are constructed from these two classes, using associative arrays to store a set of options such as those required for InfoFilterOptionsResponse_t. More detail can be found in GSDLHOME/src/recpt/comtypes.h.
Inherited Filter objects
<imgcaption figure_inheritance_hierarchy_for_filter|%!– id:910 –%Inheritance hierarchy for Filter ></imgcaption>
Two levels of inheritance are used for filters, as illustrated in Figure <imgref figure_inheritance_hierarchy_for_filter>. First a distinction is made between Query and Browse filters, and then for the former there is a specific implementation based on mg. To operate correctly, mgqueryfilterclass needs access to mg through mgsearchclass and to gdbm through gdbmclass. browsefilterclass only needs access to gdbm. Pointers to these objects are stored as protected data fields within the respective classes.
The collection server code
Here are the header files in GSDLHOME/src/colservr, with a description of each. The filename generally repeats the object name defined within it.
<tblcaption table_table_1|##HIDDEN##></tblcaption>
< - 120 420 > | |
browsefilter.h | Inherited from filterclass, this object provides access to gdbm. (Described above.) |
collectserver.h | This object binds Filters and Sources for one collection together, to form the Collection object depicted in Figure <imgref figure_greenstone_runtime_system>. |
colservrconfig.h | Function support for reading the collection-specific files etc/collect.cfg and index/build.cfg. The former is the collection's configuration file. The latter is a file generated by the building process that records the time of the last successful build, an index map list, how many documents were indexed, and how large they are in bytes (uncompressed). |
filter.h | The base class Filter object filterclass described above. |
maptools.h | Defines a class called stringmap that provides a mapping that remembers the original order of a text_t map, but is fast to look up. Used in mggdbmsourceclass and queryfilterclass. |
mggdbmsource.h | Inherited from sourceclass, this object provides access to mg and gdbm. (Described above.) |
mgppqueryfilter.h | Inherited from queryfilterclass, this object provides an implementation of QueryFilter based upon mg++, an improved version of mg written in C++. Note that Greenstone is set up to use mg by default, since mg++ is still under development. |
mgppsearch.h | Inherited from searchclass, this object provides an implementation of Search using mg++. Like mgppqueryfilterclass, this is not used by default. |
mgq.h | Function-level interface to the mg package. Principal functions are mg_ask() and mg_results(). |
mgqueryfilter.h | Inherited from queryfilterclass, this object provides an implementation of QueryFilter based upon mg. |
mgsearch.h | Inherited from searchclass, this object provides an implementation of Search using mg. (Described above.) |
phrasequeryfilter.h | Inherited from mgqueryclass, this object provides a phrase-based query class. It is not used in the default installation. Instead mgqueryfilterclass provides this capability through functional support from phrasesearch.h. |
phrasesearch.h | Functional support to implement phrase searching as a post-processing operation. |
querycache.h | Used by searchclass and its inherited classes to cache the results of a query, in order to make the generation of further search results pages more efficient. (Described above.) |
queryfilter.h | Inherited from the Filter base class filterclass, this object establishes a base class for Query filter objects. (Described above.) |
queryinfo.h | Support for searching: data structures and objects to hold query parameters, document results and term frequencies. |
search.h | The base class Search object searchclass. (Described above.) |
source.h | The base class Source object sourceclass. (Described above.) |
Protocol
<tblcaption table_list_of_protocol_calls|List of protocol calls></tblcaption>
< - 132 397 > | |
get_protocol_name() | Returns the name of this protocol. Choices include nullproto, corbaproto, and z3950proto. Used by protocol-sensitive parts of the runtime system to decide which code to execute. |
get_collection_list() | Returns the list of collections that this protocol knows about. |
has_collection() | Returns true if the protocol can communicate with the named collection, i.e. it is within its collection list. |
ping() | Returns true if a successful connection was made to the named collection. In the null protocol the implementation is identical to has_collection(). |
get_collectinfo() | Obtains general information about the named collection: when it was last built, how many documents it contains, and so on. Also includes metadata from the collection configuration file: “about this collection” text; the collection icon to use, and so on. |
get_filterinfo() | Gets a list of all Filters for the named collection. |
get_filteroptions() | Gets all options for a particular Filter within the named collection. |
filter() | Supports searching and browsing. For a given filter type and option settings, it accesses the content of the named collections to produce a result set that is filtered in accordance with the option settings. The data fields returned also depend on the option settings: examples include query term frequency and document metadata. |
get_document() | Gets a document or section of a document. |
Table <tblref table_list_of_protocol_calls> lists the function calls to the protocol, with a summary for each entry. The examples in Section how_the_conceptual_framework_fits_together covered most of these. Functions not previously mentioned are has_collection(), ping(), get_protocol_name() and get_filteroptions(). The first two provide yes/no answers to the questions “does the collection exists on this server?” and “is it running?” respectively. The purpose of the other two is to support multiple protocols within an architecture that is distributed over different computers, not just the null-protocol based single executable described here. One of these distinguishes which protocol is being used. The other lets a receptionist interrogate a collection server to find what options are supported, and so dynamically configure itself to take full advantage of the services offered by a particular server.
<imgcaption figure_null_protocol_api|%!– id:971 –%Null protocol API (abridged) ></imgcaption>
class nullproto : public recptproto { public: virtual text_t get_protocol_name (); virtual void get_collection_list (text_tarray &collist, comerror_t &err, ostream &logout); virtual void has_collection (const text_t &collection, bool &hascollection, comerror_t &err, ostream &logout); virtual void ping (const text_t &collection, bool &wassuccess, comerror_t &err, ostream &logout); virtual void get_collectinfo (const text_t &collection, ColInfoResponse_t &collectinfo, comerror_t &err, ostream &logout); virtual void get_filterinfo (const text_t &collection, InfoFiltersResponse_t &response, comerror_t &err, ostream &logout); virtual void get_filteroptions (const text_t &collection, const InfoFilterOptionsRequest_t &request, InfoFilterOptionsResponse_t &response, comerror_t &err, ostream &logout); virtual void filter (const text_t &collection, FilterRequest_t &request, FilterResponse_t &response, comerror_t &err, ostream &logout); virtual void get_document (const text_t &collection, const DocumentRequest_t &request, DocumentResponse_t &response, comerror_t &err, ostream &logout); };
Figure <imgref figure_null_protocol_api> shows the API for the null protocol. Comments, and certain low level details, have been omitted (see the source file recpt/nullproto.h for full details).
This protocol inherits from the base class recptproto. Virtual inheritance is used so that more than one type of protocol—including protocols not even conceived yet—can be easily supported in the rest of the source code. This is possible because the base class object recptproto is used throughout the source code, with the exception of the point of construction. Here we specify the actual variety of protocol we wish to use—in this case, the null protocol.
With the exception of get_protocol_name(), which takes no parameters and returns the protocol name as a Unicode-compliant text string, all protocol functions include an error parameter and an output stream as the last two arguments. The error parameter records any errors that occur during the execution of the protocol call, and the output stream is for logging purposes. The functions have type void—they do not explicitly return information as their final statement, but instead return data through designated parameters such as the already-introduced error parameter. In some programming languages, such routines would be defined as procedures rather than functions, but C++ makes no syntactic distinction.
Most functions take the collection name as an argument. Three of the member functions, get_filteroptions(), filter(), and get_document(), follow the pattern of providing a Request parameter and receiving the results in a Response parameter.
Receptionist
The final layer of the conceptual model is the receptionist. Once the CGI arguments are parsed, the main activity is the execution of an Action, supported by the Format and Macro Language objects. These are described below. Although they are represented as objects in the conceptual framework, Format and Macro Language objects are not strictly objects in the C++ sense. In reality, Format is a collection of data structures with a set of functions that operate on them, and the Macro Language object is built around displayclass, defined in lib/display.h, with stream conversion support from lib/gsdlunicode.h.
Actions
<tblcaption table_actions_in_greenstone|Actions in Greenstone></tblcaption>
< - 132 397 > | |
action | Base class for virtual inheritance. |
authenaction | Supports user authentication: prompts the user for a password if one has not been entered; checks whether it is valid; and forces the user to log in again if sufficient time lapses between accesses. |
collectoraction | Generates the pages for the Collector. |
documentaction | Retrieves documents, document sections, parts of the classification hierarchy, or formatting information. |
extlinkaction | Takes a user directly to a URL that is external to a collection, possibly generating an alert page first (dictated by the Preferences). |
pageaction | Generates a page in conjunction with the macro language. |
pingaction | Checks to see whether a collection is online. |
queryaction | Performs a search. |
statusaction | Generates the administration pages. |
tipaction | Brings up a random tip for the user. |
usersaction | Supports adding, deleting, and managing user access. |
Greenstone supports the eleven actions summarised in Table <tblref table_actions_in_greenstone>.
<imgcaption figure_using_the_cgiargsinfoclass_from_pageactioncpp|%!– id:1003 –%Using the cgiargsinfoclass from pageaction.cpp %!– withLineNumber –%></imgcaption>
cgiarginfo arg_ainfo; arg_ainfo.shortname = " a " ; arg_ainfo.longname = " action" ; arg_ainfo.multiplechar = true; arg_ainfo.argdefault = " p" ; arg_ainfo.defaultstatus = cgiarginfo::weak; arg_ainfo.savedarginfo = cgiarginfo::must; argsinfo.addarginfo (NULL, arg_ainfo); arg_ainfo.shortname = " p" ; arg_ainfo.longname = " page" ; arg_ainfo.multiplechar = true; arg_ainfo.argdefault = " home" ; arg_ainfo.defaultstatus = cgiarginfo::weak; arg_ainfo.savedarginfo = cgiarginfo::must; argsinfo.addarginfo (NULL, arg_ainfo);
The CGI arguments needed by an action are formally declared in its constructor function using cgiarginfo(defined in recpt/cgiargs.h). Figure <imgref figure_using_the_cgiargsinfoclass_from_pageactioncpp> shows an excerpt from the pageaction constructor function, which defines the size and properties of the CGI arguments a and p.
For each CGI argument, the constructor must specify its short name (lines 2 and 10), which is the name of the CGI variable itself; a long name (lines 3 and 11) that is used to provide a more meaningful description of the action; whether it represents a single or multiple character value (lines 4 and 12); a possible default value (lines 5 and 13); what happens when more than one default value is supplied (lines 6 and 14) (since defaults can also be set in configuration files); and whether or not the value is preserved at the end of this action (lines 7 and 15) .
Since it is built into the code, web pages that detail this information can be generated automatically. The statusaction produces this information. It can be viewed by entering the URL for the Greenstone administration page.
The twelve inherited actions are constructed in main(), the top-level function for the library executable, whose definition is given in recpt/librarymain.cpp. This is also where the receptionist object (defined in recpt/receptionist.cpp) is constructed. Responsibility for all the actions is passed to the receptionist, which processes them by maintaining, as a data field, an associative array of the Action base class, indexed by action name.
<imgcaption figure_action_base_class_api|%!– id:1008 –%Action base class API ></imgcaption>
class action { protected: cgiargsinfoclass argsinfo; text_t gsdlhome; public: action (); virtual ~action (); virtual void configure (const text_t &key, const text_tarray &cfgline); virtual bool init (ostream &logout); virtual text_t get_action_name (); cgiargsinfoclass getargsinfo (); virtual bool check_cgiargs (cgiargsinfoclass &argsinfo, cgiargsclass &args, ostream &logout); virtual bool check_external_cgiargs (cgiargsinfoclass &argsinfo, cgiargsclass &args, outconvertclass &outconvert, const text_t &saveconf, ostream &logout); virtual void get_cgihead_info (cgiargsclass &args, recptprotolistclass *protos, response_t &response, text_t &response_data, ostream &logout); virtual bool uses_display (cgiargsclass &args); virtual void define_internal_macros (displayclass &disp, cgiargsclass &args, recptprotolistclass *protos, ostream &logout); virtual void define_external_macros (displayclass &disp, cgiargsclass &args, recptprotolistclass *protos, ostream &logout); virtual bool do_action (cgiargsclass &args, recptprotolistclass *protos, browsermapclass *browsers, displayclass &disp, outconvertclass &outconvert, ostream &textout, ostream &logout); };
Figure <imgref figure_action_base_class_api> shows the API for the Action base class. When executing an action, receptionist calls several functions, starting with check_cgiargs(). Most help to check, set up, and define values and macros; while do_action() actually generates the output page. If a particular inherited object has no definition for a particular member function, it passes through to the base class definition which implements appropriate default behaviour.
Explanations of the member functions are as follows.
- get_action_name() returns the CGI a argument value that specifies this action. The name should be short but may be more than one character long.
- check_cgiargs() is called before get_cgihead_info(), define_external_macros(), and do_action(). If an error is found a message is written to logout; if it is serious the function returns false and no page content is produced.
- check_external_cgiargs() is called after check_cgiargs() for all actions. It is intended for use only to override some other normal behaviour, for example producing a login page when the requested page needs authentication.
- get_cgihead_info() sets the CGI header information. If response is set to location, then response_data contains the redirect address. If response is set to content, then response_data contains the content type.
- uses_display() returns true if the displayclass is needed to output the page content (the default).
- define_internal_macros() defines all macros that are related to pages generated by this action.
- define_external_macros() defines all macros that might be used by other actions to produce pages.
- do_action() generates the output page, normally streamed through the macro language object display and the output conversion object textout. Returns false if there was an error that prevented the action from outputting anything.
At the beginning of the class definition, argsinfo is the protected data field (used in the code excerpt shown in Figure <imgref figure_using_the_cgiargsinfoclass_from_pageactioncpp>) that stores the CGI argument information specified in an inherited Action constructor function. The other data field, gsdlhome, records GSDLHOME for convenient access.1)The object also includes configure() and init() for initialisation purposes.
Formatting
<imgcaption figure_core_data_structures_in_format|%!– id:1021 –%Core data structures in Format ></imgcaption>
enum command_t {comIf, comOr, comMeta, comText, comLink, comEndLink, comNum, comIcon, comDoc, comHighlight, comEndHighlight}; enum pcommand_t {pNone, pImmediate, pTop, pAll}; enum dcommand_t {dMeta, dText}; enum mcommand_t {mNone, mCgiSafe}; struct metadata_t { void clear(); metadata_t () {clear();} text_t metaname; mcommand_t metacommand; pcommand_t parentcommand; text_t parentoptions; }; // The decision component of an {If}{decision,true-text,false-text} // formatstring. The decision can be based on metadata or on text; // normally that text would be a macro like // _cgiargmode_. struct decision_t { void clear(); decision_t () {clear();} dcommand_t command; metadata_t meta; text_t text; }; struct format_t { void clear(); format_t () {clear();} command_t command; decision_t decision; text_t text; metadata_t meta; format_t *nextptr; format_t *ifptr; format_t *elseptr; format_t *orptr; };
Although formatting is represented as a single entity in Figure <imgref figure_greenstone_runtime_system>, in reality it constitutes a collection of data structures and functions. They are gathered together under the header file recpt/formattools.h. The core data structures are shown in Figure <imgref figure_core_data_structures_in_format>.
<imgcaption figure_data_structures_built_for_sample_format_statement|%!– id:1023 –%Data structures built for sample format statement ></imgcaption>
The implementation is best explained using an example. When the format statement
format CL1Vlist "[link][Title]{If}{[Creator], by [Creator]}[/link]} "
is read from a collection configuration file, it is parsed by functions in formattools.cpp and the interconnected data structure shown in Figure <imgref figure_data_structures_built_for_sample_format_statement> is built. When the format statement needs to be evaluated by an action, the data structure is traversed. The route taken at comIf and comOr nodes depends on the metadata that is returned from a call to the protocol.
One complication is that when metadata is retrieved, it might include further macros and format syntax. This is handled by switching back and forth between parsing and evaluating, as needed.
Macro language
The Macro Language entity in Figure <imgref figure_greenstone_runtime_system>, like Format, does not map to a single C++ class. In this case there is a core class, but the implementation of the macro language also calls upon supporting functions and classes.
Again, the implementation is best explained using an example. First we give some sample macro definitions that illustrate macro precedence, then—with the aid of a diagram—we describe the core data structures built to support this activity. Finally we present and describe the public member functions to displayclass, the top-level macro object.
<imgcaption figure_illustration_of_macro_precedence|%!– id:1031 –%Illustration of macro precedence ></imgcaption>
package query _header_ [] {_querytitle_} _header_ [l=en] {Search page} _header_ [c=demo] {<table bgcolor=green><tr><td>_querytitle_</td></tr></table>} _header_ [v=1] {_textquery_} _header_ [l=fr,v=1,c=hdl] {HDL Page de recherche}
In a typical Greenstone installation, macro precedence is normally: c(for the collection) takes precedence over v(for graphical or text-only interface), which takes precedence over l(for the language). This is accomplished by the line
macroprecedence c,v,l
in the main configuration file main.cfg. The macro statements in Figure <imgref figure_illustration_of_macro_precedence> define sample macros for _header_ in the query package for various settings of c, v, and l. If the CGI arguments given when an action is invoked included c=dls, v=1, and l=en, the macro _header_[v=1] would be selected for display. It would be selected ahead of _content_[l=en] because v has a higher precedence than l. The _content_[l=fr,v=1,c=dls] macro would not be selected because the page parameter for l is different.
<imgcaption figure_data_structures_representing_the_default_macros|%!– id:1034 –%Data structures representing the default macros ></imgcaption>
Figure <imgref figure_data_structures_representing_the_default_macros> shows the core data structure built when reading the macro files specified in etc/main.cfg. Essentially, it is an associative array of associative arrays of associative arrays. The top layer (shown on the left) indexes which package the macro is from, and the second layer indexes the macro name. The final layer indexes any parameters that were specified, storing each one as the type mvalue which records, along with the macro value, the file it came from. For example, the text defined for _header_[l=en] in Figure <imgref figure_illustration_of_macro_precedence> can be seen stored in the lower of the two mvalue records in Figure <imgref figure_data_structures_representing_the_default_macros>.
<imgcaption figure_displayclass_api|%!– id:1036 –%Displayclass API (abridged) ></imgcaption>
class displayclass { public: displayclass (); ~displayclass (); int isdefaultmacro (text_t package, const text_t ¯oname); int setdefaultmacro (text_t package, const text_t ¯oname, text_t params, const text_t ¯ovalue); int loaddefaultmacros (text_t thisfilename); void openpage (const text_t &thispageparams, const text_t &thisprecedence); void setpageparams (text_t thispageparams, text_t thisprecedence); int setmacro (const text_t ¯oname, text_t package, const text_t ¯ovalue); void expandstring (const text_t &inputtext, text_t &outputtext); void expandstring (text_t package, const text_t &inputtext, text_t &outputtext, int recursiondepth = 0); void setconvertclass (outconvertclass *theoutc) {outc = theoutc;} outconvertclass *getconvertclass () {return outc;} ostream *setlogout (ostream *thelogout); };
The central object that supports the macro language is displayclass, defined in lib/display.h. Its public member functions are shown in Figure <imgref figure_displayclass_api>. The class reads the specified macro files using loaddefaultmacros(), storing in a protected section of the class (not shown) the type of data structure shown in Figure <imgref figure_data_structures_representing_the_default_macros>. It is also permissible for macros to be set by the runtime system using setmacro() (in the last example of Section how_the_conceptual_framework_fits_together, pageaction sets _homeextra_ to be the dynamically generated table of available collections using this function.) This is supported by a set of associative arrays similar to those used to represent macro files (it is not identical, because the former does not require the “parameter” layer). In displayclass, macros read from the file are referred to as default macros. Local macros specified through setmacro() are referred to as current macros, and are cleared from memory once the page has been generated.
When a page is to be produced, openpage() is first called to communicate the current settings of the page parameters (l=en and so on). Following that, text and macros are streamed through the class—typically from within an actionclass —using code along the lines of:
cout << text_t2ascii << display << "_amacro_ " << "_anothermacro_ ";
The result is that macros are expanded according to the page parameter settings. If required, these settings can be changed partway through an action by using setpageparams(). The remaining public member functions provide lower level support.
The receptionist code
The principal objects in the receptionist have now been described. Below we detail the supporting classes, which reside in GSDLHOME/src/recpt. Except where efficiency is paramount—in which case definitions are in-line—implementation details are contained within a header file's .cpp counterpart. Supporting files often include the word tool as part of the file name, as in OIDtools.h and formattools.h.
A second set of lexically scoped files include the prefix z3950. The files provide remote access to online databases and catalogs that make their content publicly available using the Z39.50 protocol.
Another large group of supporting files include the term browserclass. These files are related through a virtual inheritance hierarchy. As a group they support an abstract notion of browsing: serial page generation of compartmentalised document content or metadata. Browsing activities include perusing documents ordered alphabetically by title or chronologically by date; progressing through the titles returned by a query ten entries at a time; and accessing individual pages of a book using the “go to page” mechanism. Each browsing activity inherits from browserclass, the base class:
- datelistbrowserclass provides support for chronological lists;
- hlistbrowserclass provides support for horizontal lists;
- htmlbrowserclass provides support for pages of html;
- invbrowserclass provides support for invisible lists;
- pagedbrowserclass provides go to page support;
- vlistbrowserclass provides support for vertical lists.
Actions access browserclass objects through browsetools.h.
<tblcaption table_table_2|##HIDDEN##></tblcaption>
< - 140 390 > | |
OIDtools.h | Function support for evaluating document identifiers over the protocol. |
action.h | Base class for the Actions entity depicted in Figure <imgref figure_greenstone_runtime_system>. |
authenaction.h | Inherited action for handling authentication of a user. |
browserclass.h | Base class for abstract browsing activities. |
browsetools.h | Function support that accesses the browserclass hierarchy. Functionality includes expanding and contracting contents, outputing a table of contents, and generating control such as the “go to page” mechanism. |
cgiargs.h | Defines cgiarginfo used in Figure <imgref figure_using_the_cgiargsinfoclass_from_pageactioncpp>, and other data structure support for CGI arguments. |
cgiutils.h | Function support for CGI arguments using the data structures defined in cgiargs.h. |
cgiwrapper.h | Function support that does everything necessary to output a page using the CGI protocol. Access is through the function void cgiwrapper (receptionist &recpt, text_t collection); which is the only function declared in the header file. Everything else in the .cpp counterpart is lexically scoped to be local to the file (using the C++ keyword static). If the function is being run for a particular collection then collection should be set, otherwise it should be the empty string "". The code includes support for Fast-CGI. |
collectoraction.h | Inherited action that facilitates end-user collection-building through the Collector. The page generated comes from collect.dm and is controlled by the CGI argument p=page. |
comtypes.h | Core types for the protocol. |
converter.h | Object support for stream converters. |
datelistbrowserclass.h | Inherited from browserclass, this object provides browsing support for chronological lists such as that seen in the Greenstone Archives collection under “dates” in the navigation bar. |
documentaction.h | Inherited action used to retrieve a document or part of a classification hierarchy. |
extlinkaction.h | Inherited action that controls whether or not a user goes straight to an external link or passes through a warning page alerting the user to the fact that they are about to move outside the digital library system. |
formattools.h | Function support for parsing and evaluating collection configuration format statements. Described in more detail in Section formatting above. |
historydb.h | Data structures and function support for managing a database of previous queries so a user can start a new query that includes previous query terms. |
hlistbrowserclass.h | Inherited from browserclass, this object provides browsing support for horizontal lists. |
htmlbrowserclass.h | Inherited from browserclass, this object provides browsing support for html pages. |
htmlgen.h | Function support to highlight query terms in a text_t string. |
htmlutils.h | Function support that converts a text_t string into the equivalent html. The symbols ", &, <, and > are converted into ", &, < and > respectively. |
infodbclass.h | Defines two classes: gdbmclass and infodbclass. The former provides the Greenstone API to gdbm; the latter is the object class used to store a record entry read in from a gdbm database, and is essentially an associative array of integer-indexed arrays of text_t strings. |
invbrowserclass.h | Inherited from browserclass, this object provides browsing support for lists that are not intended for display (invisible). |
nullproto.h | Inherited from recptproto, this class realises the null protocol, implemented through function calls from the receptionist to the collection server. |
pageaction.h | Inherited action that, in conjunction with the macro file named in p=page, generates a web page. |
pagedbrowserclass.h | Inherited from browserclass, this object provides browsing support for the “go to page” mechanism seen (for example) in the Gutenberg collection. |
pingaction.h | Inherited action that checks to see whether a particular collection is responding. |
queryaction.h | Inherited action that takes the stipulated query, settings and preferences and performs a search, generating as a result the subset of o=num matching documents starting at position r=num. |
querytools.h | Function support for querying. |
receptionist.h | Top-level object for the receptionist. Maintains a record of CGI argument information, instantiations of each inherited action, instantiations of each inherited browser, the core macro language object displayclass, and all possible converters. |
recptconfig.h | Function support for reading the site and main configuration files. |
recptproto.h | Base class for the protocol. |
statusaction.h | Inherited action that generates, in conjunction with status.dm, the various administration pages. |
tipaction.h | Inherited action that produces, in conjunction with tip.dm, a web page containing a tip taken at random from a list of tips stored in tip.dm. |
userdb.h | Data structure and function support for maintaining a gdbm database of users: their password, groups, and so on. |
usersaction.h | An administrator action inherited from the base class that supports adding and deleting users, as well as modifying the groups they are in. |
vlistbrowserclass.h | Inherited from browserclass, this object provides browsing support for vertical lists, the mainstay of classifiers. For example, the children of the node for titles beginning with the letter N are stipulated to be a VList. |
z3950cfg.h | Data structure support for the Z39.50 protocol. Used by z3950proto.cpp, which defines the main protocol class(inherited from the base class recptproto), and configuration file parser zparse.y (written using Yacc). |
z3950proto.h | Inherited from recptproto, this class realises the Z39.50 protocol so that a Greenstone receptionist can access remote library sites running Z39.50 servers. |
z3950server.h | Further support for the Z39.50 protocol. |
Initialisation
Initialisation in Greenstone is an intricate operation that processes configuration files and assigns default values to data fields. In addition to inheritance and constructor functions, core objects define init() and configure() functions to help standardise the task. Even so, the order of execution can be difficult to follow. This section describes what happens.
Greenstone uses several configuration files for different purposes, but all follow the same syntax. Unless a line starts with the hash symbol (#) or consists entirely of white space, the first word defines a keyword, and the remaining words represent a particular setting for that keyword.
The lines from configuration files are passed, one at a time, to configure() as two arguments: the keyword and an array of the remaining words. Based on the keyword, a particular version of configure() decides whether the information is of interest, and if so stores it. For example, collectserver(which maps to the Collection object in Figure <imgref figure_greenstone_runtime_system>) processes the format statements in a collection's configuration file. When the keyword format is passed to configure(), an if statement is triggered that stores in the object a copy of the function's second argument.
After processing the keyword and before the function terminates, some versions of configure() pass the data to configure() functions in other objects. The Receptionist object calls configure() for Actions, Protocols, and Browsers. The NullProtocol object calls configure() for each Collection object; Collection calls Filters and Sources.
In C++, data fields are normally initialized by the object's constructor function. However, in Greenstone some initialisation depends on values read from configuration files, so a second round of initialisation is needed. This is the purpose of the init() member functions, and in some cases it leads to further calls to configure().
<imgcaption figure_initialising_greenstone_using_the_null_protocol|%!– id:1136 –%Initialising Greenstone using the null protocol ></imgcaption>
============ Main program ============ Statically construct Receptionist Statically construct NullProtocol Establish the value for ’gsdlhome’ by reading gsdlsite.cfg Foreach directory in GSDLHOME/collect that isn’t "modelcol": Add directory name (now treated as collection name) to NullProtocol: Dynamically construct Collection Dynamically construct Gdbm class Dynamically construct the Null Filter Dynamically construct the Browse Filter Dynamically construct MgSearch Dynamically construct the QueryFilter Dynamically construct the MgGdbmSource Configure Collection with ’collection’ Passing ’collection’ value on to Filters and Sources: Configure Receptionist with ’collectinfo’: Passing ’collectinfo’ value on to Actions, Protocols, and Browsers: Add NullProtocol to Receptionist Add in UTF-8 converter Add in GB converter Add in Arabic converter Foreach Action: Statically construct Action Add Action to Receptionist Foreach Browsers: Statically construct Browser Add Browser to Receptionist Call function cgiwrapper: ================= Configure objects ================= Configure Receptionist with ’collection’ Passing ’collection’ value on to Actions, Protocols, and Browsers: NullProtocol not interested in ’collection’ Configure Receptionist with ’httpimg’ Passing ’httpimg’ value on to Actions, Protocols, and Browsers: NullProtocol passing ’httpimg’ on to Collection Passing ’httpimg’ value on to Filters and Sources: Configure Receptionist with ’gwcgi’ Passing ’gwcgi’ value on to Actions, Protocols, and Browsers: NullProtocol passing ’gwcgi’ on to Collection Passing ’gwcgi’ value on to Filters and Sources: Reading in site configuration file gsdlsite.cfg Configure Recptionist with ’gsdlhome’ Passing ’gsdlhome’ value on to Actions, Protocols, and Browsers: NullProtocol passing ’gsdlhome’ on to Collection Passing ’gsdlhome’ value on to Filters and Sources: Configure Recptionist with ... ... and so on for all entries in gsdlsite.cfg Reading in main configuration file main.cfg Configure Recptionist with ... ... and so on for all entries in main.cfg ==================== Initialising objects ==================== Initialise the Receptionist Configure Receptionist with ’collectdir’ Passing ’collectdir’ value on to Actions, Protocols, and Browsers: NullProtocol not interested in ’collectdir’ Read in Macro files Foreach Actions Initialise Action Foreach Protocol Initialise Protocol When Protocol==NullProtocol: Foreach Collection Reading Collection’s build.cfg Reading Collection’s collect.cfg Configure Collection with ’creator’ Passing ’creator’ value on to Filters and Sources: Configure Collection with ’maintainer’ Passing ’maintainer’ value on to Filters and Sources: ... and so on for all entries in collect.cfg Foreach Browsers Initialise Browser ============= Generate page ============= Parse CGI arguments Execute designated Action to produce page End.
Figure <imgref figure_initialising_greenstone_using_the_null_protocol> shows diagnostic statements generated from a version of Greenstone augmented to highlight the initialisation process. The program starts in the main() function in recpt/librarymain.cpp. It constructs a Receptionist object and a NullProtocol object, then scans gsdlsite.cfg(located in the same directory as the library executable) for gsdlhome and stores its value in a variable. For each online collection—as established by reading in the directories present in GSDLHOME/collect —it constructs a Collection object, through the NullProtocol object, that includes within it Filters, Search and Source, plus a few hardwired calls to configure().
Next main() adds the NullProtocol object to the Receptionist, which keeps a base class array of protocols in a protected data field, and then sets up several converters. main() constructs all Actions and Browsers used in the executable and adds them to the Receptionist. The function concludes by calling cgiwrapper() in cgiwrapper.cpp, which itself includes substantial object initialisation.
There are three sections to cgiwrapper(): configuration, initialisation and page generation. First some hardwired calls to configure() are made. Then gsdlsite.cfg is read and configure() is called for each line. The same is done for etc/main.cfg.
The second phase of cgiwrapper() makes calls to init(). The Receptionist makes only one call to its init() function, but the act of invoking this calls init() functions in the various objects stored within it. First a hardwired call to configure() is made to set collectdir, then the macro files are read. For each action, its init() function is called. The same occurs for each protocol stored in the receptionist—but in the system being described only one protocol is stored, the NullProtocol. Calling init() for this object causes further configuration: for each collection in the NullProtocol, its collection-specific build.cfg and collect.cfg are read and processed, with a call to configure() for each line.
The final phase of cgiwrapper() is to parse the CGI arguments, and then call the appropriate action. Both these calls are made with the support of the Receptionist object.
The reason for the separation of the configuration, initialisation, and page generation code is that Greenstone is optimised to be run as a server (using Fast-cgi, or the Corba protocol, or the Windows Local Library). In this mode of operation, the configuration and initialisation code is executed once, then the program remains in memory and generates many web pages in response to requests from clients, without requiring re-initalisation.