Greenstone3 And Seaweed

The Pei Jones project was undertaken over the summer of 2010-2011. While the main goal of this project was to present the digital collection of Images and OCR'd text in Greenstone, a secondary goal was to provide functionality that was not standard in a Greenstone Digital Library; such functionality as live editing of metadata and full text.

Switching On CGI for Tomcat
The process of updating live metadata was done through Perl Scripts that Tomcat runs. Therefore the first step is to activate CGI for Tomcat. A page on the Greenstone Wiki already summarises how to do this; see: Remote Greenstone3. Follow the following steps in the Installation section:  Step 2.1.b (Java 1.6 was installed) Step 2.2 Step 3 Step 4 Step 5, you can go ahead and do this step for all *.pl files in this directory Step 6 Step 8  Some of these steps may already be done by default in your version of greenstone.

Step 6 describes how to check if gliserver.pl was working. However, this work makes use of metadata-server.pl which you can check is working by running: http://<your-machine-name>: /greenstone3/cgi-bin/metadata-server.pl? You should get some page back starting with "No action (a=...) specified."

Splicing Seaweed into the Document View (Editing Metadata)
As mentioned in the last section this work makes significant use of metadata-server.pl; this is because (through metadataaction.pm) it provides actions such as set_live_metadata and set_metadata. Perl server programs can be found in $GS3Home\web\WEB-INF\cgi and their associated actions can be found in $GS3Home\gs2build\perllib\cgiactions.

Modifications to document.xsl
In order to make use of the actions associated with metadata-server.pl modifications are made to document.xsl ($GS3Home\web\interfaces\gs2\transform\document.xsl) and methods from gsajaxapi.js are used ($GS3Home\web\interfaces\gs2\js\gsajaxapi.js).

Complete Code
The following code is the full code inserted into the pageStyle template of document.xsl; following this each aspect of the code will be explained. Editing the pageStyle template inserts the added code into each collection page that Greenstone serves up. <script type="text/javascript" src="interfaces/gs2/js/gsajaxapi.js"><xsl:text disable-output-escaping="yes"> </xsl:text> <script type="text/javascript" src="interfaces/gs2/js/seaweed.js"><xsl:text disable-output-escaping="yes"> </xsl:text> <script type="text/javascript"> <xsl:text disable-output-escaping="yes"> <![CDATA[ var gsapi = new GSAjaxAPI("/greenstone3/cgi-bin/gliserver.pl","jonesmin");

window.onload=function{ try { de.init;

if (window.addEventListener) { window.addEventListener('beforeunload',saveMetadata,true); }	  else { window.attachEvent('onbeforeunload',saveMetadata); }

de.doc.declarePropertySets({               metadata: {                    phMarkup: ' [Enter a value] ',                    name: "metadata"                }	   }); }

catch(err) { alert("SeaWeed failed to initialise: " + err.message); } }

function saveMetadataElement(editedElem) {	  //alert("in saveMetadataElement"); var docoid = editedElem.getAttribute("docoid"); //alert("the docoid is: " + docoid); if (gsdefined(docoid)) { var metaname = editedElem.getAttribute("metaname"); var metapos = editedElem.getAttribute("metapos"); var metavalue= editedElem.innerHTML; var metavalue = metavalue.replace(/ /g, " "); //metavalue = metavalue.replace(/ /g, " "); metavalue = escape(metavalue); //alert("metaname: " + metaname + ";metapos: " + metapos + ";metavalue: " + metavalue);

// alert("docoid = " + docoid + " metaname = " + metaname + " metapos = " + metapos + " metavalue = " + metavalue);

// console.log("docoid = " + docoid + " metaname = " + metaname + " metapos = " + metapos + " metavalue = " + metavalue);

gsapi.setDocumentMetadata(docoid,metaname,metapos,metavalue); //alert("document metadata set"); // figure out if needs to be exploded or if can be save with setImportMetadata var needsExploding = 0;

var metanameParts = metaname.split(/\./); //alert("metanameParts: " + metanameParts); if (metanameParts.length==1) { needsExploding = 1; }	  else if (metanameParts[0] == "ex") { needsExploding = 1; }	  var docParts = docoid.split(/\./); if (docParts.length>=2) { needsExploding = 1; }	  //alert("needsExploding: " + needsExploding);

if (needsExploding) { if (confirm("Document needs to be exploded for this edit of " + metaname + " to be retained.\nProceed?")) { gsapi.explodeDocument(docoid); //alert("document exploded") }	  }	   else { gsapi.setImportMetadata(docoid,metaname,metapos,metavalue); }     }      return true; }

function gsdefined(val) { return (typeof(val) != "undefined"); }

function saveMetadata {	 //alert("in saveMetadata"); var editedArray = de.Changes.getChangedEditableSections; //alert("obtained edited sections"); // editedHashSet.iterate(function(item) { console.log(item.innerHTML); return true } );

if (editedArray.length>0) { var commitChanges = confirm("Commit edited metadata?"); if (commitChanges) { //alert("Commiting Changes: " + editedArray.length); for (i=0; i<editedArray.length; i++) {		       //alert("Commiting change: " + i); saveMetadataElement(editedArray[i]); }	 }

// whether commited or not, now clear list of edited values //clearEditedElements; gsapi.urlGetSync("library?a=s&sa=c&sc=jonesmin"); } } ]]> </xsl:text> Note: There is some css already present in this part of document.xsl, this is left unmodified. The code was written as part of a project that produced Seaweed; it makes use of GSAjaxApi (gsajaxapi.js) as well as the main seaweed.js

On Load explained
window.onload=function{ try { de.init;

if (window.addEventListener) { window.addEventListener('beforeunload',saveMetadata,true); }	  else { window.attachEvent('onbeforeunload',saveMetadata); }

de.doc.declarePropertySets({               metadata: {                    phMarkup: ' [Enter a value] ',                    name: "metadata"                }	   }); }

catch(err) { alert("SeaWeed failed to initialise: " + err.message); } } In the above block of code an function is attached to the window.load event of the browser. The purpose of this function is to specify that the saveMetadata function is called when someone navigates away from the page.

Save Metadata explained
function saveMetadata {     var editedArray = de.Changes.getChangedEditableSections;

if (editedArray.length>0) { var commitChanges = confirm("Commit edited metadata?"); if (commitChanges) { for (i=0; i<editedArray.length; i++) {				saveMetadataElement(editedArray[i]); }	 }          gsapi.urlGetSync("library?a=s&sa=c&sc=jonesmin"); } } As mentioned above the saveMetadata function is called when the user navigates away from the page. Three things happen here. Firstly; Seaweed is used in order to get sections that have been changed and are editable. Secondly; each change is passed to the saveMetadataElement method. We will talk about the saveMetadataElement method in a minute, but first it should be mentioned that Seaweed needs a little help in order to specify what sections of the document are editable and have been edited. In order to do this Seaweed looks for elements that are children of div tags with a certain name; how these are specified is explained in next section: Modification of Format Statements. Thirdly and finally the GSAjaxApi is used after all the modified elements have been saved in order to tell Tomcat to re-read in its knowledge about the collections or else things like ex.Title will remain cached.

Save Metadata Element Explained
function saveMetadataElement(editedElem) {      var docoid = editedElem.getAttribute("docoid"); if (gsdefined(docoid)) { var metaname = editedElem.getAttribute("metaname"); var metapos = editedElem.getAttribute("metapos"); var metavalue= editedElem.innerHTML; var metavalue = metavalue.replace(/ /g, " "); metavalue = escape(metavalue); gsapi.setDocumentMetadata(docoid,metaname,metapos,metavalue); var needsExploding = 0;

var metanameParts = metaname.split(/\./); if (metanameParts.length==1) { needsExploding = 1; }	  else if (metanameParts[0] == "ex") { needsExploding = 1; }	  var docParts = docoid.split(/\./); if (docParts.length>=2) { needsExploding = 1; }

if (needsExploding) { if (confirm("Document needs to be exploded for this edit of " + metaname + " to be retained.\nProceed?")) { gsapi.explodeDocument(docoid); }	  }	   else { gsapi.setImportMetadata(docoid,metaname,metapos,metavalue); }     }      return true; } This method is responsible for actually using the GSAjaxApi in order to save the metadata but before it does that it has to use the Seaweed library to extract the required information. Following the update of the live metadata it may also need to explode the document so that the update can be updated in the import folder as well so that rebuilds keep their edits.

Things of Note
There are a couple things to watch out for in this section.  Firstly; GSAjaxApi is decleared specifying the collection as jonesmin- this is hard-coded in but shouldn't be.</li> All this code has been directly included in each webpage that Greenstone dishes out. This was done for simplicity but for the sake of efficiency and tidyness there should be some way of only including it if its needed.</li> </ul>

Modifications of Format Statements
This part of the project used Format Statements for two purposes; a third purpose appears in Section 3 of this document. <ol> The usual formatting task - controlling how things where displayed</li> Marking up editable sections with div tags so Seaweed could find them</li> </ol>

The Complete Format Statement
The following is the complete format statement for the display format (more is added in Section 3). <gsf:option name="TOC" value="true"/> <gsf:template match="documentNode" mode="content"> <xsl:variable name="docID" select="@nodeID"/>docID = <xsl:value-of select="$docID"/> <xsl:variable name="httpPath" select="/page/pageResponse/collection/metadataList/metadata[@name='httpPath']"/> <xsl:variable name="Screen"> <gsf:metadata name="Screen"/> </xsl:variable> <xsl:variable name="Image"> <gsf:metadata name="Image"/> </xsl:variable> <xsl:variable name="assocPath"> <gsf:metadata name="root_assocfilepath"/> </xsl:variable> <xsl:if test="metadataList/metadata[@name='Title']"> <span class="editable-metadata" docoid="{$docID}" metaname="ex.Title" metapos="0"> <xsl:value-of disable-output-escaping="yes" select="metadataList/metadata[@name='Title']"/> </xsl:if> <xsl:if test="metadataList/metadata[@name='Image']"> </xsl:if> </gsf:template> <xsl:template match="nodeContent"> <xsl:for-each select="node"> <xsl:choose> <xsl:when test="not(name)"> <xsl:value-of disable-output-escaping="yes" select="."/> </xsl:when> <xsl:otherwise> <xsl:apply-templates select="."/> </xsl:otherwise> </xsl:choose> </xsl:for-each> </xsl:template>

Variables Explained
<xsl:variable name="docID" select="@nodeID"/>docID = <xsl:value-of select="$docID"/> <xsl:variable name="httpPath" select="/page/pageResponse/collection/metadataList/metadata[@name='httpPath']"/> <xsl:variable name="Screen"> <gsf:metadata name="Screen"/> </xsl:variable> <xsl:variable name="Image"> <gsf:metadata name="Image"/> </xsl:variable> <xsl:variable name="assocPath"> <gsf:metadata name="root_assocfilepath"/> </xsl:variable> This section of the Format Statement declares variables for use in the rest of the format statement.

Marking Title for Seaweed Explained
<xsl:if test="metadataList/metadata[@name='Title']"> <span class="editable-metadata" docoid="{$docID}" metaname="ex.Title" metapos="0"> <xsl:value-of disable-output-escaping="yes" select="metadataList/metadata[@name='Title']"/> </xsl:if> This is the section where we specify that the title of the document can be edited. Seaweed looks for for a span element with the class editable-metadata and uses the other attributes (docoid, metaname and metapos) to interact with the GSAjaxApi.

Showing and Linking Image; Displaying Text Explained
<xsl:if test="metadataList/metadata[@name='Image']"> </xsl:if>

This section of the code controls the layout of the image and attached text. A screen version of the image is displayed that is hyper-linked to the actual image. The test <xsl: if test="nodeContent.... tests to make sure there is some full text associated with the document before displaying it.

Working With Full Text (OCR & Updating)
This part of the project wanted to introduce options for the user to re-OCR text and manually edit the attached text. This was not as straight forward as allowing editing of the title as was done in Section 2 of this article as Greenstone is not set up to treat full text as it does other metadata.

The Complete Format Statement
This is the updated Format Statement with the aspects added for the new functionality in mind. <gsf:option name="TOC" value="true"/> <gsf:template match="documentNode" mode="content"> <xsl:variable name="docID" select="@nodeID"/>docID = <xsl:value-of select="$docID"/> <xsl:variable name="httpPath" select="/page/pageResponse/collection/metadataList/metadata[@name='httpPath']"/> <xsl:variable name="Screen"> <gsf:metadata name="Screen"/> </xsl:variable> <xsl:variable name="Image"> <gsf:metadata name="Image"/> </xsl:variable> <xsl:variable name="assocPath"> <gsf:metadata name="root_assocfilepath"/> </xsl:variable> <xsl:if test="metadataList/metadata[@name='Title']"> <span class="editable-metadata" docoid="{$docID}" metaname="ex.Title" metapos="0"> <xsl:value-of disable-output-escaping="yes" select="metadataList/metadata[@name='Title']"/> </xsl:if> <xsl:if test="metadataList/metadata[@name='Image']"> </xsl:if> </gsf:template> <xsl:template match="nodeContent"> <xsl:for-each select="node"> <xsl:choose> <xsl:when test="not(name)"> <xsl:value-of disable-output-escaping="yes" select="."/> </xsl:when> <xsl:otherwise> <xsl:apply-templates select="."/> </xsl:otherwise> </xsl:choose> </xsl:for-each> </xsl:template>

The Addition of the Form
The table element present in the format statement of Section 2 was straight forward but now that we want to let the user manipulate the full text it becomes slightly more complicated. A form element is created with  the full text inside a textarea element so that it can be edited - notice the name assigned to the text box</li> A Save Changes button that calls a javascript method saveText when clicked - The idea being that they will be able to correct the OCR and save the changes</li> A Revert button that simply refreshes the page - The idea being that may may want to discard changes before saving them</li> A Attempt to OCR button that calls a javascript method preformOCR when cliked - The idea being that a new OCR attempt might improve on a previous attempt</li> A hidden field with the id imageTif. The value of this field is required by the preformOcr javascript method so that it can know what image to OCR</li> </ul> The functions saveText and preformOCR that are called are present on the document due to additions made to document.xsl as described in the next sub-section.

Additions to document.xsl
Rather than placing the full lot of Javascript present in document.xsl I will just show and explain the new methods added. Section 2.1 explains what is already present before these additions.

saveText function
function saveText{ var elems = document.getElementsByName("fulltext"); var fulltext = ""; for(var i = 0; i < elems.length; i++) { fulltext += elemns[0]; }	 gsapi.setFullText(fulltext,document.getElementsByName("c")[0].value,             document.getElementsByName("d")[0].value); } As mentioned in sub-section 3.1.2 the saveText function is called by one of the buttons associated with the form surrounding the full text of the document. It makes use of:  the fact that the textarea element that contains the full text is named fulltext &</li> the fact that the collection name and docId are stored in the document and are accessible through lookup due to assigned names</li> </ul> I have chosen to go with giving the textarea element a name rather than id mostly in foresight of a feature I would have liked to have implemented but did not have time. See sub-section 3.5. The saveText function ends by calling another new function that has been created inside GSAjaxApi; this will be discussed in the next sub-section.

preformOcr function
function preformOcr { gsapi.preformOcr(document.getElementsByName("c")[0].value,                  document.getElementsByName("d")[0].value,document.getElementById("imageTif").value); }

The preformOcr function is a very basic function that simply calls a function of identical name in the GSAjaxApi. It makes use of the ability to look up the collection and docid the same way saveText did as well as making use of the hidden element in the form that was mentioned in sub-section 3.1.2. The preformOcr in GSAjaxApi is another function newly created that will be discussed in the next section.

Additions to GSAjaxApi
The GSAjaxApi (gsajaxapi.js) file comes with methods capable of doing things like making HTTP get requests and updating standard metadata. Rather than listing them all here I will simply explain the methods added as part of this project.

setFullText function (Unfinished)
this.setFullText = function(fulltext,collectionName,docId) {		var url = this.ocrserverURL; var params = "a=setText&c=" + collectionName + "&site=localsite&d=" + docId + "&newtext=" + fulltext; this.urlPostSync(url,params);//this.urlGetSync(url + "?" + params); } The setFullText function submits a request to the ocr-server asking it to preform the action setText. The ocr-server is another newly created part of this project that is explained in more detail in Section 3.4. This function counts as unfinished due to a problem with the way perl programs interact with POST requests. Normally perl programs are run via a GET request but in this case it is not possible; the problem being that the parameter fulltext that has to be passed through is likely to be to long for a GET request to handle!

preformOcr function (Unfinished)
this.preformOcr = function(collectionName,docId,imageName) {		var url = this.ocrserverURL + "?a=preformOcr&c=" + collectionName + "&site=localsite" + "&d=" + docId + "&imagename=" + imageName; var result = this.urlGetSync(url); alert(result); this.setFullText(result,collectionName,docId); } The preformOcr function makes a GET request to the ocr-server asking it to preform the preformOcr action on the specified image. This actually works! The alert command will pop up a dialog with the result of the OCRing the image. Unfortunately; as the setFullText function is not working it is unable to make the new OCR permanent.

urlPostSync function (Troublesome)
this.urlPostSync = function(url, params) {		var xmlHttp; try { xmlHttp=new XMLHttpRequest; }catch (e) { try { xmlHttp=new ActiveXObject("Msxml2.XMLHTTP"); } catch (e) { try { xmlHttp=new ActiveXObject("Microsoft.XMLHTTP"); } catch (e) { alert("Your browser does not support AJAX!"); return false; }				}		}		xmlHttp.setRequestHeader("Content-type", "application/x-www-form-urlencoded"); xmlHttp.setRequestHeader("Content-length", params.length); xmlHttp.setRequestHeader("Connection","close"); xmlHttp.open("POST",url,false); xmlHttp.send(params); } The urlPostSync function was designed to be able to call the ocr-server with a POST request. Unfortunately due to the way it interacts with perl programs it does not currently work.

New Perl Scripts
Two new perl scripts are created in order to allow manipulation of the full text associated with documents.  ocr-server.pl is located in $GS3HOME\web\WEB-INF\cgi</li> <li>ocraction.pm is located in $GS3HOME\gs2build\perllib\cgiactions</li> </ul> Rather than describing the whole document I will describe parts deemed important. The stuff I dont explain will be general across all of the files in the appropriate directory.

main
sub main {	my $gsdl_cgi = new gsdlCGI; $gsdl_cgi->setup_gsdl; my $gsdlhome = $ENV{'GSDLHOME'}; $gsdl_cgi->checked_chdir($gsdlhome);

require cgiactions::ocraction; $gsdl_cgi->parse_cgi_args; $gsdl_cgi->{'xml'} = 0; my $action = new ocraction($gsdl_cgi,$iis6_mode); $action->do_action; } The server files act as gateways. They parse the arguments passed in and then construct the appropriate action object (ocraction in this case) and call their do_action method. While the ocraction object does not specifically have a do_action method it can be used as if it did thanks to its constructor.

action-table variable
my $action_table = {    "preformOcr"     => { 'compulsory-args' => [ "d", "imagename" ], 'optional-args'  => [] }, "setText"		 =>	{ 'compulsory-args' => [ "d", "newtext" ], 'optional-args'  => [] } }; The action_table variable specifies what actions ocraction.pm can carry out. As it stands it can do: <ul> <li>preformOcr with paramerters d and imagename &</li> <li>setText with parameters d and newtext</li> </ul> The parameter d is the docid and the other two are self-explanatory.

preformOcr
sub preformOcr {	my $self = shift @_;

my $collection = $self->{'collect'}; my $site = $self->{'site'}; my $gsdl_cgi = $self->{'gsdl_cgi'}; my $docId = $self->{'d'}; my $imagename = $self->{'imagename'};

# Change URL style '/' to Windows style $imagename =~ s/\//\\/g;

my $indexDir = "C:\\Users\\Bryce\\Desktop\\Research\\greenstone3-svn-bryce64\\web\\sites\\localsite\\collect\\jonesmin\\"; ##my $hashDir = $archivesDir. substr($docId,0,8). ".dir\\";

my $imageFile = $indexDir. $imagename;

my $outputFile = $ENV{'TEMP'}. "\\out"; my $cmd = "tesseract $imageFile $outputFile"; ## print STDERR "\n\n CMD=\n$cmd \n\n";

my $status = system($cmd); if($status != 0) { print STDERR "\n\n Fail to run \n $cmd \n $! \n\n"; } if (open (FILE, "<$outputFile". ".txt")==0) { $gsdl_cgi->generate_error("Unable to open file containing OCR'd text: $!"); return; }	my $result = ""; my $line; while(defined ($line=<FILE>)) { $result .= $line; }	close FILE; unlink($outputFile . ".txt"); $gsdl_cgi->generate_ok_message($result); } The preformOcr method works by: <ol> <li>building up a file path to the image to OCR</li> <li>building a file path in the temp directory for a place to put the result of OCRing</li> <li>running tesseract: passing in the image to OCR and the output directory for result</li> <li>reading in result and then deleting file that had result in</li> <li>replying with the result of the OCR</li> </ol>

Things of note
<ul> <li>tesseract (the ocr program) is assumed to be on your path</li> <li>The indexDir variable has been hard-coded</li> <li>The image filename has been hard coded to work with windows directory seperators</li> </ul>

setText (Unfinished - Stub)
sub setText {	my $self = shift @_; my $collection = $self->{'collect'}; my $site = $self->{'site'}; my $docId = $self->{'d'}; my $newtext = $self->{'newtext'}; } The setText method has only been implemented up to the Stub point (extracting out its parameters) because there is no sensible way of calling it (due to the problems of POST interacting with perl).

In order to finish implementing it you will need to: <ul> <li>Have it edit the appropriate file in the import and archives directory</li> <li>Run a buildcol</li> </ul>

Further functionality ideas
Once the ability to update the full text of a document is implemented some ideas to further enhancements would be: <ul> <li>Allow Rollbacks (History of Edits) so if someone makes a mistake and saves it then its easy to fix</li> <li>Edit Permissions - Allow certain IPs to edit the full text, not just everyone.</li> <li>Allow sections of the full text to be split up - for example, two different text areas that both count as part of the full text but are displayed in some meaning full way.</li> <li>The ability to activate and deactivate functionality like described in this project through the preferences page of any Greenstone collection.</li> </ul>