Table of Contents
NZDL projects and Demonstrations
New Zealand Digital Library Project members have developed a range of practical software packages in the course of their research. Much of this software is available for download.
Digital libraries and indexing
Greenstone is the digital library system that generates most of the pages of the New Zealand Digital Library website. It is freely available under the GNU General public license, and has been adopted by numerous other projects. It is used to disseminate information by humanitarian organisations including Global Help Projects and United Nations organisations. Greenstone is available for download from http://www.greenstone.org/download.
MG is an enhancement of the Managing Gigabytes full-text retrieval system that provides flexible stemming methods, weighting terms, term frequencies, merged indexes, machine independent indexes, and a port to MSDOS.
PreScript converts PostScript to plain ASCII or HTML. It detects paragraph boundaries, removes hyphenation, and interprets many ligatures.
Extracting and enriching data and metadata
Sequitur is a method for inferring compositional hierarchies from strings by detecting repetition and factoring it out of the string by forming rules in a grammar. Sequitur is useful for recognizing lexical structure in strings, and excels at very long sequences. The Sequitur WWW interface detects structure in text sequences. See also the wikipedia page Sequitur_algorithm
Kea is a program for automatically extracting keywords and keyphrases from the full text of documents. Candidate keyphrases are identified using rudimentary lexical processing, features are computed for each candidate, and machine learning is used to determines which candidates should be assigned as keyphrases.
Maui is an indexing tool that automatically identifies main topics in text documents. Depending on the task, topics are tags, keywords, keyphrases, vocabulary terms, descriptors, index terms or titles of Wikipedia articles. Maui builds on the Kea algoritm, but provides additional functionalities: it allows the assignment of topics to documents based on terms from Wikipedia using Wikipedia Miner. Maui also has many new features that help identify topics more accurately.
Wikipedia Miner is an open-source software system that allows researchers and developers to integrate Wikipediaʼs rich semantics into their own applications. The toolkit creates databases that contain summarized versions of Wikipediaʼs content and structure, and includes a Java API to provide access to them.
Realistic Books is a suite of programs for creating and interacting with an interactive three-dimensional simulation of a paper-based book.
3D Book Visualizer
The 3D Book Visualizer is an early version of the Realistic Book software. It supports these interactive features:
- Spinning the book around
- Zooming in and out
- Turning a single page or a wodge of pages
- Flipping through key pages
- Switching between handling mode and reading mode.
It supports the PDF and DjVu document formats.
MAT: Metadata Analysis Tool
MAT is a tool for producing statistics and visualisations of repository metadata.
Phind is an interface for browsing the phrases that occur in a collection. The phrases form an approximation of the topics covered. They are extracted from the noun-phrases occuring in the text, so nonsense phrases and phrases with very little information content are excluded. Each phrase is part of a hierarchy, and the user can browse more specialised topics, or retrieve documents that contain the phrase, at any point. You can see Phind in action in the UN Food and Agriculture Organisation collection.
The collage applet dynamically displays a given set of images. When an image is clicked, a new browser window opens and the associated URL is displayed.
The applet can be used in two different contexts: either within the Greenstone Digital Library Software or externally using a directory of images and associated links.
Collages have been included in the following Greenstone collections:
- First Aid in Pictures
A collage using a directory of images can be found at Ian Witten's Collage.
Chinese Text Segmentation
Word segmentation is designed to find word boundaries in languages like Chinese and Japanese, which are (unlike English) written without spaces or other word delimiters (except for punctuation marks). It plays a significant role in applications that use the word as the basic unit due to the fact that machine-readable Chinese text is invariably stored in unsegmented form.
We have implemented a WWW interface for segmenting Chinese text. A demo used to be available at www.nzdl.org/cgi-bin/congb but that is no longer running. You can see an illustration of the transform at http://www.nzdl.org/chinese-text-segmenter/demo1.htm. (Currently at http://community.nzdl.org/www/chinese-text-segmenter/demo1.htm)
(Note, the code can be found on community, in the chinese-text-segmenter directory.)
More information can be found in the paper: A Compression-based Algorithm for Chinese Word Segmentation
Music Query Corpus
For details about the Waikato corpus of music queries, see our paper Forming a Corpus of Voice Queries for Music Information Retrieval: A Pilot Study.
Electronic Lexical Knowledge Base (ELKB) is software for accessing and exploring the Roget's thesaurus. It also provides solutions for various natural language processing tasks. All scripts were originally developed as a part of Mario Jarmasz' Master thesis at the University of Ottawa, Canada.