Differences

This shows you the differences between two versions of the page.

--- nzdl:projects [2017/09/25 01:34] – [Text Mining] kjdon
+++ nzdl:projects [2023/03/13 01:46] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
 ====== NZDL projects and Demonstrations ======
-New Zealand Digital Library Project members have developed a range of practical software packages in the course of their research. Much of this software is available for [[download]].
+New Zealand Digital Library Project members have developed a range of practical software packages in the course of their research. Much of this software is available for download.
 =====Digital libraries and indexing=====
@@ Line 16: / Line 19: @@
-=====Extracting data and metadata=====
+===== Extracting and enriching data and metadata =====
 ====Sequitur====
-links dont work. http://sequence.rutgers.edu/sequitur/
+[[http://www.sequitur.info/ |Sequitur]] is a method for inferring compositional hierarchies from strings by detecting repetition and factoring it out of the string by forming rules in a grammar. Sequitur is useful for recognizing lexical structure in strings, and excels at very long sequences. The Sequitur WWW interface detects structure in text sequences. See also the wikipedia page [[https://en.wikipedia.org/wiki/Sequitur_algorithm | Sequitur_algorithm]]
-Sequitur is a method for inferring compositional hierarchies from strings by detecting repetition and factoring it out of the string by forming rules in a grammar. Sequitur is useful for recognizing lexical structure in strings, and excels at very long sequences. The Sequitur WWW interface detects structure in text sequences.
@@ Line 27: / Line 29: @@
 [[http://www.nzdl.org/Kea/|Kea]] is a program for automatically extracting keywords and keyphrases from the full text of documents. Candidate keyphrases are identified using rudimentary lexical processing, features are computed for each candidate, and machine learning is used to determines which candidates should be assigned as keyphrases.
-=====Text Mining=====
+==== Maui ====
-See our Text Mining Webpage. ?? what link? http://www.cs.waikato.ac.nz/~nzdl/textmining/
+[[https://code.google.com/archive/p/maui-indexer/ |Maui]] is an indexing tool that automatically identifies main topics in text documents. Depending on the task, topics are tags, keywords, keyphrases, vocabulary terms, descriptors, index terms or titles of Wikipedia articles. Maui builds on the Kea algoritm, but provides additional functionalities: it allows the assignment of topics to documents based on terms from Wikipedia using Wikipedia Miner. Maui also has many new features that help identify topics more accurately.
+==== Wikipedia Miner ====
+[[http://nzdl.org/wikipediaminer | Wikipedia Miner]] is an open-source software system that allows researchers and developers to integrate Wikipediaʼs rich semantics into their own applications. The toolkit creates databases that contain summarized versions of Wikipediaʼs content and structure, and includes a Java API to provide access to them.
 =====Browsing interfaces=====
@@ Line 38: / Line 43: @@
 ==== 3D Book Visualizer ====
-The [[http://www.nzdl.org/html/open_the_book/|3D Book Visualizer]] is an early version of the Realistic Book software.  It supports these interactive features:
+The [[http://www.nzdl.org/open_the_book/|3D Book Visualizer]] is an early version of the Realistic Book software.  It supports these interactive features:
   * Spinning the book around
@@ Line 47: / Line 52: @@
 It supports the PDF and DjVu document formats.
+==== MAT: Metadata Analysis Tool ====
+[[nzdl:mat|MAT]] is a tool for producing statistics and visualisations of repository metadata.
 ==== Phind====
-Phind is an interface for browsing the phrases that occur in a collection. The phrases form an approximation of the topics covered. They are extracted from the noun-phrases occuring in the text, so nonsense phrases and phrases with very little information content are excluded. Each phrase is part of a hierarchy, and the user can browse more specialised topics, or retrieve documents that contain the phrase, at any point. You can see Phind in action in the [[http://collections.nzdl.org/gsdlmod?a=p&p=about&c=fi1998|UN Food and Agriculture Organisation collection]].
+[[http://www.nzdl.org/phind|Phind]] is an interface for browsing the phrases that occur in a collection. The phrases form an approximation of the topics covered. They are extracted from the noun-phrases occuring in the text, so nonsense phrases and phrases with very little information content are excluded. Each phrase is part of a hierarchy, and the user can browse more specialised topics, or retrieve documents that contain the phrase, at any point. You can see Phind in action in the [[http://collections.nzdl.org/gsdlmod?a=p&p=about&c=fi1998|UN Food and Agriculture Organisation collection]].
 ==== Collage====
@@ Line 66: / Line 76: @@
 A collage using a directory of images can be found at [[http://www.cs.waikato.ac.nz/~ihw/collage/index.html|Ian Witten's Collage]].
-=====Word segmentation=====
+===== Chinese Text Segmentation=====
-[[http://www.nzdl.org/cgi-bin/congb]]
-[[http://www.nzdl.org/chinese/demo1.htm]] - there are some pages here but this redirects to chinese collection...
 Word segmentation is designed to find word boundaries in languages like Chinese and Japanese, which are (unlike English) written without spaces or other word delimiters (except for punctuation marks). It plays a significant role in applications that use the word as the basic unit due to the fact that machine-readable Chinese text is invariably stored in unsegmented form.
-We have implemented a WWW interface for segmanting Chinese text.
+We have implemented a WWW interface for segmenting Chinese text. A demo used to be available at www.nzdl.org/cgi-bin/congb but that is no longer running. You can see an illustration of the transform at [[http://www.nzdl.org/chinese-text-segmenter/demo1.htm]]. (Currently at [[http://community.nzdl.org/www/chinese-text-segmenter/demo1.htm]])
+(Note, the code can be found on community, in the chinese-text-segmenter directory.)
+More information can be found in the paper: [[https://www.cs.waikato.ac.nz/~ihw/papers/00WT-YW-RMN-IHW-Comprsbased.pdf| A Compression-based Algorithm for Chinese Word Segmentation]]
+===== Music Query Corpus =====
-If your web browsers does not support Chinese text, [[http://collections.nzdl.org/chinese/demo1.htm|illustrations of the transformation]] are available.
+For details about the [[http://community.nzdl.org/www/waikato-music-query-corpus/waikato-query-corpus.zip|Waikato corpus of music queries]], see our paper
+[[http://ismir2002.ismir.net/proceedings/03-SP04-2.pdf|Forming a Corpus of Voice Queries for Music Information Retrieval: A Pilot Study]].
 =====Others=====
-[[http://collections.nzdl.org/ELKB/|Electronic Lexical Knowledge Base (ELKB)]] is software for accessing and exploring the Roget's thesaurus. It also provides solutions for various natural language processing tasks. All scripts were originally developed as a part of Mario Jarmasz' Master thesis at the [[http://engineering.uottawa.ca/eecs/|University of Ottawa]], Canada.
+[[http://nzdl.org/ELKB/|Electronic Lexical Knowledge Base (ELKB)]] is software for accessing and exploring the Roget's thesaurus. It also provides solutions for various natural language processing tasks. All scripts were originally developed as a part of Mario Jarmasz' Master thesis at the [[http://engineering.uottawa.ca/eecs/|University of Ottawa]], Canada.