User Tools

Site Tools


en:plugin:unknownconverterplugin

This is an old revision of the document!


The UnknownConverterPlugin

  • There's a Greenstone 3 tutorial demonstrating how to use the UnknownConverterPlugin. It requires that you find and install a tool that you can run from the command line to convert the unknown document format to text or html.
    As an example, the tutorial covers how Greenstone can be made to process djvu files using the UnknownConverterPlugin and a free command line tool that converts djvu to text or html.
  • In place of installing LibreOffice so that Greenstone can extract text from docx files, the UnknownConverterPlugin can be used in conjunction with Apache-Tika to likewise create a collection where docx files are searchable. The difference is that without LibreOffice, no equivalent html version of the docx file would be produced: you get a text-only html version allowing for full text searching, but not a nice html that reproduces the non-text content and formatting of the docx file. If you want the latter, you would still need the free LibreOffice (or Microsoft Word itself) installed.

Using the UnknownConverterPlugin with Apache Tika to process docx (and other) files

Apache Tika is Apache's open-source software to extract text from countless different (textual) document types, one of which is docx. While one can write code to make calls on Apache-Tika's API, their ready made jar file contained everything that we needed to get Greenstone to index text in docx files.

All that's necessary is to drop an Apache-Tika jar file into your GS3/gs2build/ext and then configure an UnknownConverterPlugin instance to make use of it. Building the collection with this will allow Greenstone to process and index docx files to make them searchable without requiring users to install libreoffice.

The UnknownConverterPlugin has been officially available since Greenstone 3.09, so that 3.09 users can also start using Tika with the plugin, by

1. creating a subfolder called "tika" inside their GS3-install-dir/gs2build/ext,

2. downloading the Apache-Tika binary jar file from https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.24.1.jar (or by visiting http://trac.greenstone.org/browser/main/trunk/greenstone2/ext/tika/tika-app-1.24.1.jar and clicking the link labelled "downloading" there), then dropping the downloaded jar file into GS3/gs2build/ext/tika

3. and then configuring an UnknownConverterPlugin instance for any collection that needs docx processing as follows:

All 3 of the above steps are already setup for you in the GS3 binaries generated every night and available from http://www.greenstone.org/caveat-emptor/

Untried: Greenstone 2 users can try a grabbing a nightly GS2 binary from http://www.greenstone.org/caveat-emptor/ as it should also come with an UnknownConverterPlugin). The nightly GS2 binaries should already have an ext/tika subfolder within the GS2-installation folder, containing the tika jar file. Otherwise you can create this folder yourself and download the tika jar file into that location as in step 2. Next configure your UnknownConverterPlugin as in step 3 above before building your GS2 collection containing docx files.

You're not limited to processing docx files by using UnknownConverterPlugin with Tika. You can process other textual doc types, whether already supported by existing Greenstone plugins or not, by configuring a new instance of UnknownConverterPlugin and setting the mime_type, srcicon, process_extension (and file_format) fields appropriately for that doctype.

For every doctype to be processed by UnknownConverterPlugin, the plugin requires you to have a command line tool installed that can convert that doctype to text or html. Apache-Tika supplies that, being the actual command line tool that can convert from a textual doctype to text or html. Next time you have a collection containing doctypes for which Greenstone does not provide existing plugins, experiment with the combination of the UnknownConverterPlugin with Tika.

en/plugin/unknownconverterplugin.1596786109.txt.gz · Last modified: 2020/08/07 07:41 by anupama