The contents of this page were originally created as the final section of the Greenstone 3 tutorial "Using the UnknownConverterPlugin to make unsupported document formats searchable", with that tutorial intended to appear officially in public with GS3.09. However this section has been removed from the tutorial for the following reasons:
PdfAct, formerly known as Icecite which is the name used for the software on the rest of this page, is an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFPlugin, even when pdfbox_conversion option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here.
As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.
export JAVA_HOME=/PATH/TO/YOUR-JAVA-8-HOME export PATH=$JAVA_HOME/bin:$PATH
<PLACEHOLDERS>
below.java -classpath "<DRIVE:\PATH-TO-GS-INSTALLATION>\ext\icecite\gs-installed-jars\*;<DRIVE:\PATH-TO-GS-INSTALLATION>\ext\icecite\pdf-cli\target\pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar" cli.PdfParserCommandLine –format txt –feature paragraphs <DRIVE:\FULL\PATH\TO\YOUR.pdf> <DRIVE:\FULL\PATH\TO\CONVERTED.txt>
java -classpath '/<PATH-TO-GS-INSTALLATION>/ext/icecite/gs-installed-jars/*:/<PATH-TO-GS-INSTALLATION>/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine –format txt –feature paragraphs </PATH/TO/YOUR.pdf> </PATH/TO/CONVERTED.txt>
</PATH/TO/CONVERTED.txt>
We're now ready to use the UnknownConverterPlugin to launch Icecite as the external tool to do the conversion, producing output that Greenstone's building scripts can ingest into Greenstone and index for searching.
convert_to
to the text
option, this is the output format upon conversionmime_type
to application/pdf
srcicon
to the iconpdf
, since Greenstone already knows about this macro and already has an icon for PDFs and knows to associate the twoprocess_extension
to pdf
, this is the input format of the files that this instance of the UnknownConverterPlugin will processexec_cmd
field as follows, depending on your operating system:DRIVE:\PATH\TO\YOUR-JAVA-8-HOME\bin\java -classpath "GSDL3SRCHOME\ext\icecite\gs-installed-jars\*:GSDL3SRCHOME\ext\icecite\pdf-cli\target\pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar" cli.PdfParserCommandLine –format txt –feature paragraphs INPUT_FILE OUTPUT
/PATH/TO/YOUR-JAVA-8-HOME/bin/java -classpath 'GSDL3SRCHOME/ext/icecite/gs-installed-jars/*:GSDL3SRCHOME/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine –format txt –feature paragraphs INPUT_FILE OUTPUT
Note: When filling in the exec_cmd
field, leave the words with %
%
signs in front of them intact. They are placeholders for Greenstone to replace.
You will however need to adjust the above value for exec_cmd by finding out where your Java 8 is installed and replacing /PATH/TO/YOUR-JAVA-8-HOME
with it. The reason you need to provide the full path to the Java 8 executable is because, at present, GLI binaries ship with Java 7, which is incompatible with the precompiled Icecite. And if you're running Greenstone from a source, you may have a different version of Java set up in your environment too. However, by providing the full path to the Java 8 executable above, you force Icecite's PDF conversion program to run with Java 8.
On Windows, if there are spaces in any filepaths in the command, other than in the parameter value to -classpath, remember to bookend those filepaths within double quotes escaped with a backslash, \".
The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the GSDL3SRCHOME, INPUT_FILE and OUTPUT appropriately. GSDL3SRCHOME works out to be the Greenstone 3 installation directory, whereas INPUT_FILE is whichever matching PDF it's processing and OUTPUT is likewise the file (or folder of files) produced by the conversion process. In this case, the output type is txt, as that's what Icecite produces. Once the conversion to text has finished, Greenstone will be able to process it as usual, such as indexing the extracted contents to make the document searchable.