en:user_advanced:ice_cite
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
en:user_advanced:ice_cite [2019/02/21 05:58] – created anupama | en:user_advanced:ice_cite [2023/03/13 01:46] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | |||
+ | |||
+ | |||
====== Processing PDFs with Icecite and the UnknownConverterPlugin ====== | ====== Processing PDFs with Icecite and the UnknownConverterPlugin ====== | ||
The contents of this page were originally created as the final section of the Greenstone 3 tutorial "Using the UnknownConverterPlugin to make unsupported document formats searchable", | The contents of this page were originally created as the final section of the Greenstone 3 tutorial "Using the UnknownConverterPlugin to make unsupported document formats searchable", | ||
Line 8: | Line 11: | ||
==== Using the Icecite' | ==== Using the Icecite' | ||
- | Icecite (now called | + | // |
- | As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial. | + | |
- | Grab the pre-compiled Icecite zip file from http://trac.greenstone.org/ | + | //As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.// |
- | + | ||
- | Now you're ready to test Icecite' | + | |
- | + | ||
- | Set up your environment for Java 8: | + | |
- | + | ||
- | + | ||
- | export JAVA_HOME=/ | + | |
- | + | ||
- | You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the < | + | |
- | + | ||
- | The command will look as follows on Windows, note the use of double quotes around the classpath value and the use of semi-colon as the path separator on Windows: | + | |
- | java -classpath "< | + | - Grab the pre-compiled Icecite zip file from http:// |
+ | - Set up your environment for Java 8:\\ < | ||
+ | - You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the ''< | ||
+ | * The command will look as follows on Windows, note the use of double quotes around the classpath value and the use of semi-colon as the path separator on Windows:\\ '' | ||
+ | * On Unix systems, the command will be of the following form, where single quotes are acceptable around the value for classpath and where colon is the path separator: | ||
+ | - It can take a while for the PDF to get converted, but once it has finished, you can inspect the text file produced, denoted by the placeholder string ''</ | ||
+ | |||
+ | ==== Using the UnknownConverterPlugin to launch Icecite from GLI to do the PDF to text conversion ==== | ||
+ | // | ||
+ | - Run GLI | ||
+ | - Create a new collection called Icecite. In the **Gather** pane, drop in the sample PDF file into your collection. | ||
+ | - In the **Design** pane and select **Document Plugins** from the list on the left. Add the **UnknownConverterPlugin**. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring the **UnknownConverterPlugin**. Click **< | ||
+ | * set '' | ||
+ | * set '' | ||
+ | * set '' | ||
+ | * set '' | ||
+ | * set the '' | ||
+ | * on Windows:\\ '' | ||
+ | * on Unix systems:\\ ''/ | ||
- | On Unix systems, the command will be of the following form, where single quotes | + | Note: When filling in the '' |
- | java -classpath | + | You will however need to adjust the above value for exec_cmd by finding out where your Java 8 is installed and replacing ''/ |
+ | //On Windows, if there are spaces in any filepaths in the command, other than in the parameter value to -classpath, remember to bookend those filepaths within double quotes escaped with a backslash, \".// | ||
+ | |||
+ | //The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the %%GSDL3SRCHOME, | ||
+ | - Having sufficiently configured the **UnknownConverterPlugin**, | ||
+ | - Select the **UnknownConverterPlugin** in the list of plugins and keep pressing the **<Move Up>** button to shift it upwards, until it appears in the plugin pipeline above the existing **PDFPlugin**, | ||
+ | - Move to the **Create** pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone' | ||
- | It can take a while for the PDF to get converted, but once it has finished, you can inspect the text file produced, denoted by the placeholder string </ | ||
- | You can experiment with using --feature words or --feature lines above, in place of --feature paragraphs, to find out the effect of such a change on the output file, particularly if --feature paragraphs does not produce the desired results for your PDFs. | + | <!-- |
- | Using the UnknownConverterPlugin to launch Icecite from GLI to do the PDF to text conversion | + | USING THE ICECITE TOOL TO CONVERT FROM PDF TO TXT |
- | We're now ready to use the UnknownConverterPlugin to launch Icecite as the external tool to do the conversion, producing output that Greenstone' | + | |
- | Run GLI | + | |
- | Create a new collection called | + | 1. Need Java 8 for compiling and probably also for running |
+ | < | ||
+ | export JAVA_HOME=/ | ||
+ | export PATH=$JAVA_HOME/ | ||
+ | </ | ||
- | In the Design pane and select Document Plugins from the list on the left. Add the UnknownConverterPlugin. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring | + | 2. Get and compile icecite, following |
+ | <code> | ||
+ | git clone https:// | ||
+ | cd icecite | ||
+ | git pull --recurse-submodules | ||
+ | cd pdf-parent/ | ||
+ | mvn install | ||
+ | </code> | ||
- | set convert_to to the text option, this is the output format upon conversion | + | 3. Run icecite, general instructions at https:// |
- | set mime_type to application/pdf | + | < |
- | set srcicon to the iconpdf, since Greenstone already knows about this macro and already has an icon for PDFs and knows to associate the two | + | cd ../../ |
- | set process_extension to pdf, this is the input format of the files that this instance of the UnknownConverterPlugin will process | + | cd icecite/pdf-cli |
- | set the exec_cmd field as follows, depending on your operating system: | + | java -jar target/pdf-cli-*-jar-with-dependencies.jar [options] <input> [< |
- | on Windows: | + | </ |
+ | Examples: | ||
+ | greenstone@bedrock:~/ | ||
- | DRIVE:\PATH\TO\YOUR-JAVA-8-HOME\bin\java -classpath " | + | greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature |
- | on Unix systems: | + | greenstone@bedrock:~/ |
- | / | + | (Also tried with input file pdf01.pdf from the Reports collection) |
+ | 4. If you see the exception | ||
+ | --- | ||
+ | Exception in thread " | ||
+ | at org.apache.pdfbox.pdmodel.encryption.PDEncryption.< | ||
+ | at org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java: | ||
+ | at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java: | ||
+ | at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java: | ||
+ | at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: | ||
+ | at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: | ||
+ | at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: | ||
+ | at parser.pdfbox.core.PdfStreamEngine.processFile(PdfStreamEngine.java: | ||
+ | at parser.pdfbox.PdfBoxParser.parse(PdfBoxParser.java: | ||
+ | at cli.PdfParserCommandLine.parse(PdfParserCommandLine.java: | ||
+ | at cli.PdfParserCommandLine.processFile(PdfParserCommandLine.java: | ||
+ | at cli.PdfParserCommandLine.process(PdfParserCommandLine.java: | ||
+ | at cli.PdfParserCommandLine.main(PdfParserCommandLine.java: | ||
+ | Caused by: java.lang.ClassNotFoundException: | ||
+ | at java.net.URLClassLoader$1.run(URLClassLoader.java: | ||
+ | at java.net.URLClassLoader$1.run(URLClassLoader.java: | ||
+ | at java.security.AccessController.doPrivileged(Native Method) | ||
+ | at java.net.URLClassLoader.findClass(URLClassLoader.java: | ||
+ | at java.lang.ClassLoader.loadClass(ClassLoader.java: | ||
+ | at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java: | ||
+ | at java.lang.ClassLoader.loadClass(ClassLoader.java: | ||
+ | ... 13 more | ||
- | Note: When filling in the exec_cmd field, leave the words with %% signs in front of them intact. They are placeholders for Greenstone to replace. | + | --- |
- | You will however need to adjust the above value for exec_cmd by finding out where your Java 8 is installed and replacing | + | Then: |
+ | a. Obtain bouncycastle (encryption? | ||
- | On Windows, if there are spaces in any filepaths in the command, other than in the parameter value to -classpath, remember to bookend those filepaths within double quotes escaped with a backslash, \". | + | Download both jar files listed under the " |
- | The above command will use the java executable to run the java Icecite program | + | |
- | Having sufficiently configured the UnknownConverterPlugin, | + | |
- | Select the UnknownConverterPlugin in the list of plugins and keep pressing the <Move Up> button | + | b. Then see https:// |
+ | for how to run a java programme when you have multiple jar files on classpath, as you can't run java with both -cp and -jar. | ||
- | Move to the Create pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone's building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms. | + | greenstone@bedrock: |
+ | --> |
en/user_advanced/ice_cite.txt · Last modified: 2023/03/13 01:46 by 127.0.0.1