en:user_advanced:ice_cite
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revisionLast revisionBoth sides next revision | ||
en:user_advanced:ice_cite [2019/02/21 05:58] – created anupama | en:user_advanced:ice_cite [2019/04/24 09:29] – anupama | ||
---|---|---|---|
Line 8: | Line 8: | ||
==== Using the Icecite' | ==== Using the Icecite' | ||
- | Icecite (now called | + | // |
- | As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial. | + | |
- | Grab the pre-compiled Icecite zip file from http://trac.greenstone.org/ | + | //As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.// |
- | Now you're ready to test Icecite' | + | - Grab the pre-compiled Icecite zip file from http:// |
+ | - Set up your environment for Java 8:\\ < | ||
+ | - You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the ''< | ||
+ | * The command will look as follows on Windows, note the use of double quotes around the classpath value and the use of semi-colon as the path separator on Windows:\\ '' | ||
+ | * On Unix systems, the command will be of the following form, where single quotes are acceptable around the value for classpath and where colon is the path separator: | ||
+ | - It can take a while for the PDF to get converted, but once it has finished, you can inspect the text file produced, denoted by the placeholder string ''</ | ||
+ | |||
+ | ==== Using the UnknownConverterPlugin to launch Icecite from GLI to do the PDF to text conversion ==== | ||
+ | // | ||
+ | - Run GLI | ||
+ | - Create a new collection called Icecite. In the **Gather** pane, drop in the sample PDF file into your collection. | ||
+ | - In the **Design** pane and select **Document Plugins** from the list on the left. Add the **UnknownConverterPlugin**. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring the **UnknownConverterPlugin**. Click **< | ||
+ | * set '' | ||
+ | * set '' | ||
+ | * set '' | ||
+ | * set '' | ||
+ | * set the '' | ||
+ | * on Windows:\\ '' | ||
+ | * on Unix systems:\\ ''/ | ||
- | Set up your environment for Java 8: | + | Note: When filling in the '' |
+ | You will however need to adjust the above value for exec_cmd by finding out where your Java 8 is installed and replacing ''/ | ||
- | export JAVA_HOME=/PATH/TO/YOUR-JAVA-8-HOME export PATH=$JAVA_HOME/ | + | //On Windows, if there are spaces in any filepaths in the command, other than in the parameter value to -classpath, remember to bookend those filepaths within double quotes escaped with a backslash, \".// |
+ | |||
+ | //The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the %%GSDL3SRCHOME, | ||
+ | | ||
+ | | ||
+ | | ||
- | You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the < | ||
- | The command will look as follows on Windows, note the use of double quotes around the classpath value and the use of semi-colon as the path separator on Windows: | + | <!-- |
+ | USING THE ICECITE TOOL TO CONVERT FROM PDF TO TXT | ||
- | java -classpath "<DRIVE: | + | 1. Need Java 8 for compiling and probably also for running Icecite |
+ | <code> | ||
+ | export JAVA_HOME=/ | ||
+ | export | ||
+ | </code> | ||
- | On Unix systems, the command will be of the following | + | 2. Get and compile icecite, following the instructions at https:// |
+ | < | ||
+ | git clone https:// | ||
+ | cd icecite | ||
+ | git pull --recurse-submodules | ||
+ | cd pdf-parent/ | ||
+ | mvn install | ||
+ | </ | ||
- | java -classpath '/<PATH-TO-GS-INSTALLATION>/ext/icecite/gs-installed-jars/*:/<PATH-TO-GS-INSTALLATION>/ext/ | + | 3. Run icecite, general instructions at https:// |
+ | <code> | ||
+ | cd ../../ | ||
+ | cd icecite/pdf-cli | ||
+ | java -jar target/pdf-cli-*-jar-with-dependencies.jar [options] <input> [< | ||
+ | </code> | ||
+ | Examples: | ||
+ | greenstone@bedrock: | ||
+ | greenstone@bedrock: | ||
- | It can take a while for the PDF to get converted, but once it has finished, you can inspect the text file produced, denoted by the placeholder string </PATH/TO/CONVERTED.txt> | + | greenstone@bedrock: |
- | You can experiment | + | (Also tried with input file pdf01.pdf from the Reports collection) |
- | Using the UnknownConverterPlugin to launch Icecite | + | |
- | We're now ready to use the UnknownConverterPlugin to launch Icecite as the external tool to do the conversion, producing output that Greenstone' | + | |
- | Run GLI | + | |
- | Create a new collection called Icecite. In the Gather pane, drop in the sample PDF file into your collection. | ||
- | In the Design pane and select Document Plugins from the list on the left. Add the UnknownConverterPlugin. Having tried out the Icecite conversion command manually | + | 4. If you see the exception |
+ | --- | ||
+ | Exception | ||
+ | at org.apache.pdfbox.pdmodel.encryption.PDEncryption.< | ||
+ | at org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java: | ||
+ | at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java: | ||
+ | at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java: | ||
+ | at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: | ||
+ | at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: | ||
+ | at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: | ||
+ | at parser.pdfbox.core.PdfStreamEngine.processFile(PdfStreamEngine.java: | ||
+ | at parser.pdfbox.PdfBoxParser.parse(PdfBoxParser.java: | ||
+ | at cli.PdfParserCommandLine.parse(PdfParserCommandLine.java: | ||
+ | at cli.PdfParserCommandLine.processFile(PdfParserCommandLine.java: | ||
+ | at cli.PdfParserCommandLine.process(PdfParserCommandLine.java: | ||
+ | at cli.PdfParserCommandLine.main(PdfParserCommandLine.java: | ||
+ | Caused by: java.lang.ClassNotFoundException: | ||
+ | at java.net.URLClassLoader$1.run(URLClassLoader.java: | ||
+ | at java.net.URLClassLoader$1.run(URLClassLoader.java: | ||
+ | at java.security.AccessController.doPrivileged(Native Method) | ||
+ | at java.net.URLClassLoader.findClass(URLClassLoader.java: | ||
+ | at java.lang.ClassLoader.loadClass(ClassLoader.java: | ||
+ | at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java: | ||
+ | at java.lang.ClassLoader.loadClass(ClassLoader.java: | ||
+ | ... 13 more | ||
- | set convert_to to the text option, this is the output format upon conversion | + | --- |
- | set mime_type to application/ | + | |
- | set srcicon to the iconpdf, since Greenstone already knows about this macro and already has an icon for PDFs and knows to associate the two | + | |
- | set process_extension to pdf, this is the input format of the files that this instance of the UnknownConverterPlugin will process | + | |
- | set the exec_cmd field as follows, depending on your operating system: | + | |
- | on Windows: | + | |
- | DRIVE:\PATH\TO\YOUR-JAVA-8-HOME\bin\java -classpath " | + | Then: |
+ | a. Obtain bouncycastle (encryption? | ||
- | on Unix systems: | + | Download both jar files listed under the " |
- | /PATH/TO/YOUR-JAVA-8-HOME/bin/java -classpath ' | + | b. Then see https://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option |
+ | for how to run a java programme when you have multiple | ||
- | + | greenstone@bedrock:~/ | |
- | + | --> | |
- | Note: When filling in the exec_cmd field, leave the words with %% signs in front of them intact. They are placeholders for Greenstone to replace. | + | |
- | + | ||
- | You will however need to adjust the above value for exec_cmd by finding out where your Java 8 is installed and replacing | + | |
- | + | ||
- | On Windows, if there are spaces in any filepaths in the command, other than in the parameter value to -classpath, remember to bookend those filepaths within double quotes escaped with a backslash, \". | + | |
- | The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the %%GSDL3SRCHOME, | + | |
- | Having sufficiently configured the UnknownConverterPlugin, | + | |
- | + | ||
- | Select the UnknownConverterPlugin in the list of plugins and keep pressing the <Move Up> button to shift it upwards, until it appears in the plugin pipeline above the existing PDFPlugin, so that this instance of UnknownConverterPlugin, | + | |
- | + | ||
- | Move to the Create pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone' | + |
en/user_advanced/ice_cite.txt · Last modified: 2023/03/13 01:46 by 127.0.0.1