en:user_advanced:ice_cite
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:user_advanced:ice_cite [2019/02/21 06:18] – anupama | en:user_advanced:ice_cite [2023/03/13 01:46] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | |||
+ | |||
+ | |||
====== Processing PDFs with Icecite and the UnknownConverterPlugin ====== | ====== Processing PDFs with Icecite and the UnknownConverterPlugin ====== | ||
The contents of this page were originally created as the final section of the Greenstone 3 tutorial "Using the UnknownConverterPlugin to make unsupported document formats searchable", | The contents of this page were originally created as the final section of the Greenstone 3 tutorial "Using the UnknownConverterPlugin to make unsupported document formats searchable", | ||
Line 8: | Line 11: | ||
==== Using the Icecite' | ==== Using the Icecite' | ||
- | // | + | // |
//As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.// | //As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.// | ||
Line 22: | Line 25: | ||
// | // | ||
- Run GLI | - Run GLI | ||
- | - Create a new collection called Icecite. In the Gather pane, drop in the sample PDF file into your collection. | + | - Create a new collection called Icecite. In the **Gather** pane, drop in the sample PDF file into your collection. |
- In the **Design** pane and select **Document Plugins** from the list on the left. Add the **UnknownConverterPlugin**. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring the **UnknownConverterPlugin**. Click **< | - In the **Design** pane and select **Document Plugins** from the list on the left. Add the **UnknownConverterPlugin**. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring the **UnknownConverterPlugin**. Click **< | ||
- | * set convert_to to the text option, this is the output format upon conversion | + | * set '' |
- | * set mime_type to application/ | + | * set '' |
- | * set srcicon to the iconpdf, since Greenstone already knows about this macro and already has an icon for PDFs and knows to associate the two | + | * set '' |
- | * set process_extension to pdf, this is the input format of the files that this instance of the UnknownConverterPlugin will process | + | * set '' |
- | * set the exec_cmd field as follows, depending on your operating system: | + | * set the '' |
- | * on Windows:\\ '' | + | * on Windows:\\ '' |
- | * on Unix systems:\\ ''/ | + | * on Unix systems:\\ ''/ |
- | Note: When filling in the '' | + | Note: When filling in the '' |
You will however need to adjust the above value for exec_cmd by finding out where your Java 8 is installed and replacing ''/ | You will however need to adjust the above value for exec_cmd by finding out where your Java 8 is installed and replacing ''/ | ||
Line 39: | Line 42: | ||
//The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the %%GSDL3SRCHOME, | //The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the %%GSDL3SRCHOME, | ||
- | | + | |
- | | + | |
- | | + | |
+ | |||
+ | |||
+ | <!-- | ||
+ | USING THE ICECITE TOOL TO CONVERT FROM PDF TO TXT | ||
+ | |||
+ | 1. Need Java 8 for compiling and probably also for running Icecite | ||
+ | < | ||
+ | export JAVA_HOME=/ | ||
+ | export PATH=$JAVA_HOME/ | ||
+ | </ | ||
+ | |||
+ | 2. Get and compile icecite, following the instructions at https:// | ||
+ | < | ||
+ | git clone https:// | ||
+ | cd icecite | ||
+ | git pull --recurse-submodules | ||
+ | cd pdf-parent/ | ||
+ | mvn install | ||
+ | </ | ||
+ | |||
+ | 3. Run icecite, general instructions at https:// | ||
+ | < | ||
+ | cd ../../ | ||
+ | cd icecite/ | ||
+ | java -jar target/ | ||
+ | </ | ||
+ | Examples: | ||
+ | greenstone@bedrock: | ||
+ | |||
+ | greenstone@bedrock: | ||
+ | |||
+ | greenstone@bedrock: | ||
+ | |||
+ | (Also tried with input file pdf01.pdf from the Reports collection) | ||
+ | |||
+ | |||
+ | 4. If you see the exception | ||
+ | --- | ||
+ | Exception in thread " | ||
+ | at org.apache.pdfbox.pdmodel.encryption.PDEncryption.< | ||
+ | at org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java: | ||
+ | at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java: | ||
+ | at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java: | ||
+ | at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: | ||
+ | at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: | ||
+ | at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: | ||
+ | at parser.pdfbox.core.PdfStreamEngine.processFile(PdfStreamEngine.java: | ||
+ | at parser.pdfbox.PdfBoxParser.parse(PdfBoxParser.java: | ||
+ | at cli.PdfParserCommandLine.parse(PdfParserCommandLine.java: | ||
+ | at cli.PdfParserCommandLine.processFile(PdfParserCommandLine.java: | ||
+ | at cli.PdfParserCommandLine.process(PdfParserCommandLine.java: | ||
+ | at cli.PdfParserCommandLine.main(PdfParserCommandLine.java: | ||
+ | Caused by: java.lang.ClassNotFoundException: | ||
+ | at java.net.URLClassLoader$1.run(URLClassLoader.java: | ||
+ | at java.net.URLClassLoader$1.run(URLClassLoader.java: | ||
+ | at java.security.AccessController.doPrivileged(Native Method) | ||
+ | at java.net.URLClassLoader.findClass(URLClassLoader.java: | ||
+ | at java.lang.ClassLoader.loadClass(ClassLoader.java: | ||
+ | at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java: | ||
+ | at java.lang.ClassLoader.loadClass(ClassLoader.java: | ||
+ | ... 13 more | ||
+ | |||
+ | --- | ||
+ | |||
+ | Then: | ||
+ | a. Obtain bouncycastle (encryption? | ||
+ | |||
+ | Download both jar files listed under the " | ||
+ | |||
+ | b. Then see https:// | ||
+ | for how to run a java programme when you have multiple jar files on classpath, as you can't run java with both -cp and -jar. | ||
+ | |||
+ | greenstone@bedrock: | ||
+ | --> |
en/user_advanced/ice_cite.txt · Last modified: 2023/03/13 01:46 by 127.0.0.1