Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
en:user_advanced:ice_cite [2019/03/13 18:57]
anupama
en:user_advanced:ice_cite [2019/04/24 21:29] (current)
anupama
Line 42: Line 42:
   - Select the **UnknownConverterPlugin** in the list of plugins and keep pressing the **<Move Up>** button to shift it upwards, until it appears in the plugin pipeline above the existing **PDFPlugin**,​ so that this instance of **UnknownConverterPlugin**,​ configured as it has now been to handle PDF files, will take precedence in processing such files.   - Select the **UnknownConverterPlugin** in the list of plugins and keep pressing the **<Move Up>** button to shift it upwards, until it appears in the plugin pipeline above the existing **PDFPlugin**,​ so that this instance of **UnknownConverterPlugin**,​ configured as it has now been to handle PDF files, will take precedence in processing such files.
   - Move to the **Create** pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone'​s building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms.   - Move to the **Create** pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone'​s building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms.
 +
 +
 +<!--
 +USING THE ICECITE TOOL TO CONVERT FROM PDF TO TXT
 +
 +1. Need Java 8 for compiling and probably also for running Icecite
 +<​code>​
 +export JAVA_HOME=/​opt/​java8/​
 +export PATH=$JAVA_HOME/​bin:​$PATH
 +</​code>​
 +
 +2. Get and compile icecite, following the instructions at https://​github.com/​ckorzen/​icecite
 +<​code>​
 +git clone https://​github.com/​ckorzen/​icecite.git --recursive
 +cd icecite
 +git pull --recurse-submodules
 +cd pdf-parent/
 +mvn install
 +</​code>​
 +
 +3. Run icecite, general instructions at https://​github.com/​ckorzen/​icecite
 +<​code>​
 +cd ../../
 +cd icecite/​pdf-cli
 +java -jar target/​pdf-cli-*-jar-with-dependencies.jar [options] <​input>​ [<​output>​]
 +</​code>​
 +Examples:
 + greenstone@bedrock:​~/​icecite/​pdf-cli$ java -jar target/​pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature words ~/​Downloads/​A9-access-best-practices.pdf ~/​Desktop/​iceciteconverted1.txt
 +
 + greenstone@bedrock:​~/​icecite/​pdf-cli$ java -jar target/​pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature lines ~/​Downloads/​A9-access-best-practices.pdf ~/​Desktop/​iceciteconverted2.txt
 +
 + greenstone@bedrock:​~/​icecite/​pdf-cli$ java -jar target/​pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature paragraphs ~/​Downloads/​A9-access-best-practices.pdf ~/​Desktop/​iceciteconverted3.txt
 +
 +(Also tried with input file pdf01.pdf from the Reports collection)
 +
 +
 +4. If you see the exception
 +---
 +Exception in thread "​main"​ java.lang.NoClassDefFoundError:​ org/​bouncycastle/​jce/​provider/​BouncyCastleProvider
 + at org.apache.pdfbox.pdmodel.encryption.PDEncryption.<​init>​(PDEncryption.java:​96)
 + at org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java:​282)
 + at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:​199)
 + at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:​249)
 + at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:​847)
 + at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:​803)
 + at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:​757)
 + at parser.pdfbox.core.PdfStreamEngine.processFile(PdfStreamEngine.java:​120)
 + at parser.pdfbox.PdfBoxParser.parse(PdfBoxParser.java:​44)
 + at cli.PdfParserCommandLine.parse(PdfParserCommandLine.java:​268)
 + at cli.PdfParserCommandLine.processFile(PdfParserCommandLine.java:​247)
 + at cli.PdfParserCommandLine.process(PdfParserCommandLine.java:​233)
 + at cli.PdfParserCommandLine.main(PdfParserCommandLine.java:​168)
 +Caused by: java.lang.ClassNotFoundException:​ org.bouncycastle.jce.provider.BouncyCastleProvider
 + at java.net.URLClassLoader$1.run(URLClassLoader.java:​372)
 + at java.net.URLClassLoader$1.run(URLClassLoader.java:​361)
 + at java.security.AccessController.doPrivileged(Native Method)
 + at java.net.URLClassLoader.findClass(URLClassLoader.java:​360)
 + at java.lang.ClassLoader.loadClass(ClassLoader.java:​424)
 + at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:​308)
 + at java.lang.ClassLoader.loadClass(ClassLoader.java:​357)
 + ... 13 more
 +
 +---
 +
 +Then:
 +a. Obtain bouncycastle (encryption?​) jar files from https://​www.bouncycastle.org/​latest_releases.html
 +
 +Download both jar files listed under the "​Provider"​ column for row "JDK 1.5 - JDK 1.8" (not sure that both are necessary) and put them in icecite/​pdf-cli folder (for example)
 +
 +b. Then see https://​stackoverflow.com/​questions/​15930782/​call-java-jar-myfile-jar-with-additional-classpath-option
 +for how to run a java programme when you have multiple jar files on classpath, as you can't run java with both -cp and -jar.
 +
 +greenstone@bedrock:​~/​icecite/​pdf-cli$ java -classpath '​.:/​home/​greenstone/​icecite/​pdf-cli/​*:​target/​pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar'​ cli.PdfParserCommandLine --format txt --feature words ~/​Desktop/​24.pdf ~/​Desktop/​24converted.txt
 +-->