User Tools

Site Tools


en:user_advanced:ice_cite

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Last revisionBoth sides next revision
en:user_advanced:ice_cite [2019/03/13 05:57] anupamaen:user_advanced:ice_cite [2019/04/24 09:29] anupama
Line 42: Line 42:
   - Select the **UnknownConverterPlugin** in the list of plugins and keep pressing the **<Move Up>** button to shift it upwards, until it appears in the plugin pipeline above the existing **PDFPlugin**, so that this instance of **UnknownConverterPlugin**, configured as it has now been to handle PDF files, will take precedence in processing such files.   - Select the **UnknownConverterPlugin** in the list of plugins and keep pressing the **<Move Up>** button to shift it upwards, until it appears in the plugin pipeline above the existing **PDFPlugin**, so that this instance of **UnknownConverterPlugin**, configured as it has now been to handle PDF files, will take precedence in processing such files.
   - Move to the **Create** pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone's building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms.   - Move to the **Create** pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone's building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms.
 +
 +
 +<!--
 +USING THE ICECITE TOOL TO CONVERT FROM PDF TO TXT
 +
 +1. Need Java 8 for compiling and probably also for running Icecite
 +<code>
 +export JAVA_HOME=/opt/java8/
 +export PATH=$JAVA_HOME/bin:$PATH
 +</code>
 +
 +2. Get and compile icecite, following the instructions at https://github.com/ckorzen/icecite
 +<code>
 +git clone https://github.com/ckorzen/icecite.git --recursive
 +cd icecite
 +git pull --recurse-submodules
 +cd pdf-parent/
 +mvn install
 +</code>
 +
 +3. Run icecite, general instructions at https://github.com/ckorzen/icecite
 +<code>
 +cd ../../
 +cd icecite/pdf-cli
 +java -jar target/pdf-cli-*-jar-with-dependencies.jar [options] <input> [<output>]
 +</code>
 +Examples:
 + greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature words ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted1.txt
 +
 + greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature lines ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted2.txt
 +
 + greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature paragraphs ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted3.txt
 +
 +(Also tried with input file pdf01.pdf from the Reports collection)
 +
 +
 +4. If you see the exception
 +---
 +Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider
 + at org.apache.pdfbox.pdmodel.encryption.PDEncryption.<init>(PDEncryption.java:96)
 + at org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java:282)
 + at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:199)
 + at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:249)
 + at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847)
 + at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:803)
 + at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
 + at parser.pdfbox.core.PdfStreamEngine.processFile(PdfStreamEngine.java:120)
 + at parser.pdfbox.PdfBoxParser.parse(PdfBoxParser.java:44)
 + at cli.PdfParserCommandLine.parse(PdfParserCommandLine.java:268)
 + at cli.PdfParserCommandLine.processFile(PdfParserCommandLine.java:247)
 + at cli.PdfParserCommandLine.process(PdfParserCommandLine.java:233)
 + at cli.PdfParserCommandLine.main(PdfParserCommandLine.java:168)
 +Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider
 + at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 + at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 + at java.security.AccessController.doPrivileged(Native Method)
 + at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 + at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 + at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 + at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 + ... 13 more
 +
 +---
 +
 +Then:
 +a. Obtain bouncycastle (encryption?) jar files from https://www.bouncycastle.org/latest_releases.html
 +
 +Download both jar files listed under the "Provider" column for row "JDK 1.5 - JDK 1.8" (not sure that both are necessary) and put them in icecite/pdf-cli folder (for example)
 +
 +b. Then see https://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option
 +for how to run a java programme when you have multiple jar files on classpath, as you can't run java with both -cp and -jar.
 +
 +greenstone@bedrock:~/icecite/pdf-cli$ java -classpath '.:/home/greenstone/icecite/pdf-cli/*:target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature words ~/Desktop/24.pdf ~/Desktop/24converted.txt
 +-->
en/user_advanced/ice_cite.txt · Last modified: 2023/03/13 01:46 by 127.0.0.1