User Tools

Site Tools


en:tutorials

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revisionBoth sides next revision
en:tutorials [2017/10/05 02:42] anupamaen:tutorials [2019/04/24 09:30] anupama
Line 443: Line 443:
  
 </TABAREA> </TABAREA>
-<!-- 
-USING THE ICECITE TOOL TO CONVERT FROM PDF TO TXT 
  
-1. Need Java 8 for compiling and probably also for running Icecite 
-<code> 
-export JAVA_HOME=/opt/java8/ 
-export PATH=$JAVA_HOME/bin:$PATH 
-</code> 
- 
-2. Get and compile icecite, following the instructions at https://github.com/ckorzen/icecite 
-<code> 
-git clone https://github.com/ckorzen/icecite.git --recursive 
-cd icecite 
-git pull --recurse-submodules 
-cd pdf-parent/ 
-mvn install 
-</code> 
- 
-3. Run icecite, general instructions at https://github.com/ckorzen/icecite 
-<code> 
-cd ../../ 
-cd icecite/pdf-cli 
-java -jar target/pdf-cli-*-jar-with-dependencies.jar [options] <input> [<output>] 
-</code> 
-Examples: 
- greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature words ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted1.txt 
- 
- greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature lines ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted2.txt 
- 
- greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature paragraphs ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted3.txt 
- 
-(Also tried with input file pdf01.pdf from the Reports collection) 
- 
- 
-4. If you see the exception 
---- 
-Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider 
- at org.apache.pdfbox.pdmodel.encryption.PDEncryption.<init>(PDEncryption.java:96) 
- at org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java:282) 
- at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:199) 
- at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:249) 
- at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847) 
- at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:803) 
- at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757) 
- at parser.pdfbox.core.PdfStreamEngine.processFile(PdfStreamEngine.java:120) 
- at parser.pdfbox.PdfBoxParser.parse(PdfBoxParser.java:44) 
- at cli.PdfParserCommandLine.parse(PdfParserCommandLine.java:268) 
- at cli.PdfParserCommandLine.processFile(PdfParserCommandLine.java:247) 
- at cli.PdfParserCommandLine.process(PdfParserCommandLine.java:233) 
- at cli.PdfParserCommandLine.main(PdfParserCommandLine.java:168) 
-Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider 
- at java.net.URLClassLoader$1.run(URLClassLoader.java:372) 
- at java.net.URLClassLoader$1.run(URLClassLoader.java:361) 
- at java.security.AccessController.doPrivileged(Native Method) 
- at java.net.URLClassLoader.findClass(URLClassLoader.java:360) 
- at java.lang.ClassLoader.loadClass(ClassLoader.java:424) 
- at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) 
- at java.lang.ClassLoader.loadClass(ClassLoader.java:357) 
- ... 13 more 
- 
---- 
- 
-Then: 
-a. Obtain bouncycastle (encryption?) jar files from https://www.bouncycastle.org/latest_releases.html 
- 
-Download both jar files listed under the "Provider" column for row "JDK 1.5 - JDK 1.8" (not sure that both are necessary) and put them in icecite/pdf-cli folder (for example) 
- 
-b. Then see https://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option 
-for how to run a java programme when you have multiple jar files on classpath, as you can't run java with both -cp and -jar. 
- 
-greenstone@bedrock:~/icecite/pdf-cli$ java -classpath '.:/home/greenstone/icecite/pdf-cli/*:target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature words ~/Desktop/24.pdf ~/Desktop/24converted.txt 
---> 
en/tutorials.txt · Last modified: 2023/11/29 00:21 by kjdon