User Tools

Site Tools


en:user_advanced:ice_cite

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:user_advanced:ice_cite [2019/02/21 06:26] – [Using the UnknownConverterPlugin to launch Icecite from GLI to do the PDF to text conversion] anupamaen:user_advanced:ice_cite [2023/03/13 01:46] (current) – external edit 127.0.0.1
Line 1: Line 1:
 +
 +
 +
 ====== Processing PDFs with Icecite and the UnknownConverterPlugin ====== ====== Processing PDFs with Icecite and the UnknownConverterPlugin ======
 The contents of this page were originally created as the final section of the Greenstone 3 tutorial "Using the UnknownConverterPlugin to make unsupported document formats searchable", with that tutorial intended to appear officially in public with GS3.09. However this section has been removed from the tutorial for the following reasons: The contents of this page were originally created as the final section of the Greenstone 3 tutorial "Using the UnknownConverterPlugin to make unsupported document formats searchable", with that tutorial intended to appear officially in public with GS3.09. However this section has been removed from the tutorial for the following reasons:
Line 8: Line 11:
  
 ==== Using the Icecite's commandline tool to convert from PDF to text ===== ==== Using the Icecite's commandline tool to convert from PDF to text =====
-//[[https://github.com/ad-freiburg/pdfact|PdfAct]], formerly known as [[https://github.com/ckorzen/icecite|Icecite]] and which is the name used for the software on the rest of this page, is an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFPlugin, even when pdfbox_conversion option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here.//+//[[https://github.com/ad-freiburg/pdfact|PdfAct]], formerly known as **[[https://github.com/ckorzen/icecite|Icecite]]** which is the name used for the software on the rest of this page, is an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFPlugin, even when pdfbox_conversion option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here.//
    
 //As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.// //As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.//
Line 42: Line 45:
   - Select the **UnknownConverterPlugin** in the list of plugins and keep pressing the **<Move Up>** button to shift it upwards, until it appears in the plugin pipeline above the existing **PDFPlugin**, so that this instance of **UnknownConverterPlugin**, configured as it has now been to handle PDF files, will take precedence in processing such files.   - Select the **UnknownConverterPlugin** in the list of plugins and keep pressing the **<Move Up>** button to shift it upwards, until it appears in the plugin pipeline above the existing **PDFPlugin**, so that this instance of **UnknownConverterPlugin**, configured as it has now been to handle PDF files, will take precedence in processing such files.
   - Move to the **Create** pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone's building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms.   - Move to the **Create** pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone's building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms.
 +
 +
 +<!--
 +USING THE ICECITE TOOL TO CONVERT FROM PDF TO TXT
 +
 +1. Need Java 8 for compiling and probably also for running Icecite
 +<code>
 +export JAVA_HOME=/opt/java8/
 +export PATH=$JAVA_HOME/bin:$PATH
 +</code>
 +
 +2. Get and compile icecite, following the instructions at https://github.com/ckorzen/icecite
 +<code>
 +git clone https://github.com/ckorzen/icecite.git --recursive
 +cd icecite
 +git pull --recurse-submodules
 +cd pdf-parent/
 +mvn install
 +</code>
 +
 +3. Run icecite, general instructions at https://github.com/ckorzen/icecite
 +<code>
 +cd ../../
 +cd icecite/pdf-cli
 +java -jar target/pdf-cli-*-jar-with-dependencies.jar [options] <input> [<output>]
 +</code>
 +Examples:
 + greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature words ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted1.txt
 +
 + greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature lines ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted2.txt
 +
 + greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature paragraphs ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted3.txt
 +
 +(Also tried with input file pdf01.pdf from the Reports collection)
 +
 +
 +4. If you see the exception
 +---
 +Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider
 + at org.apache.pdfbox.pdmodel.encryption.PDEncryption.<init>(PDEncryption.java:96)
 + at org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java:282)
 + at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:199)
 + at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:249)
 + at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847)
 + at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:803)
 + at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
 + at parser.pdfbox.core.PdfStreamEngine.processFile(PdfStreamEngine.java:120)
 + at parser.pdfbox.PdfBoxParser.parse(PdfBoxParser.java:44)
 + at cli.PdfParserCommandLine.parse(PdfParserCommandLine.java:268)
 + at cli.PdfParserCommandLine.processFile(PdfParserCommandLine.java:247)
 + at cli.PdfParserCommandLine.process(PdfParserCommandLine.java:233)
 + at cli.PdfParserCommandLine.main(PdfParserCommandLine.java:168)
 +Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider
 + at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 + at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 + at java.security.AccessController.doPrivileged(Native Method)
 + at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 + at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 + at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 + at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 + ... 13 more
 +
 +---
 +
 +Then:
 +a. Obtain bouncycastle (encryption?) jar files from https://www.bouncycastle.org/latest_releases.html
 +
 +Download both jar files listed under the "Provider" column for row "JDK 1.5 - JDK 1.8" (not sure that both are necessary) and put them in icecite/pdf-cli folder (for example)
 +
 +b. Then see https://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option
 +for how to run a java programme when you have multiple jar files on classpath, as you can't run java with both -cp and -jar.
 +
 +greenstone@bedrock:~/icecite/pdf-cli$ java -classpath '.:/home/greenstone/icecite/pdf-cli/*:target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature words ~/Desktop/24.pdf ~/Desktop/24converted.txt
 +-->
en/user_advanced/ice_cite.1550730390.txt.gz · Last modified: 2019/02/21 06:26 by anupama