User Tools

Site Tools


en:user_advanced:ice_cite

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Last revisionBoth sides next revision
en:user_advanced:ice_cite [2019/02/21 05:58] – created anupamaen:user_advanced:ice_cite [2019/04/24 09:29] anupama
Line 8: Line 8:
  
 ==== Using the Icecite's commandline tool to convert from PDF to text ===== ==== Using the Icecite's commandline tool to convert from PDF to text =====
-Icecite (now called PdfActis an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFPlugin, even when pdfbox_conversion option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here.  +//[[https://github.com/ad-freiburg/pdfact|PdfAct]], formerly known as **[[https://github.com/ckorzen/icecite|Icecite]]** which is the name used for the software on the rest of this page, is an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFPlugin, even when pdfbox_conversion option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here.// 
-As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.  +  
-Grab the pre-compiled Icecite zip file from http://trac.greenstone.org/export/head/gs3-extensions/gs-icecite/gs-icecite.zip (or from http://trac.greenstone.org/export/head/gs3-extensions/gs-icecite/gs-icecite.tar.gz, if you prefer a tarball) and decompress it into your Greenstone installation's ext subfolder.+//As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.//
  
-Now you're ready to test Icecite's PDF to text conversion abilities manually, by running Icecite from the command line.+  - Grab the pre-compiled Icecite zip file from http://trac.greenstone.org/export/head/gs3-extensions/gs-icecite/gs-icecite.zip (or from http://trac.greenstone.org/export/head/gs3-extensions/gs-icecite/gs-icecite.tar.gz, if you prefer a tarball) and decompress it into your Greenstone installation's ext subfolder.\\ Now you're ready to test Icecite's PDF to text conversion abilities manually, by running Icecite from the command line. 
 +  - Set up your environment for Java 8:\\ <code>export JAVA_HOME=/PATH/TO/YOUR-JAVA-8-HOME export PATH=$JAVA_HOME/bin:$PATH</code> 
 +  - You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the ''<PLACEHOLDERS>'' below. 
 +    * The command will look as follows on Windows, note the use of double quotes around the classpath value and the use of semi-colon as the path separator on Windows:\\ ''java -classpath "<DRIVE:\PATH-TO-GS-INSTALLATION>\ext\icecite\gs-installed-jars\*;<DRIVE:\PATH-TO-GS-INSTALLATION>\ext\icecite\pdf-cli\target\pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar" cli.PdfParserCommandLine --format txt --feature paragraphs <DRIVE:\FULL\PATH\TO\YOUR.pdf> <DRIVE:\FULL\PATH\TO\CONVERTED.txt>'' 
 +    * On Unix systems, the command will be of the following form, where single quotes are acceptable around the value for classpath and where colon is the path separator:\\ ''java -classpath '/<PATH-TO-GS-INSTALLATION>/ext/icecite/gs-installed-jars/*:/<PATH-TO-GS-INSTALLATION>/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs </PATH/TO/YOUR.pdf> </PATH/TO/CONVERTED.txt>''\\ 
 +  - It can take a while for the PDF to get converted, but once it has finished, you can inspect the text file produced, denoted by the placeholder string ''</PATH/TO/CONVERTED.txt>''\\ //You can experiment with using --feature words or --feature lines above, in place of --feature paragraphs, to find out the effect of such a change on the output file, particularly if --feature paragraphs does not produce the desired results for your PDFs.// 
 +  
 +==== Using the UnknownConverterPlugin to launch Icecite from GLI to do the PDF to text conversion ==== 
 +//We're now ready to use the **UnknownConverterPlugin** to launch Icecite as the external tool to do the conversion, producing output that Greenstone's building scripts can ingest into Greenstone and index for searching.//  
 +  - Run GLI 
 +  - Create a new collection called Icecite. In the **Gather** pane, drop in the sample PDF file into your collection. 
 +  - In the **Design** pane and select **Document Plugins** from the list on the left. Add the **UnknownConverterPlugin**. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring the **UnknownConverterPlugin**. Click **<Configure Plugin...>** and set up the plugin with the following settings: 
 +     * set ''convert_to'' to the ''text'' option, this is the output format upon conversion 
 +     * set ''mime_type'' to ''application/pdf'' 
 +     * set ''srcicon'' to the ''iconpdf'', since Greenstone already knows about this macro and already has an icon for PDFs and knows to associate the two 
 +     * set ''process_extension'' to ''pdf'', this is the input format of the files that this instance of the **UnknownConverterPlugin** will process 
 +     * set the ''exec_cmd'' field as follows, depending on your operating system: 
 +        * on Windows:\\ ''DRIVE:\PATH\TO\YOUR-JAVA-8-HOME\bin\java -classpath "%%GSDL3SRCHOME\ext\icecite\gs-installed-jars\*:%%GSDL3SRCHOME\ext\icecite\pdf-cli\target\pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar" cli.PdfParserCommandLine --format txt --feature paragraphs %%INPUT_FILE %%OUTPUT'' 
 +        * on Unix systems:\\ ''/PATH/TO/YOUR-JAVA-8-HOME/bin/java -classpath '%%GSDL3SRCHOME/ext/icecite/gs-installed-jars/*:%%GSDL3SRCHOME/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs %%INPUT_FILE %%OUTPUT'' 
  
-Set up your environment for Java 8:+NoteWhen filling in the ''exec_cmd'' field, leave the words with ''%''''%'' signs in front of them intact. They are placeholders for Greenstone to replace.
  
 +You will however need to adjust the above value for exec_cmd by finding out where your Java 8 is installed and replacing ''/PATH/TO/YOUR-JAVA-8-HOME'' with it. The reason you need to provide the full path to the Java 8 executable is because, at present, GLI binaries ship with Java 7, which is incompatible with the precompiled Icecite. And if you're running Greenstone from a source, you may have a different version of Java set up in your environment too. However, by providing the full path to the Java 8 executable above, you force Icecite's PDF conversion program to run with Java 8.
  
-export JAVA_HOME=/PATH/TO/YOUR-JAVA-8-HOME export PATH=$JAVA_HOME/bin:$PATH +//On Windows, if there are spaces in any filepaths in the command, other than in the parameter value to -classpath, remember to bookend those filepaths within double quotes escaped with a backslash, \"./
 +  
 +//The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the %%GSDL3SRCHOME, %%INPUT_FILE and %%OUTPUT appropriately. %%GSDL3SRCHOME works out to be the Greenstone 3 installation directory, whereas %%INPUT_FILE is whichever matching PDF it's processing and %%OUTPUT is likewise the file (or folder of files) produced by the conversion process. In this case, the output type is txt, as that's what Icecite produces. Once the conversion to text has finished, Greenstone will be able to process it as usual, such as indexing the extracted contents to make the document searchable.// 
 +  Having sufficiently configured the **UnknownConverterPlugin**, click the **<OK>** button to close its configuration dialog. 
 +  Select the **UnknownConverterPlugin** in the list of plugins and keep pressing the **<Move Up>** button to shift it upwards, until it appears in the plugin pipeline above the existing **PDFPlugin**, so that this instance of **UnknownConverterPlugin**, configured as it has now been to handle PDF files, will take precedence in processing such files. 
 +  Move to the **Create** pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone's building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms.
  
-You would need to run Icecite from the terminal wherein you set up the Java 8 environment. Run it on a PDF as follows, after first replacing the <PLACEHOLDERS> below. 
  
-The command will look as follows on Windows, note the use of double quotes around the classpath value and the use of semi-colon as the path separator on Windows:+<!-
 +USING THE ICECITE TOOL TO CONVERT FROM PDF TO TXT
  
-java -classpath "<DRIVE:\PATH-TO-GS-INSTALLATION>\ext\icecite\gs-installed-jars\*;<DRIVE:\PATH-TO-GS-INSTALLATION>\ext\icecite\pdf-cli\target\pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar" cli.PdfParserCommandLine --format txt --feature paragraphs <DRIVE:\FULL\PATH\TO\YOUR.pdf> <DRIVE:\FULL\PATH\TO\CONVERTED.txt+1. Need Java 8 for compiling and probably also for running Icecite 
 +<code> 
 +export JAVA_HOME=/opt/java8/ 
 +export PATH=$JAVA_HOME/bin:$PATH 
 +</code>
  
-On Unix systemsthe command will be of the following form, where single quotes are acceptable around the value for classpath and where colon is the path separator:+2. Get and compile icecite, following the instructions at https://github.com/ckorzen/icecite 
 +<code> 
 +git clone https://github.com/ckorzen/icecite.git --recursive 
 +cd icecite 
 +git pull --recurse-submodules 
 +cd pdf-parent/ 
 +mvn install 
 +</code>
  
-java -classpath '/<PATH-TO-GS-INSTALLATION>/ext/icecite/gs-installed-jars/*:/<PATH-TO-GS-INSTALLATION>/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs </PATH/TO/YOUR.pdf> </PATH/TO/CONVERTED.txt+3. Run icecite, general instructions at https://github.com/ckorzen/icecite 
 +<code> 
 +cd ../../ 
 +cd icecite/pdf-cli 
 +java -jar target/pdf-cli-*-jar-with-dependencies.jar [options] <input[<output>
 +</code> 
 +Examples: 
 + greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature words ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted1.txt
  
 + greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature lines ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted2.txt
  
-It can take a while for the PDF to get converted, but once it has finished, you can inspect the text file produced, denoted by the placeholder string </PATH/TO/CONVERTED.txt+ greenstone@bedrock:~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature paragraphs ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted3.txt
  
-You can experiment with using --feature words or --feature lines above, in place of --feature paragraphs, to find out the effect of such a change on the output file, particularly if --feature paragraphs does not produce the desired results for your PDFs +(Also tried with input file pdf01.pdf from the Reports collection)
-Using the UnknownConverterPlugin to launch Icecite from GLI to do the PDF to text conversion  +
-We're now ready to use the UnknownConverterPlugin to launch Icecite as the external tool to do the conversion, producing output that Greenstone's building scripts can ingest into Greenstone and index for searching.  +
-Run GLI+
  
-Create a new collection called Icecite. In the Gather pane, drop in the sample PDF file into your collection. 
  
-In the Design pane and select Document Plugins from the list on the leftAdd the UnknownConverterPlugin. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring the UnknownConverterPluginClick <Configure Plugin...> and set up the plugin with the following settings:+4If you see the exception 
 +--- 
 +Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider 
 + at org.apache.pdfbox.pdmodel.encryption.PDEncryption.<init>(PDEncryption.java:96) 
 + at org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java:282) 
 + at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:199) 
 + at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:249) 
 + at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847) 
 + at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:803) 
 + at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757) 
 + at parser.pdfbox.core.PdfStreamEngine.processFile(PdfStreamEngine.java:120) 
 + at parser.pdfbox.PdfBoxParser.parse(PdfBoxParser.java:44) 
 + at cli.PdfParserCommandLine.parse(PdfParserCommandLine.java:268) 
 + at cli.PdfParserCommandLine.processFile(PdfParserCommandLine.java:247) 
 + at cli.PdfParserCommandLine.process(PdfParserCommandLine.java:233) 
 + at cli.PdfParserCommandLine.main(PdfParserCommandLine.java:168) 
 +Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider 
 + at java.net.URLClassLoader$1.run(URLClassLoader.java:372) 
 + at java.net.URLClassLoader$1.run(URLClassLoader.java:361) 
 + at java.security.AccessController.doPrivileged(Native Method) 
 + at java.net.URLClassLoader.findClass(URLClassLoader.java:360) 
 + at java.lang.ClassLoader.loadClass(ClassLoader.java:424) 
 + at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) 
 + at java.lang.ClassLoader.loadClass(ClassLoader.java:357) 
 + ... 13 more
  
-set convert_to to the text option, this is the output format upon conversion +---
-set mime_type to application/pdf  +
-set srcicon to the iconpdf, since Greenstone already knows about this macro and already has an icon for PDFs and knows to associate the two +
-set process_extension to pdf, this is the input format of the files that this instance of the UnknownConverterPlugin will process +
-set the exec_cmd field as follows, depending on your operating system: +
-on Windows:+
  
-DRIVE:\PATH\TO\YOUR-JAVA-8-HOME\bin\java -classpath "%%GSDL3SRCHOME\ext\icecite\gs-installed-jars\*:%%GSDL3SRCHOME\ext\icecite\pdf-cli\target\pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar" cli.PdfParserCommandLine --format txt --feature paragraphs %%INPUT_FILE %%OUTPUT +Then: 
 +a. Obtain bouncycastle (encryption?) jar files from https://www.bouncycastle.org/latest_releases.html
  
-on Unix systems:+Download both jar files listed under the "Provider" column for row "JDK 1.5 - JDK 1.8" (not sure that both are necessary) and put them in icecite/pdf-cli folder (for example)
  
-/PATH/TO/YOUR-JAVA-8-HOME/bin/java -classpath '%%GSDL3SRCHOME/ext/icecite/gs-installed-jars/*:%%GSDL3SRCHOME/ext/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature paragraphs %%INPUT_FILE %%OUTPUT +b. Then see https://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option 
 +for how to run a java programme when you have multiple jar files on classpath, as you can't run java with both -cp and -jar.
  
- +greenstone@bedrock:~/icecite/pdf-cli$ java -classpath '.:/home/greenstone/icecite/pdf-cli/*:target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jarcli.PdfParserCommandLine --format txt --feature words ~/Desktop/24.pdf ~/Desktop/24converted.txt 
- +-->
-NoteWhen filling in the exec_cmd field, leave the words with %% signs in front of them intactThey are placeholders for Greenstone to replace. +
- +
-You will however need to adjust the above value for exec_cmd by finding out where your Java 8 is installed and replacing /PATH/TO/YOUR-JAVA-8-HOME with itThe reason you need to provide the full path to the Java 8 executable is because, at present, GLI binaries ship with Java 7, which is incompatible with the precompiled IceciteAnd if you're running Greenstone from a source, you may have a different version of Java set up in your environment too. However, by providing the full path to the Java 8 executable above, you force Icecite's PDF conversion program to run with Java 8. +
- +
-On Windows, if there are spaces in any filepaths in the command, other than in the parameter value to -classpath, remember to bookend those filepaths within double quotes escaped with a backslash, \".  +
-The above command will use the java executable to run the java Icecite program that does the actual PDF to text conversion. Greenstone will run the command given after first filling in the %%GSDL3SRCHOME, %%INPUT_FILE and %%OUTPUT appropriately%%GSDL3SRCHOME works out to be the Greenstone 3 installation directory, whereas %%INPUT_FILE is whichever matching PDF it's processing and %%OUTPUT is likewise the file (or folder of files) produced by the conversion processIn this case, the output type is txt, as that's what Icecite producesOnce the conversion to text has finished, Greenstone will be able to process it as usual, such as indexing the extracted contents to make the document searchable.  +
-Having sufficiently configured the UnknownConverterPlugin, click the <OKbutton to close its configuration dialog. +
- +
-Select the UnknownConverterPlugin in the list of plugins and keep pressing the <Move Up> button to shift it upwards, until it appears in the plugin pipeline above the existing PDFPlugin, so that this instance of UnknownConverterPlugin, configured as it has now been to handle PDF files, will take precedence in processing such files. +
- +
-Move to the Create pane and build the collection. Once more, when Icecite conversion utility is called by Greenstone's building process, the conversion will take some time processing. But after a minute or so, the building will be done and you can Preview the collection. Search for some terms.+
en/user_advanced/ice_cite.txt · Last modified: 2023/03/13 01:46 by 127.0.0.1