Table of Contents
PDF Documents in Greenstone
PDFPlugin
Some of these options use the standard pdftohtml program, others use ImageMagick and Ghostscript to convert the file to a series of images. Ghostscript is a program that can convert Postscript and PDF files to other formats. You can download it from here. (follow the link to the current stable release).
PDFBox
The standard PDF Plugin can process PDF versions up to 1.4. To process later versions, you'll need to download the PDFBox extension. See 2.85 Release Notes.
Troubleshooting PDF document problems in Greenstone
Security settings can prevent Greenstone from processing the files. Check these in Acrobat Reader. Go to File Menu, Document Properties and once there, go to security tab.
PDF is a "page description language". This means that the document contains objects and commands such as "draw this text here" and "draw this image here".
Greenstone uses an external program called "pdftohtml" to extract text out of PDF files. Sometimes, there is no text that can be extracted. This often depends on how the PDF was created.
- Adobe Acrobat Writer can be used to create PDFs from paper documents that are scanned in by a scanner. In this case, the PDF file contains images of text, rather than computer-readable text. Therefore, pdftohtml cannot find any text to extract.
- Some programs (such as older versions of GNU ghostscript, which is used by ps2pdf on Unix computers) sometimes create "bitmap fonts", which means that every character in the document is really an image rather than a computer readable letter. The LaTeX type-setting program sometimes does this when the "Computer Modern Roman" font is used.
- Certain characters and character combinations may be extracted incorrectly, depending on the program that generated the PDF file. For example, "ligatures" such as "fi", "fl", "ff" and "ffl" are often rendered using a special glyph rather than as individual characters, and this information may be lost in the textual representation. Also, some PDF generating programs may not correctly encode accented characters. For example, to draw a lowercase "u" with an umlaut accent, LaTeX draws a "u" and then draws an umlaut accent over it. This means that pdftohtml will extract two separate characters (¨ and 'u') rather than a single accented character (ü).
- PDF contains pieces of text, and coordinates for where that text should be displayed. This means that pdftohtml may incorrectly guess the order that the text fragments are supposed to occur in. For example, for text that is in two or more columns, the text may be extracted as the first sentence of each column, then the second sentence of each column, and so on. In this case, the extracted text is still usable for indexing purposes, but should not be displayed. In this case, a format statement should be added to the collect.cfg file to provide a link to the original PDF file but not to the extracted text, such as:format SearchVList "<td valign=top>[srclink][srcicon][/srclink]</td><td>[srclink][Title][/srclink]</td>"
- Because of the way that images are embedded in PDF files, pdftohtml occasionally extracts an image upside-down, or mirrored. This appears to be a bug in the program.
