This version (2014/04/14 11:52) is a draft.
Approvals: 0/1

HTML files

Downloading files from the web

See the download page for general information on downloading records through Greenstone.

Greenstone can download records using HTTP or FTP protocols from the GLI (in the Download panel) or the command line (using the downloadfrom.pl script). Either way, you have the following options:

ArgumentDescription
Source URL(-url <string>)(REQUIRED) Source URL. In case of http redirects, this value may change
Download Depth (-depth <int>)How many hyperlinks deep to go when downloading (Default: 0)
Only files below URL (-below)Only mirror files below this URL
Only files within site (-within)Only mirror files within the same site
Only HTML files (-html_only)Download only HTML files, and ignore associated files e.g images and stylesheets

If downloading via the GLI, you can view the downloaded files on the Gather panel. On the left-hand side of the panel, double click the Downloaded Files folder to expand its content. The subfolders are named by the URL. These files are physically stored in a temporary cache directory. You can build a collection using these downloaded files by dragging them across to the Collection section on the right-hand side of the Gather panel.

An example web download on the command line would be:

 perl -S downloadfrom.pl -document_mode Web -url http://www.waikato.ac.nz/ -depth 1 -below -html_only

This would download only html files below the url http://www.waikato.ac.nz/ to one hyperlink deep.

If you are downloading html files, they will be handled by the HTMLPlugin.

Additional Resources