HTML files
Downloading files from the web
See the download page for general information on downloading records through Greenstone.
Greenstone can download records using HTTP or FTP protocols from the GLI (in the Download panel) or the command line (using the downloadfrom.pl
script). Either way, you have the following options:
Argument | Description |
---|---|
Source URL(-url <string> ) | (REQUIRED) Source URL. In case of http redirects, this value may change |
Download Depth (-depth <int> ) | How many hyperlinks deep to go when downloading (Default: 0) |
Only files below URL (-below ) | Only mirror files below this URL |
Only files within site (-within ) | Only mirror files within the same site |
Only HTML files (-html_only ) | Download only HTML files, and ignore associated files e.g images and stylesheets |
If downloading via the GLI, you can view the downloaded files on the Gather panel. On the left-hand side of the panel, double click the Downloaded Files folder to expand its content. The subfolders are named by the URL. These files are physically stored in a temporary cache directory. You can build a collection using these downloaded files by dragging them across to the Collection section on the right-hand side of the Gather panel.
An example web download on the command line would be:
perl -S downloadfrom.pl -document_mode Web -url http://www.waikato.ac.nz/ -depth 1 -below -html_only
This would download only html files below the url http://www.waikato.ac.nz/
to one hyperlink deep.
If you are downloading html files, they will be handled by the HTMLPlugin.
Additional Resources
Greenstone3
There are several tutorials on creating collections of HTML documents on Greenstone:
Greenstone2
There are several tutorials on creating collections of HTML documents on Greenstone: