en:user_advanced:collection_building
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | en:user_advanced:collection_building [2023/03/13 01:46] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | |||
+ | |||
+ | ====== Advanced collection building ====== | ||
+ | |||
+ | In the GLI, the actual building of the collection is done in the **Create** panel. There are many options associated with the building process. Many of these are available in the GLI when in expert mode ('' | ||
+ | |||
+ | ===== Importing ===== | ||
+ | Building collections is split into two segments: importing and building. The import process is when all of the documents and metadata are actually imported into the collection--documents are processed by their respective plugins, metadata is extracted, and both assigned and extracted metadata are stored in a format Greenstone can use. | ||
+ | |||
+ | |||
+ | You have several import options available in the GLI: | ||
+ | |||
+ | Greenstone can store metadata in two different formats: **GreenstoneXML** (default) and **GreenstoneMETS**. Use the '' | ||
+ | |||
+ | **Logs**: Logs provide information on what has occurred during the import/ | ||
+ | |||
+ | |||
+ | '' | ||
+ | |||
+ | |||
+ | '' | ||
+ | |||
+ | Every document in the collection must have a unique object identifier (OID). Greenstone gives you four options for assigning this identifier ('' | ||
+ | * **hash** (default): This creates a hash of the contents of the file, and will be consistent every time the collection is imported | ||
+ | * **assigned**: | ||
+ | * **dirname**: | ||
+ | * **incremental**: | ||
+ | <tabbox Greenstone3>'' | ||
+ | <tabbox Greenstone2>'' | ||
+ | </ | ||
+ | by replacing the " | ||
+ | <code perl> | ||
+ | } elsif ($self-> | ||
+ | $OID = " | ||
+ | $OIDcount ++; | ||
+ | </ | ||
+ | (// | ||
+ | |||
+ | |||
+ | ===== Building ===== | ||
+ | The actual building part of the build process takes the imported documents and metadata and builds the indexes, classifiers, | ||
+ | |||
+ | Options include only building specific indexes ('' | ||
+ | |||
+ | <tabbox Greenstone3> | ||
+ | <tabbox Greenstone2> | ||
+ | ===== Scheduling builds ===== | ||
+ | Greenstone allows you to set your collection to automatically re-build either hourly, daily, or weekly. In **Schedule Options** in the **Create** panel, check '' | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== Incremental Building ===== | ||
+ | Incremental building is a way to only import new/ | ||
+ | |||
+ | |||
+ | |||
+ | ===== Apparent multiple copies of documents ===== | ||
+ | When you look at a built collection, you may notice that there appear to be three copies of each document: in the //import//, // | ||
+ | |||
+ | |||
+ | =====Collection Size Limits===== | ||
+ | The largest collections we have built have been 7 GB of text, and 11 million short documents (about 3 GB of text). These built with no problems. We haven' | ||
+ | |||
+ | There are four main limitations: | ||
+ | * Operating system limitations: | ||
+ | * Technical limitations: | ||
+ | * Build time limitations: | ||
+ | * GLI limitations: | ||
+ | |||
+ | In practice, the solution for very large amounts of data is not to treat the collection as one huge monolith, but to partition it into subcollections and arrange for the search engine to search them all together behind the scenes. However, while you can amalgamate the results of searching subcollections fairly easily, it's much harder with browsing. Of course, A-Z lists and datelists and the like aren't really much use with very large collections. This is where new techniques of hierarchical phrase browsing come into their own. And the really good news is that you can partition a collection into subcollections, | ||
+ | |||
+ | In 2004 a test collection was built by " | ||
+ | |||
+ | //Setup details// | ||
+ | * Greenstone version: 2.52 | ||
+ | * Server: Pentium IV 1.8 GHz, 512 Mb RAM, Windows XP Prof. | ||
+ | * Number of indexed documents: 17,655 | ||
+ | * Number of images (tiff format): 980,000 | ||
+ | * Total size of text files: 3.2 GB | ||
+ | * Built indexes: section: | ||
+ | * Used Plugin: PagedImgPlug | ||
+ | * 5 classifiers | ||
+ | |||
+ | // | ||
+ | * Time to import the collection: Almost a week was spent collecting documents and importing them. No image conversion was done. | ||
+ | * Time to build the collection (excluding import): almost 24 hours. The archives and the indexes were on separate hard disks, to reduce the overhead that reading and writing from the same disk would cause. | ||
+ | * Time to open a hierarchy node that contains 908 objects: 23 seconds | ||
+ | * Average Time to search only one word in text index: 2 to 5 seconds | ||
+ | * Average Time to search 3 words in text index: 2 to 5 seconds | ||
+ | * Average Time to search exact phrases (includes 4, 5 and 6 words): 30 seconds | ||
+ | |||
+ | ====The Papers Past collection==== | ||
+ | Greenstone has been used to build the digital library collection for the [[http:// | ||
+ | |||
+ | At the start of February 2008, the collection for Papers Past comprised: | ||
+ | * 1,119,788 newspaper pages, from 207,793 issues. Of those, 91,545 issues--601, | ||
+ | * As at 06 March 2008, the number of documents--which corresponds to the number of newspaper issues--was 207,844. The space these take up in Greenstone is about 17.25GB (18, | ||
+ | * The total built index directory is 87GB. That includes the GDBM databases used to store word coordinates and the Lucene index itself (but no images). | ||
+ | * The 1,119,788 newspapers images are stored in TIFF format. (The total size of the collection data is still uncertain: it is either 3Tb or--if the images average 500kb each, as they have been estimated to--it is 546GB of image data.) | ||
+ | |||
+ | |||
+ | ===== Using the GLI with command line building===== | ||
+ | |||
+ | In some circumstances you may want to use the Librarian Interface to design your collection, | ||
+ | then actually build it using command line building. When you click " | ||
+ | its carrying out the last three steps: import, buildcol, and renaming building to index. | ||
+ | So you can do the earlier steps using the Librarian Interface, and then import and build | ||
+ | on the command line. If you are generating archive files by hand, then you will need to do | ||
+ | this as you will not be able to use the Librarian Interface to " | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Additional Resources ===== | ||
+ | * **[[Command line building]]** offers more, advanced options for collection importing and building | ||
+ | * **[[http:// | ||
+ | * **[[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// |
en/user_advanced/collection_building.txt · Last modified: 2023/03/13 01:46 by 127.0.0.1