Code Notes about Incremental Building.

archiveinf-src.gdb - is a reverse lookup between import file and OID that it gets used for. One file may end up with several oids, eg if its a database file of many records.

archiveinf-doc.gdb stores a list of OID's. For each one, it stores the following

 [HASHe407a49647d425a685f959]
 <sort-meta>
 <doc-file>HASHe407.dir/doc.xml
 <src-file>/research/kjdon/home/gsdl/collect/incre4/import/1-1.jpeg
 <assoc-file>
 <meta-file>
 <index-status>I

src-file is the primary import document that generated this greenstone document. doc-file is the file in archives sort-meta is a metadata for sorting the build (if -sortmeta was specified for import)

assoc-file and meta-file store auxiliary files that were used for this document, eg txt files for a pagedimage, OAI metadata file.

For a metadata file to be listed in these databases, then an entry in extrametafile must be added.

 $extrametafile->{$filename_for_metadata}->{$file} = $filename_full_path;

$filename_for_metadata is the primary import file, $file is a short version of the meta file, and $filename_full_path is the full path to the metadata file.

Note, metadata.xml files and metadatacsv files do not do this. Instead, during the file_block_read, they are marked as metadata files:

   $block_hash->{'metadata_files'}->{$filename_full_path} = 1;

It seems that metadata_read is always carried out, so if a doc is changed it will always get metadata. If a metadata files is changed, then the whole folder must be reimported, as we don't know if new docs have metadata added to that file.

Because OAI files currently only give metadata to one document, I am assuming that the document that it refers to will never change, and so we do a lookup for the file it belongs to, like for associated files. We don't reimport the whole folder.

Plugins need to call associate_source_file for any auxiliary files in import that they are using (eg images in HTML page, image and text files in pagedimage etc). This is to record these files as associated files and this info will go into the two databases.

HTMLPlugin and PagedImagePlugin are sometimes used to process converted files - eg PDF converted to HTML/item, then processed by HTML/PagedImage plugin. In this case, we do not want to record the auxiliary files as they are in temp dir and will not be modified directly. When loading up the secondary plugin, the main plugin must set -tmp_file_process option, and these plugins will not store these auxiliary files.

If a doc or one of its associated files change, then the doc is marked for reimporting, and the old record is marked for deletion (the doc may end up with a new id after reimporting.)

Some annoying/leftover things:

  • docs that fail to be imported will show up each import as they are not recorded anywhere, so always look like new documents. This includes docs that should be processed but fail, and docs that are normally blocked (eg ~ docs)
  • I haven't looked at DSpace, CONTENTdm plugins properly. Not sure if they will work incrementally. May need to add more docs using associate_source_file
  • I am currently using full paths in src db, and in most of doc db (except meta-file ends up relative due to old code). Maybe change back to relative file paths so that collection can be moved and still be updated.
  • If we ever add -document_field option to database plugins (marc, marcxml, bibtex etc), then we can end up with the situation of one file holding metadata for new records, and metadata for documents. Need to think about whether to set the file as metadata file, and whether to use extrametafile too.