Documentation: HowTo about choosing the right harvester
This page contains some important information to help new users to build a working metadata portal.
Choosing the right repository type for new metadata providers and corresponding harvester
For remote metadata repositories exist 4 possibilities to setup them. Usage of one type has advantages and disadvantages.
- Static OAI Repository: A static repository consists of one single (big) XML file containing all metadata. In principle there is no limit for the number of records in a static repository, as long as the harvester is able to parse the file sequentially (and not DOM-based), which is the case for panFMP. The only problem is (if there are many records), that the repository cannot be harvested incremental (only records changed since the last harvesting). For a static repository, the harvester only checks that the static repository file changed, and if so, it downloads and parses it completely. The corresponding index is completely rebuilt.
So static repositories are only good for small repositories (<100 documents). The harvester for indexing this type of repository is de.pangaea.metadataportal.harvester.OAIStaticRepositoryHarvester.
- For bigger repositories another possibility is harvesting plain XML files from a webserver similar to web crawlers.
For one metadata portal the panFMP developers do this here: http://www.st.nmfs.gov/plankton/content/xml_src/ (repository of NMFS COPEPOD in DIF format). As you can see, the URL points to a directory on the foreign webserver (with directory index switched on).
The harvester collects all links from the HTML output of the directory listing and harvests all destinations. It checks every link's MIME type and last modified date (if its text/xml and was changed, it harvests this file; if it is text/html, he recursively looks for further links). As there is the need to check every single file for datestamps etc. and the overhead of many single HTTP connections, this harvester is very slow for too many files. It can be used for repositories <1500 files.
The harvester for indexing this type of repository is de.pangaea.metadataportal.harvester.WebCrawlingHarvester. Ask the data provider to put the plain XML metadata files in a directory on their webserver, enable dynamic HTML directory listings, and ask them not to touch timestamps of XML files not changed on e.g. bulk exports.
- Another possibility is harvesting contents of a single ZIP file using de.pangaea.metadataportal.harvester.ZipFileHarvester (at the moment it is not possible to combine WebCrawlingHarvester with ZipFileHarvester). Just ask the metadata provider to place a ZIP file on his webserver and point the property zipFile of the harvester to the correct URL (which may also be a local file). For network based ZIP harvesting, it is recommended to set the harvester property useZipFileDate to "yes", as this will only read the file, when it changed, and then reharvests all documents in it. If this setting is "no", the harvester will always scan the whole ZIP file and update those records, changed inside the ZIP file (according to ZIP entries last modified date). This setting is also recommened for local ZIP files, which are very large and only some of the included files change. As for finding out the last-modified-date inside ZIP files, the whole file needs to be downloaded, so it is not useable for files from URLs.
- For huge repositories, there is no way around a real OAI-PMH server! This is the most flexible solution. It is possible to harvest millions of records into panFMP indexes and update them later with incremental harvesting. This is controlled by the OAI-PMH repository server and is fully supported by panFMP automatically.
For building simple repositories we recommend DLESE jOAI, that creates a repository around a collection of XML files. It is simple to install, secure enough (Java based in webapp container), and simple to manage. If the provider's metadata is inside a database, records can be created by a XML marshalling solution, generating a bulk of XML files. More sophisticated solutions directly use the database and serve the XML files on-the-fly. This is possible without hassle because the OAI-PMH protocol is simple to implement.
The harvester for indexing this type of repository is de.pangaea.metadataportal.harvester.OAIHarvester.