Class ZipFileHarvester


  • public class ZipFileHarvester
    extends SingleFileEntitiesHarvester
    Harvester for unzipping ZIP files and reading their contents. Identifiers look like: "zip:<identifierPrefix><entryFilename>"

    This harvester supports the following additional harvester properties:

    • zipFile: filename or URL of ZIP file to harvest
    • identifierPrefix: This prefix is appended before all identifiers (that are the identifiers of the documents) (default: "")
    • filenameFilter: regex to match the entry filename (default: none)
    • useZipFileDate: if "yes", check the modification date of the ZIP file and re-harvest in complete; if "no", look at each file in the archive and store its modification date in index. For ZIP files from network connections that seldom change use "yes" as it prevents scanning the ZIP file in complete. "No" is recommended for large local files with much modifications in only some files (default: yes)
    • retryCount: how often retry on HTTP errors? (default: 5)
    • retryAfterSeconds: time between retries in seconds (default: 60)
    • timeoutAfterSeconds: HTTP Timeout for harvesting in seconds
    Author:
    Uwe Schindler