Class SingleFileEntitiesHarvester

  • Direct Known Subclasses:
    DirectoryHarvester, ElasticsearchHarvester, PanFMP1IndexHarvester, PushWrapperHarvester, WebCrawlingHarvester, ZipFileHarvester

    public abstract class SingleFileEntitiesHarvester
    extends Harvester
    Abstract harvester class for single file entities (like files from web page or from a local directory). The harvester makes it possible to add XML documents given by a Source to the index. These are harvested, but if an fatal parse error occurs, the harvester will then stop harvesting (like it would be with OAI-PMH), ignore the document, or delete it (if existent in index) depending on the harvester property "parseErrorAction".

    This panFMP harvester supports the following harvester properties in adidition to the default ones:

    • parseErrorAction: What to do if a parse error occurs? Can be STOP, IGNOREDOCUMENT, DELETEDOCUMENT (default is to ignore the document)
    • deleteMissingDocuments: remove documents after harvesting that were deleted from source (maybe a heavy operation). (default: true)
    Uwe Schindler
    • Method Detail

      • close

        public void close​(boolean cleanShutdown)
                   throws Exception
        Description copied from class: Harvester
        Closes harvester. All resources are freed and the Harvester.processor is closed.
        close in class Harvester
        cleanShutdown - enables writing of status information to the Elasticsearch instance for the next harvesting. If an error occurred during harvesting this should not be done.
        Exception - if an exception occurs during closing (various types of exceptions can be thrown). Exceptions can be thrown asynchronous and may not affect the correct document.
      • addDocument

        protected final void addDocument​(String identifier,
                                         long lastModified,
                                         Source xml)
                                  throws Exception
        Adds a document to the Harvester.processor working in the background. If a parsing error occurs the document is handled according to parseErrorAction. It is also added to the valid identifiers (if unseen documents should be deleted).
        identifier - is the document's identifier in the index
        lastModified - is the last-modification date which is used to calculate the next harvesting start date. If document is older that the last harvesting, it is skipped.
        xml - is the transformer source of the document, null to only update document status (lastModified) and adding to valid identifiers
        See Also:
      • cancelMissingDocumentDelete

        protected void cancelMissingDocumentDelete()
        disable the property "deleteMissingDocuments" for this instance. This can be used, when the container (like a ZIP file was not modified), and all containing documents are not enumerated. To prevent deletion of all these documents call this.