Class Harvester

  • Direct Known Subclasses:
    NoOpHarvester, OAIHarvesterBase, Rebuilder, SingleFileEntitiesHarvester

    public abstract class Harvester
    extends Object
    Harvester interface to panFMP. This class is the abstract superclass of all harvesters. It also supplies an entry point for the command line interface.

    All panFMP harvesters support the following harvester properties:

    • harvestMessageStep: After how many documents should a status message be printed out by the method addDocument(de.pangaea.metadataportal.processor.MetadataDocument)? (default: 100)
    • numThreads: how many threads should process documents (XPath queries and XSL templates)? (default: 1) Raise this value, if the indexer waits to often for more documents and you have more than one processor. The optimal value is one lower than the number of processors. If you have very simple metadata documents (simple XML schmema) and few fields, lower values may be enough. The optimal value could only be found by testing.
    • maxQueue: size of queue for threads. (default 100 metadata documents)
    • bulkSize: size of bulk requests sent to Elasticsearch. (default 100 metadata documents)
    • concurrentBulkRequests: how many bulk requests can be sent in parallel to Elasticsearch. (default 1)
    • maxBulkMemory: maximum size of CBOR/JSON source for a bulk request. After a bulk gets larger than this, it will be submitted. Please note, that a bulk might get significantly larger, because the check is done after the document is added. Must be given using a unit like MB for megabytes. (default 5 MB)
    • validate: validate harvested documents against schema given in configuration? (default: true, if schema given)
    • conversionErrorAction: What to do if a conversion error occurs (e.g. number format error)? Can be STOP, IGNOREDOCUMENT, DELETEDOCUMENT (default is to stop conversion)
    Author:
    Uwe Schindler
    • Constructor Detail

      • Harvester

        public Harvester​(HarvesterConfig iconfig)
        Default constructor.
    • Method Detail

      • main

        public static void main​(String[] args)
        External entry point to the harvester interface. Called from the Java command line with two parameters (config file, harvester name)
      • runHarvester

        public static boolean runHarvester​(Config conf,
                                           String harvesterId)
        Harvests one (harvesterId='name') or more (harvesterId='*' ) sources. The harvester implementation is defined by the given configuration.
      • runHarvester

        protected static boolean runHarvester​(Config conf,
                                              String id,
                                              Class<? extends Harvester> harvesterClass)
        Harvests one (harvesterId="name") or more ( harvesterId="*"/"all"/null) sources. The harvester implementation is defined by the given configuration or if harvesterClass is not null, the specified harvester will be used. This is used by Rebuilder. Public code should use runHarvester(Config,String).
      • isAllIndexes

        protected static boolean isAllIndexes​(String id)
      • prepareReindex

        public void prepareReindex​(ElasticsearchConnection es,
                                   String targetIndex)
                            throws Exception
        Prepares harvester for rebuilding the index by Rebuilder. By default this method does nothing, but can be overridden by subclasses that need to setup additional things.
        Throws:
        Exception - if an exception occurs during opening (various types of exceptions can be thrown).
      • finishReindex

        public void finishReindex​(boolean cleanShutdown)
                           throws Exception
        Does cleanup work after rebuilding the index by Rebuilder. By default this method does nothing, but can be overridden by subclasses that need to shutdown additional things.
        Throws:
        Exception - if an exception occurs during closing (various types of exceptions can be thrown). Exceptions can be thrown asynchronous and may not affect the correct document.
      • isClosed

        public boolean isClosed()
        Checks if harvester is closed.
      • close

        public void close​(boolean cleanShutdown)
                   throws Exception
        Closes harvester. All resources are freed and the processor is closed.
        Parameters:
        cleanShutdown - enables writing of status information to the Elasticsearch instance for the next harvesting. If an error occurred during harvesting this should not be done.
        Throws:
        Exception - if an exception occurs during closing (various types of exceptions can be thrown). Exceptions can be thrown asynchronous and may not affect the correct document.
      • createMetadataDocumentInstance

        public MetadataDocument createMetadataDocumentInstance()
        Creates an instance of MetadataDocument and initializes it with the harvester config. This method should be overwritten, if a harvester uses another class.
      • isDocumentOutdated

        protected boolean isDocumentOutdated​(Instant lastModified)
        Checks, if the supplied Datestamp needs harvesting. This method can be used to find out, if a documents needs harvesting.
      • setHarvestingDateReference

        protected void setHarvestingDateReference​(Instant harvestingDateReference)
        Reference date of this harvesting event (in time reference of the original server). This date is used on the next harvesting in variable fromDateReference. As long as this is null, the harvester will not write or update the value in Elasticsearch.
      • setValidIdentifiers

        protected void setValidIdentifiers​(Set<String> validIdentifiers)
        Set a set of all "seen" valid identifiers. Must be set, before close(boolean) is called, as the information is passed to the processor before finalizing the index.
      • enumerateValidHarvesterPropertyNames

        protected void enumerateValidHarvesterPropertyNames​(Set<String> props)
        This method is used by subclasses to enumerate all available harvester properties that are implemented by them. Overwrite this method in your own implementation and append all harvester names to the supplied Set. The public API for client code requesting property names is getValidHarvesterPropertyNames().
        See Also:
        getValidHarvesterPropertyNames()