Class OAIHarvesterBase

  • Direct Known Subclasses:
    OAIHarvester, OAIStaticRepositoryHarvester

    public abstract class OAIHarvesterBase
    extends Harvester
    Abstract base class for OAI harvesting support in panFMP. Use one of the subclasses for harvesting OAI-PMH or OAI Static Repositories.

    This harvester supports the following additional harvester properties:

    • setSpec: OAI set to harvest (default: none)
    • retryCount: how often retry on HTTP errors? (default: 5)
    • retryAfterSeconds: time between retries in seconds (default: 60)
    • timeoutAfterSeconds: HTTP Timeout for harvesting in seconds
    • authorizationHeader: Optional 'Authorization' HTTP header contents to be sent with request.
    • metadataPrefix: OAI metadata prefix to harvest
    • identifierPrefix: prepend all identifiers returned by OAI with this string
    • ignoreDatestamps: does full harvesting, while ignoring all datestamps. They are saved, but ignored, if invalid.
    • deleteMissingDocuments: remove documents after harvesting that were deleted from source (maybe a heavy operation). The harvester only does this on full (not on incremental harvesting). (default: true)
    Author:
    Uwe Schindler
    • Field Detail

      • USER_AGENT

        public static final String USER_AGENT
      • metadataPrefix

        protected final String metadataPrefix
        the used metadata prefix from the configuration
      • identifierPrefix

        protected final String identifierPrefix
        prepend all identifiers returned by OAI with this string
      • sets

        protected final Set<String> sets
        the sets to harvest from the configuration, null to harvest all
      • retryCount

        protected final int retryCount
        the retryCount from configuration
      • retryTime

        protected final int retryTime
        the retryTime from configuration
      • timeout

        protected final Duration timeout
        the timeout from configuration
      • authorizationHeader

        protected final String authorizationHeader
        the authorizationHeader from configuration
      • ignoreDatestamps

        protected final boolean ignoreDatestamps
        If enabled, does full harvesting, while ignoring all datestamps (default is false). They are saved, but ignored, if invalid.
      • deleteMissingDocuments

        protected final boolean deleteMissingDocuments
        If enabled, on any kind of full harvesting it will track all valid identifiers and delete all of them not seen in index.
      • httpClient

        protected final HttpClient httpClient
        HttpClient to use, configured with correct connect timeout.
      • filterIncomingSets

        protected boolean filterIncomingSets
        The harvester should filter incoming documents according to its set metadata. Should be disabled for OAI-PMH protocol with only one set. Default is true.
    • Constructor Detail

    • Method Detail

      • createMetadataDocumentInstance

        public MetadataDocument createMetadataDocumentInstance()
        Description copied from class: Harvester
        Creates an instance of MetadataDocument and initializes it with the harvester config. This method should be overwritten, if a harvester uses another class.
        Overrides:
        createMetadataDocumentInstance in class Harvester
      • getMetadataDocumentFactory

        protected org.apache.commons.digester.ObjectCreationFactory getMetadataDocumentFactory()
        Returns a factory for creating the MetadataDocuments in Digester code (using FactoryCreateRule).
        See Also:
        createMetadataDocumentInstance()
      • recreateDigester

        protected abstract void recreateDigester()
        Recreates all digesters that are used by parsing the OAI XML. This method is called initiall once and later on network errors before parsing same document again. This allows to recover from document parsing failing somewhere in the middle of a document.
      • doParse

        protected boolean doParse​(Supplier<ExtendedDigester> digSupplier,
                                  String url,
                                  AtomicReference<Instant> checkModifiedDate)
                           throws Exception
        Harvests a URL using the suplied digester.
        Parameters:
        digSupplier - a Supplier that gives access to a (possibly recreated) digester instance.
        url - the URL is parsed by this digester instance.
        checkModifiedDate - for static repositories, it is possible to give a reference to a Instant for checking the last modification, in this case false is returned, if the URL was not modified. If it was modified, the reference contains a new Date object with the new modification date. Supply null for no checking of last modification, a last modification date is then not returned back (as there is no reference).
        Returns:
        true if harvested, false if not modified and no harvesting was done.
        Throws:
        Exception
      • getInputSource

        protected InputSource getInputSource​(URI url,
                                             AtomicReference<Instant> checkModifiedDate)
                                      throws IOException
        Returns a SAX InputSource for retrieving stream data of an URL. It is optimized for compression of the HTTP(S) protocol and timeout checking.
        Parameters:
        url - the URL to open
        checkModifiedDate - for static repositories, it is possible to give a reference to a Instant for checking the last modification, in this case null is returned, if the URL was not modified. If it was modified, the reference contains a new Date object with the new modification date. Supply null for no checking of last modification, a last modification date is then not returned back (as there is no reference).
        Throws:
        IOException
        See Also:
        getEntityResolver(org.xml.sax.EntityResolver)
      • reset

        protected void reset()
        Resets the internal variables.
      • enableMissingDocumentDelete

        protected void enableMissingDocumentDelete()
        Enable unseen document deletes. This should be enabled by harvester before calling addDocument(MetadataDocument), so tracking can be enabled.
      • cancelMissingDocumentDelete

        protected void cancelMissingDocumentDelete()
        Disable the property "deleteMissingDocuments" for this instance. This can be used, when the container (like a ZIP file was not modified), and all containing documents are not enumerated. To prevent deletion of all these documents call this.
      • close

        public void close​(boolean cleanShutdown)
                   throws Exception
        Description copied from class: Harvester
        Closes harvester. All resources are freed and the Harvester.processor is closed.
        Overrides:
        close in class Harvester
        Parameters:
        cleanShutdown - enables writing of status information to the Elasticsearch instance for the next harvesting. If an error occurred during harvesting this should not be done.
        Throws:
        Exception - if an exception occurs during closing (various types of exceptions can be thrown). Exceptions can be thrown asynchronous and may not affect the correct document.