Class WebCrawlingHarvester


  • public class WebCrawlingHarvester
    extends SingleFileEntitiesHarvester
    Harvester for traversing websites and harvesting XML documents. If the baseURL (from config) contains a XML file with the correct MIME type, it is directly harvested. A html webpage is analyzed and all links are followed and checked for XML files with correct MIME type. This is done recursively, but harvesting does not escape the server and baseURL directory.

    This harvester supports the following additional harvester properties:

    • baseUrl: URL to start crawling (should point to a HTML page).
    • retryCount: how often retry on HTTP errors? (default: 5)
    • retryAfterSeconds: time between retries in seconds (default: 60)
    • timeoutAfterSeconds: HTTP Timeout for harvesting in seconds
    • authorizationHeader: Optional 'Authorization' HTTP header contents to be sent with request.
    • filenameFilter: regex to match the filename. The regex is applied against the whole filename (this is like ^pattern$)! (default: none)
    • contentTypes: MIME types of documents to index (maybe additionally limited by filenameFilter). (default: "text/xml,application/xml")
    • excludeUrlPattern: A regex that is applied to all URLs appearing during harvesting process. URLs with matching patterns (partial matches allowed, use ^,$ for start/end matches) are excluded and not further traversed. (default: none)
    • pauseBetweenRequests: to not overload server that is harvested, wait XX milliseconds after each HTTP request (default: none)
    Author:
    Uwe Schindler