org.archive.crawler.prefetch
Class RuntimeLimitEnforcer

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.prefetch.RuntimeLimitEnforcer
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, FetchStatusCodes

public class RuntimeLimitEnforcer
extends Processor
implements FetchStatusCodes

A processor to enforce runtime limits on crawls.

This processor extends and improves on the 'max-time' capability of Heritrix. Essentially, the 'Terminate job' option functions the same way as 'max-time'. The processor however also enables pausing when the runtime is exceeded and the blocking of all URIs.

  1. Pause job - Pauses the crawl. A change (increase) to the runtime duration will make it pausible to resume the crawl. Attempts to resume the crawl without modifying the run time will cause it to be immediately paused again.
  2. Terminate job - Terminates the job. Equivalent to using the max-time setting on the CrawlController.
  3. Block URIs - Blocks each URI with an -5002 (blocked by custom processor) fetch status code. This will cause all the URIs queued to wind up in the crawl.log.
    1. The processor allows variable runtime based on host (or other override/refinement criteria) however using such overrides only makes sense when using 'Block URIs' as pause and terminate will have global impact once encountered anywhere.

      Author:
      Kristinn Sigurðsson
      See Also:
      Serialized Form

      Nested Class Summary
       
      Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
      ComplexType.MBeanAttributeInfoIterator
       
      Field Summary
      static java.lang.String ATTR_END_OPERATION
                 
      static java.lang.String ATTR_RUNTIME_SECONDS
                 
      protected static java.lang.String[] AVAILABLE_END_OPERATIONS
                 
      protected static java.lang.String DEFAULT_END_OPERATION
                 
      protected static long DEFAULT_RUNTIME_SECONDS
                 
      protected  java.util.logging.Logger logger
                 
      protected static java.lang.String OP_BLOCK_URIS
                 
      protected static java.lang.String OP_PAUSE
                 
      protected static java.lang.String OP_TERMINATE
                 
       
      Fields inherited from class org.archive.crawler.framework.Processor
      ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
       
      Fields inherited from class org.archive.crawler.settings.ComplexType
      definition, definitionMap
       
      Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
      S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
       
      Constructor Summary
      RuntimeLimitEnforcer(java.lang.String name)
                 
       
      Method Summary
      protected  long getRuntime(CrawlURI curi)
                Returns the amount of time to allow the crawl to run before this processor interrupts.
      protected  void innerProcess(CrawlURI curi)
                Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.
       
      Methods inherited from class org.archive.crawler.framework.Processor
      checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
       
      Methods inherited from class org.archive.crawler.settings.ModuleType
      addElement, listUsedFiles
       
      Methods inherited from class org.archive.crawler.settings.ComplexType
      addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
       
      Methods inherited from class org.archive.crawler.settings.Type
      addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
       
      Methods inherited from class javax.management.Attribute
      getName, hashCode
       
      Methods inherited from class java.lang.Object
      clone, finalize, getClass, notify, notifyAll, wait, wait, wait
       

      Field Detail

      logger

      protected java.util.logging.Logger logger

      ATTR_RUNTIME_SECONDS

      public static final java.lang.String ATTR_RUNTIME_SECONDS

      DEFAULT_RUNTIME_SECONDS

      protected static final long DEFAULT_RUNTIME_SECONDS
      See Also:
      Constant Field Values

      ATTR_END_OPERATION

      public static final java.lang.String ATTR_END_OPERATION

      OP_PAUSE

      protected static final java.lang.String OP_PAUSE

      OP_TERMINATE

      protected static final java.lang.String OP_TERMINATE

      OP_BLOCK_URIS

      protected static final java.lang.String OP_BLOCK_URIS

      DEFAULT_END_OPERATION

      protected static final java.lang.String DEFAULT_END_OPERATION

      AVAILABLE_END_OPERATIONS

      protected static final java.lang.String[] AVAILABLE_END_OPERATIONS
      Constructor Detail

      RuntimeLimitEnforcer

      public RuntimeLimitEnforcer(java.lang.String name)
      Method Detail

      innerProcess

      protected void innerProcess(CrawlURI curi)
                           throws java.lang.InterruptedException
      Description copied from class: Processor
      Classes subclassing this one should override this method to perform their custom actions on the CrawlURI.

      Overrides:
      innerProcess in class Processor
      Parameters:
      curi - The CrawlURI being processed.
      Throws:
      java.lang.InterruptedException

      getRuntime

      protected long getRuntime(CrawlURI curi)
      Returns the amount of time to allow the crawl to run before this processor interrupts.

      Returns:
      the amount of time in milliseconds.


      Copyright © 2003-2011 Internet Archive. All Rights Reserved.