org.archive.crawler.extractor
Class ExtractorImpliedURI

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.extractor.Extractor
                          extended by org.archive.crawler.extractor.ExtractorImpliedURI
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants

public class ExtractorImpliedURI
extends Extractor
implements CoreAttributeConstants

An extractor for finding 'implied' URIs inside other URIs. If the 'trigger' regex is matched, a new URI will be constructed from the 'build' replacement pattern. Unlike most other extractors, this works on URIs discovered by previous extractors. Thus it should appear near the end of any set of extractors. Initially, only finds absolute HTTP(S) URIs in query-string or its parameters. TODO: extend to find URIs in path-info

Author:
Gordon Mohr
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_BUILD_PATTERN
          replacement pattern used to build 'implied' URI
static java.lang.String ATTR_REMOVE_TRIGGER_URIS
          whether to remove URIs that trigger addition of 'implied' URI; default false
static java.lang.String ATTR_TRIGGER_REGEXP
          regex which when matched triggers addition of 'implied' URI
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Constructor Summary
ExtractorImpliedURI(java.lang.String name)
          Constructor
 
Method Summary
 void extract(CrawlURI curi)
          Perform usual extraction on a CrawlURI
protected static java.lang.String extractImplied(java.lang.CharSequence uri, java.lang.String trigger, java.lang.String build)
          Utility method for extracting 'implied' URI given a source uri, trigger pattern, and build pattern.
 java.lang.String report()
          Compiles and returns a report (in human readable form) about the status of the processor.
 
Methods inherited from class org.archive.crawler.extractor.Extractor
innerProcess, isHttpTransactionContentToProcess, isIndependentExtractors
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_TRIGGER_REGEXP

public static final java.lang.String ATTR_TRIGGER_REGEXP
regex which when matched triggers addition of 'implied' URI

See Also:
Constant Field Values

ATTR_BUILD_PATTERN

public static final java.lang.String ATTR_BUILD_PATTERN
replacement pattern used to build 'implied' URI

See Also:
Constant Field Values

ATTR_REMOVE_TRIGGER_URIS

public static final java.lang.String ATTR_REMOVE_TRIGGER_URIS
whether to remove URIs that trigger addition of 'implied' URI; default false

See Also:
Constant Field Values
Constructor Detail

ExtractorImpliedURI

public ExtractorImpliedURI(java.lang.String name)
Constructor

Parameters:
name -
Method Detail

extract

public void extract(CrawlURI curi)
Perform usual extraction on a CrawlURI

Specified by:
extract in class Extractor
Parameters:
curi - Crawl URI to process.

extractImplied

protected static java.lang.String extractImplied(java.lang.CharSequence uri,
                                                 java.lang.String trigger,
                                                 java.lang.String build)
Utility method for extracting 'implied' URI given a source uri, trigger pattern, and build pattern.

Parameters:
uri - source to check for implied URI
trigger - regex pattern which if matched implies another URI
build - replacement pattern to build the implied URI
Returns:
implied URI, or null if none

report

public java.lang.String report()
Description copied from class: Processor
Compiles and returns a report (in human readable form) about the status of the processor. The processor's name (of implementing class) should always be included.

Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.

Overrides:
report in class Processor
Returns:
A human readable report on the processor's state.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.