org.archive.crawler.extractor
Class ExtractorImpliedURI
java.lang.Object
javax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.extractor.Extractor
org.archive.crawler.extractor.ExtractorImpliedURI
- All Implemented Interfaces:
- java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants
public class ExtractorImpliedURI
- extends Extractor
- implements CoreAttributeConstants
An extractor for finding 'implied' URIs inside other URIs. If the
'trigger' regex is matched, a new URI will be constructed from the
'build' replacement pattern.
Unlike most other extractors, this works on URIs discovered by
previous extractors. Thus it should appear near the end of any
set of extractors.
Initially, only finds absolute HTTP(S) URIs in query-string or its
parameters.
TODO: extend to find URIs in path-info
- Author:
- Gordon Mohr
- See Also:
- Serialized Form
Field Summary |
static java.lang.String |
ATTR_BUILD_PATTERN
replacement pattern used to build 'implied' URI |
static java.lang.String |
ATTR_REMOVE_TRIGGER_URIS
whether to remove URIs that trigger addition of 'implied' URI;
default false |
static java.lang.String |
ATTR_TRIGGER_REGEXP
regex which when matched triggers addition of 'implied' URI |
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants |
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX |
Method Summary |
void |
extract(CrawlURI curi)
Perform usual extraction on a CrawlURI |
protected static java.lang.String |
extractImplied(java.lang.CharSequence uri,
java.lang.String trigger,
java.lang.String build)
Utility method for extracting 'implied' URI given a source uri,
trigger pattern, and build pattern. |
java.lang.String |
report()
Compiles and returns a report (in human readable form) about the status
of the processor. |
Methods inherited from class org.archive.crawler.framework.Processor |
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ComplexType |
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute |
Methods inherited from class org.archive.crawler.settings.Type |
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
getName, hashCode |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
ATTR_TRIGGER_REGEXP
public static final java.lang.String ATTR_TRIGGER_REGEXP
- regex which when matched triggers addition of 'implied' URI
- See Also:
- Constant Field Values
ATTR_BUILD_PATTERN
public static final java.lang.String ATTR_BUILD_PATTERN
- replacement pattern used to build 'implied' URI
- See Also:
- Constant Field Values
ATTR_REMOVE_TRIGGER_URIS
public static final java.lang.String ATTR_REMOVE_TRIGGER_URIS
- whether to remove URIs that trigger addition of 'implied' URI;
default false
- See Also:
- Constant Field Values
ExtractorImpliedURI
public ExtractorImpliedURI(java.lang.String name)
- Constructor
- Parameters:
name
-
extract
public void extract(CrawlURI curi)
- Perform usual extraction on a CrawlURI
- Specified by:
extract
in class Extractor
- Parameters:
curi
- Crawl URI to process.
extractImplied
protected static java.lang.String extractImplied(java.lang.CharSequence uri,
java.lang.String trigger,
java.lang.String build)
- Utility method for extracting 'implied' URI given a source uri,
trigger pattern, and build pattern.
- Parameters:
uri
- source to check for implied URItrigger
- regex pattern which if matched implies another URIbuild
- replacement pattern to build the implied URI
- Returns:
- implied URI, or null if none
report
public java.lang.String report()
- Description copied from class:
Processor
- Compiles and returns a report (in human readable form) about the status
of the processor. The processor's name (of implementing class) should
always be included.
Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.
- Overrides:
report
in class Processor
- Returns:
- A human readable report on the processor's state.
Copyright © 2003-2011 Internet Archive. All Rights Reserved.