org.archive.crawler.extractor
Class ExtractorURI
java.lang.Object
javax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.extractor.Extractor
org.archive.crawler.extractor.ExtractorURI
- All Implemented Interfaces:
- java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants
public class ExtractorURI
- extends Extractor
- implements CoreAttributeConstants
An extractor for finding URIs inside other URIs. Unlike most other
extractors, this works on URIs discovered by previous extractors. Thus
it should appear near the end of any set of extractors.
Initially, only finds absolute HTTP(S) URIs in query-string or its
parameters.
TODO: extend to find URIs in path-info
- Author:
- Gordon Mohr
- See Also:
- Serialized Form
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants |
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX |
Constructor Summary |
ExtractorURI(java.lang.String name)
Constructor |
Method Summary |
void |
extract(CrawlURI curi)
Perform usual extraction on a CrawlURI |
protected void |
extractLink(CrawlURI curi,
Link wref)
Consider a single Link for internal URIs |
protected static java.util.List<java.lang.String> |
extractQueryStringLinks(UURI source)
Look for URIs inside the supplied UURI. |
java.lang.String |
report()
Compiles and returns a report (in human readable form) about the status
of the processor. |
Methods inherited from class org.archive.crawler.framework.Processor |
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ComplexType |
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute |
Methods inherited from class org.archive.crawler.settings.Type |
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
getName, hashCode |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
ABS_HTTP_URI_PATTERN
static final java.lang.String ABS_HTTP_URI_PATTERN
- See Also:
- Constant Field Values
ExtractorURI
public ExtractorURI(java.lang.String name)
- Constructor
- Parameters:
name
-
extract
public void extract(CrawlURI curi)
- Perform usual extraction on a CrawlURI
- Specified by:
extract
in class Extractor
- Parameters:
curi
- Crawl URI to process.
extractLink
protected void extractLink(CrawlURI curi,
Link wref)
- Consider a single Link for internal URIs
- Parameters:
curi
- CrawlURI to add discoveries towref
- Link to examine for internal URIs
extractQueryStringLinks
protected static java.util.List<java.lang.String> extractQueryStringLinks(UURI source)
- Look for URIs inside the supplied UURI.
Static for ease of testing or outside use.
- Parameters:
source
- UURI to example
- Returns:
- List of discovered String URIs.
report
public java.lang.String report()
- Description copied from class:
Processor
- Compiles and returns a report (in human readable form) about the status
of the processor. The processor's name (of implementing class) should
always be included.
Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.
- Overrides:
report
in class Processor
- Returns:
- A human readable report on the processor's state.
Copyright © 2003-2011 Internet Archive. All Rights Reserved.