ExtractorUniversal (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.crawler.extractor
Class ExtractorUniversal

java.lang.Object
  javax.management.Attribute
      org.archive.crawler.settings.Type
          org.archive.crawler.settings.ComplexType
              org.archive.crawler.settings.ModuleType
                  org.archive.crawler.framework.Processor
                      org.archive.crawler.extractor.Extractor
                          org.archive.crawler.extractor.ExtractorUniversal

All Implemented Interfaces:: java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants

public class ExtractorUniversal
extends Extractor
implements CoreAttributeConstants
extends Extractor
implements CoreAttributeConstants

A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link. If used, it should always be specified as the last link extractor in the order file.

To accomplish this it will scan through the bytecode and try and build up strings of consecutive bytes that all represent characters that are valid in a URL (see #isURLableChar(int) for details). Once it hits the end of such a string (i.e. finds a character that should not be in a URL) it will try to determine if it has found a URL. This is done be seeing if the string is an IP address prefixed with http(s):// or contains a dot followed by a Top Level Domain and end of string or a slash.

Author:: Kristinn Sigurdsson
See Also:: Serialized Form

Nested Class Summary

Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
`ComplexType.MBeanAttributeInfoIterator`

Field Summary
`(package private) static java.lang.String`	`IP_ADDRESS` Matches any string that begins with http:// or https:// followed by something that looks like an ip address (four numbers, none longer then 3 chars seperated by 3 dots).
`protected long`	`numberOfCURIsHandled`
`protected long`	`numberOfLinksExtracted`
`static java.lang.String`	`TLDs` Matches any string that begins with a TLD (no .) followed by a '/' slash or end of string.

Fields inherited from class org.archive.crawler.framework.Processor
`ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules`

Fields inherited from class org.archive.crawler.settings.ComplexType
`definition, definitionMap`

Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX

Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants

A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX

Constructor Summary
`ExtractorUniversal(java.lang.String name)` Constructor

Method Summary
`protected void`	`extract(CrawlURI curi)`
`java.lang.String`	`report()` Compiles and returns a report (in human readable form) about the status of the processor.

Methods inherited from class org.archive.crawler.extractor.Extractor
`innerProcess, isHttpTransactionContentToProcess, isIndependentExtractors`

Methods inherited from class org.archive.crawler.framework.Processor
`checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn`

Methods inherited from class org.archive.crawler.settings.ModuleType
`addElement, listUsedFiles`

Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute

Methods inherited from class org.archive.crawler.settings.ComplexType

addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute

Methods inherited from class org.archive.crawler.settings.Type
`addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient`

Methods inherited from class javax.management.Attribute
`getName, hashCode`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Field Detail

IP_ADDRESS

static final java.lang.String IP_ADDRESS

Matches any string that begins with http:// or https:// followed by something that looks like an ip address (four numbers, none longer then 3 chars seperated by 3 dots). Does not ensure that the numbers are each in the range 0-255.

See Also:: Constant Field Values

TLDs

public static final java.lang.String TLDs

Matches any string that begins with a TLD (no .) followed by a '/' slash or end of string. If followed by slash then nothing after the slash is of consequence.

See Also:: Constant Field Values

numberOfCURIsHandled

protected long numberOfCURIsHandled

numberOfLinksExtracted

protected long numberOfLinksExtracted

Constructor Detail