org.archive.crawler.extractor
Class ExtractorUniversal
java.lang.Object
javax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.extractor.Extractor
org.archive.crawler.extractor.ExtractorUniversal
- All Implemented Interfaces:
- java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants
public class ExtractorUniversal
- extends Extractor
- implements CoreAttributeConstants
A last ditch extractor that will look at the raw byte code and try to extract
anything that looks like a link.
If used, it should always be specified as the last link extractor in the
order file.
To accomplish this it will scan through the bytecode and try and build up
strings of consecutive bytes that all represent characters that are valid
in a URL (see #isURLableChar(int) for details).
Once it hits the end of such a string (i.e. finds a character that
should not be in a URL) it will try to determine if it has found a URL.
This is done be seeing if the string is an IP address prefixed with
http(s):// or contains a dot followed by a Top Level Domain and end of
string or a slash.
- Author:
- Kristinn Sigurdsson
- See Also:
- Serialized Form
Field Summary |
(package private) static java.lang.String |
IP_ADDRESS
Matches any string that begins with http:// or https:// followed by
something that looks like an ip address (four numbers, none longer then
3 chars seperated by 3 dots). |
protected long |
numberOfCURIsHandled
|
protected long |
numberOfLinksExtracted
|
static java.lang.String |
TLDs
Matches any string that begins with a TLD (no .) followed by a '/' slash
or end of string. |
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants |
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX |
Method Summary |
protected void |
extract(CrawlURI curi)
|
java.lang.String |
report()
Compiles and returns a report (in human readable form) about the status
of the processor. |
Methods inherited from class org.archive.crawler.framework.Processor |
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ComplexType |
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute |
Methods inherited from class org.archive.crawler.settings.Type |
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
getName, hashCode |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
IP_ADDRESS
static final java.lang.String IP_ADDRESS
- Matches any string that begins with http:// or https:// followed by
something that looks like an ip address (four numbers, none longer then
3 chars seperated by 3 dots). Does not ensure that the numbers are
each in the range 0-255.
- See Also:
- Constant Field Values
TLDs
public static final java.lang.String TLDs
- Matches any string that begins with a TLD (no .) followed by a '/' slash
or end of string. If followed by slash then nothing after the slash is
of consequence.
- See Also:
- Constant Field Values
numberOfCURIsHandled
protected long numberOfCURIsHandled
numberOfLinksExtracted
protected long numberOfLinksExtracted
ExtractorUniversal
public ExtractorUniversal(java.lang.String name)
- Constructor
- Parameters:
name
- The name of the module.
extract
protected void extract(CrawlURI curi)
- Specified by:
extract
in class Extractor
report
public java.lang.String report()
- Description copied from class:
Processor
- Compiles and returns a report (in human readable form) about the status
of the processor. The processor's name (of implementing class) should
always be included.
Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.
- Overrides:
report
in class Processor
- Returns:
- A human readable report on the processor's state.
Copyright © 2003-2011 Internet Archive. All Rights Reserved.