|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object javax.management.Attribute org.archive.crawler.settings.Type org.archive.crawler.settings.ComplexType org.archive.crawler.settings.ModuleType org.archive.crawler.framework.Processor org.archive.crawler.extractor.Extractor org.archive.crawler.extractor.ExtractorHTML
public class ExtractorHTML
Basic link-extraction, from an HTML content-body, using regular expressions.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
---|
ComplexType.MBeanAttributeInfoIterator |
Field Summary | |
---|---|
(package private) static java.lang.String |
APPLET
|
static java.lang.String |
ATTR_EXTRACT_JAVASCRIPT
whether to try finding links in Javscript; default true |
static java.lang.String |
ATTR_EXTRACT_ONLY_FORM_GETS
|
static java.lang.String |
ATTR_IGNORE_FORM_ACTION_URLS
|
static java.lang.String |
ATTR_IGNORE_UNEXPECTED_HTML
|
static java.lang.String |
ATTR_TREAT_FRAMES_AS_EMBED_LINKS
|
(package private) static java.lang.String |
BASE
|
(package private) static java.lang.String |
CLASSEXT
|
(package private) static java.lang.String |
EACH_ATTRIBUTE_EXTRACTOR
|
static java.lang.String |
EXTRACT_VALUE_ATTRIBUTES
|
(package private) static java.lang.String |
FRAME
|
(package private) static java.lang.String |
IFRAME
|
(package private) static java.lang.String |
JAVASCRIPT
|
(package private) static java.lang.String |
LINK
|
(package private) static int |
MAX_ATTR_VAL_LENGTH
|
(package private) static java.lang.String |
NON_HTML_PATH_EXTENSION
|
protected long |
numberOfCURIsHandled
|
protected long |
numberOfLinksExtracted
|
(package private) static java.lang.String |
RELEVANT_TAG_EXTRACTOR
|
(package private) static java.lang.String |
WHITESPACE
|
Fields inherited from class org.archive.crawler.framework.Processor |
---|
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules |
Fields inherited from class org.archive.crawler.settings.ComplexType |
---|
definition, definitionMap |
Constructor Summary | |
---|---|
ExtractorHTML(java.lang.String name)
|
|
ExtractorHTML(java.lang.String name,
java.lang.String description)
|
Method Summary | |
---|---|
protected void |
addLinkFromString(CrawlURI curi,
java.lang.CharSequence uri,
java.lang.CharSequence context,
char hopType)
|
protected void |
considerIfLikelyUri(CrawlURI curi,
java.lang.CharSequence candidate,
java.lang.CharSequence valueContext,
char hopType)
Consider whether a given string is URI-like. |
protected void |
considerQueryStringValues(CrawlURI curi,
java.lang.CharSequence queryString,
java.lang.CharSequence valueContext,
char hopType)
Consider a query-string-like collections of key=value[&key=value] pairs for URI-like strings in the values. |
void |
extract(CrawlURI curi)
|
(package private) void |
extract(CrawlURI curi,
java.lang.CharSequence cs)
Run extractor. |
protected boolean |
isHtmlExpectedHere(CrawlURI curi)
Test whether this HTML is so unexpected (eg in place of a GIF URI) that it shouldn't be scanned for links. |
protected void |
processEmbed(CrawlURI curi,
java.lang.CharSequence value,
java.lang.CharSequence context)
|
protected void |
processEmbed(CrawlURI curi,
java.lang.CharSequence value,
java.lang.CharSequence context,
char hopType)
|
protected void |
processGeneralTag(CrawlURI curi,
java.lang.CharSequence element,
java.lang.CharSequence cs)
|
protected void |
processLink(CrawlURI curi,
java.lang.CharSequence value,
java.lang.CharSequence context)
Handle generic HREF cases. |
protected boolean |
processMeta(CrawlURI curi,
java.lang.CharSequence cs)
Process metadata tags. |
protected void |
processScript(CrawlURI curi,
java.lang.CharSequence sequence,
int endOfOpenTag)
|
protected void |
processScriptCode(CrawlURI curi,
java.lang.CharSequence cs)
Extract the (java)script source in the given CharSequence. |
protected void |
processStyle(CrawlURI curi,
java.lang.CharSequence sequence,
int endOfOpenTag)
Process style text. |
java.lang.String |
report()
Compiles and returns a report (in human readable form) about the status of the processor. |
Methods inherited from class org.archive.crawler.extractor.Extractor |
---|
innerProcess, isHttpTransactionContentToProcess, isIndependentExtractors |
Methods inherited from class org.archive.crawler.framework.Processor |
---|
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ModuleType |
---|
addElement, listUsedFiles |
Methods inherited from class org.archive.crawler.settings.Type |
---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
---|
getName, hashCode |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
static final java.lang.String RELEVANT_TAG_EXTRACTOR
static final int MAX_ATTR_VAL_LENGTH
static final java.lang.String EACH_ATTRIBUTE_EXTRACTOR
static final java.lang.String WHITESPACE
static final java.lang.String CLASSEXT
static final java.lang.String APPLET
static final java.lang.String BASE
static final java.lang.String LINK
static final java.lang.String FRAME
static final java.lang.String IFRAME
public static final java.lang.String ATTR_TREAT_FRAMES_AS_EMBED_LINKS
public static final java.lang.String ATTR_IGNORE_FORM_ACTION_URLS
public static final java.lang.String ATTR_EXTRACT_ONLY_FORM_GETS
public static final java.lang.String ATTR_EXTRACT_JAVASCRIPT
public static final java.lang.String EXTRACT_VALUE_ATTRIBUTES
public static final java.lang.String ATTR_IGNORE_UNEXPECTED_HTML
protected long numberOfCURIsHandled
protected long numberOfLinksExtracted
static final java.lang.String JAVASCRIPT
static final java.lang.String NON_HTML_PATH_EXTENSION
Constructor Detail |
---|
public ExtractorHTML(java.lang.String name)
public ExtractorHTML(java.lang.String name, java.lang.String description)
Method Detail |
---|
protected void processGeneralTag(CrawlURI curi, java.lang.CharSequence element, java.lang.CharSequence cs)
protected void considerQueryStringValues(CrawlURI curi, java.lang.CharSequence queryString, java.lang.CharSequence valueContext, char hopType)
curi
- origin CrawlURIqueryString
- query-string-like stringvalueContext
- page context where foundprotected void considerIfLikelyUri(CrawlURI curi, java.lang.CharSequence candidate, java.lang.CharSequence valueContext, char hopType)
curi
- origin CrawlURIqueryString
- query-string-like stringvalueContext
- page context where foundprotected void processScriptCode(CrawlURI curi, java.lang.CharSequence cs)
curi
- source CrawlURIcs
- CharSequence of javascript codeprotected void processLink(CrawlURI curi, java.lang.CharSequence value, java.lang.CharSequence context)
curi
- value
- context
- protected void addLinkFromString(CrawlURI curi, java.lang.CharSequence uri, java.lang.CharSequence context, char hopType)
protected final void processEmbed(CrawlURI curi, java.lang.CharSequence value, java.lang.CharSequence context)
protected void processEmbed(CrawlURI curi, java.lang.CharSequence value, java.lang.CharSequence context, char hopType)
public void extract(CrawlURI curi)
extract
in class Extractor
void extract(CrawlURI curi, java.lang.CharSequence cs)
curi
- CrawlURI we're processing.cs
- Sequence from underlying ReplayCharSequence. This
is TRANSIENT data. Make a copy if you want the data to live outside
of this extractors' lifetime.protected boolean isHtmlExpectedHere(CrawlURI curi) throws org.apache.commons.httpclient.URIException
curi
- CrawlURI to examine.
org.apache.commons.httpclient.URIException
protected void processScript(CrawlURI curi, java.lang.CharSequence sequence, int endOfOpenTag)
protected boolean processMeta(CrawlURI curi, java.lang.CharSequence cs)
curi
- CrawlURI we're processing.cs
- Sequence from underlying ReplayCharSequence. This
is TRANSIENT data. Make a copy if you want the data to live outside
of this extractors' lifetime.
protected void processStyle(CrawlURI curi, java.lang.CharSequence sequence, int endOfOpenTag)
curi
- CrawlURI we're processing.sequence
- Sequence from underlying ReplayCharSequence. This
is TRANSIENT data. Make a copy if you want the data to live outside
of this extractors' lifetime.endOfOpenTag
- public java.lang.String report()
Processor
Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.
report
in class Processor
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |