|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object javax.management.Attribute org.archive.crawler.settings.Type org.archive.crawler.settings.ComplexType org.archive.crawler.settings.ModuleType org.archive.crawler.framework.Processor org.archive.crawler.fetcher.FetchHTTP
public class FetchHTTP
HTTP fetcher that uses Apache Jakarta Commons HttpClient library.
Nested Class Summary | |
---|---|
(package private) class |
FetchHTTP.PostRestore
|
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
---|
ComplexType.MBeanAttributeInfoIterator |
Field Summary | |
---|---|
static java.lang.String |
ATTR_ACCEPT_HEADERS
|
static java.lang.String |
ATTR_BDB_COOKIES
|
static java.lang.String |
ATTR_DEFAULT_ENCODING
|
static java.lang.String |
ATTR_DIGEST_ALGORITHM
|
static java.lang.String |
ATTR_DIGEST_CONTENT
|
static java.lang.String |
ATTR_FETCH_BANDWIDTH_MAX
|
static java.lang.String |
ATTR_HTTP_BIND_ADDRESS
|
static java.lang.String |
ATTR_HTTP_PROXY_HOST
|
static java.lang.String |
ATTR_HTTP_PROXY_PORT
|
static java.lang.String |
ATTR_IGNORE_COOKIES
|
static java.lang.String |
ATTR_LOAD_COOKIES
|
static java.lang.String |
ATTR_MAX_LENGTH_BYTES
|
static java.lang.String |
ATTR_MIDFETCH_DECIDE_RULES
Rules to apply mid-fetch, just after receipt of the response headers before we start to download body. |
static java.lang.String |
ATTR_SAVE_COOKIES
|
static java.lang.String |
ATTR_SEND_CONNECTION_CLOSE
|
static java.lang.String |
ATTR_SEND_IF_MODIFIED_SINCE
|
static java.lang.String |
ATTR_SEND_IF_NONE_MATCH
|
static java.lang.String |
ATTR_SEND_RANGE
|
static java.lang.String |
ATTR_SEND_REFERER
|
static java.lang.String |
ATTR_SOTIMEOUT_MS
|
static java.lang.String |
ATTR_TIMEOUT_SECONDS
|
static java.lang.String |
ATTR_TRUST
SSL trust level setting attribute name. |
protected com.sleepycat.je.Database |
cookieDb
Database backing cookie map, if using BDB |
static java.lang.String |
COOKIEDB_NAME
Name of cookie BDB Database |
static java.lang.String |
DEFAULT_DIGEST_ALGORITHM
Default algorithm to use for message disgesting. |
(package private) static java.lang.Boolean |
DEFAULT_DIGEST_CONTENT
Default whether to perform on-the-fly digest hashing of content-bodies. |
static java.lang.String |
DESC_DIGEST_ALGORITHM
|
static java.lang.String |
DESC_DIGEST_CONTENT
|
static java.lang.String[] |
DIGEST_ALGORITHMS
|
static java.lang.String |
HTTP_SCHEME
|
static java.lang.String |
HTTPS_SCHEME
|
static java.lang.String |
MD5
|
static java.lang.String |
RANGE
|
static java.lang.String |
RANGE_PREFIX
|
static java.lang.String |
REFERER
|
(package private) static java.lang.String |
SERVER_CACHE_KEY
|
static java.lang.String |
SHA1
The different digest algorithms to choose between, SHA-1 or MD-5 at the moment. |
(package private) static java.lang.String |
SSL_FACTORY_KEY
|
Fields inherited from class org.archive.crawler.framework.Processor |
---|
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules |
Fields inherited from class org.archive.crawler.settings.ComplexType |
---|
definition, definitionMap |
Constructor Summary | |
---|---|
FetchHTTP(java.lang.String name)
Constructor. |
Method Summary | |
---|---|
protected void |
addResponseContent(org.apache.commons.httpclient.HttpMethod method,
CrawlURI curi)
This method populates curi with response status and
content type. |
protected boolean |
checkMidfetchAbort(CrawlURI curi,
HttpRecorderMethod method,
org.apache.commons.httpclient.HttpConnection conn)
|
protected void |
cleanupHttp()
Perform any final cleanup related to the HttpClient instance. |
protected void |
configureHttp()
|
protected org.apache.commons.httpclient.HostConfiguration |
configureMethod(CrawlURI curi,
org.apache.commons.httpclient.HttpMethod method)
Configure the HttpMethod setting options and headers. |
void |
crawlCheckpoint(java.io.File checkpointDir)
Called by CrawlController when checkpointing. |
void |
crawlEnded(java.lang.String sExitMessage)
Called when a CrawlController has ended a crawl and is about to exit. |
void |
crawlEnding(java.lang.String sExitMessage)
Called when a CrawlController is ending a crawl (for any reason) |
void |
crawlPaused(java.lang.String statusMessage)
Called when a CrawlController is actually paused (all threads are idle). |
void |
crawlPausing(java.lang.String statusMessage)
Called when a CrawlController is going to be paused. |
void |
crawlResuming(java.lang.String statusMessage)
Called when a CrawlController is resuming a crawl that had been paused. |
void |
crawlStarted(java.lang.String message)
Called on crawl start. |
protected void |
doAbort(CrawlURI curi,
org.apache.commons.httpclient.HttpMethod method,
java.lang.String annotation)
|
void |
finalTasks()
Classes subclassing this one should override this method to perform processor specific actions. |
protected java.lang.Object |
getAttributeEither(CrawlURI curi,
java.lang.String key)
Get a value either from inside the CrawlURI instance, or from settings (module attributes). |
protected org.apache.commons.httpclient.auth.AuthScheme |
getAuthScheme(org.apache.commons.httpclient.HttpMethod method,
CrawlURI curi)
|
protected org.apache.commons.httpclient.HttpClient |
getHttp()
|
protected DecideRule |
getMidfetchRule(java.lang.Object o)
|
protected void |
handle401(org.apache.commons.httpclient.HttpMethod method,
CrawlURI curi)
Server is looking for basic/digest auth credentials (RFC2617). |
void |
initialTasks()
Classes subclassing this one should override this method to perform processor specific actions. |
protected void |
innerProcess(CrawlURI curi)
Classes subclassing this one should override this method to perform their custom actions on the CrawlURI. |
protected void |
listUsedFiles(java.util.List<java.lang.String> list)
Those Modules that use files on disk should list them all when this method is called. |
void |
loadCookies()
Load cookies from the file specified in the order file. |
void |
loadCookies(java.lang.String cookiesFile)
Load cookies from a file before the first fetch. |
java.lang.String |
report()
Compiles and returns a report (in human readable form) about the status of the processor. |
void |
saveCookies()
Saves cookies to the file specified in the order file. |
void |
saveCookies(java.lang.String saveCookiesFile)
Saves cookies to a file. |
protected void |
setConditionalGetHeader(CrawlURI curi,
org.apache.commons.httpclient.HttpMethod method,
java.lang.String setting,
java.lang.String sourceHeader,
java.lang.String targetHeader)
Set the given conditional-GET header, if the setting is enabled and a suitable value is available in the URI history. |
protected void |
setSizes(CrawlURI curi,
HttpRecorder rec)
Update CrawlURI internal sizes based on current transaction (and in the case of 304s, history) |
Methods inherited from class org.archive.crawler.framework.Processor |
---|
checkForInterrupt, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ModuleType |
---|
addElement |
Methods inherited from class org.archive.crawler.settings.Type |
---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
---|
getName, hashCode |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String ATTR_HTTP_PROXY_HOST
public static final java.lang.String ATTR_HTTP_PROXY_PORT
public static final java.lang.String ATTR_TIMEOUT_SECONDS
public static final java.lang.String ATTR_SOTIMEOUT_MS
public static final java.lang.String ATTR_MAX_LENGTH_BYTES
public static final java.lang.String ATTR_LOAD_COOKIES
public static final java.lang.String ATTR_SAVE_COOKIES
public static final java.lang.String ATTR_ACCEPT_HEADERS
public static final java.lang.String ATTR_DEFAULT_ENCODING
public static final java.lang.String ATTR_DIGEST_CONTENT
public static final java.lang.String ATTR_DIGEST_ALGORITHM
public static final java.lang.String ATTR_FETCH_BANDWIDTH_MAX
public static final java.lang.String DESC_DIGEST_CONTENT
public static final java.lang.String DESC_DIGEST_ALGORITHM
public static final java.lang.String ATTR_TRUST
static java.lang.Boolean DEFAULT_DIGEST_CONTENT
public static final java.lang.String SHA1
public static final java.lang.String MD5
public static java.lang.String[] DIGEST_ALGORITHMS
public static final java.lang.String DEFAULT_DIGEST_ALGORITHM
public static final java.lang.String ATTR_MIDFETCH_DECIDE_RULES
public static final java.lang.String ATTR_SEND_CONNECTION_CLOSE
public static final java.lang.String ATTR_SEND_REFERER
public static final java.lang.String ATTR_SEND_RANGE
public static final java.lang.String ATTR_SEND_IF_MODIFIED_SINCE
public static final java.lang.String ATTR_SEND_IF_NONE_MATCH
public static final java.lang.String REFERER
public static final java.lang.String RANGE
public static final java.lang.String RANGE_PREFIX
public static final java.lang.String HTTP_SCHEME
public static final java.lang.String HTTPS_SCHEME
public static final java.lang.String ATTR_IGNORE_COOKIES
public static final java.lang.String ATTR_BDB_COOKIES
public static final java.lang.String ATTR_HTTP_BIND_ADDRESS
protected com.sleepycat.je.Database cookieDb
public static final java.lang.String COOKIEDB_NAME
static final java.lang.String SERVER_CACHE_KEY
static final java.lang.String SSL_FACTORY_KEY
Constructor Detail |
---|
public FetchHTTP(java.lang.String name)
name
- Name of this processor.Method Detail |
---|
protected void innerProcess(CrawlURI curi) throws java.lang.InterruptedException
Processor
innerProcess
in class Processor
curi
- The CrawlURI being processed.
java.lang.InterruptedException
protected void setSizes(CrawlURI curi, HttpRecorder rec)
curi
- CrawlURIrec
- HttpRecorderprotected void doAbort(CrawlURI curi, org.apache.commons.httpclient.HttpMethod method, java.lang.String annotation)
protected boolean checkMidfetchAbort(CrawlURI curi, HttpRecorderMethod method, org.apache.commons.httpclient.HttpConnection conn)
protected DecideRule getMidfetchRule(java.lang.Object o)
protected void addResponseContent(org.apache.commons.httpclient.HttpMethod method, CrawlURI curi)
curi
with response status and
content type.
curi
- CrawlURI to populate.method
- Method to get response status and headers from.protected org.apache.commons.httpclient.HostConfiguration configureMethod(CrawlURI curi, org.apache.commons.httpclient.HttpMethod method)
curi
- CrawlURI from which we pull configuration.method
- The Method to configure.
protected void setConditionalGetHeader(CrawlURI curi, org.apache.commons.httpclient.HttpMethod method, java.lang.String setting, java.lang.String sourceHeader, java.lang.String targetHeader)
curi
- source CrawlURImethod
- HTTP operation pendingsetting
- true/false enablement setting name to consultsourceHeader
- header to consult in URI historytargetHeader
- header to set if possibleprotected java.lang.Object getAttributeEither(CrawlURI curi, java.lang.String key)
curi
- CrawlURI to consultkey
- key to lookup
protected void handle401(org.apache.commons.httpclient.HttpMethod method, CrawlURI curi)
method
- Method that got a 401.curi
- CrawlURI that got a 401.protected org.apache.commons.httpclient.auth.AuthScheme getAuthScheme(org.apache.commons.httpclient.HttpMethod method, CrawlURI curi)
method
- Method that got a 401.curi
- CrawlURI that got a 401.
public void initialTasks()
Processor
This method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.
initialTasks
in class Processor
public void finalTasks()
Processor
finalTasks
in class Processor
protected void cleanupHttp()
protected void configureHttp() throws java.lang.RuntimeException
java.lang.RuntimeException
public void loadCookies(java.lang.String cookiesFile)
The file is a text file in the Netscape's 'cookies.txt' file format.
Example entry of cookies.txt file:
www.archive.org FALSE / FALSE 1074567117 details-visit texts-cralond
Each line has 7 tab-separated fields:
cookiesFile
- file in the Netscape's 'cookies.txt' format.public java.lang.String report()
Processor
Examples of stats declared would include:
* Number of CrawlURIs handled.
* Number of links extracted (for link extractors)
etc.
report
in class Processor
public void loadCookies()
The file is a text file in the Netscape's 'cookies.txt' file format.
Example entry of cookies.txt file:
www.archive.org FALSE / FALSE 1074567117 details-visit texts-cralond
Each line has 7 tab-separated fields:
public void saveCookies()
public void saveCookies(java.lang.String saveCookiesFile)
saveCookiesFile
- output file.protected void listUsedFiles(java.util.List<java.lang.String> list)
ModuleType
Each file (as a string name with full path) should be added to the provided list.
Modules that do not use any files can safely ignore this method.
listUsedFiles
in class ModuleType
list
- The list to add files to.protected org.apache.commons.httpclient.HttpClient getHttp()
public void crawlStarted(java.lang.String message)
CrawlStatusListener
crawlStarted
in interface CrawlStatusListener
message
- Start message.public void crawlCheckpoint(java.io.File checkpointDir)
CrawlStatusListener
CrawlController
when checkpointing.
crawlCheckpoint
in interface CrawlStatusListener
checkpointDir
- Checkpoint dir. Write checkpoint state here.public void crawlEnding(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnding
in interface CrawlStatusListener
sExitMessage
- Type of exit. Should be one of the STATUS constants
in defined in CrawlJob.CrawlJob
public void crawlEnded(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnded
in interface CrawlStatusListener
sExitMessage
- Type of exit. Should be one of the STATUS constants
in defined in CrawlJob.CrawlJob
public void crawlPausing(java.lang.String statusMessage)
CrawlStatusListener
crawlPausing
in interface CrawlStatusListener
statusMessage
- Should be
STATUS_WAITING_FOR_PAUSE
. Passed for conveniencepublic void crawlPaused(java.lang.String statusMessage)
CrawlStatusListener
crawlPaused
in interface CrawlStatusListener
statusMessage
- Should be
CrawlJob.STATUS_PAUSED
. Passed for
conveniencepublic void crawlResuming(java.lang.String statusMessage)
CrawlStatusListener
crawlResuming
in interface CrawlStatusListener
statusMessage
- Should be
CrawlJob.STATUS_RUNNING
. Passed for
convenience
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |