org.archive.crawler.fetcher
Class FetchFTP

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.fetcher.FetchFTP
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, FetchStatusCodes

public class FetchFTP
extends Processor
implements CoreAttributeConstants, FetchStatusCodes

Fetches documents and directory listings using FTP. This class will also try to extract FTP "links" from directory listings. For this class to archive a directory listing, the remote FTP server must support the NLIST command. Most modern FTP servers should.

Author:
pjack
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_BANDWIDTH
          The name for the fetch-bandwidth attribute.
static java.lang.String ATTR_MAX_LENGTH
          The name for the max-length-bytes attribute.
static java.lang.String ATTR_PASSWORD
          The name for the password attribute.
static java.lang.String ATTR_TIMEOUT
          The name for the timeout-seconds attribute.
static java.lang.String ATTR_USERNAME
          The name for the username attribute.
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Constructor Summary
FetchFTP(java.lang.String name)
          Constructs a new FetchFTP.
 
Method Summary
 boolean getExtractFromDirs(CrawlURI curi)
          Returns the extract.from.dirs attribute for this FetchFTP and the given curi.
 boolean getExtractParent(CrawlURI curi)
          Returns the extract.parent attribute for this FetchFTP and the given curi.
 int getFetchBandwidth(CrawlURI curi)
          Returns the fetch-bandwidth attribute for this FetchFTP and the given curi.
 long getMaxLength(CrawlURI curi)
          Returns the max-length-bytes attribute for this FetchFTP and the given curi.
 int getTimeout(CrawlURI curi)
          Returns the timeout-seconds attribute for this FetchFTP and the given curi.
 void innerProcess(CrawlURI curi)
          Processes the given URI.
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_USERNAME

public static final java.lang.String ATTR_USERNAME
The name for the username attribute.

See Also:
Constant Field Values

ATTR_PASSWORD

public static final java.lang.String ATTR_PASSWORD
The name for the password attribute.

See Also:
Constant Field Values

ATTR_MAX_LENGTH

public static final java.lang.String ATTR_MAX_LENGTH
The name for the max-length-bytes attribute.

See Also:
Constant Field Values

ATTR_BANDWIDTH

public static final java.lang.String ATTR_BANDWIDTH
The name for the fetch-bandwidth attribute.

See Also:
Constant Field Values

ATTR_TIMEOUT

public static final java.lang.String ATTR_TIMEOUT
The name for the timeout-seconds attribute.

See Also:
Constant Field Values
Constructor Detail

FetchFTP

public FetchFTP(java.lang.String name)
Constructs a new FetchFTP.

Parameters:
name - the name of this processor
Method Detail

innerProcess

public void innerProcess(CrawlURI curi)
                  throws java.lang.InterruptedException
Processes the given URI. If the given URI is not an FTP URI, then this method does nothing. Otherwise an attempt is made to connect to the FTP server.

If the connection is successful, an attempt will be made to CD to the path specified in the URI. If the remote CD command succeeds, then it is assumed that the URI represents a directory. If the CD command fails, then it is assumed that the URI represents a file.

For directories, the directory listing will be fetched using the FTP LIST command, and saved to the HttpRecorder. If the extract.from.dirs attribute is set to true, then the files in the fetched list will be added to the curi as extracted FTP links. (It was easier to do that here, rather than writing a separate FTPExtractor.)

For files, the file will be fetched using the FTP RETR command, and saved to the HttpRecorder.

All file transfers (including directory listings) occur using Binary mode transfer. Also, the local passive transfer mode is always used, to play well with firewalls.

Overrides:
innerProcess in class Processor
Parameters:
curi - the curi to process
Throws:
java.lang.InterruptedException - if the thread is interrupted during processing

getExtractFromDirs

public boolean getExtractFromDirs(CrawlURI curi)
Returns the extract.from.dirs attribute for this FetchFTP and the given curi.

Parameters:
curi - the curi whose attribute to return
Returns:
that curi's extract.from.dirs

getExtractParent

public boolean getExtractParent(CrawlURI curi)
Returns the extract.parent attribute for this FetchFTP and the given curi.

Parameters:
curi - the curi whose attribute to return
Returns:
that curi's extract-parent

getTimeout

public int getTimeout(CrawlURI curi)
Returns the timeout-seconds attribute for this FetchFTP and the given curi.

Parameters:
curi - the curi whose attribute to return
Returns:
that curi's timeout-seconds

getMaxLength

public long getMaxLength(CrawlURI curi)
Returns the max-length-bytes attribute for this FetchFTP and the given curi.

Parameters:
curi - the curi whose attribute to return
Returns:
that curi's max-length-bytes

getFetchBandwidth

public int getFetchBandwidth(CrawlURI curi)
Returns the fetch-bandwidth attribute for this FetchFTP and the given curi.

Parameters:
curi - the curi whose attribute to return
Returns:
that curi's fetch-bandwidth


Copyright © 2003-2011 Internet Archive. All Rights Reserved.