org.archive.crawler.frontier
Class RecoveryJournal

java.lang.Object
  extended by org.archive.crawler.io.CrawlerJournal
      extended by org.archive.crawler.frontier.RecoveryJournal
All Implemented Interfaces:
FrontierJournal

public class RecoveryJournal
extends CrawlerJournal
implements FrontierJournal

Helper class for managing a simple Frontier change-events journal which is useful for recovering from crawl problems. By replaying the journal into a new Frontier, its state (at least with respect to URIs alreadyIncluded and in pending queues) will match that of the original Frontier, allowing a pseudo-resume of a previous crawl, at least as far as URI visitation/coverage is concerned.

Author:
gojomo

Field Summary
static java.lang.String F_ADD
           
static java.lang.String F_DISREGARD
           
static java.lang.String F_EMIT
           
static java.lang.String F_FAILURE
           
static java.lang.String F_RESCHEDULE
           
static java.lang.String F_SUCCESS
           
 
Fields inherited from class org.archive.crawler.io.CrawlerJournal
accumulatingBuffer, GZIP_SUFFIX, gzipFile, lines, LOG_ERROR, LOG_TIMESTAMP, out, timestamp_interval
 
Fields inherited from interface org.archive.crawler.frontier.FrontierJournal
LOGNAME_RECOVER
 
Constructor Summary
RecoveryJournal(java.lang.String path, java.lang.String filename)
          Create a new recovery journal at the given location
 
Method Summary
 void added(CandidateURI curi)
           
 void emitted(CandidateURI curi)
          Note that a CrawlURI was emitted for processing.
 void finishedDisregard(CandidateURI curi)
           
 void finishedFailure(CandidateURI curi)
           
 void finishedSuccess(CandidateURI curi)
           
static void importRecoverLog(java.io.File source, CrawlController controller, boolean retainFailures)
          Utility method for scanning a recovery journal and applying it to a Frontier.
 void rescheduled(CandidateURI curi)
           
 void writeLongUriLine(java.lang.String tag, CandidateURI curi)
           
 
Methods inherited from class org.archive.crawler.io.CrawlerJournal
checkpoint, close, considerTimestamp, getBufferedInput, getBufferedReader, getBufferedReader, initialize, noteLine, seriousError, writeLine, writeLine, writeLine, writeLine
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.archive.crawler.frontier.FrontierJournal
checkpoint, close, seriousError
 

Field Detail

F_ADD

public static final java.lang.String F_ADD
See Also:
Constant Field Values

F_EMIT

public static final java.lang.String F_EMIT
See Also:
Constant Field Values

F_DISREGARD

public static final java.lang.String F_DISREGARD
See Also:
Constant Field Values

F_RESCHEDULE

public static final java.lang.String F_RESCHEDULE
See Also:
Constant Field Values

F_SUCCESS

public static final java.lang.String F_SUCCESS
See Also:
Constant Field Values

F_FAILURE

public static final java.lang.String F_FAILURE
See Also:
Constant Field Values
Constructor Detail

RecoveryJournal

public RecoveryJournal(java.lang.String path,
                       java.lang.String filename)
                throws java.io.IOException
Create a new recovery journal at the given location

Parameters:
path - Directory to make the recovery journal in.
filename - Name to use for recovery journal file.
Throws:
java.io.IOException
Method Detail

added

public void added(CandidateURI curi)
Specified by:
added in interface FrontierJournal
Parameters:
curi - CrawlURI that has been scheduled to be added to the Frontier.

writeLongUriLine

public void writeLongUriLine(java.lang.String tag,
                             CandidateURI curi)

finishedSuccess

public void finishedSuccess(CandidateURI curi)
Specified by:
finishedSuccess in interface FrontierJournal
Parameters:
curi - CrawlURI that finished successfully.

emitted

public void emitted(CandidateURI curi)
Description copied from interface: FrontierJournal
Note that a CrawlURI was emitted for processing. If not followed by a finished or rescheduled notation in the journal, the CrawlURI was still in-process when the journal ended.

Specified by:
emitted in interface FrontierJournal
Parameters:
curi - CrawlURI emitted.

finishedDisregard

public void finishedDisregard(CandidateURI curi)
Specified by:
finishedDisregard in interface FrontierJournal
Parameters:
curi - CrawlURI finished disregarded (uncounted failure).

finishedFailure

public void finishedFailure(CandidateURI curi)
Specified by:
finishedFailure in interface FrontierJournal
Parameters:
curi - CrawlURI finished unsuccessfully.

rescheduled

public void rescheduled(CandidateURI curi)
Specified by:
rescheduled in interface FrontierJournal
Parameters:
curi - CrawlURI that was returned to the Frontier for another try.

importRecoverLog

public static void importRecoverLog(java.io.File source,
                                    CrawlController controller,
                                    boolean retainFailures)
                             throws java.io.IOException
Utility method for scanning a recovery journal and applying it to a Frontier.

Parameters:
source - Recover log path.
frontier - Frontier reference.
retainFailures -
Throws:
java.io.IOException
See Also:
Frontier.importRecoverLog(String, boolean)


Copyright © 2003-2011 Internet Archive. All Rights Reserved.