org.archive.crawler.processor.recrawl
Class PersistProcessor

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.processor.recrawl.PersistProcessor
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean
Direct Known Subclasses:
PersistLogProcessor, PersistOnlineProcessor

public abstract class PersistProcessor
extends Processor

Superclass for Processors which utilize BDB-JE for URI state (including most notably history) persistence. Includes many static utility methods (including a main()).

Author:
gojomo
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String URI_HISTORY_DBNAME
          name of history Database
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Constructor Summary
PersistProcessor(java.lang.String name, java.lang.String string)
          Usual constructor
 
Method Summary
static int copyPersistSourceToHistoryMap(java.io.File context, java.lang.String sourcePath, com.sleepycat.collections.StoredSortedMap<java.lang.String,st.ata.util.AList> historyMap)
          Populates a given StoredSortedMap (history map) from an old environment db or a persist log.
protected static com.sleepycat.je.DatabaseConfig historyDatabaseConfig()
           
static void main(java.lang.String[] args)
          Utility main for importing a log into a BDB-JE environment or moving a database between environments (2 arguments), or simply dumping a log to stderr in a more readable format (1 argument).
 java.lang.String persistKeyFor(CrawlURI curi)
          Return a preferred String key for persisting the given CrawlURI's AList state.
static int populatePersistEnv(java.lang.String sourcePath, java.io.File envFile)
          Populates a new environment db from an old environment db or a persist log.
static EnhancedEnvironment setupCopyEnvironment(java.io.File env)
           
static EnhancedEnvironment setupCopyEnvironment(java.io.File env, boolean readOnly)
           
protected  boolean shouldLoad(CrawlURI curi)
          Whether the current CrawlURI's state should be loaded
protected  boolean shouldStore(CrawlURI curi)
          Whether the current CrawlURI's state should be persisted (to log or direct to database).
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, initialTasks, innerProcess, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

URI_HISTORY_DBNAME

public static final java.lang.String URI_HISTORY_DBNAME
name of history Database

See Also:
Constant Field Values
Constructor Detail

PersistProcessor

public PersistProcessor(java.lang.String name,
                        java.lang.String string)
Usual constructor

Parameters:
name -
string -
Method Detail

historyDatabaseConfig

protected static com.sleepycat.je.DatabaseConfig historyDatabaseConfig()
Returns:
DatabaseConfig for history Database

persistKeyFor

public java.lang.String persistKeyFor(CrawlURI curi)
Return a preferred String key for persisting the given CrawlURI's AList state.

Parameters:
curi - CrawlURI
Returns:
String key

shouldStore

protected boolean shouldStore(CrawlURI curi)
Whether the current CrawlURI's state should be persisted (to log or direct to database).

Parameters:
curi - CrawlURI
Returns:
true if any records were written to warc for this url, or if content matches previous fetch that was written

shouldLoad

protected boolean shouldLoad(CrawlURI curi)
Whether the current CrawlURI's state should be loaded

Parameters:
curi - CrawlURI
Returns:
true if state should be loaded; false to skip loading

populatePersistEnv

public static int populatePersistEnv(java.lang.String sourcePath,
                                     java.io.File envFile)
                              throws com.sleepycat.je.DatabaseException,
                                     java.io.IOException
Populates a new environment db from an old environment db or a persist log. If path to new environment is not provided, only logs the entries that would have been populated.

Parameters:
sourcePath - source of old entries: can be a path to an existing environment db, or a URL or path to a persist log
envFile - path to new environment db (or null for a dry run)
Returns:
number of records
Throws:
com.sleepycat.je.DatabaseException
java.io.IOException

copyPersistSourceToHistoryMap

public static int copyPersistSourceToHistoryMap(java.io.File context,
                                                java.lang.String sourcePath,
                                                com.sleepycat.collections.StoredSortedMap<java.lang.String,st.ata.util.AList> historyMap)
                                         throws com.sleepycat.je.DatabaseException,
                                                java.io.IOException,
                                                java.net.MalformedURLException,
                                                java.io.UnsupportedEncodingException
Populates a given StoredSortedMap (history map) from an old environment db or a persist log. If a map is not provided, only logs the entries that would have been populated.

Parameters:
sourcePath - source of old entries: can be a path to an existing environment db, or a URL or path to a persist log
historyMap - map to populate (or null for a dry run)
Returns:
number of records
Throws:
com.sleepycat.je.DatabaseException
java.io.IOException
java.net.MalformedURLException
java.io.UnsupportedEncodingException

main

public static void main(java.lang.String[] args)
                 throws com.sleepycat.je.DatabaseException,
                        java.io.IOException
Utility main for importing a log into a BDB-JE environment or moving a database between environments (2 arguments), or simply dumping a log to stderr in a more readable format (1 argument).

Parameters:
args - command-line arguments
Throws:
com.sleepycat.je.DatabaseException
java.io.IOException

setupCopyEnvironment

public static EnhancedEnvironment setupCopyEnvironment(java.io.File env)
                                                throws com.sleepycat.je.DatabaseException
Throws:
com.sleepycat.je.DatabaseException

setupCopyEnvironment

public static EnhancedEnvironment setupCopyEnvironment(java.io.File env,
                                                       boolean readOnly)
                                                throws com.sleepycat.je.DatabaseException
Throws:
com.sleepycat.je.DatabaseException


Copyright © 2003-2011 Internet Archive. All Rights Reserved.