org.archive.crawler.util
Class BdbUriUniqFilter

java.lang.Object
  extended by org.archive.crawler.util.SetBasedUriUniqFilter
      extended by org.archive.crawler.util.BdbUriUniqFilter
All Implemented Interfaces:
java.io.Serializable, UriUniqFilter

public class BdbUriUniqFilter
extends SetBasedUriUniqFilter
implements java.io.Serializable

A BDB implementation of an AlreadySeen list. This implementation performs adequately without blowing out the heap. See AlreadySeen.

Makes keys that have URIs from same server close to each other. Mercator and 2.3.5 'Elminating Already-Visited URLs' in 'Mining the Web' by Soumen Chakrabarti talk of a two-level key with the first 24 bits a hash of the host plus port and with the last 40 as a hash of the path. Testing showed adoption of such a scheme halving lookup times (This implementation actually concatenates scheme + host in first 24 bits and path + query in trailing 40 bits).

Version:
$Date: 2007-02-21 10:18:39 +0000 (Wed, 21 Feb 2007) $, $Revision: 4927 $
Author:
stack
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from interface org.archive.crawler.datamodel.UriUniqFilter
UriUniqFilter.HasUriReceiver
 
Field Summary
protected  com.sleepycat.je.Database alreadySeen
           
protected  long count
           
protected  boolean createdEnvironment
           
protected  long lastCacheMiss
           
protected  long lastCacheMissDiff
           
protected static com.sleepycat.je.DatabaseEntry ZERO_LENGTH_ENTRY
           
 
Fields inherited from class org.archive.crawler.util.SetBasedUriUniqFilter
duplicateCount, duplicatesAtLastSample, profileLog, receiver
 
Constructor Summary
protected BdbUriUniqFilter()
          Shutdown default constructor.
  BdbUriUniqFilter(com.sleepycat.je.Environment environment)
          Constructor.
  BdbUriUniqFilter(java.io.File bdbEnv)
          Constructor.
  BdbUriUniqFilter(java.io.File bdbEnv, int cacheSizePercentage)
          Constructor.
 
Method Summary
 void close()
          Close down any allocated resources.
static long createKey(java.lang.CharSequence uri)
          Create fingerprint.
 long flush()
           
 long getCacheMisses()
           
protected  com.sleepycat.je.DatabaseConfig getDatabaseConfig()
           
 long getLastCacheMissDiff()
           
protected  void initialize(com.sleepycat.je.Environment env)
          Method shared by constructors.
protected  void open(com.sleepycat.je.Environment env, com.sleepycat.je.DatabaseConfig dbConfig)
           
 void reopen(com.sleepycat.je.Environment env)
          Call after deserializing an instance of this class.
protected  boolean setAdd(java.lang.CharSequence uri)
           
protected  long setCount()
           
protected  boolean setRemove(java.lang.CharSequence uri)
           
 
Methods inherited from class org.archive.crawler.util.SetBasedUriUniqFilter
add, addForce, addNow, count, forget, note, pending, profileLog, requestFlush, setDestination, setProfileLog
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

createdEnvironment

protected boolean createdEnvironment

lastCacheMiss

protected long lastCacheMiss

lastCacheMissDiff

protected long lastCacheMissDiff

alreadySeen

protected transient com.sleepycat.je.Database alreadySeen

ZERO_LENGTH_ENTRY

protected static com.sleepycat.je.DatabaseEntry ZERO_LENGTH_ENTRY

count

protected long count
Constructor Detail

BdbUriUniqFilter

protected BdbUriUniqFilter()
Shutdown default constructor.


BdbUriUniqFilter

public BdbUriUniqFilter(com.sleepycat.je.Environment environment)
                 throws java.io.IOException
Constructor.

Parameters:
environment - A bdb environment ready-configured.
Throws:
java.io.IOException

BdbUriUniqFilter

public BdbUriUniqFilter(java.io.File bdbEnv)
                 throws java.io.IOException
Constructor.

Parameters:
bdbEnv - The directory that holds the bdb environment. Will make a database under here if doesn't already exit. Otherwise reopens any existing dbs.
Throws:
java.io.IOException

BdbUriUniqFilter

public BdbUriUniqFilter(java.io.File bdbEnv,
                        int cacheSizePercentage)
                 throws java.io.IOException
Constructor.

Parameters:
bdbEnv - The directory that holds the bdb environment. Will make a database under here if doesn't already exit. Otherwise reopens any existing dbs.
cacheSizePercentage - Percentage of JVM bdb allocates as its cache. Pass -1 to get default cache size.
Throws:
java.io.IOException
Method Detail

initialize

protected void initialize(com.sleepycat.je.Environment env)
                   throws com.sleepycat.je.DatabaseException
Method shared by constructors.

Parameters:
env - Environment to use.
Throws:
com.sleepycat.je.DatabaseException

getDatabaseConfig

protected com.sleepycat.je.DatabaseConfig getDatabaseConfig()
Returns:
DatabaseConfig to use

reopen

public void reopen(com.sleepycat.je.Environment env)
            throws com.sleepycat.je.DatabaseException
Call after deserializing an instance of this class. Will open the already seen in passed environment.

Parameters:
env - DB Environment to use.
Throws:
com.sleepycat.je.DatabaseException

open

protected void open(com.sleepycat.je.Environment env,
                    com.sleepycat.je.DatabaseConfig dbConfig)
             throws com.sleepycat.je.DatabaseException
Throws:
com.sleepycat.je.DatabaseException

close

public void close()
Description copied from interface: UriUniqFilter
Close down any allocated resources. Makes sense calling this when checkpointing.

Specified by:
close in interface UriUniqFilter
Overrides:
close in class SetBasedUriUniqFilter

getCacheMisses

public long getCacheMisses()
                    throws com.sleepycat.je.DatabaseException
Throws:
com.sleepycat.je.DatabaseException

getLastCacheMissDiff

public long getLastCacheMissDiff()

createKey

public static long createKey(java.lang.CharSequence uri)
Create fingerprint. Pubic access so test code can access createKey.

Parameters:
uri - URI to fingerprint.
Returns:
Fingerprint of passed url.

setAdd

protected boolean setAdd(java.lang.CharSequence uri)
Specified by:
setAdd in class SetBasedUriUniqFilter

setCount

protected long setCount()
Specified by:
setCount in class SetBasedUriUniqFilter

setRemove

protected boolean setRemove(java.lang.CharSequence uri)
Specified by:
setRemove in class SetBasedUriUniqFilter

flush

public long flush()


Copyright © 2003-2011 Internet Archive. All Rights Reserved.