|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.crawler.util.SetBasedUriUniqFilter org.archive.crawler.util.BdbUriUniqFilter
public class BdbUriUniqFilter
A BDB implementation of an AlreadySeen list. This implementation performs adequately without blowing out the heap. See AlreadySeen.
Makes keys that have URIs from same server close to each other. Mercator and 2.3.5 'Elminating Already-Visited URLs' in 'Mining the Web' by Soumen Chakrabarti talk of a two-level key with the first 24 bits a hash of the host plus port and with the last 40 as a hash of the path. Testing showed adoption of such a scheme halving lookup times (This implementation actually concatenates scheme + host in first 24 bits and path + query in trailing 40 bits).
Nested Class Summary |
---|
Nested classes/interfaces inherited from interface org.archive.crawler.datamodel.UriUniqFilter |
---|
UriUniqFilter.HasUriReceiver |
Field Summary | |
---|---|
protected com.sleepycat.je.Database |
alreadySeen
|
protected long |
count
|
protected boolean |
createdEnvironment
|
protected long |
lastCacheMiss
|
protected long |
lastCacheMissDiff
|
protected static com.sleepycat.je.DatabaseEntry |
ZERO_LENGTH_ENTRY
|
Fields inherited from class org.archive.crawler.util.SetBasedUriUniqFilter |
---|
duplicateCount, duplicatesAtLastSample, profileLog, receiver |
Constructor Summary | |
---|---|
protected |
BdbUriUniqFilter()
Shutdown default constructor. |
|
BdbUriUniqFilter(com.sleepycat.je.Environment environment)
Constructor. |
|
BdbUriUniqFilter(java.io.File bdbEnv)
Constructor. |
|
BdbUriUniqFilter(java.io.File bdbEnv,
int cacheSizePercentage)
Constructor. |
Method Summary | |
---|---|
void |
close()
Close down any allocated resources. |
static long |
createKey(java.lang.CharSequence uri)
Create fingerprint. |
long |
flush()
|
long |
getCacheMisses()
|
protected com.sleepycat.je.DatabaseConfig |
getDatabaseConfig()
|
long |
getLastCacheMissDiff()
|
protected void |
initialize(com.sleepycat.je.Environment env)
Method shared by constructors. |
protected void |
open(com.sleepycat.je.Environment env,
com.sleepycat.je.DatabaseConfig dbConfig)
|
void |
reopen(com.sleepycat.je.Environment env)
Call after deserializing an instance of this class. |
protected boolean |
setAdd(java.lang.CharSequence uri)
|
protected long |
setCount()
|
protected boolean |
setRemove(java.lang.CharSequence uri)
|
Methods inherited from class org.archive.crawler.util.SetBasedUriUniqFilter |
---|
add, addForce, addNow, count, forget, note, pending, profileLog, requestFlush, setDestination, setProfileLog |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected boolean createdEnvironment
protected long lastCacheMiss
protected long lastCacheMissDiff
protected transient com.sleepycat.je.Database alreadySeen
protected static com.sleepycat.je.DatabaseEntry ZERO_LENGTH_ENTRY
protected long count
Constructor Detail |
---|
protected BdbUriUniqFilter()
public BdbUriUniqFilter(com.sleepycat.je.Environment environment) throws java.io.IOException
environment
- A bdb environment ready-configured.
java.io.IOException
public BdbUriUniqFilter(java.io.File bdbEnv) throws java.io.IOException
bdbEnv
- The directory that holds the bdb environment. Will
make a database under here if doesn't already exit. Otherwise
reopens any existing dbs.
java.io.IOException
public BdbUriUniqFilter(java.io.File bdbEnv, int cacheSizePercentage) throws java.io.IOException
bdbEnv
- The directory that holds the bdb environment. Will
make a database under here if doesn't already exit. Otherwise
reopens any existing dbs.cacheSizePercentage
- Percentage of JVM bdb allocates as
its cache. Pass -1 to get default cache size.
java.io.IOException
Method Detail |
---|
protected void initialize(com.sleepycat.je.Environment env) throws com.sleepycat.je.DatabaseException
env
- Environment to use.
com.sleepycat.je.DatabaseException
protected com.sleepycat.je.DatabaseConfig getDatabaseConfig()
public void reopen(com.sleepycat.je.Environment env) throws com.sleepycat.je.DatabaseException
env
- DB Environment to use.
com.sleepycat.je.DatabaseException
protected void open(com.sleepycat.je.Environment env, com.sleepycat.je.DatabaseConfig dbConfig) throws com.sleepycat.je.DatabaseException
com.sleepycat.je.DatabaseException
public void close()
UriUniqFilter
close
in interface UriUniqFilter
close
in class SetBasedUriUniqFilter
public long getCacheMisses() throws com.sleepycat.je.DatabaseException
com.sleepycat.je.DatabaseException
public long getLastCacheMissDiff()
public static long createKey(java.lang.CharSequence uri)
uri
- URI to fingerprint.
url
.protected boolean setAdd(java.lang.CharSequence uri)
setAdd
in class SetBasedUriUniqFilter
protected long setCount()
setCount
in class SetBasedUriUniqFilter
protected boolean setRemove(java.lang.CharSequence uri)
setRemove
in class SetBasedUriUniqFilter
public long flush()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |