|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.crawler.util.FPMergeUriUniqFilter
public abstract class FPMergeUriUniqFilter
UriUniqFilter based on merging FP arrays (in memory or from disk). Inspired by the approach in Najork and Heydon, "High-Performance Web Crawling" (2001), section 3.2, "Efficient Duplicate URL Eliminators".
Nested Class Summary | |
---|---|
class |
FPMergeUriUniqFilter.PendingItem
Represents a long fingerprint and (possibly) its corresponding CandidateURI, awaiting the next merge in a 'pending' state. |
Nested classes/interfaces inherited from interface org.archive.crawler.datamodel.UriUniqFilter |
---|
UriUniqFilter.HasUriReceiver |
Field Summary | |
---|---|
static int |
DEFAULT_MAX_PENDING
|
static long |
FLUSH_DELAY_FACTOR
|
protected int |
maxPending
size at which to force flush of pending items |
protected long |
mergeDupAtLast
|
protected long |
mergeDuplicateCount
|
protected long |
nextFlushAllowableAfter
time-based throttle on flush-merge operations |
protected long |
pendDupAtLast
|
protected long |
pendDuplicateCount
|
protected java.util.TreeSet<FPMergeUriUniqFilter.PendingItem> |
pendingSet
items awaiting merge TODO: consider only sorting just pre-merge TODO: consider using a fastutil long->Object class TODO: consider actually writing items to disk file, as in Najork/Heydon |
protected java.io.PrintWriter |
profileLog
|
protected ArrayLongFPCache |
quickCache
cache of most recently seen FPs |
protected long |
quickDupAtLast
|
protected long |
quickDuplicateCount
|
protected UriUniqFilter.HasUriReceiver |
receiver
|
Constructor Summary | |
---|---|
FPMergeUriUniqFilter()
|
Method Summary | |
---|---|
void |
add(java.lang.String key,
CandidateURI value)
Add given uri, if not already present. |
void |
addForce(java.lang.String key,
CandidateURI value)
Add given uri, all the way through to underlying destination, even if already present. |
protected abstract void |
addNewFp(long fp)
Add an FP (which may be an old or new FP) to the new complete list. |
void |
addNow(java.lang.String key,
CandidateURI value)
Immediately add uri. |
protected abstract it.unimi.dsi.fastutil.longs.LongIterator |
beginFpMerge()
Begin merging pending candidates with complete list. |
void |
close()
Close down any allocated resources. |
static long |
createFp(java.lang.CharSequence key)
Create a fingerprint from the given key |
protected abstract void |
finishFpMerge()
Complete the merge of candidate and previously-known FPs (closing files/iterators as appropriate). |
long |
flush()
Perform a merge of all 'pending' items to the overall fingerprint list. |
void |
forget(java.lang.String key,
CandidateURI value)
Forget item was seen |
void |
note(java.lang.String key)
Note item as seen, without passing through to receiver. |
protected void |
pend(long fp,
CandidateURI value)
Place the given FP/CandidateURI pair into the pending set, awaiting a merge to determine if it's actually accepted. |
long |
pending()
Count of items added, but not yet filtered in or out. |
protected void |
profileLog(java.lang.String key)
|
long |
requestFlush()
Request that any pending items be added/dropped. |
void |
setDestination(UriUniqFilter.HasUriReceiver receiver)
Receiver of uniq URIs. |
void |
setMaxPending(int max)
|
void |
setProfileLog(java.io.File logfile)
Set a File to receive a log for replay profiling. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.archive.crawler.datamodel.UriUniqFilter |
---|
count |
Field Detail |
---|
protected UriUniqFilter.HasUriReceiver receiver
protected java.io.PrintWriter profileLog
protected long quickDuplicateCount
protected long quickDupAtLast
protected long pendDuplicateCount
protected long pendDupAtLast
protected long mergeDuplicateCount
protected long mergeDupAtLast
protected java.util.TreeSet<FPMergeUriUniqFilter.PendingItem> pendingSet
protected int maxPending
public static final int DEFAULT_MAX_PENDING
protected long nextFlushAllowableAfter
public static final long FLUSH_DELAY_FACTOR
protected ArrayLongFPCache quickCache
Constructor Detail |
---|
public FPMergeUriUniqFilter()
Method Detail |
---|
public void setMaxPending(int max)
public long pending()
UriUniqFilter
pending
in interface UriUniqFilter
public void setDestination(UriUniqFilter.HasUriReceiver receiver)
UriUniqFilter
setDestination
in interface UriUniqFilter
receiver
- Object that will be passed items. Must implement
HasUriReceiver interface.protected void profileLog(java.lang.String key)
public void add(java.lang.String key, CandidateURI value)
UriUniqFilter
add
in interface UriUniqFilter
key
- Usually a canonicalized version of value
.
This is the key used doing lookups, forgets and insertions on the
already included list.value
- item to add.protected void pend(long fp, CandidateURI value)
fp
- long fingerprintvalue
- CandidateURI or null, if fp only needs merging (as when
CandidateURI was already forced inpublic static long createFp(java.lang.CharSequence key)
key
- CharSequence (URI) to fingerprint
public void addNow(java.lang.String key, CandidateURI value)
UriUniqFilter
addNow
in interface UriUniqFilter
key
- Usually a canonicalized version of uri
.
This is the key used doing lookups, forgets and insertions on the
already included list.value
- item to add.public void addForce(java.lang.String key, CandidateURI value)
UriUniqFilter
addForce
in interface UriUniqFilter
key
- Usually a canonicalized version of uri
.
This is the key used doing lookups, forgets and insertions on the
already included list.value
- item to add.public void note(java.lang.String key)
UriUniqFilter
note
in interface UriUniqFilter
key
- Usually a canonicalized version of an URI
.
This is the key used doing lookups, forgets and insertions on the
already included list.public void forget(java.lang.String key, CandidateURI value)
UriUniqFilter
forget
in interface UriUniqFilter
key
- Usually a canonicalized version of an URI
.
This is the key used doing lookups, forgets and insertions on the
already included list.value
- item to add.public long requestFlush()
UriUniqFilter
requestFlush
in interface UriUniqFilter
public long flush()
protected abstract it.unimi.dsi.fastutil.longs.LongIterator beginFpMerge()
protected abstract void addNewFp(long fp)
fp
- the FP to addprotected abstract void finishFpMerge()
public void close()
UriUniqFilter
close
in interface UriUniqFilter
public void setProfileLog(java.io.File logfile)
UriUniqFilter
setProfileLog
in interface UriUniqFilter
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |