org.archive.crawler.util
Class FPMergeUriUniqFilter

java.lang.Object
  extended by org.archive.crawler.util.FPMergeUriUniqFilter
All Implemented Interfaces:
UriUniqFilter
Direct Known Subclasses:
DiskFPMergeUriUniqFilter, MemFPMergeUriUniqFilter

public abstract class FPMergeUriUniqFilter
extends java.lang.Object
implements UriUniqFilter

UriUniqFilter based on merging FP arrays (in memory or from disk). Inspired by the approach in Najork and Heydon, "High-Performance Web Crawling" (2001), section 3.2, "Efficient Duplicate URL Eliminators".

Author:
gojomo

Nested Class Summary
 class FPMergeUriUniqFilter.PendingItem
          Represents a long fingerprint and (possibly) its corresponding CandidateURI, awaiting the next merge in a 'pending' state.
 
Nested classes/interfaces inherited from interface org.archive.crawler.datamodel.UriUniqFilter
UriUniqFilter.HasUriReceiver
 
Field Summary
static int DEFAULT_MAX_PENDING
           
static long FLUSH_DELAY_FACTOR
           
protected  int maxPending
          size at which to force flush of pending items
protected  long mergeDupAtLast
           
protected  long mergeDuplicateCount
           
protected  long nextFlushAllowableAfter
          time-based throttle on flush-merge operations
protected  long pendDupAtLast
           
protected  long pendDuplicateCount
           
protected  java.util.TreeSet<FPMergeUriUniqFilter.PendingItem> pendingSet
          items awaiting merge TODO: consider only sorting just pre-merge TODO: consider using a fastutil long->Object class TODO: consider actually writing items to disk file, as in Najork/Heydon
protected  java.io.PrintWriter profileLog
           
protected  ArrayLongFPCache quickCache
          cache of most recently seen FPs
protected  long quickDupAtLast
           
protected  long quickDuplicateCount
           
protected  UriUniqFilter.HasUriReceiver receiver
           
 
Constructor Summary
FPMergeUriUniqFilter()
           
 
Method Summary
 void add(java.lang.String key, CandidateURI value)
          Add given uri, if not already present.
 void addForce(java.lang.String key, CandidateURI value)
          Add given uri, all the way through to underlying destination, even if already present.
protected abstract  void addNewFp(long fp)
          Add an FP (which may be an old or new FP) to the new complete list.
 void addNow(java.lang.String key, CandidateURI value)
          Immediately add uri.
protected abstract  it.unimi.dsi.fastutil.longs.LongIterator beginFpMerge()
          Begin merging pending candidates with complete list.
 void close()
          Close down any allocated resources.
static long createFp(java.lang.CharSequence key)
          Create a fingerprint from the given key
protected abstract  void finishFpMerge()
          Complete the merge of candidate and previously-known FPs (closing files/iterators as appropriate).
 long flush()
          Perform a merge of all 'pending' items to the overall fingerprint list.
 void forget(java.lang.String key, CandidateURI value)
          Forget item was seen
 void note(java.lang.String key)
          Note item as seen, without passing through to receiver.
protected  void pend(long fp, CandidateURI value)
          Place the given FP/CandidateURI pair into the pending set, awaiting a merge to determine if it's actually accepted.
 long pending()
          Count of items added, but not yet filtered in or out.
protected  void profileLog(java.lang.String key)
           
 long requestFlush()
          Request that any pending items be added/dropped.
 void setDestination(UriUniqFilter.HasUriReceiver receiver)
          Receiver of uniq URIs.
 void setMaxPending(int max)
           
 void setProfileLog(java.io.File logfile)
          Set a File to receive a log for replay profiling.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.archive.crawler.datamodel.UriUniqFilter
count
 

Field Detail

receiver

protected UriUniqFilter.HasUriReceiver receiver

profileLog

protected java.io.PrintWriter profileLog

quickDuplicateCount

protected long quickDuplicateCount

quickDupAtLast

protected long quickDupAtLast

pendDuplicateCount

protected long pendDuplicateCount

pendDupAtLast

protected long pendDupAtLast

mergeDuplicateCount

protected long mergeDuplicateCount

mergeDupAtLast

protected long mergeDupAtLast

pendingSet

protected java.util.TreeSet<FPMergeUriUniqFilter.PendingItem> pendingSet
items awaiting merge TODO: consider only sorting just pre-merge TODO: consider using a fastutil long->Object class TODO: consider actually writing items to disk file, as in Najork/Heydon


maxPending

protected int maxPending
size at which to force flush of pending items


DEFAULT_MAX_PENDING

public static final int DEFAULT_MAX_PENDING
See Also:
Constant Field Values

nextFlushAllowableAfter

protected long nextFlushAllowableAfter
time-based throttle on flush-merge operations


FLUSH_DELAY_FACTOR

public static final long FLUSH_DELAY_FACTOR
See Also:
Constant Field Values

quickCache

protected ArrayLongFPCache quickCache
cache of most recently seen FPs

Constructor Detail

FPMergeUriUniqFilter

public FPMergeUriUniqFilter()
Method Detail

setMaxPending

public void setMaxPending(int max)

pending

public long pending()
Description copied from interface: UriUniqFilter
Count of items added, but not yet filtered in or out. Some implementations may buffer up large numbers of pending items to be evaluated in a later large batch/scan/merge with disk files.

Specified by:
pending in interface UriUniqFilter
Returns:
Count of items added not yet evaluated

setDestination

public void setDestination(UriUniqFilter.HasUriReceiver receiver)
Description copied from interface: UriUniqFilter
Receiver of uniq URIs. Items that have not been seen before are pass through to this object.

Specified by:
setDestination in interface UriUniqFilter
Parameters:
receiver - Object that will be passed items. Must implement HasUriReceiver interface.

profileLog

protected void profileLog(java.lang.String key)

add

public void add(java.lang.String key,
                CandidateURI value)
Description copied from interface: UriUniqFilter
Add given uri, if not already present.

Specified by:
add in interface UriUniqFilter
Parameters:
key - Usually a canonicalized version of value. This is the key used doing lookups, forgets and insertions on the already included list.
value - item to add.

pend

protected void pend(long fp,
                    CandidateURI value)
Place the given FP/CandidateURI pair into the pending set, awaiting a merge to determine if it's actually accepted.

Parameters:
fp - long fingerprint
value - CandidateURI or null, if fp only needs merging (as when CandidateURI was already forced in

createFp

public static long createFp(java.lang.CharSequence key)
Create a fingerprint from the given key

Parameters:
key - CharSequence (URI) to fingerprint
Returns:
long fingerprint

addNow

public void addNow(java.lang.String key,
                   CandidateURI value)
Description copied from interface: UriUniqFilter
Immediately add uri.

Specified by:
addNow in interface UriUniqFilter
Parameters:
key - Usually a canonicalized version of uri. This is the key used doing lookups, forgets and insertions on the already included list.
value - item to add.

addForce

public void addForce(java.lang.String key,
                     CandidateURI value)
Description copied from interface: UriUniqFilter
Add given uri, all the way through to underlying destination, even if already present. (Sometimes a URI must be fetched, or refetched, for example when DNS or robots info expires or the operator forces a refetch. A normal add() or addNow() would drop the URI without forwarding on once it is determmined to already be in the filter.)

Specified by:
addForce in interface UriUniqFilter
Parameters:
key - Usually a canonicalized version of uri. This is the key used doing lookups, forgets and insertions on the already included list.
value - item to add.

note

public void note(java.lang.String key)
Description copied from interface: UriUniqFilter
Note item as seen, without passing through to receiver.

Specified by:
note in interface UriUniqFilter
Parameters:
key - Usually a canonicalized version of an URI. This is the key used doing lookups, forgets and insertions on the already included list.

forget

public void forget(java.lang.String key,
                   CandidateURI value)
Description copied from interface: UriUniqFilter
Forget item was seen

Specified by:
forget in interface UriUniqFilter
Parameters:
key - Usually a canonicalized version of an URI. This is the key used doing lookups, forgets and insertions on the already included list.
value - item to add.

requestFlush

public long requestFlush()
Description copied from interface: UriUniqFilter
Request that any pending items be added/dropped. Implementors may ignore the request if a flush would be too expensive/too soon.

Specified by:
requestFlush in interface UriUniqFilter
Returns:
Number added.

flush

public long flush()
Perform a merge of all 'pending' items to the overall fingerprint list. If the pending item is new, and has an associated CandidateURI, pass that URI along to the 'receiver' (frontier) for queueing.

Returns:
number of pending items actually added

beginFpMerge

protected abstract it.unimi.dsi.fastutil.longs.LongIterator beginFpMerge()
Begin merging pending candidates with complete list. Return an Iterator which will return all previously-known FPs in turn.

Returns:
Iterator over all previously-known FPs

addNewFp

protected abstract void addNewFp(long fp)
Add an FP (which may be an old or new FP) to the new complete list. Should only be called after beginFpMerge() and before finishFpMerge().

Parameters:
fp - the FP to add

finishFpMerge

protected abstract void finishFpMerge()
Complete the merge of candidate and previously-known FPs (closing files/iterators as appropriate).


close

public void close()
Description copied from interface: UriUniqFilter
Close down any allocated resources. Makes sense calling this when checkpointing.

Specified by:
close in interface UriUniqFilter

setProfileLog

public void setProfileLog(java.io.File logfile)
Description copied from interface: UriUniqFilter
Set a File to receive a log for replay profiling.

Specified by:
setProfileLog in interface UriUniqFilter


Copyright © 2003-2011 Internet Archive. All Rights Reserved.