org.archive.crawler.extractor
Class ExtractorTool

java.lang.Object
  extended by org.archive.crawler.extractor.ExtractorTool

public class ExtractorTool
extends java.lang.Object

Run named extractors against passed ARC file. This extractor tool runs suboptimally. It takes each ARC file record, writes it to a new scratch file, and then it runs each listed extractor against the scratch. It works in this manner because extractors want CharSequence, being able to refer to characters by absolute position, but ARCs are compressed streams. The work to get a CharSequence on an underlying compressed stream has not been done. Other issues are need to setup CrawlerSetting environment so extractors can run.

Version:
$Date: 2006-09-26 23:47:15 +0000 (Tue, 26 Sep 2006) $, $Revision: 4671 $
Author:
stack

Constructor Summary
ExtractorTool()
           
ExtractorTool(java.lang.String[] e, java.lang.String scratch)
           
 
Method Summary
 void extract(java.lang.String resource)
           
protected  CrawlURI getCrawlURI(ARCRecord record, HttpRecorder hr)
           
static void main(java.lang.String[] args)
           
protected  void outlinks(CrawlURI curi)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ExtractorTool

public ExtractorTool()
              throws java.lang.Exception
Throws:
java.lang.Exception

ExtractorTool

public ExtractorTool(java.lang.String[] e,
                     java.lang.String scratch)
              throws java.lang.Exception
Throws:
java.lang.Exception
Method Detail

extract

public void extract(java.lang.String resource)
             throws java.io.IOException,
                    org.apache.commons.httpclient.URIException,
                    java.lang.InterruptedException
Throws:
java.io.IOException
org.apache.commons.httpclient.URIException
java.lang.InterruptedException

outlinks

protected void outlinks(CrawlURI curi)

getCrawlURI

protected CrawlURI getCrawlURI(ARCRecord record,
                               HttpRecorder hr)
                        throws org.apache.commons.httpclient.URIException
Throws:
org.apache.commons.httpclient.URIException

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Throws:
java.lang.Exception


Copyright © 2003-2011 Internet Archive. All Rights Reserved.