org.archive.crawler.extractor
Class ExtractorTool
java.lang.Object
org.archive.crawler.extractor.ExtractorTool
public class ExtractorTool
- extends java.lang.Object
Run named extractors against passed ARC file.
This extractor tool runs suboptimally. It takes each ARC file record,
writes it to a new scratch file, and then it runs each listed
extractor against the scratch. It works in this manner because
extractors want CharSequence, being able to refer to characters
by absolute position, but ARCs are compressed streams. The work
to get a CharSequence on an underlying compressed stream has not
been done. Other issues are need to setup CrawlerSetting environment
so extractors can run.
- Version:
- $Date: 2006-09-26 23:47:15 +0000 (Tue, 26 Sep 2006) $, $Revision: 4671 $
- Author:
- stack
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ExtractorTool
public ExtractorTool()
throws java.lang.Exception
- Throws:
java.lang.Exception
ExtractorTool
public ExtractorTool(java.lang.String[] e,
java.lang.String scratch)
throws java.lang.Exception
- Throws:
java.lang.Exception
extract
public void extract(java.lang.String resource)
throws java.io.IOException,
org.apache.commons.httpclient.URIException,
java.lang.InterruptedException
- Throws:
java.io.IOException
org.apache.commons.httpclient.URIException
java.lang.InterruptedException
outlinks
protected void outlinks(CrawlURI curi)
getCrawlURI
protected CrawlURI getCrawlURI(ARCRecord record,
HttpRecorder hr)
throws org.apache.commons.httpclient.URIException
- Throws:
org.apache.commons.httpclient.URIException
main
public static void main(java.lang.String[] args)
throws java.lang.Exception
- Throws:
java.lang.Exception
Copyright © 2003-2011 Internet Archive. All Rights Reserved.