org.archive.crawler.extractor
Class PDFParser

java.lang.Object
  extended by org.archive.crawler.extractor.PDFParser

public class PDFParser
extends java.lang.Object

Supports PDF parsing operations. For now this primarily means extracting URIs, but the logic in extractURIs() could easily be adopted/extended for a variety of PDF processing tasks.

Author:
Parker Thompson

Field Summary
(package private)  com.lowagie.text.pdf.PdfDictionary catalog
           
(package private)  byte[] document
           
(package private)  com.lowagie.text.pdf.PdfReader documentReader
           
(package private)  java.util.ArrayList<java.util.ArrayList<java.lang.Integer>> encounteredReferences
           
(package private)  java.util.ArrayList<java.lang.String> foundURIs
           
 
Constructor Summary
PDFParser(byte[] doc)
           
PDFParser(java.lang.String doc)
           
 
Method Summary
 java.util.ArrayList extractURIs()
          Extract URIs from all objects found in a Pdf document's catalog.
protected  void extractURIs(com.lowagie.text.pdf.PdfObject entity)
          Parse a PdfDictionary, looking for URIs recursively and adding them to foundURIs
protected  void getInFromFile(java.lang.String doc)
          Read a file named 'doc' and store its' bytes for later processing.
 java.util.ArrayList getURIs()
          Get a list of URIs retrieved from the Pdf during the extractURIs operation.
protected  boolean haveSeen(int generation, int id)
          Indicates, based on a PDFObject's generation/id pair whether the parser has already encountered this object (or a reference to it) so we don't infinitely loop on circuits within the PDF.
protected  void initialize()
          Initialize opens the document for reading.
static void main(java.lang.String[] argv)
           
protected  void markAsSeen(int generation, int id)
          Note that an object (id/generation pair) has been seen by this parser so that it can be handled differently when it is encountered again.
protected  void resetState()
          Reinitialize the object as though a new one were created.
 void resetState(byte[] doc)
          Reset the object and initialize it with a new byte array (the document).
 void resetState(java.lang.String doc)
          Reinitialize the object as though a new one were created, complete with a valid pointer to a document that can be read
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

foundURIs

java.util.ArrayList<java.lang.String> foundURIs

encounteredReferences

java.util.ArrayList<java.util.ArrayList<java.lang.Integer>> encounteredReferences

documentReader

com.lowagie.text.pdf.PdfReader documentReader

document

byte[] document

catalog

com.lowagie.text.pdf.PdfDictionary catalog
Constructor Detail

PDFParser

public PDFParser(java.lang.String doc)
          throws java.io.IOException
Throws:
java.io.IOException

PDFParser

public PDFParser(byte[] doc)
          throws java.io.IOException
Throws:
java.io.IOException
Method Detail

resetState

protected void resetState()
Reinitialize the object as though a new one were created.


resetState

public void resetState(byte[] doc)
                throws java.io.IOException
Reset the object and initialize it with a new byte array (the document).

Parameters:
doc -
Throws:
java.io.IOException

resetState

public void resetState(java.lang.String doc)
                throws java.io.IOException
Reinitialize the object as though a new one were created, complete with a valid pointer to a document that can be read

Parameters:
doc -
Throws:
java.io.IOException

getInFromFile

protected void getInFromFile(java.lang.String doc)
                      throws java.io.IOException
Read a file named 'doc' and store its' bytes for later processing.

Parameters:
doc -
Throws:
java.io.IOException

haveSeen

protected boolean haveSeen(int generation,
                           int id)
Indicates, based on a PDFObject's generation/id pair whether the parser has already encountered this object (or a reference to it) so we don't infinitely loop on circuits within the PDF.

Parameters:
generation -
id -
Returns:
True if already seen.

markAsSeen

protected void markAsSeen(int generation,
                          int id)
Note that an object (id/generation pair) has been seen by this parser so that it can be handled differently when it is encountered again.

Parameters:
generation -
id -

getURIs

public java.util.ArrayList getURIs()
Get a list of URIs retrieved from the Pdf during the extractURIs operation.

Returns:
A list of URIs retrieved from the Pdf during the extractURIs operation.

initialize

protected void initialize()
                   throws java.io.IOException
Initialize opens the document for reading. This is done implicitly by the constuctor. This should only need to be called directly following a reset.

Throws:
java.io.IOException

extractURIs

public java.util.ArrayList extractURIs()
Extract URIs from all objects found in a Pdf document's catalog. Returns an array list representing all URIs found in the document catalog tree.

Returns:
URIs from all objects found in a Pdf document's catalog.

extractURIs

protected void extractURIs(com.lowagie.text.pdf.PdfObject entity)
Parse a PdfDictionary, looking for URIs recursively and adding them to foundURIs

Parameters:
entity -

main

public static void main(java.lang.String[] argv)


Copyright © 2003-2011 Internet Archive. All Rights Reserved.