PDFParser (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.crawler.extractor
Class PDFParser

java.lang.Object
  org.archive.crawler.extractor.PDFParser

public class PDFParser
extends java.lang.Object
extends java.lang.Object

Supports PDF parsing operations. For now this primarily means extracting URIs, but the logic in extractURIs() could easily be adopted/extended for a variety of PDF processing tasks.

Author:: Parker Thompson

Field Summary
`(package private) com.lowagie.text.pdf.PdfDictionary`	`catalog`
`(package private) byte[]`	`document`
`(package private) com.lowagie.text.pdf.PdfReader`	`documentReader`
`(package private) java.util.ArrayList<java.util.ArrayList<java.lang.Integer>>`	`encounteredReferences`
`(package private) java.util.ArrayList<java.lang.String>`	`foundURIs`

Constructor Summary
`PDFParser(byte[] doc)`
`PDFParser(java.lang.String doc)`

Method Summary
`java.util.ArrayList`	`extractURIs()` Extract URIs from all objects found in a Pdf document's catalog.
`protected void`	`extractURIs(com.lowagie.text.pdf.PdfObject entity)` Parse a PdfDictionary, looking for URIs recursively and adding them to foundURIs
`protected void`	`getInFromFile(java.lang.String doc)` Read a file named 'doc' and store its' bytes for later processing.
`java.util.ArrayList`	`getURIs()` Get a list of URIs retrieved from the Pdf during the extractURIs operation.
`protected boolean`	`haveSeen(int generation, int id)` Indicates, based on a PDFObject's generation/id pair whether the parser has already encountered this object (or a reference to it) so we don't infinitely loop on circuits within the PDF.
`protected void`	`initialize()` Initialize opens the document for reading.
`static void`	`main(java.lang.String[] argv)`
`protected void`	`markAsSeen(int generation, int id)` Note that an object (id/generation pair) has been seen by this parser so that it can be handled differently when it is encountered again.
`protected void`	`resetState()` Reinitialize the object as though a new one were created.
`void`	`resetState(byte[] doc)` Reset the object and initialize it with a new byte array (the document).
`void`	`resetState(java.lang.String doc)` Reinitialize the object as though a new one were created, complete with a valid pointer to a document that can be read

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

foundURIs

java.util.ArrayList<java.lang.String> foundURIs

encounteredReferences

java.util.ArrayList<java.util.ArrayList<java.lang.Integer>> encounteredReferences

documentReader

com.lowagie.text.pdf.PdfReader documentReader

document

byte[] document

catalog

com.lowagie.text.pdf.PdfDictionary catalog

Constructor Detail

PDFParser

public PDFParser(java.lang.String doc)
          throws java.io.IOException

Throws:: java.io.IOException

PDFParser

public PDFParser(byte[] doc)
          throws java.io.IOException

Throws:: java.io.IOException

Method Detail

resetState

protected void resetState()

Reinitialize the object as though a new one were created.

resetState

public void resetState(byte[] doc)
                throws java.io.IOException

Reset the object and initialize it with a new byte array (the document).

Parameters:: doc -
Throws:: java.io.IOException

resetState

public void resetState(java.lang.String doc)
                throws java.io.IOException

Reinitialize the object as though a new one were created, complete with a valid pointer to a document that can be read

Parameters:: doc -
Throws:: java.io.IOException

getInFromFile

protected void getInFromFile(java.lang.String doc)
                      throws java.io.IOException

Read a file named 'doc' and store its' bytes for later processing.

Parameters:: doc -
Throws:: java.io.IOException

haveSeen

protected boolean haveSeen(int generation,
                           int id)

Indicates, based on a PDFObject's generation/id pair whether the parser has already encountered this object (or a reference to it) so we don't infinitely loop on circuits within the PDF.

Parameters:: generation -; id -
Returns:: True if already seen.

markAsSeen

protected void markAsSeen(int generation,
                          int id)

Note that an object (id/generation pair) has been seen by this parser so that it can be handled differently when it is encountered again.

Parameters:: generation -; id -

getURIs

public java.util.ArrayList getURIs()

Get a list of URIs retrieved from the Pdf during the extractURIs operation.

Returns:: A list of URIs retrieved from the Pdf during the extractURIs operation.

initialize

protected void initialize()
                   throws java.io.IOException

Initialize opens the document for reading. This is done implicitly by the constuctor. This should only need to be called directly following a reset.

Throws:: java.io.IOException

extractURIs

public java.util.ArrayList extractURIs()

Extract URIs from all objects found in a Pdf document's catalog. Returns an array list representing all URIs found in the document catalog tree.

Returns:: URIs from all objects found in a Pdf document's catalog.

extractURIs

protected void extractURIs(com.lowagie.text.pdf.PdfObject entity)

Parse a PdfDictionary, looking for URIs recursively and adding them to foundURIs

Parameters:: entity -

main

public static void main(java.lang.String[] argv)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.crawler.extractor Class PDFParser

foundURIs

encounteredReferences

documentReader

document

catalog

PDFParser

PDFParser

resetState

resetState

resetState

getInFromFile

haveSeen

markAsSeen

getURIs

initialize

extractURIs

extractURIs

main

org.archive.crawler.extractor
Class PDFParser