|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.crawler.extractor.PDFParser
public class PDFParser
Supports PDF parsing operations. For now this primarily means extracting URIs, but the logic in extractURIs() could easily be adopted/extended for a variety of PDF processing tasks.
Field Summary | |
---|---|
(package private) com.lowagie.text.pdf.PdfDictionary |
catalog
|
(package private) byte[] |
document
|
(package private) com.lowagie.text.pdf.PdfReader |
documentReader
|
(package private) java.util.ArrayList<java.util.ArrayList<java.lang.Integer>> |
encounteredReferences
|
(package private) java.util.ArrayList<java.lang.String> |
foundURIs
|
Constructor Summary | |
---|---|
PDFParser(byte[] doc)
|
|
PDFParser(java.lang.String doc)
|
Method Summary | |
---|---|
java.util.ArrayList |
extractURIs()
Extract URIs from all objects found in a Pdf document's catalog. |
protected void |
extractURIs(com.lowagie.text.pdf.PdfObject entity)
Parse a PdfDictionary, looking for URIs recursively and adding them to foundURIs |
protected void |
getInFromFile(java.lang.String doc)
Read a file named 'doc' and store its' bytes for later processing. |
java.util.ArrayList |
getURIs()
Get a list of URIs retrieved from the Pdf during the extractURIs operation. |
protected boolean |
haveSeen(int generation,
int id)
Indicates, based on a PDFObject's generation/id pair whether the parser has already encountered this object (or a reference to it) so we don't infinitely loop on circuits within the PDF. |
protected void |
initialize()
Initialize opens the document for reading. |
static void |
main(java.lang.String[] argv)
|
protected void |
markAsSeen(int generation,
int id)
Note that an object (id/generation pair) has been seen by this parser so that it can be handled differently when it is encountered again. |
protected void |
resetState()
Reinitialize the object as though a new one were created. |
void |
resetState(byte[] doc)
Reset the object and initialize it with a new byte array (the document). |
void |
resetState(java.lang.String doc)
Reinitialize the object as though a new one were created, complete with a valid pointer to a document that can be read |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
java.util.ArrayList<java.lang.String> foundURIs
java.util.ArrayList<java.util.ArrayList<java.lang.Integer>> encounteredReferences
com.lowagie.text.pdf.PdfReader documentReader
byte[] document
com.lowagie.text.pdf.PdfDictionary catalog
Constructor Detail |
---|
public PDFParser(java.lang.String doc) throws java.io.IOException
java.io.IOException
public PDFParser(byte[] doc) throws java.io.IOException
java.io.IOException
Method Detail |
---|
protected void resetState()
public void resetState(byte[] doc) throws java.io.IOException
doc
-
java.io.IOException
public void resetState(java.lang.String doc) throws java.io.IOException
doc
-
java.io.IOException
protected void getInFromFile(java.lang.String doc) throws java.io.IOException
doc
-
java.io.IOException
protected boolean haveSeen(int generation, int id)
generation
- id
-
protected void markAsSeen(int generation, int id)
generation
- id
- public java.util.ArrayList getURIs()
protected void initialize() throws java.io.IOException
java.io.IOException
public java.util.ArrayList extractURIs()
protected void extractURIs(com.lowagie.text.pdf.PdfObject entity)
entity
- public static void main(java.lang.String[] argv)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |