|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.util.ms.PieceTable
class PieceTable
The piece table of a .doc file.
The piece table maps logical character positions of a document's text stream to actual file stream positions. The piece table is stored as two parallel arrays. The first array contains 32-bit integers representing the logical character positions. The second array contains 64-bit data structures that are mostly mysterious to me, except that they contain a 32-bit subfile offset. The second array is stored immediately after the first array. I call the first array the charPos array and the second array the filePos array.
The arrays are preceded by a special tag byte (2), followed by the combined size of both arrays in bytes. The number of piece table entries must be deduced from this byte size.
Because of this bizarre structure, caching piece table entries is something of a challenge. A single piece table entry is actually located in two different file locations. If there are many piece table entries, then the charPos and filePos information may be separated by many bytes, potentially crossing block boundaries. The approach I took was to use two different buffered streams. Up to n charPos offsets and n filePos structures can be buffered in the two streams, preventing any file seeking from occurring when looking up piece information. (File seeking must still occur to jump from one piece to the next.)
Note that the vast majority of .doc files in the world will have exactly 1 piece table entry, representing the complete text of the document. Only those documents that were "fast-saved" should have multiple pieces.
Finally, the text contained in a .doc file can either contain 16-bit unicode characters (charset UTF-16LE) or 8-bit CP1252 characters. One .doc file can contain both kinds of pieces. Whether or not a piece is Cp1252 is stored as a flag in the filePos value, bizarrely enough. If the flag is set, then the actual file position is the filePos with the flag cleared, then divided by 2.
Field Summary | |
---|---|
(package private) static int |
CP1252_INDICATOR
The bit that indicates if a piece uses Cp1252 or unicode. |
(package private) static int |
CP1252_MASK
The mask to use to clear the Cp1252 flag bit. |
(package private) static java.util.logging.Logger |
LOGGER
|
Constructor Summary | |
---|---|
PieceTable(SeekInputStream tableStream,
int offset,
int maxCharPos,
int cachedRecords)
Constructor. |
Method Summary | |
---|---|
int |
getMaxCharPos()
Returns the maximum character position. |
Piece |
next()
Returns the next piece in the piece table. |
Piece |
pieceFor(int charPos)
Returns the piece containing the given character position. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
static final java.util.logging.Logger LOGGER
static final int CP1252_INDICATOR
static final int CP1252_MASK
Constructor Detail |
---|
public PieceTable(SeekInputStream tableStream, int offset, int maxCharPos, int cachedRecords) throws java.io.IOException
tableStream
- the stream containing the piece tableoffset
- the starting offset of the piece tablemaxCharPos
- the total number of characters in the documentcachedRecords
- the number of piece table entries to cache
java.io.IOException
- if an IO error occursMethod Detail |
---|
public int getMaxCharPos()
public Piece next() throws java.io.IOException
java.io.IOException
- if an IO error occurspublic Piece pieceFor(int charPos) throws java.io.IOException
charPos
- the character position whose piece to return
java.io.IOException
- if an IO error occurs
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |