org.archive.util.ms
Class PieceTable

java.lang.Object
  extended by org.archive.util.ms.PieceTable

 class PieceTable
extends java.lang.Object

The piece table of a .doc file.

The piece table maps logical character positions of a document's text stream to actual file stream positions. The piece table is stored as two parallel arrays. The first array contains 32-bit integers representing the logical character positions. The second array contains 64-bit data structures that are mostly mysterious to me, except that they contain a 32-bit subfile offset. The second array is stored immediately after the first array. I call the first array the charPos array and the second array the filePos array.

The arrays are preceded by a special tag byte (2), followed by the combined size of both arrays in bytes. The number of piece table entries must be deduced from this byte size.

Because of this bizarre structure, caching piece table entries is something of a challenge. A single piece table entry is actually located in two different file locations. If there are many piece table entries, then the charPos and filePos information may be separated by many bytes, potentially crossing block boundaries. The approach I took was to use two different buffered streams. Up to n charPos offsets and n filePos structures can be buffered in the two streams, preventing any file seeking from occurring when looking up piece information. (File seeking must still occur to jump from one piece to the next.)

Note that the vast majority of .doc files in the world will have exactly 1 piece table entry, representing the complete text of the document. Only those documents that were "fast-saved" should have multiple pieces.

Finally, the text contained in a .doc file can either contain 16-bit unicode characters (charset UTF-16LE) or 8-bit CP1252 characters. One .doc file can contain both kinds of pieces. Whether or not a piece is Cp1252 is stored as a flag in the filePos value, bizarrely enough. If the flag is set, then the actual file position is the filePos with the flag cleared, then divided by 2.

Author:
pjack

Field Summary
(package private) static int CP1252_INDICATOR
          The bit that indicates if a piece uses Cp1252 or unicode.
(package private) static int CP1252_MASK
          The mask to use to clear the Cp1252 flag bit.
(package private) static java.util.logging.Logger LOGGER
           
 
Constructor Summary
PieceTable(SeekInputStream tableStream, int offset, int maxCharPos, int cachedRecords)
          Constructor.
 
Method Summary
 int getMaxCharPos()
          Returns the maximum character position.
 Piece next()
          Returns the next piece in the piece table.
 Piece pieceFor(int charPos)
          Returns the piece containing the given character position.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOGGER

static final java.util.logging.Logger LOGGER

CP1252_INDICATOR

static final int CP1252_INDICATOR
The bit that indicates if a piece uses Cp1252 or unicode.

See Also:
Constant Field Values

CP1252_MASK

static final int CP1252_MASK
The mask to use to clear the Cp1252 flag bit.

See Also:
Constant Field Values
Constructor Detail

PieceTable

public PieceTable(SeekInputStream tableStream,
                  int offset,
                  int maxCharPos,
                  int cachedRecords)
           throws java.io.IOException
Constructor.

Parameters:
tableStream - the stream containing the piece table
offset - the starting offset of the piece table
maxCharPos - the total number of characters in the document
cachedRecords - the number of piece table entries to cache
Throws:
java.io.IOException - if an IO error occurs
Method Detail

getMaxCharPos

public int getMaxCharPos()
Returns the maximum character position. Put another way, returns the total number of characters in the document.

Returns:
the maximum character position

next

public Piece next()
           throws java.io.IOException
Returns the next piece in the piece table.

Returns:
the next piece in the piece table, or null if there is no next piece
Throws:
java.io.IOException - if an IO error occurs

pieceFor

public Piece pieceFor(int charPos)
               throws java.io.IOException
Returns the piece containing the given character position.

Parameters:
charPos - the character position whose piece to return
Returns:
that piece, or null if no such piece exists (if charPos is greater than getMaxCharPos())
Throws:
java.io.IOException - if an IO error occurs


Copyright © 2003-2011 Internet Archive. All Rights Reserved.