org.archive.util.ms
Class DefaultBlockFileSystem

java.lang.Object
  extended by org.archive.util.ms.DefaultBlockFileSystem
All Implemented Interfaces:
BlockFileSystem

public class DefaultBlockFileSystem
extends java.lang.Object
implements BlockFileSystem

Default implementation of the Block File System.

The overall structure of a BlockFileSystem file (such as a .doc file) is as follows. The file is divided into blocks, which are of uniform length (512 bytes). The first block (at file pointer 0) is called the header block. It's used to look up other blocks in the file.

Subfiles contained within the .doc file are organized using a Block Allocation Table, or BAT. The BAT is basically a linked list; given a block number, the BAT will tell you the next block number. Note that the header block has no number; block #0 is the first block after the header. Thus, to convert a block number to a file pointer: int filePointer = (blockNumber + 1) * BLOCK_SIZE.

The BAT itself is discontinuous, however. To find the blocks that comprise the BAT, you have to look in the header block. The header block contains an array of 109 pointers to the blocks that comprise the BAT. If more than 109 BAT blocks are required (in other words, if the .doc file is larger than ~6 megabytes), then something called the XBAT comes into play.

XBAT blocks contain pointers to the 110th BAT block and beyond. The first XBAT block is stored at a file pointer listed in the header. The other XBAT blocks are always stored in order after the first; the XBAT table is continuous. One is inclined to wonder why the BAT itself is not so stored, but oh well.

The BAT only tells you the next block for a given block. To find the first block for a subfile, you have to look up that subfile's directory entry. Each directory entry is a 128 byte structure in the file, so four of them fit in a block. The number of the first block of the entry list is stored in the header. To find subsequent entry blocks, the BAT must be used.

I'm telling you all this so that you understand the caching that this class provides.

First, directory entries are not cached. It's assumed that they will be looked up at the beginning of a lengthy operation, and then forgotten about. This is certainly the case for Doc#getText(BlockFileSystem). If you need to remember directory entries, you can manually store the Entry objects in a map or something, as they don't grow stale.

This class keeps all 512 bytes of the header block in memory at all times. This prevents a potentially expensive file pointer repositioning every time you're trying to figure out what comes next.

BAT and XBAT blocks are stored in a least-recently used cache. The n most recent BAT and XBAT blocks are remembered, where n is set at construction time. The minimum value of n is 1. For small files, this can prevent file pointer repositioning for BAT look ups.

The BAT/XBAT cache only takes up memory as needed. If the specified cache size is 100 blocks, but the file only has 4 BAT blocks, then only 2048 bytes will be used by the cache.

Note this class only caches BAT and XBAT blocks. It does not cache the blocks that actually make up a subfile's contents. It is assumed that those blocks will only be accessed once per operation (again, this is what {Doc.getText(BlockFileSystem)} typically requires.)

Author:
pjack
See Also:
http://jakarta.apache.org/poi/poifs/fileformat.html

Field Summary
 
Fields inherited from interface org.archive.util.ms.BlockFileSystem
BLOCK_SIZE
 
Constructor Summary
DefaultBlockFileSystem(SeekInputStream input, int batCacheSize)
          Constructor.
 
Method Summary
(package private)  Entry getEntry(int entryNumber)
          Returns the entry with the given number.
 int getNextBlock(int block)
          Returns the number of the block that follows the given block.
 SeekInputStream getRawInput()
          Returns the raw input stream for this file system.
 Entry getRoot()
          Returns the root entry of the file system.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DefaultBlockFileSystem

public DefaultBlockFileSystem(SeekInputStream input,
                              int batCacheSize)
                       throws java.io.IOException
Constructor.

Parameters:
input - the file to read from
batCacheSize - number of BAT and XBAT blocks to cache
Throws:
java.io.IOException - if an IO error occurs
Method Detail

getRoot

public Entry getRoot()
              throws java.io.IOException
Description copied from interface: BlockFileSystem
Returns the root entry of the file system. Subfiles and directories can be found by searching the returned entry.

Specified by:
getRoot in interface BlockFileSystem
Returns:
the root entry
Throws:
java.io.IOException - if an IO error occurs

getEntry

Entry getEntry(int entryNumber)
         throws java.io.IOException
Returns the entry with the given number.

Parameters:
entryNumber - the number of the entry to return
Returns:
that entry, or null if no such entry exists
Throws:
java.io.IOException - if an IO error occurs

getNextBlock

public int getNextBlock(int block)
                 throws java.io.IOException
Description copied from interface: BlockFileSystem
Returns the number of the block that follows the given block. The internal block allocation tables are consulted to determine the next block. A return value that is less than zero indicates that there is no next block.

Specified by:
getNextBlock in interface BlockFileSystem
Parameters:
block - the number of block whose successor to return
Returns:
the successor of that block
Throws:
java.io.IOException - if an IO error occurs

getRawInput

public SeekInputStream getRawInput()
Description copied from interface: BlockFileSystem
Returns the raw input stream for this file system. Typically this will be the random access file containing the .doc.

Specified by:
getRawInput in interface BlockFileSystem
Returns:
the raw input stream for this file system


Copyright © 2003-2011 Internet Archive. All Rights Reserved.