|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.archive.util.ms.DefaultBlockFileSystem
public class DefaultBlockFileSystem
Default implementation of the Block File System.
The overall structure of a BlockFileSystem file (such as a .doc file) is as follows. The file is divided into blocks, which are of uniform length (512 bytes). The first block (at file pointer 0) is called the header block. It's used to look up other blocks in the file.
Subfiles contained within the .doc file are organized using a Block
Allocation Table, or BAT. The BAT is basically a linked list; given a
block number, the BAT will tell you the next block number. Note that
the header block has no number; block #0 is the first block after the
header. Thus, to convert a block number to a file pointer:
int filePointer = (blockNumber + 1) * BLOCK_SIZE
.
The BAT itself is discontinuous, however. To find the blocks that comprise the BAT, you have to look in the header block. The header block contains an array of 109 pointers to the blocks that comprise the BAT. If more than 109 BAT blocks are required (in other words, if the .doc file is larger than ~6 megabytes), then something called the XBAT comes into play.
XBAT blocks contain pointers to the 110th BAT block and beyond. The first XBAT block is stored at a file pointer listed in the header. The other XBAT blocks are always stored in order after the first; the XBAT table is continuous. One is inclined to wonder why the BAT itself is not so stored, but oh well.
The BAT only tells you the next block for a given block. To find the first block for a subfile, you have to look up that subfile's directory entry. Each directory entry is a 128 byte structure in the file, so four of them fit in a block. The number of the first block of the entry list is stored in the header. To find subsequent entry blocks, the BAT must be used.
I'm telling you all this so that you understand the caching that this class provides.
First, directory entries are not cached. It's assumed that they will
be looked up at the beginning of a lengthy operation, and then forgotten
about. This is certainly the case for Doc#getText(BlockFileSystem)
.
If you need to remember directory entries, you can manually store the Entry
objects in a map or something, as they don't grow stale.
This class keeps all 512 bytes of the header block in memory at all times. This prevents a potentially expensive file pointer repositioning every time you're trying to figure out what comes next.
BAT and XBAT blocks are stored in a least-recently used cache. The n most recent BAT and XBAT blocks are remembered, where n is set at construction time. The minimum value of n is 1. For small files, this can prevent file pointer repositioning for BAT look ups.
The BAT/XBAT cache only takes up memory as needed. If the specified cache size is 100 blocks, but the file only has 4 BAT blocks, then only 2048 bytes will be used by the cache.
Note this class only caches BAT and XBAT blocks. It does not cache the blocks that actually make up a subfile's contents. It is assumed that those blocks will only be accessed once per operation (again, this is what {Doc.getText(BlockFileSystem)} typically requires.)
http://jakarta.apache.org/poi/poifs/fileformat.html
Field Summary |
---|
Fields inherited from interface org.archive.util.ms.BlockFileSystem |
---|
BLOCK_SIZE |
Constructor Summary | |
---|---|
DefaultBlockFileSystem(SeekInputStream input,
int batCacheSize)
Constructor. |
Method Summary | |
---|---|
(package private) Entry |
getEntry(int entryNumber)
Returns the entry with the given number. |
int |
getNextBlock(int block)
Returns the number of the block that follows the given block. |
SeekInputStream |
getRawInput()
Returns the raw input stream for this file system. |
Entry |
getRoot()
Returns the root entry of the file system. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DefaultBlockFileSystem(SeekInputStream input, int batCacheSize) throws java.io.IOException
input
- the file to read frombatCacheSize
- number of BAT and XBAT blocks to cache
java.io.IOException
- if an IO error occursMethod Detail |
---|
public Entry getRoot() throws java.io.IOException
BlockFileSystem
getRoot
in interface BlockFileSystem
java.io.IOException
- if an IO error occursEntry getEntry(int entryNumber) throws java.io.IOException
entryNumber
- the number of the entry to return
java.io.IOException
- if an IO error occurspublic int getNextBlock(int block) throws java.io.IOException
BlockFileSystem
getNextBlock
in interface BlockFileSystem
block
- the number of block whose successor to return
java.io.IOException
- if an IO error occurspublic SeekInputStream getRawInput()
BlockFileSystem
getRawInput
in interface BlockFileSystem
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |