DefaultBlockFileSystem (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.util.ms
Class DefaultBlockFileSystem

java.lang.Object
  org.archive.util.ms.DefaultBlockFileSystem

All Implemented Interfaces:: BlockFileSystem

public class DefaultBlockFileSystem
extends java.lang.Object
implements BlockFileSystem
extends java.lang.Object
implements BlockFileSystem

Default implementation of the Block File System.

The overall structure of a BlockFileSystem file (such as a .doc file) is as follows. The file is divided into blocks, which are of uniform length (512 bytes). The first block (at file pointer 0) is called the header block. It's used to look up other blocks in the file.

Subfiles contained within the .doc file are organized using a Block Allocation Table, or BAT. The BAT is basically a linked list; given a block number, the BAT will tell you the next block number. Note that the header block has no number; block #0 is the first block after the header. Thus, to convert a block number to a file pointer: int filePointer = (blockNumber + 1) * BLOCK_SIZE.

The BAT itself is discontinuous, however. To find the blocks that comprise the BAT, you have to look in the header block. The header block contains an array of 109 pointers to the blocks that comprise the BAT. If more than 109 BAT blocks are required (in other words, if the .doc file is larger than ~6 megabytes), then something called the XBAT comes into play.

XBAT blocks contain pointers to the 110th BAT block and beyond. The first XBAT block is stored at a file pointer listed in the header. The other XBAT blocks are always stored in order after the first; the XBAT table is continuous. One is inclined to wonder why the BAT itself is not so stored, but oh well.

The BAT only tells you the next block for a given block. To find the first block for a subfile, you have to look up that subfile's directory entry. Each directory entry is a 128 byte structure in the file, so four of them fit in a block. The number of the first block of the entry list is stored in the header. To find subsequent entry blocks, the BAT must be used.

I'm telling you all this so that you understand the caching that this class provides.

First, directory entries are not cached. It's assumed that they will be looked up at the beginning of a lengthy operation, and then forgotten about. This is certainly the case for Doc#getText(BlockFileSystem). If you need to remember directory entries, you can manually store the Entry objects in a map or something, as they don't grow stale.

This class keeps all 512 bytes of the header block in memory at all times. This prevents a potentially expensive file pointer repositioning every time you're trying to figure out what comes next.

BAT and XBAT blocks are stored in a least-recently used cache. The n most recent BAT and XBAT blocks are remembered, where n is set at construction time. The minimum value of n is 1. For small files, this can prevent file pointer repositioning for BAT look ups.

The BAT/XBAT cache only takes up memory as needed. If the specified cache size is 100 blocks, but the file only has 4 BAT blocks, then only 2048 bytes will be used by the cache.

Note this class only caches BAT and XBAT blocks. It does not cache the blocks that actually make up a subfile's contents. It is assumed that those blocks will only be accessed once per operation (again, this is what {Doc.getText(BlockFileSystem)} typically requires.)

Author:: pjack
See Also:: http://jakarta.apache.org/poi/poifs/fileformat.html

Field Summary

Fields inherited from interface org.archive.util.ms.BlockFileSystem
`BLOCK_SIZE`

Constructor Summary
`DefaultBlockFileSystem(SeekInputStream input, int batCacheSize)` Constructor.

Method Summary
`(package private) Entry`	`getEntry(int entryNumber)` Returns the entry with the given number.
`int`	`getNextBlock(int block)` Returns the number of the block that follows the given block.
`SeekInputStream`	`getRawInput()` Returns the raw input stream for this file system.
`Entry`	`getRoot()` Returns the root entry of the file system.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail