org.archive.io
Class Latin1ByteReplayCharSequence

java.lang.Object
  extended by org.archive.io.Latin1ByteReplayCharSequence
All Implemented Interfaces:
java.lang.CharSequence, ReplayCharSequence

 class Latin1ByteReplayCharSequence
extends java.lang.Object
implements ReplayCharSequence

Provides a (Replay)CharSequence view on recorded stream bytes (a prefix buffer and overflow backing file). Assumes the byte stream is ISO-8859-1 text, taking advantage of the fact that each byte in the stream corresponds to a single unicode character with the same numerical value as the byte.

Uses a wraparound rolling buffer of the last windowSize bytes read from disk in memory; as long as the 'random access' of a CharSequence user stays within this window, access should remain fairly efficient. (So design any regexps pointed at these CharSequences to work within that range!)

When rereading of a location is necessary, the whole window is recentered around the location requested. (TODO: More research into whether this is the best strategy.)

An implementation of a ReplayCharSequence done with ByteBuffers -- one to wrap the passed prefix buffer and the second, a memory-mapped ByteBuffer view into the backing file -- was consistently slower: ~10%. My tests did the following. Made a buffer filled w/ regular content. This buffer was used as the prefix buffer. The buffer content was written MULTIPLER times to a backing file. I then did accesses w/ the following pattern: Skip forward 32 bytes, then back 16 bytes, and then read forward from byte 16-32. Repeat. Though I varied the size of the buffer to the size of the backing file,from 3-10, the difference of 10% or so seemed to persist. Same if I tried to favor get() over get(index). I used a profiler, JMP, to study times taken (St.Ack did above comment).

TODO determine in memory mapped files is better way to do this; probably not -- they don't offer the level of control over total memory used that this approach does.

Version:
$Revision: 6512 $, $Date: 2009-09-23 03:02:29 +0000 (Wed, 23 Sep 2009) $
Author:
Gordon Mohr

Field Summary
protected  int length
          Total length of character stream to replay minus the HTTP headers if present.
protected static java.util.logging.Logger logger
           
 
Constructor Summary
Latin1ByteReplayCharSequence(byte[] buffer, long size, long responseBodyStart, java.lang.String backingFilename)
          Constructor.
 
Method Summary
 char charAt(int index)
          Get character at passed absolute position.
 void close()
          Cleanup resources.
protected  void finalize()
           
 int length()
           
 java.lang.CharSequence subSequence(int start, int end)
           
 java.lang.String substring(int offset, int len)
          Deprecated. please use subSequence() and then toString() directly
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

logger

protected static java.util.logging.Logger logger

length

protected int length
Total length of character stream to replay minus the HTTP headers if present. Used to find EOS.

Constructor Detail

Latin1ByteReplayCharSequence

public Latin1ByteReplayCharSequence(byte[] buffer,
                                    long size,
                                    long responseBodyStart,
                                    java.lang.String backingFilename)
                             throws java.io.IOException
Constructor.

Parameters:
buffer - In-memory buffer of recordings prefix. We read from here first and will only go to the backing file if size requested is greater than buffer.length.
size - Total size of stream to replay in bytes. Used to find EOS. This is total length of content including HTTP headers if present.
responseBodyStart - Where the response body starts in bytes. Used to skip over the HTTP headers if present.
backingFilename - Path to backing file with content in excess of whats in buffer.
Throws:
java.io.IOException
Method Detail

length

public int length()
Specified by:
length in interface java.lang.CharSequence
Returns:
Length of characters in stream to replay. Starts counting at the HTTP header/body boundary.

charAt

public char charAt(int index)
Get character at passed absolute position. Called by charAt(int) which has a relative index into the content, one that doesn't account for HTTP header if present.

Specified by:
charAt in interface java.lang.CharSequence
Parameters:
index - Index into content adjusted to accomodate initial offset to get us past the HTTP header if present (i.e. contentOffset).
Returns:
Characater at offset index.

subSequence

public java.lang.CharSequence subSequence(int start,
                                          int end)
Specified by:
subSequence in interface java.lang.CharSequence

close

public void close()
           throws java.io.IOException
Cleanup resources.

Specified by:
close in interface ReplayCharSequence
Throws:
java.io.IOException - Failed close of random access file.

finalize

protected void finalize()
                 throws java.lang.Throwable
Overrides:
finalize in class java.lang.Object
Throws:
java.lang.Throwable

substring

public java.lang.String substring(int offset,
                                  int len)
Deprecated. please use subSequence() and then toString() directly

Convenience method for getting a substring.


toString

public java.lang.String toString()
Specified by:
toString in interface java.lang.CharSequence
Overrides:
toString in class java.lang.Object


Copyright © 2003-2011 Internet Archive. All Rights Reserved.