Developer Manual Introduction

This section is for observers and contributors who'd like to build from source. In here we'll talk of cvs access, the code layout, core technologies and key technical decisions.

Obtaining Heritrix

Heritrix can be obtained as packaged binary or source downloaded from the crawler sourceforge home page , or via CVS checkout from cvs.sourceforge.net. See the crawler sourceforge cvs page for how to fetch from CVS (Note, anonymous access does not give you the current HEAD but a snapshot that can some times be up to 24 hours behind HEAD). The packaged binary is named heritrix-?.?.?.tar.gz (or heritrix-?.?.?.zip ) and the packaged source is named heritrix-?.?.?-src.tar.gz (or heritrix-?.?.?-src.zip ) where ?.?.? is the heritrix release version.

Building Heritrix

You can build Heritrix from source using Maven. The Heritrix build uses maven 1.0-rc1. See maven.apache.org for how to obtain the binary and setup of your maven environment.

Building Heritrix with Maven

To build a CVS source checkout with Maven:

% cd CVS_CHECKOUT_DIR 
% $MAVEN_HOME/bin/maven dist

In the target/distribution subdir, you will find packaged source and binary builds. Run $MAVEN_HOME/bin/maven -g for other Maven possibilities.

Running Heritrix

See the User Manual for how to run the built Heritrix.

Eclipse

At the head of the CVS tree, you'll find Eclipse .project and .classpath configuration files that should make integrating the CVS checkout into your Eclipse development environment straight-forward.

Unit Tests Code

"[A ] popular convention is to place all test classes in a parallel directory structure. This allows you to use the same Java package names for your tests, while keeping the source files separate. To be honest, we do not like this approach because you must look in two different directories to find files." from Section 4.11.3, Java Extreme Programming Cookbook, By Eric M. Burke, Brian M. Coyner . We agree with the above so we put Unit Test classes beside the classes they are testing in the source tree giving them the name of the Class they are testing with a Test suffix.

Another advantage is that test classes of the same package can get at testee's default access methods and members, something a test in another package would not be able to do.

Coding Conventions

Heritrix baselines on SUN's Code Conventions for the JavaTM Programming Language . It'd be hard not to they say so little. They at least say maximum line length of 80 characters . Below are tightenings on the SUN conventions used in Heritrix.

We also will favor much of what is written in this document, Java Programming Style Guidelines .

No Tabs

No tabs in source code. Set your editor to indent with spaces.

Indent Width

Indents are 4 charcters wide.

Function/Block Bracket Open on Same Line

Preference is to have the bracket that opens functions and blocks on same line as function declaration or block test rather than on a new line on of its ownsome. For example:

if (true) {
    return true;
}
                    
and
public void main (String [] args) {
    System.println("Hello world");
}

File comment

Here is the eclipse template we use for the file header comment:

/* ${type_name}
 * 
 * $$Id: developer.xml,v 1.18 2004/04/15 17:01:30 stack-sf Exp $$
 * 
 * Created on ${date}
 *
 * Copyright (C) ${year} Internet Archive.
 * 
 * This file is part of the Heritrix web crawler (crawler.archive.org).
 * 
 * Heritrix is free software; you can redistribute it and/or modify
 * it under the terms of the GNU Lesser Public License as published by
 * the Free Software Foundation; either version 2.1 of the License, or
 * any later version.
 * 
 * Heritrix is distributed in the hope that it will be useful, 
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU Lesser Public License for more details.
 * 
 * You should have received a copy of the GNU Lesser Public License
 * along with Heritrix; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 */
${package_declaration}

Version and Release Numbering

Heritrix uses a version numbering scheme modelled after the one used for Linux kernels. Versions are 3 numbers:

[major ] .[minor/mode ] .[patchlevel ]

The major version number, currently at zero, increments upon significant architectural changes or the achievement of important milestones in capabilities. The minor/mode version number increments as progress is made within a major version, with the added constraint that all external releases have an even minor/mode version number, and all internal/development versions have an odd minor/mode version number.

The patchlevel number increments for small sets of changes, providing the most fine-grain timeline of software evolution. Patchlevels increment regularly for internal/development(odd minor level) work, but only increment for external releases when an official update to the previous release version has been tested and packaged.

In the CVS HEAD, version numbers are applied as tags of the form "heritrix-#_#_#". When a particular development-version is thought appropriate to become an external/"st able" release, it is considered a "Release Candidate". If testing confirms it is suitable for release, it is assigned the next even minor/mode value (and a zero patchlevel), CVS version-labelled, and packaged for release. Immediately after release, and before additional coding occurs, the CVS HEAD is assigned the next odd minor/mode value (and a zero patchlevel) in project/source files.

If patches are required to a released version, before the next release is ready, they are applied to a CVS branch from the release version tag, tested, and released as the subsequent patchlevel.

Keep in mind that each version number is an integer, not merely a decimal digit. To use an extreme example: development version 2.99.99 would be followed by either the 2.99.100 development version patchlevel or the 2.100.0 release. (And after such a release, the next development version would be 2.101.0.)

Making a Heritrix Release

Before initiating a release, its assumed that the current HEAD version has been run through the integration self test, that all unit tests pass, that the (as yet non-existent) test suite has been exercised, and that general usage shows HEAD to be release worthy.

  1. Send a mail to the list to freeze commits until the all-clear is given.
  2. Up the project.xml 'currentVersion' element and the build.xml 'version' property. Ensure they are the same version number. (See Version and Release Numbering on this page for guidance on what v ersion number to use)
  3. Update xdocs/changes.xml with bugs and RFEs closed since last release.
  4. (TODO: Changelog based off CVS history).
  5. Add news of the new release to the site main home page.
  6. Generate the site. Review all documentation making sure it remains applicable. Fix at least the embarrassing. Make issues to have all that remains addressed.
  7. Update the README.txt. Do html2txt on maven generated xdocs (I did 'cat PAGE.html| w3m -dump -T text/html > PAGE.txt', catted the product together and then did vi regex'ing to clean out website navigations and extra whitespace).
  8. Commit all changes made above all in the one commit with a log message about new release. Commit files with the new version -- the build.xml and project.xml -- as well as the README.txt, home page, and all changes in documentation including the changelog additions.
  9. Wait on a cruisecontrol successful build of all just committed. Download the src and binary latest builds from under the cruisecontrol 'build artifacts' link.
  10. Build the cruisecontrol produced src distribution version.
  11. Run both the binary and src-built product through the integration self test suite: % $HERITRIX_HOME/bin/heritrix --selftest
  12. Tag the CVS repository: % cvs -q tag heritrix-?_?_?
  13. Update the project.xml 'currentVersion' and build.xml 'version' property to both be a version number beyond that of the release currently being made (If we're releasing 0.2.0, then increment to 0.3.0).
  14. Login and upload the maven 'dist' product to sourceforge into the admin->File releases section.
  15. Send announcement to mailinglist -- and give an all-clear that commits may resume -- and update our release state on freshmeat site (Here is the URL I used creating our freshmeat project: http://freshmeat.net/add-project/all-done/43820/46804/ -- 46804 is our project ID).

Integration self test

Run the integration self test on the command line by doing the following:

% $HERITRIX_HOME/bin/heritrix --selftest

This will set the crawler going against itself, in particular, the selftest webapp. When done, it runs an analsys of the produced arc files and logs and dumps a ruling into heritrix_out.log . See the org.archive.crawler.selftest package for more on how the selftest works.

cruisecontrol

See src/cc for a config.xml that will run the heritrix maven build under cruisecontrol . See the README.txt in the same directory for how to set up continuous building using cc.

Settings XML Schema

The XML Schema that describes the crawler job order file can be viewed as xhtml here, heritrix_settings.html .