4. A quick guide to running your first crawl job

Once you've installed Heritrix and logged into the WUI (see above) you are presented with the web Console page. Near the top there is a row of tabs.

Step 1. Create a job

To create a new job choose the Jobs tab, this will take you to the Jobs page. Once there you are presented with three options for creating a new job. Select 'With defaults'. This will create a new job based on the default profile (see Section 5.2, “Profile”).

On the screen that comes next you will be asked to supply a name, description and a seed list for the new job.

For a name supply a short text with no special characters or spaces (except dash and underscore). You can skip the description if you like. In the seeds list type in the URL of the sites you are interested in harvesting. One URL to a line.

Creating a job is covered in greater detail in Section 5, “Creating jobs and profiles”.

Step 2. Configure the job

Once you've entered this information in you are ready to go to the configuration pages. Click the Modules button in the row of buttons at the bottom of the page.

This will take you to the modules configuration page (more details in Section 6.1, “Modules (Scope, Frontier, and Processors)”). For now we are only interested in the option second from the top named Select crawl scope. It allows you to specify the limits of the crawl. By default it is limited to the domains that your seeds span. This may be suitable for your purposes. If not you can choose a broad scope (not limited to the domains of its seeds) or the more restrictive host scope that limits the crawl to the hosts that its seeds span. For more on scopes refer to Section 6.1.1, “Crawl Scope”.

To change scopes, select the new one from the combobox and click the Change button.

Next turn your attention to the second row of tabs at the top of the page, below the usual tabs. You are currently on the far left tab. Now select the tab called Settings near the middle of the row.

This takes you to the Settings page. It allows you to configure various details of the crawl. Exhaustive coverage of this page can be found in Section 6.3, “Settings”. For now we are only interested in the two settings under http-headers. These are the user-agent and from field of the HTTP headers in the crawlers requests. You must set them to valid values before a crawl can be run. The current values upper-case what needs replacing. If you have trouble with that please refer to Section 6.3.1.3, “HTTP headers” for what's regarded as valid values.

Once you've set the http-headers settings to proper values (and made any other desired changes), you can click the Submit job tab at the far right of the second row of tabs. The crawl job is now configured and ready to run.

Configuring a job is covered in greater detail in Section 6, “Configuring jobs and profiles”.

Step 3. Running the job

Submitted new jobs are placed in a queue of pending jobs. The crawler does not start processing jobs from this queue until the crawler is started. While the crawler is stopped, jobs are simply held.

To start the crawler, click on the Console tab. Once on the Console page, you will find the option Start at the top of the Crawler Status box, just to the right of the indicator of current status. Clicking this option will put the crawling into Crawling Jobs mode, where it will begin crawling any next pending job, such as the job you just created and configured.

The Console will update to display progress information about the on-going crawl. Click the Refresh option (or the top-left Heritrix logo) to update this information.

For more information about running a job see Section 7, “Running a job”.

Detailed information about evaluating the progress of a job can be found in Section 8, “Analysis of jobs”.