5. Creating jobs and profiles

In order to run a crawl a configuration must be created that defines it. In Heritrix such a configuration is called a crawl job.

5.1. Crawl job

A crawl job encompasses the configurations needed to run a single crawl. It also contains some additional elements such as file locations, status etc.

Once logged onto the WUI new jobs can be created by going to the Jobs tab. Once the Jobs page loads users can create jobs by choosing of the following three options:

  1. Based on existing job

    This option allows the user to create a job by basing it on any existing job, regardless of whether it has been crawled or not. Can be useful for repeating crawls or recovering a crawl that had problems. (See Section 9.3, “Recovery of Frontier State and recover.gz”

  2. Based on a profile

    This option allows the user to create a job by basing it on any existing profiles.

  3. With defaults

    This option creates a new crawl job based on the default profile.

Options 1 and 2 will display a list of available options. Initially there are two profiles and no existing jobs.

All crawl jobs are created by basing them on profiles (see Section 5.2, “Profile”) or existing jobs.

Once the proper profile/job has been chosen to base the new job on, a simple page will appear asking for the new job's:

  1. Name

    The name must only contain letters, numbers, dash (-) and underscore (_). No other characters are allowed. This name will be used to identify the crawl in the WUI but it need not be unique. The name can not be changed later

  2. Description

    A short description of the job. This is a freetext input box and can be edited later.

  3. Seeds

    The seed URIs to use for the job. This list can be edited later along with the general configurations.

Below these input fields there are several buttons. The last one Submit job will immediately submit the job and (assuming it is properly configured) it will be ready to run (see Section 7, “Running a job”). The other buttons will take the user to the relevant configuration pages (those are covered in detail in Section 6, “Configuring jobs and profiles”). Once all desired changes have been made to the configuration, click the 'Submit job' tab (usually displayed top and bottom right) to submit it to the list of waiting jobs.

Note

Changes made afterwards to the original jobs or profiles that a new job is based on will not in any way affect the newly created job.

Note

Jobs based on the default profile provided with Heritrix are not ready to run as is. Their HTTP header information must be set to valid values. See Section 6.3.1.3, “HTTP headers” for details.

5.2. Profile

A profile is a template for a crawl job. It contains all the configurations that a crawl job would, but is not considered to be 'crawlable'. That is Heritrix will not allow you to directly crawl a profile, only jobs based on profiles. The reason for this is that while profiles may in fact be complete, they may also not be.

A common example is leaving the HTTP headers (user-agent, from) in an illegal state in a profile to force the user to input valid data. This applies to the default (default) profile that comes with Heritrix. Other examples would be leaving the seeds list empty, not specifying some processors (such as the writer/indexer) etc.

In general there is less error checking of profiles.

To manage profiles, go to the Profiles tab in the WUI. That page will display a list of existing profiles. To create a new profile select the option of creating a "New profile based on it" from the existing profile to use as a template. Much like jobs, profiles can only be created based on other profiles. It is not possible to create profiles based on existing jobs.

The process from there on mirrors the creation of jobs. A page will ask for the new profiles name, description and seeds list. Unlike job names, profile names must be unique from other profile names - jobs and a profile can share the same name - otherwise the same rules apply.

The user then proceeds to the configuration pages (see Section 6, “Configuring jobs and profiles”) to modify the behavior of the new profile from that of the parent profile.

Note

Even though profiles are based on other profiles, changes made to the original profiles afterwards will not affect the new ones.