7. Running a job

Once a crawl job has been created and properly configured it can be run. To start a crawl the user must go to the web Console page (via the Console tab).

7.1. Web Console

The web Console presents on overview of the current status of the crawler.

7.1.1. Crawler Status Box

The following information is always provided:

  • Crawler Status

    Is the crawler in Holding Jobs or Crawling Jobs mode? If holding, no new jobs pending or created will be started (but a job already begun will continue). If crawling, the next pending or created job will be started as soon as possible, for example when a previous job finishes. For more detail see "Holding Jobs" vs. "Crawling Jobs".

    To the right of the current crawler status, a control link reading either "Start" or "Hold" will toggle the crawler between the two modes.

  • Jobs

    If a current job is in progress, its status and name will appear. Alternatively, "None running" will appear to indicate no job is in progress because the crawler is holding, or "None available" if no job is in progress because no jobs have been queued.

    Below the current job info, the number of jobs pending and completed is shown. The completed count includes those that failed to start for some reason (see Section 7.3.2, “Job failed to start” for more on misconfigured jobs).

  • Alerts

    Total number of alerts, and within brackets new alerts, if any.

    See Section 7.3.4, “Alerts” for more on alerts.

  • Memory

    The amount of memory currently used, the size of the Java heap, and the maximum size to which the heap can possibly grow are all displayed, in kilobytes (KB).

7.1.2. Job Status Box

If a job is in-progress -- running, paused, or between job states -- the following information is also provided in a second area underneath the Crawler Status Box.

  • Job Status

    The current status of the job in progress. Jobs being crawled are usually running or paused.

    To the right of the current status, controls for pausing/resuming or terminating the current job will appear as appropriate.

    When a job is terminated, its status will be marked as 'Ended by operator'. All currently active threads will be allowed to finish behind the scenes even though the WUI will report the job being terminated at once. If the crawler is in "Crawling Jobs" mode, a next pending job, if any, will start immediately.

    When a running job is paused, it may take some time for all the active threads to enter a paused state. Until then the job is considered to be still running and 'pausing'. It is possible to resume from this interim state.

    Once paused a job is considered to be suspended and time spent in that state does not count towards elapsed job time or rates.

  • Rates

    The number of URIs successfully processed per second is shown, both the rate in the latest sampling interval and (in parentheses) the average rate since the crawl began. The sampling interval is typically about 20 seconds, and is adjustable via the "interval-seconds" setting. The latest rate of progress can fluctuate considerably, as the crawler workload varies and housekeeping memory and file operations occur -- especially if the sampling interval has been set to a low value.

    Also show is the rate of successful content collection, in KB/sec, for the latest sampling interval and (in parentheses) the average since the crawl began. (See Bytes, KB and statistics.)

  • Time

    The amount of time that has elapsed since the crawl began (excluding any time spent paused) is displayed, as well as a very crude estimate of the require time remaining. (This estimate does not yet consider the typical certainty of discovering more URIs to crawl, and ignored other factors, so should not be relied upon until it can be improved in future releases.)

  • Load

    A number of measures are shown of how busy or loaded the job has made the crawler. The number of active threads, compared to the total available, is shown. Typically, if only a small number of threads are active, it is because activating more threads would exceed the configured politeness settings, given the remaining URI workload. (For example, if all remaining URIs are on a single host, no more than one thread will be active -- and often none will be, as polite delays are observed between requests.)

    The congestion ratio is a rough estimate of how much additional capacity, as a multiple of current capacity, would be necessary to crawl the current workload at the maximum rate allowable by politeness settings. (It is calculated by comparing the number of internal queues that are progressing with those that are waiting for a thread to become available.)

    The deepest queue number indicates the longest chain of pending URIs that must be processed sequentially, which is a better indicator of the work remaining than the total number of URIs pending. (A thousand URIs in a thousand independent queues can complete in parallel very quickly; a thousand in one queue will take longer.)

    The average depth number indicates the average depth of the last URI in every active sequential queue.

  • Totals

    A progress bar indicates the relative percentage of completed URIs to those known and pending. As with the remaining time estimate, no consideration is given to the liklihood of discovering additional URIs to crawl. So, the percentage completed can shrink as well as grow, especially in broader crawls.

    To the left of the progress bar, the total number of URIs successfully downloaded is shown; to the right, the total number of URIs queued for future processing. Beneath the bar, the total of downloaded plus queued is shown, as well as the uncompressed total size of successfully downloaded data in kilobytes. See Bytes, KB and statistics. (Compressed ARCs on disk will be somewhat smaller than this figure.)

  • Paused Operations

    When the job is paused, additional options will appear such as View or Edit Frontier URIs.

    The View or Edit Frontier URIs option takes the operator to a page allowing the lookup and deletion of URIs in the frontier by using a regular expression, or addition of URIs from an external file (even URIs that have already been processed).

Some of this information is replicated in the head of each page (see Section 7.3.3, “All page status header”).

7.1.3. Console Bottom Operations

7.1.3.1. Refresh

Update the status display. The status display does not update itself and quickly becomes out of date as crawling proceeds. This also refreshes the options available if they've changed as a result of a change in the state of the job being crawled.

7.1.3.2. Shut down Heritrix

It is possible to shut down Heritrix through this option. Doing so will terminate the Java process running Heritrix and the only way to start it up again will be via the command line as this also disables the WUI.

The user is asked to confirm this action twice to prevent accidental shut downs.

This option will try to terminate any current job gracefully but will only wait a very short time for active threads to finish.

7.2. Pending jobs

At any given time there can be any number of crawl jobs waiting for their turn to be crawled.

From the Jobs tab the user can access a list of these pending jobs (it also possible to get to them from the header, see Section 7.3.3, “All page status header”).

The list displays the name of each job, its status (currently all pending jobs have the status 'Pending') and offers the following options for each job:

  • View order

    Opens up the actual XML configuration file in a seperate window. Of interest to advanced users only.

  • Edit configuration

    Takes the user to the Settings page of the jobs configurations (see Section 6.3, “Settings”).

  • Journal

    Takes the user to the job's Journal (see Section 7.4.1, “Journal”).

  • Delete

    Deletes the job (will only be marked as deleted, does not delete it from disk).

7.3. Monitoring a running job

In addition to the logs and reports generally available on all jobs (see Section 8.2, “Logs” and Section 8.3, “Reports”) some information is provided only for jobs being crawled.

Note

The Crawl Report (see Section 8.3.1, “Crawl report”) contains one bit of information only available on active crawls. That is the amount of time that has elapsed since a URI belonging to each host was last finished.

7.3.1. Internal reports on ongoing crawl

The following reports are only availible while the crawler is running. They provide information about the internal status of certain parts of the crawler. Generally this information is only of interest to advanced users who possess detailed knowledge of the internal workings of said modules.

These reports can be accessed from the Reports tab when a job is being crawled.

7.3.1.1. Frontier report

A report on the internal state of the frontier.Can be unwieldy in size or the amount of time/memory it takes to compose in large crawls (with thousands of hosts with pending URIs).

7.3.1.2. Thread report

Contains information about what each toe thread is doing and how long it has been doing it. Also allows users to terminate threads that have become stuck. Terminated threads will not actually be removed from memory, Java does not provide a way of doing that. Instead they will be isolated from the rest of the program running and the URI they are working on will be reported back to the frontier as if it had failed to be processed.

Caution

Terminating threads should only be done by advanced users who understand the effect of doing so.

7.3.1.3. Processors report

A report on each processor. Not all processors provide reports. Typically these are numbers of URIs handled, links extracted etc.

This report is saved to a file at the end of the crawl (see Section 9.1.6, “processors-report.txt”).

7.3.2. Job failed to start

If a job is misconfigured in such a way that it is not possible to do any crawling it might seem as if it never started. In fact what happens is that the crawl is started but on the initialization it is immediately terminated and sent to the list of completed jobs (Section 8.1, “Completed jobs”). In those instances an explanation of what went wrong is displayed on the completed jobs page. An alert will also be created.

A common cause of this is forgetting to set the HTTP header's user-agent and from attributes to valid values (see Section 6.3.1.3, “HTTP headers”).

If no processors are set on the job (or the modules otherwise badly misconfigured) the job may succeed in initializing but immediately exhaust the seed list, failing to actually download anything. This will not trigger any errors but a review of the logs for the job should highlight the problem. So if a job terminates immediately after starting without errors, the configuration (especially modules) should be reviewed for errors.

7.3.3. All page status header

At the top of every page in the WUI, right next to the Heritrix logo, is a brief overview of the crawler's current status.

The three lines contain the following information (starting at the top left and working across and down).

First bit of information is the current time when the page was displayed. This is useful since the status of the crawler will continue to change after a page loads, but those changes will not be reflected on the page until it is reloaded (usually manually by the user). As always this time is in GMT.

Right next to it is the number of current and new alerts.

The second line tells the user if the crawler is in "Crawling Jobs" or "Holding Jobs" mode. (See "Holding Jobs" vs. "Crawling Jobs"). If a job is in progress, its status and name will also be shown.

At the beginning of the final line the number of pending and completed jobs are displayed. Clicking on either value takes the user to the related overview page. Finally if a job is in progress, total current URIs completed, elapsed time, and URIs/sec figures are shown.

7.3.4. Alerts

The number of existing and new alerts is displayed both in the Console (Section 7.1, “Web Console”) and the header of each page (Section 7.3.3, “All page status header”).

Clicking on the link made up of those numbers takes the user to an overview of the alerts. The alerts are presented as messages, with unread ones clearly marked in bold and offering the user the option of reading them, marking as read and deleting them.

Clicking an alert brings up a screen with its details.

Alerts are generated in response to an error or problem of some form. Alerts have severity levels that mirror the Java log levels.

Serious exception that occur will have a Severe level. These may be indicative of bugs in the code or problems with the configuration of a crawl job.

7.4. Editing a running job

The configurations of a job can be edited while it is running. This option is accessed from the Jobs tab (Current job/Edit configuration). When selected the user is taken to the settings section of the job's configuration (see sSection 6.3, “Settings”).

When a configuration file is edited, the old version of it is saved to a new file (new file is named <oldFilename>_<timestamp>.xml) before it is updated. This way a record is kept of any changes. This record is only kept for changes made after crawling begins.

It is not possible to edit all aspects of the configuration after crawling starts. Most noticably the Modules section is disabled. Also, although not enforced by the WUI, making changes to certain settings (in particular filenames, directory locations etc.) will have no effect (doing so will typically not harm the crawl, it will simply be ignored).

However most settings can be changed. This includes the number of threads being used and the seeds list and although it is not possible to remove modules, most have the option to disable them. Settings a modules enabled attribute to false effectively removes them from the configuration.

If changing more than an existing atomic value -- for example, adding a new filter -- it is good practice to pause the crawl first, as some modifications to composite configuration entities may not occur in a thread-safe manner with respect to ongoing crawling otherwise.

7.4.1. Journal

The user can add notes to a journal that is kept for each job. No entries are made automatically in the journal, it is only for user added comments.

It can be useful to use it to document reasons behind configuration changes to preserve that information along with the actual changes.

The journal can be accessed from the Pending jobs page (Section 7.2, “Pending jobs”) for pending jobs, the Jobs tab for currently running jobs and the Completed jobs page (Section 8.1, “Completed jobs”) for completed jobs.

The journal is written to a plain text file that is stored along with the logs.