Heritrix Negotiation of Authentication Schemes

A Proposal to address RFE [ 914301 ] Logging in (HTTP POST, Basic Auth, etc.)

Michael Stack

Internet Archive

1. Introduction
1.1. Scope
1.1.1. Delivery timeline
1.1.2. Common web authentication schemes only
1.1.3. Connection-based authentication schemes
1.2. Assumptions
1.2.1. Heritrix has been granted necessary authentication credentials
1.2.2. Heritrix URI processing chain
1.2.3. No means of recording credentials used authenticating in an ARC
1.2.4. Credentials store does not need to be secured
2. Authentication Schemes
2.1. Basic and Digest Access Authentication
2.2. HTTP POST and GET of Authentication Credentials
2.3. X509 Client Certificates
2.4. NTLM
3. Proposal
3.1. Basic and Digest Access Authentication
3.1.1. CrawlServer
3.1.2. HTTPClient
3.1.3. RFC2617 Record
3.2. HTTP POST and GET of Authentication Credentials
3.2.1. Login Record
3.3. Commonage
3.3.1. URI#authority as URI canonical root URL
3.3.2. Population of Domain/VirtualDomain object with Credentials
3.3.3. Caching of Credentials
3.3.4. Credential Stores
3.3.5. Logging
3.3.6. Debugging tool
4. Design
4.1. Configuration
4.2. Credential store
5. Future
5.1. Same URL different Page Content
5.2. Integration with the UI
Bibliography

Abstract

Description of common web authentication schemes. Description of the problem volunteering credentials at the appropriate juncture. Proposal for navigating HTTP POST login and Basic Auth for when Heritrix has been supplied credentials ahead of the authorization challenge.

1. Introduction

This document is divided into two parts. The first part disccuses common web authentication schemes eliminating the less common. The second part outlines Heritrix negotiation of HTML login forms and Basic/Digest Auth authentications schemes. On the end are a list of items to consider for future versions of the authentication system.

This intent of this document is to solicit feedback in advance of implementation.

The rest of this introduction is given over to scope and assumptions made in this document.

1.1. Scope

1.1.1. Delivery timeline

Delivery on the proposal is to be parcelled out over Heritrix versions. A first cut at Heritrix form-based POST/GET authentication is to be included in version 1.0 (End of April, 2004).

1.1.2. Common web authentication schemes only

This proposal is for the common web authentication schemes only: E.g. HTTP POST to a HTML form, and Basic and Digest Auth. This proposal does not cover the Heritrix crawler authenticating against a LDAP server, PAM, getting tickets from a Kerberos server, negotiating single sign-ons, etc.

1.1.3. Connection-based authentication schemes

Connection-based authentication schemes are outside the scope of this proposal. They are antithetical to the current Heritrix mode of operation. Consideration of connection-based authentication schemes is postponed until Heritrix does other than HTTP/1.0 behavior of getting a new connection per request.

1.2. Assumptions

1.2.1. Heritrix has been granted necessary authentication credentials

Assumption is that Heritrix has been granted legitimate access to the site we're trying to log into ahead of the login attempt; that the site owners have given permission and the necessary login/password combination and/or certificates necessary to gain access.

1.2.2. Heritrix URI processing chain

Assumption is that this proposal integrate with the Heritrix URI processing chains model [See URI Processing Chains ] rather than go to an authentication framework such as JAAS and encapsulate the complete authentication dialog within a JAAS LoginModule plugin, with a plugin per authentication scheme supported. On the one hand, the Heritrix URI processing chain lends itself naturally to the processing of the common web authentication mechanisms with its core notions of HTML fetching and extracting, and besides, the authentication dialog will likely have links to harvest. On the other hand, authentication will be spread about the application.

1.2.3. No means of recording credentials used authenticating in an ARC

There is no means currently for recording in an arc file the credentials used getting to pages (If we recorded the request, we'd have some hope of archiving them).

1.2.4. Credentials store does not need to be secured

Assumption is that Heritrix does not need to secure the store in which we keep credentials to offer up during authentications; the credentials store does not need to be saved on disk encrypted and password protected.

2. Authentication Schemes

This section discusses common web authentication schemes and where applicable, practical issues navigating the schemes' requirements. The first two described, Section 2.1, “Basic and Digest Access Authentication ” and Section 2.2, “HTTP POST and GET of Authentication Credentials”, are assumed most commonly used.

2.1. Basic and Digest Access Authentication [rfc2617]

The server returns a HTTP response code of 401 Unauthorized or 407 Proxy Authentication Required when it requires authentiation of the client.

The realm directive (case-insensitive) is required for all authentication schemes that issue a challenge. The realm value (case-sensitive), in combination with the canonical root URL...of the server being accessed, defines the protection space. [rfc2617]

The canonical root URL is discussed in this message, Re: IE and cached passwords. Its scheme + hostname + port only. Path and query string have been stripped. Effectively, it equates to scheme + URI authority.

A client SHOULD assume that all paths at or deeper than the depth of the last symbolic element in the path field of the Request-URI also are within the protection space specified by the Basic realm value of the current challenge. A client MAY preemptively send the corresponding Authorization header with requests for resources in that space without receipt of another challenge from the server. [rfc2617]

2.2. HTTP POST and GET of Authentication Credentials

Generally, this scheme works as follows. When an unauthenticated client attempts to access a protected area, they are redirected by the server to a page with an HTML login form. The client must then HTTP POST or a HTTP GET the HTML form with the client access credentials filled in. Upon verification of the credentials by the server, the client is given access. So the client does not need to pass credentials on all subsequent accesses to the protected areas of the site, the server will mark the client usually in one of two ways: It will write a special, usually time- and scope-limited, token, or "cookie", back to the client which the client volunteers on all subsequent accesses, or the server will serve pages that have embedded URLs rewritten to include a special token. The tokens are examined by the server on each subsequent access for validity and access continues while the token remains valid.

There is no standard for how this dialogue is supposed to proceed. Myriad are the implementations of this basic scheme. Below is a listing of common difficulties:

  • Form field item names are varient.

  • Means by which unsuccessful login is reported to the client varies. A client can be redirected to new failed login page or the original login page is redrawn with the inclusion of banner message reporting on the failed login.

  • Following on from the previous point, should a solution POST authentication and then do all necessary to ensure a successful login -- i.e. follow redirects, regex over the result page to ensure it says "successful login", etc. -- or should a solution do nought but POST and then give whatever the resultant page to the Heritrix URI processing chain whether successful or not?

    Processing of form success page?

    The result page should probably be let through. It may have valuable links on board. The alternative would necessitate our running an out-of-band subset of the Heritrix URI processing chain POSTing/GETting authentication running extractors to verify result of login attempt. This mini authentication chain could be kept tidy encapsulated within a login module -- see Section 1.2.2, “Heritrix URI processing chain”-- but ugly would be how to transfer such as the cookies from the mini chain over to the main URI processing chain.

  • The aforementioned differing ways in which the server parks in the client a validated token.

  • What if login attempt fails? Should we retry? For how long? Means maintaining a state across URI processing?

  • Should there be tools to help an operator develop Heritrix authentication configuration? Should a tool be developed that runs the login outside of the Heritrix context to make it easier on operator developing the authentication configuration?

2.3. X509 Client Certificates

To gain access, the client must volunteer a trusted certificate setting up an SSL connection to the server. Upon receipt, the server tests the client is entitled to access.

Its probably rare that client certificates alone will be used as access protection. More likely, certificates will be used in combination with one of the above listed schemes.

The certificate the client is to volunteer needs to be in a local TrustStore available to the Heritrix TrustManager making the SSL connection (Heritrix already maintains its own keystore of certificates to use verifying server proffered certs).

Testing

Test to see if certificates are volunteered even in case where we're running in open trust mode. Test to see how hard to append a host-particular keystore to the general Heritrix keystore at runtime.

2.4. NTLM [ntlm]

NTLM is...a proprietary protocol designed by Microsoft with no publicly available specification. Early version of NTLM were less secure than Digest authentication due to faults in the design, however these were fixed in a service pack for Windows NT 4 and the protocol is now considered more secure than Digest authentication... There are some significant differences in the way that NTLM works compared with basic and digest authentication...NTLM authenticates a connection and not a request, so you need to authenticate every time a new connection is made and keeping the connection open during authentication is vital. Due to this, NTLM cannot be used to authenticate with both a proxy and the server, nor can NTLM be used with HTTP 1.0 connections or servers that do not support HTTP keep-alives. [httpclient]

The NTLM is put outside the scope of this proposal because its nature is antithetical to how Heritrix works: i.e. It authenticates the connection, not a session [Also see Section 1.1.3, “Connection-based authentication schemes” ]. Related, the implementation is incomplete in httpclient. NTLM will not be discussed further.

3. Proposal

Proposal is to put off implementation of client-side certificates in Heritrix. Rare is the case where its needed.

Workaround?

It should be possible to just add the client certificate to the local truststore and all would just work. Test.

Having cut Section 2.4, “NTLM ” and Section 2.3, “X509 Client Certificates”, we're left with Section 2.1, “Basic and Digest Access Authentication ” and Section 2.2, “HTTP POST and GET of Authentication Credentials”, the assumed most commonly used web authentication schemes.

Reading in the above, Section 2, “Authentication Schemes”, it may be apparent that there can not be one solution that will work for both schemes. The discussion in the following two sections -- a section per scheme under consideration -- should bring this fact out and help identify facility common to the two schemes detailed later in Section 3.3, “Commonage”.

3.1. Basic and Digest Access Authentication [rfc2617]

A basic implementation would, upon receipt of a 401 response status code, extract a realm from the 401 response and use this realm + URI canonical root URL as a compound key to do a look up into a store of Basic/Digest Auth credentials. If a match is found, the persistent domain/virtualdomain object made for the current domain is loaded with the discovered credentials and the 401'ing current URI is marked for retry (If no matching credentials found, the current URI is marked failed with a 401 response code).

Let it be a given that any rfc2617 credentials found in a persistent domain/virtualdomain object always get always loaded into the HTTP GET request.

When our 401'ing URI comes around again for retry, since credentials were loaded the last time this URI was seen, credentials will be found in the persistent domain/virtualdomain object and will be added to the request headers. This time around the authentication should succeed.

Any other URI that is a member of this realm will also subsequently successfully authenticate given the above rule whereby we always load any found credentials into the current request.

Let the above be the default behavior. Configurations would enable/disable:

  • Enable/Disable this feature.

  • Pre-population of the persistent domain/virtualdomain object with all rfc2617 credentials upon construction thereby avoiding 401s altogether since we'd be sending all credentials in advance of any challenge (preemptive authentication). A domain might have many rfc2617 realms. Preemptive authentication would have us volunteering all of a domains realms' credentials in each request.

    The query of the store pre-populating the persistent domain/virtualdomain object would use the URI canonical root URL for a key.

    This configuration could be set globally for all Heritrix requests or per URI canonical root URL by setting a property on the corresponding record in the store.

  • Upon receipt of a 401 and on successfully locating appropriate credentials in the store (or already loaded in the persistent domain/virtualdomain object), configuration could enable immediately retrying the request rather than letting the 401 percolate down through the Heritrix processing chain and back up out of the Frontier (Enabling this configuration would leave no trace of the 401 in the ARC).

The simplest implementation would have us always do preemptive authentication. Configuration would turn this feature on or off, and that'd be all.

Below we look with more detail at aspects of the above proposed implementation.

3.1.1. CrawlServer

In Heritrix, the persistent domain/virtualdomain object is org.archive.crawler.datamodel.CrawlServer. Its created inside in org.archive.crawler.basic.Frontier#next() if no extant CrawlServer is found in the org.archive.crawler.datamodel.ServerCache. The lookup is done using a (decoded) URI authority. The currently processed URI has easy access to its corresponding CrawlServer. See CrawlURI#getServer().

3.1.2. HTTPClient

HTTPClient has builtin support for Basic, Digest and NTLM. It takes care of sending appropriate Authentication headers.

Digest Authentication generally works but has a ways to go according to the comment made on 2004-03-11 16:21 in Wrong reauthentication when using DigestAuthentication

Multiple Realms

What to do if host has multiple realms? Will HTTPClient [httpclient] do right thing and offer all credentials available appropriately? Need to test.

The HTTPClient authentication code was just refactored extensively in HEAD -- post 2.0 release. Reported problems authenticating via a proxy going over SSL.

3.1.3. RFC2617 Record

A RFC2617 record would be keyed by URI canonical root URL. It would contain a realm, login and password. We'd not distingush proxy (407) records.

3.2. HTTP POST and GET of Authentication Credentials

Every URI processed by Heritrix first has preconditions checked. Example preconditions are the fetching of a domain's DNS record and its robots.txt file before proceeding to make requests against the domain. This proposal is to add a new login precondition after the fashion of the robots and DNS preconditions -- See org.archive.crawler.basic.PreconditionEnforcer -- and a facility for having our HTTP fetcher run a configurable one time login.

The new login precondition will test the current URI against a preloaded list of login URI patterns. Each login URI pattern describes a protected area of a domain (or virtualdomain): e.g. "http://www.archive.org/private/*". Each login URI pattern serves as a key to an associated login record. A login record has all information necessary for negotiation of a successful login such as the HTML form content to submit -- username, password, submit button name, etc. -- and whether login requires POSTing or GETting the login form. The login record also has a ran login flag that says whether or not the login has been run previously against this protected area.

Ran Login flag

The ran login flag says whether the login has been run, not whether or not login succeeded. Guaging whether the login was successful or not is difficult. It varies with the login implementation as already noted.

Also part of the login record is a login URI. The login URI is the login page whose successful navigation gives access to the protected space: e.g. If the pattern we used testing was, "http://www.archive.org/private/*", the login URI might be "http://www.archive.org/private/login.html".

If the current URI matches one of the login URI pattern list, we pull the matched patterns associated login record. If the ran login flag has not been set, the login URI is force queued. Its force queued in case the URI has been seen (GET'd) already. The login URI (somehow) has the login record associated. The presence of the login record distingushes the login URI. The current URI is requeued (Precondition not met). Otherwise the current URI is let run through as per normal.

When the login URI becomes the current URI and is being processed by the HTTP fetcher, the presence of the login record with a ran login set to false signals the HTTP fetcher to run the abnormal login sequence rather than do its usual GET. The login record has all the HTTP fetcher needs to execute the login. Upon completion, the login ran flag is set in the login record and the login record is removed from the login URI.

GET of the login URI

What if we haven't already seen the login page? Should the login precondition first force fetch the login URI without the login record loaded so its first GET'd before the we run a login?

This implementation cannot guarantee successful login nor is there provision for retries. The general notion is that the single running of the login succeeds and that the produced success cookie or rewritten URI makes it back to the Heritrix client gaining us access to the protected area.

Configuration would enable or disable this feature.

3.2.1. Login Record

A login record would be keyed by the pattern it applies to and would contain aforementioned ran login flag and login URI. Tied to the login URI would be a list of key-value pairs to hold the login form content as well as specification of whether the form is to be POSTed or GETed.

3.3. Commonage

Here we discuss features common to the two above authentication scheme implementations.

3.3.1. URI#authority as URI canonical root URL

Proposal is to equate the two. Doing so means no need to change CrawlServer. Currently the CawlServer is constructed wrapping the URI#authority portion of an URI. URI#authority is URI canonical root URL absent the scheme. Assuming CrawlServer is for http only, then it should be safe making this equation.

DNS

Are there CrawlServer instances made for anything but http schemes?

HTTPS

Check that URI canonical root URLs of http://www.example.com and https://www.example.com result in different CrawlServer instances.

3.3.2. Population of Domain/VirtualDomain object with Credentials

Proposal is that CrawlServer encapsulate credentials store accessing, that it read the store upon construction.

3.3.3. Caching of Credentials

Once read from the store, we need to cache the credentials in CrawlServer.

3.3.3.1. JAAS Subject, Principal and Credentials [jaas]

Proposal is that we at least look at selectively exploiting this library caching credentials. For example, a CrawlServer might implement the java.security.auth.Subject interface. To this Subject, we'd add implementations of the Principals and Credentials interfaces (Makes sense for the carrying of RFC2617 credentials. Less so for login credentials. TBD).

3.3.4. Credential Stores

The credential store would be on disk.

For convenience, particularly listing credentials in a global file store, credentials can be grouped first by host (the base domain -- domain minus port #) and then by URI#authority (domain plus any port #).

Configuration would allow us to point at a global store of credentials.

3.3.4.1. Layering of Credential Stores

Subsequently, we'd add support for layering stores. Modeled after apache's .htaccess mechanism for selectively overriding the main server configuration on a directory scope, or, closer to home, on how Heritrix settings can be overridden on a per-host basis, it'd be possible to point the store querying code at a directory whose subdirectories are named for domains progressing from a root down through the macro level org, com, gov, etc., subdomains getting progressively more precise: e.g travel.yahoo.com would be found under the yahoo.com directory which would be under the com directory. Searching for credentials, we'd search up through the directory structure going from the current domain on up to the root. realm + canonical root URL key. If not found in the domain store, of if a domain store did not exist, we'd back up the settings hierarchy until we hit the global store.

3.3.4.2. Exploit the settings framework implementing credentials store

Propose extending or adapting the Heritrix settings framework to have it manage our credentials store so we can exploit code already written.

3.3.5. Logging

A new log will trace authentication transactions. Log will include listing of credentials offered, new cookies, query parameters, and pertinent HTTP headers returned by the submitted authentication, and where possible, report on whether authentication succeeded or not

3.3.6. Debugging tool

A command-line tool to run single logins to aid debugging logins will aid development and be of use to operators.

4. Design

4.1. Configuration

Will add to the HTTP Fetcher options that enable, disable and configuration of the two authentication types supported.

4.2. Credential store

Below is a static class model diagram for accessing the credential store.

Implementation looks nothing like the above

Ignore the above design. The implementation turned out to be something else altogether. The model was effectively inverted (credentials hold domains) and notions of going via a CredentialManager/CredentialStore to do all operations on the store were removed. While the resultant implementation is not a good OOM, its amenable to UI manipulation (and sits easily atop the heritrix settings system).

5. Future

This section has issues to be addressed later, probably in a version 2.0 of the authentication system.

5.1. Same URL different Page Content

Heritrix distingushes pages by URIs. Pages seen can be different whether logged in or not. We'll need some way to force/suggest sets of URIs are revisitable after a login token is received. This might mean the 'fingerprint' of a URI includes any authentication information to be used.

5.2. Integration with the UI

Add/Edit/Delete of Credentials via the UI. Flagging the operator about 401s and likely html login forms.

Bibliography

[httpclient] Apache Jakarta Commons HTTPClient Authentication Guide. Commons HTTPClient version 2.0..