When crawling a page, it fails with a "forbidden" error (code 403)

(Updated: )

As well as trying to access a resource you simply don't have permission for, web servers can return 403 (Forbidden) status for other reasons. This article two other common cases Cyotek has identified.

Incorrect handling of HEAD requests

This first case can occur when you try to crawl a website which doesn't allow the HEAD method, but returns 403 rather than 405 (Method Not Allowed).

WebCopy 1.8 and above will try to mitigate this by testing with GET if a HEAD fails and automatically disable header checking for affected domains.

While the HTTP protocol defines a number of methods, WebCopy makes use of only three of these - HEAD, GET and POST.

By default, WebCopy issues the HEAD method before crawling any URI which provides important information such as the content type and length before actually trying to download any content. This speeds up crawls where you are excluding content types that belong to large binary files. However, if the web server doesn't support, or has disabled, the HEAD method, then any crawl of that server will fail.

If this happens, you need to disable the use of the HEAD command by WebCopy. To do this, display the Project Properties dialogue, select the Advanced category, then uncheck Use Header Checking. Click OK to save your changes and close the dialogue, then retry the crawl.

How can I check up front if HEAD is supported?

You can use the Test URI feature of WebCopy to determine if the URI you want to crawl supports the HEAD method. Simply click Test URI from the toolbar, enter the URL of the site to test, and click Test. WebCopy will try and retrieve the headers, and will notify you of any problems. You can then use the same window to switch to GET and see if this works.

Servers rejecting custom user agents

Another common case is where a server checks the client user agent and returns 403 if it doesn't match what it is expecting. In this scenario, changing the user agent to mimic one used a traditional web browser may help.

WebCopy 1.9 and above will try to mitigate this by testing with the default user agent and in the event of 403 will retest with a generic agent and automatically use this if successful

To change the user agent, display the Project Properties dialogue and select the User Agent category. Next select Use custom user agent and either enter a custom value or choose from a pre-defined list. Click OK to save your changes and close the dialogue, then retry the crawl.

More Information

Update History

  • 2013-04-28 - First published
  • 2021-04-02 - Updated to include user agents